# NLP using SpaCy

### I authored this code for as part of an assignment for the Machine Learning course I took at Johns Hopkins. 

### This example uses text from an article found on CNN.com.  The link to the original page can be found here: 
#### https://www.cnn.com/2022/04/28/football/afghan-women-footballers-taliban-australia-spt-intl/index.html

### My code is broken down into six parts to do the following tasks:

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article
 
 
### Skills used:
- NLP using spaCy including: parts of speech mapping, entity types and dependencies, similar word analysis 


In [1]:
import pandas as pd
import numpy as np

from collections import Counter
from itertools import groupby
from operator import itemgetter
from collections import defaultdict

import spacy

In [2]:
article_text = """Even though she says the images play out like a movie in her mind, it's a scene she couldn't possibly ever have imagined: The end of her world as she knew it, and quite possibly her own imminent demise.

'They were beating our parents, our family members, our teammates,' Fatima, who is spokesperson for the Afghanistan Women's National Football team, told CNN Sport. 'You didn't know if you'll be alive, or dead soon.'

The scene being described is the fall of Kabul in 2021 and the frantic rush to exit Afghanistan before the Taliban seized full control of the country.

Women and girls were particularly vulnerable, as the future they thought they could look forward to simply evaporated in front of their eyes. 'You're losing your dreams in a blink,' added Fatima.

Earlier this year the United Nations warned that 'virtually every man, woman and child in Afghanistan could face acute poverty,' while in March the Taliban went back on their much-anticipated promise to let girls above sixth grade return to school.

 The Afghanistan Women's National Football team is a group of strong and independent women and knew they'd be in the Taliban's crosshairs. They were desperate to escape as fear and panic gripped Kabul last year as the city teetered on the edge.

Their proudest achievements, everything they had worked for, suddenly turned radioactive; identifying social media accounts were scrubbed and treasured football shirts, boots, medals, and trophies were burned.

For some, the elevated status of their positions could have proved fatal.

At least one team member was recognized in the crush of people outside of the airport. 'Oh look, there is a player from the Afghanistan National Football team,' somebody was heard to have said. To save their own lives, the players had to lie to the Taliban and deny it.

They spent two days hiding outside of the airport and it was another two before they boarded the C-130 transport planes that would airlift them to safety.

They got out in the nick of time – 48 hours later, a suicide bomber killed around 180 people, including 13 US servicemen and women. CNN has been told that some of the US military members killed in the blast had helped rescue the players.

But it was the moment that the planes' wheels left the ground, as those gigantic human lifeboats roared up into the sky, that the emotions crashed over the players in waves.

When a question about the emotion of that flight was posed to Fatima she visibly crumples and excuses herself from the interview.

'Your question is too deep,' she explains, once she has composed herself, adding: 'We all left everything. You know you're never going to have those things that you had before. You don't know about your future. You're saying goodbye to your country where you grew up, your childhood moments, so many memories.'

Many were forced to leave their families behind to an uncertain fate, and for some of the players, the weight of their guilt in that moment was crushing.  It's been eight months since the Afghanistan players were airlifted to safety and many have found themselves in Melbourne, Australia.

For some sports fans, Melbourne is the capital of the world, home of Formula One's Australian Grand Prix, the Australian Open tennis tournament, and the iconic Melbourne Cricket Ground. But for these Afghan players, it's simply their new home.

The Melbourne Victory Soccer Club has taken the Afghan team under its wing and is providing facilities and coaching to maximize their potential.

Director of Football John Didulica outlined the team's sporting opportunities and the goal of one day perhaps playing a World Cup qualifier, but he stressed that it's principally a humanitarian program for now.

'My first hope is better lives, they've got very complex personal situations and if football can help them get a foothold into Australian life, that's our ultimate objective,' Didulica told CNN Sport.

'Football has this overriding responsibility to support, not only its participants, but show the best humanity and the best in people

'And I think this team acts as a totem, to some degree, for a lot of the good that we see in football. They deserve whatever the game can offer them.' 
 Whilst she's had to make major adjustments in her life, Fatima says she has discovered something that she wasn't expecting: peace and security.

'I never felt that before in Afghanistan,' she explained, 'I was afraid of so many things. [But] I found it here, peaceful moments. I said, ‘That's it. You're alive. You made it.'

'Australia is a multicultural country,' defender Marsul told CNN Sport. 'They accept all kinds of people. They don't ask us, ‘Are you Muslim? Are you Christian? That's such a good thing and Australians are such a kind people. I love it.'

Yet life is still complicated. The backup goalkeeper, Montaha, fled Afghanistan tightly clutching the hand of her 15-year-old brother.

Now, she is responsible for looking after herself, raising her brother, working, studying, and trying to preserve her dream of playing international football.

At least she is surrounded by role models she couldn't have imagined back in Afghanistan.

'Women are playing better than men here in football, it was like, magical things! I was like, ‘Wow, women are [more] powerful than men.' And it was the happiest thing for me.' 
 At the end of April, the Afghanistan team played their first match together since fleeing last August. In one of their final training sessions, the players were reunited with their former coach, the American Haley Carter.

As a former Marine, Carter was one of the team's guardian angels, pulling strings and working back channels to get them out safely.

'It's exciting to see them on the pitch again,' said Carter. 'There's this sense of optimism and hope for what the future will bring.'

She believes that the team's revival is a powerful moment that transcends them all.

'The Afghanistan Women's national team plays for everyone. Every woman, every athlete, every sport, even non-athletes. They represent the power of the women of Afghanistan, the strength and the resilience of the women of Afghanistan.

'And they're a reminder to women, everywhere, that we can collectively do anything that we put our minds to, and we are stronger than others may think.'

But as with so many other aspects of their story, it is bittersweet for Carter.

She can't help thinking of the players that they couldn't get out, the families who were left behind and the military personnel who sacrificed their own lives to save so many others. 'It's heavy,' concedes Carter. 'There is this weight that's hanging over things.' 
 There is still so much uncertainty for this team and these players.

It's still not clear if FIFA will allow them to play under the flag and name of Afghanistan and compete as an international team in exile. But whatever happens, there can be no doubting the potency of their very existence.

'Nothing can stop us. We want to show the Taliban that we are never going to stop,' Montaha stated defiantly.

'The Taliban doesn't allow the girls to go to school or university. We want to be a voice for the voiceless who are still in Afghanistan, we want to assure the Taliban that they can never change anything.'

Montaha says that the spirit in the team is stronger than ever before, and they certainly need to be strong for each other now.

Amidst the smiles and laughter on the field, it would be easy to overlook the immense toll that their ordeal has taken. All are starting from scratch, some can't speak English, birthdays without their families present are difficult, and parents' meetings at school are a painful reminder of the absence of loved ones.

Fatima says she tries her best to lift her teammates whenever they're feeling down.

'I try to be helpful and give her the courage to stay powerful. Stay positive that one day you will have your parents beside you, and they will celebrate your day.' 
 It's impossible to know what the future holds for this team of players, but as individuals, in their own lives, they must look out for themselves.

Fatima says she dreams of being a businesswoman. 'I'm trying to achieve it,' she enthused. 'Every day it's motivated me to stay positive and work harder. I will feel more powerful.'

Carter has no doubt that bright futures lie ahead for all of them.

'The sky's the limit for this group, they've clearly proven to everyone that they're capable of incredible things.'

She acknowledges that some have personal challenges adjusting to a new life in a new country, and she shared one such conversation.

'She's frustrated because she's starting her life from zero. I mentioned to her, ‘Think about all of the opportunities that that now gives you. You can do anything that you want to. It's the start of the rest of your life, so dream big!'' 
"""

In [3]:
processor = spacy.load("en_core_web_md")

processed_text = processor(article_text)
processed_text

Even though she says the images play out like a movie in her mind, it's a scene she couldn't possibly ever have imagined: The end of her world as she knew it, and quite possibly her own imminent demise.

'They were beating our parents, our family members, our teammates,' Fatima, who is spokesperson for the Afghanistan Women's National Football team, told CNN Sport. 'You didn't know if you'll be alive, or dead soon.'

The scene being described is the fall of Kabul in 2021 and the frantic rush to exit Afghanistan before the Taliban seized full control of the country.

Women and girls were particularly vulnerable, as the future they thought they could look forward to simply evaporated in front of their eyes. 'You're losing your dreams in a blink,' added Fatima.

Earlier this year the United Nations warned that 'virtually every man, woman and child in Afghanistan could face acute poverty,' while in March the Taliban went back on their much-anticipated promise to let girls above sixth grade

## 1. Show the most common words in the article.

In [4]:
words = []
pos = []

for sentence in processed_text.sents:
    for token in sentence:
        if token.pos_ not in  ("PUNCT", "SPACE", "SYM", "X"):
            words.append(token.text)
            pos.append({'Type': token.pos_, 'word': token.text})

### Showing just the top 25

In [5]:
Counter(words).most_common(25)

[('the', 82),
 ('of', 44),
 ('to', 44),
 ('and', 39),
 ('that', 30),
 ("'s", 27),
 ('in', 24),
 ('a', 23),
 ('she', 18),
 ('for', 18),
 ('it', 17),
 ('is', 17),
 ('their', 17),
 ('Afghanistan', 15),
 ('team', 15),
 ('they', 15),
 ('her', 12),
 ('was', 12),
 ('I', 11),
 ("n't", 10),
 ('you', 10),
 ('players', 10),
 ('as', 9),
 ('were', 9),
 ('are', 9)]

## 2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})

In [23]:
parts_of_speech = sorted(pos, key = itemgetter('Type'))

print("\nOnly displaying words with more than one occurrence:\n")
    
for key, speech_type in groupby(parts_of_speech, key = itemgetter('Type')):
    print("\n", key, ":") # prints my parts of speech type
    
    wordcount = defaultdict(int) # resets my word counter for each part of speech
       
    for s in speech_type:
        wordcount[s['word']] += 1
        
    sorted_words = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)
    
    mydict = {} 
    
    for w in sorted_words:
        if w[1] != 1:
            mydict[w[0]] = w[1]
       
    print(mydict)



Only displaying words with more than one occurrence:


 ADJ :
{'many': 5, 'own': 4, 'powerful': 4, 'new': 3, 'alive': 2, 'strong': 2, 'last': 2, 'least': 2, 'military': 2, 'Australian': 2, 'Afghan': 2, 'first': 2, 'personal': 2, 'best': 2, 'international': 2, 'former': 2, 'stronger': 2, 'other': 2, 'positive': 2}

 ADP :
{'of': 44, 'in': 24, 'for': 18, 'to': 14, 'as': 6, 'out': 5, 'on': 4, 'from': 4, 'At': 3, 'about': 3, 'than': 3, 'like': 2, 'For': 2, 'up': 2, 'into': 2, 'over': 2, 'under': 2, 'with': 2}

 ADV :
{'so': 6, 'never': 4, 'still': 4, 'back': 3, 'now': 3, 'possibly': 2, 'ever': 2, 'simply': 2, 'before': 2, 'behind': 2, 'here': 2}

 AUX :
{"'s": 18, 'is': 14, 'was': 12, 'were': 9, 'are': 9, 'be': 7, "'re": 7, 'can': 7, 'could': 6, 'have': 5, 'has': 5, 'will': 5, 'do': 3, 'being': 2, 'had': 2, 'would': 2, 'been': 2, "'ve": 2, 'Are': 2, 'ca': 2}

 CCONJ :
{'and': 39, 'But': 5, 'but': 3, 'And': 3, 'or': 2}

 DET :
{'the': 82, 'a': 23, 'The': 8, 'this': 8, 'those': 2, 'that': 2

## 3. Find a subject/object relationship through the dependency parser in any sentence.

In [9]:
def pr_tree(word, level):
    if word.is_punct:
        return
    for child in word.lefts:
        pr_tree(child, level + 1)
    print('\t'* level + word.text + " - " + word.dep_)
    for child in word.rights:
        pr_tree(child, level + 1)

### Showing first 4 sentences subject/object relationship

In [24]:
n = 1   
for sentence in processed_text.sents:
    print("\nSENTENCE", n, "~~~~~~~~~~~~\n")
    pr_tree(sentence.root, 0)
    n += 1
   
    if n > 4:
        break


SENTENCE 1 ~~~~~~~~~~~~

		Even - advmod
		though - mark
		she - nsubj
	says - advcl
				the - det
			images - nsubj
		play - ccomp
			out - prt
			like - prep
					a - det
				movie - pobj
					in - prep
							her - poss
						mind - pobj
	it - nsubj
's - ROOT
		a - det
	scene - attr
			she - nsubj
			could - aux
			n't - neg
			possibly - advmod
			ever - advmod
			have - aux
		imagined - relcl
		The - det
	end - attr
		of - prep
				her - poss
			world - pobj
			as - mark
			she - nsubj
		knew - advcl
			it - dobj
		and - cc
				quite - advmod
			possibly - advmod
			her - poss
			own - amod
			imminent - amod
		demise - conj

SENTENCE 2 ~~~~~~~~~~~~

	

 - dep
	They - nsubj
	were - aux
beating - ROOT
		our - poss
	parents - dobj
			our - poss
			family - compound
		members - appos
				our - poss
			teammates - conj
				Fatima - appos
						who - nsubj
					is - relcl
						spokesperson - attr
							for - prep
										the - det
										Afghanistan - compound
									Women - 

## 4. Show the most common Entities and their types.

### Displaying Top 20 

In [15]:
common_entities = pd.DataFrame({"Entity": [entity.text for entity in processed_text.ents],
                        "Type": [entity.label_ for entity in processed_text.ents]})


common_entities = common_entities.groupby(["Entity", "Type"]).size().reset_index(name = "Occurrences")
common_entities = common_entities.sort_values("Occurrences", ascending = False).head(20)
common_entities

Unnamed: 0,Entity,Type,Occurrences
6,Afghanistan,GPE,11
36,Taliban,ORG,7
22,Fatima,PERSON,6
16,Carter,PERSON,5
32,Montaha,ORG,3
30,Melbourne,GPE,2
10,Australian,NORP,2
39,US,GPE,2
45,one,CARDINAL,2
37,The Afghanistan Women's,ORG,2


## 5. Find Entites and their dependency (hint: entity.root.head)

In [16]:
pd.DataFrame({"Entity": [entity.text for entity in processed_text.ents],
              "Dependency": [entity.root.head for entity in processed_text.ents]})

Unnamed: 0,Entity,Dependency
0,Fatima,teammates
1,the Afghanistan Women's National Football,team
2,CNN Sport,told
3,Kabul,of
4,2021,in
...,...,...
86,Fatima,says
87,Fatima,says
88,Carter,has
89,one,conversation


## 6. Find the most similar words in the article

In [17]:
def most_similar(word, topn):
    word = processed_text.vocab[str(word)]
    queries = [
        w for w in word.vocab 
        if w.is_lower == word.is_lower
    ]

    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    similar_words = [(word.text, w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]
    return similar_words



### Only comparing Nouns, Verbs and Adjectives

In [18]:
words =[]

for sentence in processed_text.sents:
    for token in sentence:
        if token.pos_ in ("NOUN", "ADJ", "VERB"):
            words.append(token.text)
            
#removing duplicate words
words = list(set(words))            

In [19]:
all_similar_words=[]
similar =[]
comparison_df = pd.DataFrame(columns=['Word', 'Comparison', 'Similarity'])

for word in words:
    similar_words =  most_similar(word, topn=3)
    all_similar_words.append(similar_words)


  by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)


In [20]:
for element in all_similar_words:
    for i in element:
        comparison_df = comparison_df.append({'Word': i[0], 'Comparison': i[1], 'Similarity':i[2]}, ignore_index=True)


In [21]:
comparison_df = comparison_df.sort_values(["Similarity"], ascending=False)
comparison_df

Unnamed: 0,Word,Comparison,Similarity
657,happiest,proudest,1.000000
666,proudest,happiest,1.000000
496,Afghan,afghanistan,1.000000
663,'s,’s,1.000000
495,Afghan,kabul,1.000000
...,...,...,...
779,radioactive,blast,0.315761
122,totem,spirit,0.298534
15,crosshairs,totem,0.271395
16,crosshairs,blink,0.262938
