# NLP
Find your favorite news source and grab the article text.

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [2]:
txt = open('./economist.txt',newline='\n',encoding='utf-8').read()

In [3]:
doc = nlp(txt)
doc

On what terms could the war in Ukraine stop?
Russia's lightning attack on Ukraine’s capital, Kyiv, was a failure. Its creeping artillery war to seize the eastern region of Donbas has ground to a bloody halt. It has lost a chunk of stolen territory south of the city of Kharkiv, and this week announced a retreat from Kherson, the only provincial capital it had captured since its invasion in February. With each setback, Vladimir Putin, Russia’s president, has sought new ways to torment Ukraine. The latest is a relentless bombardment that seeks to wreck Ukraine’s infrastructure. Residents of the capital have been told they may have to evacuate if the power grid collapses, halting water and sewage services.
Power cuts have not sapped Ukraine’s will to fight. But they are a reminder that, eight months after his unprovoked invasion, Mr Putin keeps looking for ways to raise the stakes. Some worry he might blow up a dam on the Dnieper river, as Stalin did in 1941, to slow his adversaries’ advan

## 1. Show the most common words in the article.

In [4]:
from collections import Counter

In [26]:
token_list = []
for token in doc:
    if token.is_stop == False:
        if token.is_punct == False:
            if token.is_space == False:
                lem = token.lemma_
                token_list.append(lem)

In [6]:
counted = Counter(token_list)
counted.most_common(15)

[('Ukraine', 52),
 ('Russia', 30),
 ('Mr', 29),
 ('Putin', 22),
 ('western', 15),
 ('war', 14),
 ('West', 14),
 ('talk', 12),
 ('weapon', 11),
 ('support', 11),
 ('nuclear', 9),
 ('country', 9),
 ('russian', 8),
 ('long', 8),
 ('Biden', 8)]

## 2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})

In [7]:
test = doc[50]
[test.text,test.pos_,test.lemma_]

['stolen', 'VERB', 'steal']

In [8]:
import pandas as pd
df = pd.DataFrame(columns=['token','pos','lemma'])
df['token'] = [token for token in doc]
df['pos'] = [token.pos_ for token in doc]
df['lemma'] = [token.lemma_ for token in doc]

In [9]:
df.head()

Unnamed: 0,token,pos,lemma
0,On,ADP,on
1,what,PRON,what
2,terms,NOUN,term
3,could,AUX,could
4,the,DET,the


In [10]:
noun_sub = df[df['pos']=='NOUN']
noun_sub.head()

Unnamed: 0,token,pos,lemma
2,terms,NOUN,term
5,war,NOUN,war
8,stop,NOUN,stop
13,lightning,NOUN,lightning
14,attack,NOUN,attack


In [11]:
noun_count = Counter(noun_sub['lemma'])
print('NOUN:',noun_count.most_common(10))

NOUN: [('war', 14), ('weapon', 11), ('’s', 9), ('support', 9), ('country', 9), ('defence', 8), ('territory', 7), ('power', 7), ('time', 7), ('term', 6)]


## 3. Find a subject/object relationship through the dependency parser in any sentence.

In [12]:
def pr_tree(word, level):
    if word.is_punct:
        return
    for child in word.lefts:
        pr_tree(child, level+1)
    print('\t'* level + word.text + ' - ' + word.dep_)
    for child in word.rights:
        pr_tree(child, level+1)

In [13]:
sent = doc[48:66]

In [14]:
pr_tree(sent.root, 0)

	a - det
chunk - dobj
	of - prep
			stolen - amod
		territory - pobj


Based on the above parsed sentence, "region" is the dependent object of the verb "to seize," which modifies the main subject, "war." The main verb, "ground," has no dependent object.

## 4. Show the most common Entities and their types.

In [15]:
for entity in doc.ents:
    print(entity,entity.label_)

Ukraine GPE
Russia GPE
Ukraine GPE
Kyiv PERSON
Donbas GPE
Kharkiv GPE
this week DATE
Kherson PERSON
February DATE
Vladimir Putin PERSON
Russia GPE
Ukraine GPE
Ukraine GPE
Ukraine GPE
eight months DATE
Putin PERSON
Stalin PERSON
1941 DATE
Russian NORP
America GPE
Europe LOC
Ukraine GPE
the billions of dollars MONEY
every month DATE
Russia GPE
Western NORP
Russia GPE
Tens of thousands CARDINAL
Rome GPE
November 5th DATE
one CARDINAL
America GPE
Democrats NORP
America GPE
Republicans NORP
November 8th DATE
American NORP
two years DATE
Ukraine GPE
Jake Sullivan PERSON
Biden PERSON
November 4th DATE
Ukraine GPE
Russian NORP
November 9th DATE
Biden PERSON
Russia GPE
Ukraine GPE
Kherson PERSON
Ukraine GPE
Western NORP
Ukrainian GPE
Ukraine GPE
Finland GPE
decades DATE
West Germany GPE
half CARDINAL
Israel GPE
America GPE
Russia GPE
Ukraine GPE
one CARDINAL
roughly 100,000 CARDINAL
Kherson PERSON
Putin PERSON
Russian NORP
Dnieper ORG
Putin PERSON
hundreds of thousands CARDINAL
next year DATE
U

In [16]:
ent_list = []
label_list = []
for entity in doc.ents:
    ent_list.append(entity.lemma_)
    label_list.append(entity.label_)
ent_list[0:5]

['Ukraine', 'Russia', 'Ukraine', 'Kyiv', 'Donbas']

In [17]:
label_list[0:5]

['GPE', 'GPE', 'GPE', 'PERSON', 'GPE']

In [18]:
ent_df = pd.DataFrame(columns=['entity','label'])
ent_df['entity'], ent_df['label'] = ent_list, label_list
ent_df.head()

Unnamed: 0,entity,label
0,Ukraine,GPE
1,Russia,GPE
2,Ukraine,GPE
3,Kyiv,PERSON
4,Donbas,GPE


In [19]:
ent_df.groupby(['entity','label']).size().sort_values(ascending=False)[0:10]

entity      label 
Ukraine     GPE       52
Russia      GPE       30
Putin       PERSON    16
russian     NORP       8
America     GPE        7
West        LOC        6
american    NORP       5
Ukrainians  NORP       5
Europe      LOC        5
Mr Putin    PERSON     5
dtype: int64

## 5. Find Entites and their dependency (hint: entity.root.head)

In [20]:
for entity in doc.ents[0:10]:
    print (entity,entity.root.head)

Ukraine stop
Russia attack
Ukraine capital
Kyiv attack
Donbas of
Kharkiv of
this week announced
Kherson from
February in
Vladimir Putin sought


## 6. Find the most similar words in the article

In [33]:
import numpy as np
token_list = []
for token in doc:
    if token.is_stop == False:
        if token.is_punct == False:
            if token.is_space == False:
                token_list.append(token)

In [34]:
token_list[0:5]

[terms, war, Ukraine, stop, Russia]

In [56]:
from spacy.tokens import Token, Doc
import warnings
warnings.filterwarnings('ignore')

In [71]:
sim = pd.DataFrame(columns=['token1','token2','sim_score'])
count = 0
for token in token_list:
    count += 1
    scores = pd.DataFrame(columns=['token1','token2','sim_score'])
    for token2 in token_list:
        score = token.similarity(token2)
        scores = scores.append([{'token1':token.text,
                                 'token2':token2.text,
                                 'sim_score':score}])
    scores_sub = scores[scores['token1']!=scores['token2']]
    score_max = scores_sub.loc[scores_sub['sim_score'].idxmax()]
    sim = sim.append(score_max)
    print(count,' of ',len(token_list))

1  of  1495
2  of  1495
3  of  1495
4  of  1495
5  of  1495
6  of  1495
7  of  1495
8  of  1495
9  of  1495
10  of  1495
11  of  1495
12  of  1495
13  of  1495
14  of  1495
15  of  1495
16  of  1495
17  of  1495
18  of  1495
19  of  1495
20  of  1495
21  of  1495
22  of  1495
23  of  1495
24  of  1495
25  of  1495
26  of  1495
27  of  1495
28  of  1495
29  of  1495
30  of  1495
31  of  1495
32  of  1495
33  of  1495
34  of  1495
35  of  1495
36  of  1495
37  of  1495
38  of  1495
39  of  1495
40  of  1495
41  of  1495
42  of  1495
43  of  1495
44  of  1495
45  of  1495
46  of  1495
47  of  1495
48  of  1495
49  of  1495
50  of  1495
51  of  1495
52  of  1495
53  of  1495
54  of  1495
55  of  1495
56  of  1495
57  of  1495
58  of  1495
59  of  1495
60  of  1495
61  of  1495
62  of  1495
63  of  1495
64  of  1495
65  of  1495
66  of  1495
67  of  1495
68  of  1495
69  of  1495
70  of  1495
71  of  1495
72  of  1495
73  of  1495
74  of  1495
75  of  1495
76  of  1495
77  of  1495
78  of  

594  of  1495
595  of  1495
596  of  1495
597  of  1495
598  of  1495
599  of  1495
600  of  1495
601  of  1495
602  of  1495
603  of  1495
604  of  1495
605  of  1495
606  of  1495
607  of  1495
608  of  1495
609  of  1495
610  of  1495
611  of  1495
612  of  1495
613  of  1495
614  of  1495
615  of  1495
616  of  1495
617  of  1495
618  of  1495
619  of  1495
620  of  1495
621  of  1495
622  of  1495
623  of  1495
624  of  1495
625  of  1495
626  of  1495
627  of  1495
628  of  1495
629  of  1495
630  of  1495
631  of  1495
632  of  1495
633  of  1495
634  of  1495
635  of  1495
636  of  1495
637  of  1495
638  of  1495
639  of  1495
640  of  1495
641  of  1495
642  of  1495
643  of  1495
644  of  1495
645  of  1495
646  of  1495
647  of  1495
648  of  1495
649  of  1495
650  of  1495
651  of  1495
652  of  1495
653  of  1495
654  of  1495
655  of  1495
656  of  1495
657  of  1495
658  of  1495
659  of  1495
660  of  1495
661  of  1495
662  of  1495
663  of  1495
664  of  1495
665  o

1168  of  1495
1169  of  1495
1170  of  1495
1171  of  1495
1172  of  1495
1173  of  1495
1174  of  1495
1175  of  1495
1176  of  1495
1177  of  1495
1178  of  1495
1179  of  1495
1180  of  1495
1181  of  1495
1182  of  1495
1183  of  1495
1184  of  1495
1185  of  1495
1186  of  1495
1187  of  1495
1188  of  1495
1189  of  1495
1190  of  1495
1191  of  1495
1192  of  1495
1193  of  1495
1194  of  1495
1195  of  1495
1196  of  1495
1197  of  1495
1198  of  1495
1199  of  1495
1200  of  1495
1201  of  1495
1202  of  1495
1203  of  1495
1204  of  1495
1205  of  1495
1206  of  1495
1207  of  1495
1208  of  1495
1209  of  1495
1210  of  1495
1211  of  1495
1212  of  1495
1213  of  1495
1214  of  1495
1215  of  1495
1216  of  1495
1217  of  1495
1218  of  1495
1219  of  1495
1220  of  1495
1221  of  1495
1222  of  1495
1223  of  1495
1224  of  1495
1225  of  1495
1226  of  1495
1227  of  1495
1228  of  1495
1229  of  1495
1230  of  1495
1231  of  1495
1232  of  1495
1233  of  1495
1234  of  

In [76]:
sim.sort_values('sim_score',ascending=False).reset_index().drop('index',axis=1).loc[0]

token1        British
token2       American
sim_score    0.936294
Name: 0, dtype: object