# NLP
Find your favorite news source and grab the article text.

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import smartquote
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_lg")

In [2]:
site = "https://www.foxsports.com/stories/soccer/christian-pulisic-scores-biggest-usa-goal-in-12-years-as-u-s-advances"
res = requests.get(site)
soup = BeautifulSoup(res.content, "html.parser")

In [3]:
content = soup.find(class_ = 'story-content').text

article = re.findall('DOHA, Qatar.+pain of failure.', content)[0]
article = re.sub('Deandre Yedlin .+ \(Photo by Claudio Villa/Getty Images\)', '', article)
article = re.sub('USA\'s .+ in 38\'', '', article)
article = smartquote.substitute(article)
article = re.sub('\.(?=\w)', '. ', article)
article = re.sub('\s+', ' ', article)

In [4]:
with open("article.txt", "w") as text_file:
    text_file.write(article)

1. Show the most common words in the article

In [5]:
doc = nlp(article)
tokens = [token for token in doc]

In [6]:
tokens_text = [token.text for token in tokens
          if not token.is_punct
          and not token.is_stop
          and not token.is_digit
          ]

In [7]:
words = pd.Series(tokens_text).value_counts(ascending = False)

In [8]:
words[0:9]

Iran         9
goal         7
face         5
Pulisic      5
ball         5
Americans    4
game         4
winning      3
team         3
dtype: int64

2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})

In [9]:
tokens_pos = [token.pos_ for token in tokens
          if not token.is_punct
          and not token.is_stop
          and not token.is_digit
          ]

In [10]:
tokens_df = pd.DataFrame(zip(tokens_text, tokens_pos), columns = ['word', 'pos'])

In [11]:
with pd.option_context('display.max_rows', None
                       ):
    print(tokens_df.groupby(by='pos')['word'].value_counts(ascending = False))

pos    word         
ADJ    American         2
       defensive        2
       free             2
       abdominal        1
       better           1
       busier           1
       calm             1
       cautious         1
       different        1
       emotional        1
       excellent        1
       fierce           1
       final            1
       finer            1
       heroic           1
       important        1
       little           1
       national         1
       new              1
       old              1
       paramount        1
       past             1
       political        1
       potential        1
       ready            1
       rife             1
       set              1
       sizable          1
       speedy           1
       total            1
       ultra            1
       unflinching      1
       unmarked         1
       winded           1
       young            1
ADP    alongside        1
ADV    forward          2
       longer    

3. Find a subject/object relationship through the dependency parser in any sentence.

In [12]:
def pr_tree(word, level):
    if word.is_punct:
        return
    for child in word.lefts:
        pr_tree(child, level + 1)
    print('\t' * level + word.text + ' - ' + word.dep_)
    for child in word.rights:
        pr_tree(child, level + 1)

In [13]:
sel_sent = 11
i = 0
for sent in doc.sents:
    if i == sel_sent:
        print(sent)
        sel_sent_text = sent
        pr_tree(sent.root, 0)
        print('-' * 100)
    i+=1

He had done enough, if you consider scoring the most immediately important goal by an American man in 12 years to be "enough.
	He - nsubj
	had - aux
done - ROOT
	enough - dobj
		if - mark
		you - nsubj
	consider - advcl
		scoring - xcomp
				the - det
						most - advmod
					immediately - advmod
				important - amod
			goal - dobj
				by - prep
						an - det
						American - amod
					man - pobj
			in - prep
					12 - nummod
				years - pobj
			to - aux
		be - xcomp
			enough - acomp
----------------------------------------------------------------------------------------------------


In [14]:
spacy.displacy.render(sel_sent_text, style = 'dep')

4. Show the most common Entities and their types. 

In [15]:
ent_text = []
ent_type = []
for token in doc.ents:
    ent_text.append(token.text)
    ent_type.append(token.label_)
    

In [16]:
ent_df = pd.DataFrame(
    zip(ent_text, ent_type),
    columns = ['text', 'type']
)

In [17]:
ent_df.groupby(by='type')['text'].value_counts(ascending=False)

type      text                  
CARDINAL  0                         1
          1                         1
          10                        1
          16                        1
          one                       1
DATE      12 years                  1
          24-year-old               1
          Saturday                  1
          the first half            1
          the past days             1
EVENT     the World Cup             1
FAC       Al Thumama Stadium        1
GPE       Iran                      9
          USA                       2
          England                   1
          Netherlands               1
          Qatar                     1
          the United States         1
NORP      Americans                 4
          American                  3
ORDINAL   first                     3
ORG       Dest                      2
          Pulisic                   2
          Turner                    2
          Carter-Vickers            1
          Champio

5. Find Entites and their dependency (hint: entity.root.head)

In [18]:
sel_ent = 8
i = 0
for ent in doc.ents:
    if i == 8:
        print(ent, "-" * 5, ent.label_)
        pr_tree(ent.root.head, 0)
        print('*' * 100)
    i+=1

        

Pulisic ----- ORG
	Pulisic - nsubj
				the - det
			superstar - nsubj
		figurehead - appos
			for - prep
					this - det
					young - amod
					USA - compound
				team - pobj
saw - ROOT
		that - det
	equation - dobj
	liked - conj
		it - dobj
		and - cc
		decided - conj
				to - aux
			go - xcomp
				with - prep
						the - det
						winning - amod
					option - pobj
****************************************************************************************************


6. Find the most similar words in the article

In [19]:
doc = nlp(article.lower())
compare_list = []
for token1 in doc:
    for token2 in doc:
        sim = token1.similarity(token2)
        if 0.75 < sim < 1:
            if token1.text == token2.text or sim in compare_list:
                continue
            print(f'{token1.text}: {token2.text} {"-"*5} {token1.similarity(token2)}')
            compare_list.append(token1.similarity(token2))

  sim = token1.similarity(token2)


scored: scoring ----- 0.8124553561210632
goal: goals ----- 0.80790114402771
that: because ----- 0.7600496411323547
that: however ----- 0.7528941631317139
tuesday: saturday ----- 0.8429607152938843
really: because ----- 0.772476315498352
really: just ----- 0.7586804628372192
anyone: anything ----- 0.7538626790046692
because: but ----- 0.8247512578964233
because: though ----- 0.8218691349029541
because: however ----- 0.7560612559318542
nothing: everything ----- 0.8324678540229797
nothing: anything ----- 0.8896462917327881
everything: anything ----- 0.8457676768302917
team: teams ----- 0.8131204843521118
too: so ----- 0.8098674416542053
but: though ----- 0.8362677097320557
but: however ----- 0.776674747467041
may: will ----- 0.7649534940719604
38: 52 ----- 0.910517692565918
defense: defensive ----- 0.789395809173584
midfield: goalkeeper ----- 0.7685624957084656
midfield: striker ----- 0.7810852527618408
forward: forwards ----- 0.8401443958282471
24: 12 ----- 0.790008544921875
24: 16 -----