# NLP
Find your favorite news source and grab the article text.

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [74]:
from bs4 import BeautifulSoup
import requests
import re
import smartquote
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_lg")

### Tutorital on how to webscrape an article
1. Send Request
 - Requests is a library used by your command line which basically points your computer to that specific URL.
    Everytime you click on a website in the browser, your computer is sending a 'request' to the URL
    It is a request because if you don't have permission to view that URL, you will not be allowed to.
- Note: you want to keep requests in a separate block and only re run it when absolutely necessary, this
    is because you could potentially overload servers and most sites have an automatic IP block if your
    requests exceed a certain number of requests/min.
2. Parse it using BeautifulSoup
 - After sending your request you want to parse it using BeautifulSoups 'html.parser', 

3. Find specific element
 - soup.find is the command you use to find the specific HTML element you are looking for
 .text is needed because you don't want all the '<div class = ...>' stuff
 - For this specific example, on this website right click the article where it says "DOHA Quarta" and click 'Inspect'
Keep scrolling to the top of your Elements window until you mouse over the element that contains all the content
In this case the class is 'story-content', keep in mind this piece is trial and error and may take some time figuring
out whether you have the right element or not
working with articles, most times the class is something like [story, content, article, article-content]

4. Clean text
 - Steps I performed here
   1. Get rid of all of the text preceding and proceding the article
   2. Get rid of all of the advertisements or photo captions within the article
   3. Subsittute smartquotes for normal quotes
    - Smart quotes are explained here https://en.wikipedia.org/wiki/Quotation_marks_in_English#Smart_quotes,
    I believe this is only a problem for windows, but not sure. This step is necessary, otherwise you will run into
    alot of problems later on
   4. Replace ".word" followed by words with ". word", this will help because often SPACY depends on spaces in between words to distinguish them
   5. Replace any double, triple, etc. spaces by a single space
5. (optional) write to text file or CSV

In [59]:
site = "https://www.foxsports.com/stories/soccer/christian-pulisic-scores-biggest-usa-goal-in-12-years-as-u-s-advances"
res = requests.get(site)
soup = BeautifulSoup(res.content, "html.parser")

In [60]:
content = soup.find(class_ = 'story-content').text

article = re.findall('DOHA, Qatar.+pain of failure.', content)[0]
article = re.sub('Deandre Yedlin .+ \(Photo by Claudio Villa/Getty Images\)', '', article)
article = re.sub('USA\'s .+ in 38\'', '', article)
article = smartquote.substitute(article)
article = re.sub('\.(?=\w)', '. ', article)
article = re.sub('\s+', ' ', article)

In [54]:
with open("article.txt", "w") as text_file:
    text_file.write(article)

1. Show the most common words in the article

In [118]:
doc = nlp(article)
tokens = [token for token in doc]

In [119]:
tokens_text = [token.text for token in tokens
          if not token.is_punct
          and not token.is_stop
          and not token.is_digit
          ]

In [120]:
words = pd.Series(tokens_text).value_counts(ascending = False)

In [121]:
words[0:9]

Iran         9
goal         7
face         5
Pulisic      5
ball         5
Americans    4
game         4
winning      3
team         3
dtype: int64

2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})

In [132]:
l = [('Bob','Noun'), ('Alice','Noun'), ('Ran', 'Verb')]


('Bob', 'Noun')

TypeError: list.count() takes exactly one argument (0 given)

In [140]:
tokens_pos = [token.pos_ for token in tokens
          if not token.is_punct
          and not token.is_stop
          and not token.is_digit
          ]

In [146]:
tokens_df = pd.DataFrame(zip(tokens_text, tokens_pos), columns = ['word', 'pos'])

In [159]:
with pd.option_context('display.max_rows', None
                       ):
    print(tokens_df.groupby(by='pos')['word'].value_counts(ascending = False))

pos    word         
ADJ    American         2
       defensive        2
       free             2
       abdominal        1
       better           1
       busier           1
       calm             1
       cautious         1
       different        1
       emotional        1
       excellent        1
       fierce           1
       final            1
       finer            1
       heroic           1
       important        1
       little           1
       lively           1
       national         1
       new              1
       old              1
       past             1
       political        1
       potential        1
       ready            1
       rife             1
       set              1
       sizable          1
       speedy           1
       total            1
       ultra            1
       unflinching      1
       unmarked         1
       winded           1
       wracking         1
       young            1
ADP    alongside        1
ADV    forward   