# Module 39 Topic Review: Natural Language Processing (NLP)

## Text Vectorization & Tokenization

In [106]:
import nltk
from nltk.corpus import gutenberg,stopwords
from nltk.collocations import *
from nltk import FreqDist
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import re

In [107]:
space = """
It’s a marvelous day for a moon launch
NASA rocket in FloridaJoe Raedle/Getty Images
Did you know that a human hasn’t stepped foot on the moon since 1972?

Well, NASA wants to change that, and today marks a major milestone on its quest to put astronauts back on the lunar surface. At 8:33am ET, the agency is set to launch its new moon rocket called the Space Launch System (SLS), the most powerful rocket ever built. The uncrewed mission will head toward the moon and complete one and a half orbits during its 42-day mission.

If successful, the voyage will set the stage for a crewed “flyby” mission to the moon in 2024, teeing up a potential landing as soon as the following year.

But this is rocket science after all, and most things don’t go according to plan. The SLS was first ordered by Congress in 2010, and is only now hitting the launchpad after numerous delays and billions in cost overruns. Plus, getting humans on the moon will require not only the rocket but also a vehicle to send astronauts from the capsule to the moon’s surface. SpaceX has been tapped to provide that lunar lander, though it hasn’t successfully reached orbit yet.

“I would say simply that space is hard,” NASA Administrator Bill Nelson deadpanned on Saturday.

So why even go to the moon?
Here are a few reasons why NASA thinks it’s worth the trouble, per NPR.

Science: Lunar geologists say some parts of the moon are pivotal for understanding the beginnings of the solar system, because there’s no atmosphere or flowing water to erode rocks.
Dress rehearsal for Mars: Before astronauts head to the Red Planet, they can work out all the kinks on the moon, which is 200 times closer to Earth than Mars.
Marketing: Hey, we’re talking about NASA, right? Doing big, buzzy projects like moon landings could help boost the reputation of the agency and also inspire more Americans to pursue science and engineering careers.
Final fun fact: NASA’s new moon program is called Artemis. In Greek myth, Artemis was the twin sister of Apollo, the name of NASA’s OG moon program.
"""

world = """
Tour de headlines
Flooded street in PakistanAkram Shahid/AFP via Getty Images
 Pakistan faces a “climate catastrophe.” Pakistan officials said on Sunday that flooding from monsoon season has killed more than 1,000 people since mid-June, including 119 in the previous 24 hours. Flash floods that have destroyed villages and affected at least 33 million people amounted to a “climate-induced humanitarian disaster of epic proportions,” Pakistan’s climate change minister said.

 $3 movie tickets are coming. You’ll be able to tell your grandkids, “Back in my day, I paid $3 for a movie ticket,” because on Saturday the majority of movie theaters in the US will sell tickets for $3. The National Cinema Day initiative, launched by the nonprofit Cinema Foundation, is an effort to get people back into theaters during Labor Day weekend, which is typically one of the weakest all year. Plus, domestic box office sales this summer are still lagging 2019 levels by 20%.

 Baseball card sells for $12.6 million. A 1952 Mickey Mantle baseball card in mint condition sold for $12.6 million at auction—topping a $9.3 million Diego Maradona jersey as the most expensive piece of sports memorabilia ever sold. The sports collectibles space has exploded in popularity during the pandemic. In 2018, the size of the market was estimated to be around $5.4 million. By 2021, it grew to $26 billion.
"""

education = """
Missouri school district brings back spanking
Tom spanking Jerry in the cartoonTom and Jerry/Warner Bros. via Giphy
If all other disciplinary actions fail, teachers in Missouri’s Cassville R-IV School District will be allowed to spank a student with a paddle, parents learned last week. The superintendent said the school board brought back the practice after parents asked for more punishments other than suspension.

The parent has to approve the spanking of their child, but once they do, a teacher can use “reasonable physical force” on a student but give no “chance of bodily injury or harm.” A witness has to be present during the spanking and a teacher or principal must give notice to the superintendent justifying the punishment.

Big picture: You may be surprised to learn that corporal punishment—as it’s formally known—is legal in 19 states. That’s because a Supreme Court decision in 1977 said the technique was constitutional and let each state decide on its own rules.

More than 69,000 children were punished physically in the 2017–2018 school year, according to the most recent data.

Spanking in schools has loads of critics, including the American Academy of Pediatrics and American Psychological Association, which say it’s not effective and can give students trauma. A 2016 study found that boys, Black kids, and children with disabilities were more likely to be paddled than their peers.
"""


In [108]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
space_tokens_raw = nltk.regexp_tokenize(space,pattern)
space_tokens_raw

['It',
 's',
 'a',
 'marvelous',
 'day',
 'for',
 'a',
 'moon',
 'launch',
 'NASA',
 'rocket',
 'in',
 'FloridaJoe',
 'Raedle',
 'Getty',
 'Images',
 'Did',
 'you',
 'know',
 'that',
 'a',
 'human',
 'hasn',
 't',
 'stepped',
 'foot',
 'on',
 'the',
 'moon',
 'since',
 'Well',
 'NASA',
 'wants',
 'to',
 'change',
 'that',
 'and',
 'today',
 'marks',
 'a',
 'major',
 'milestone',
 'on',
 'its',
 'quest',
 'to',
 'put',
 'astronauts',
 'back',
 'on',
 'the',
 'lunar',
 'surface',
 'At',
 'am',
 'ET',
 'the',
 'agency',
 'is',
 'set',
 'to',
 'launch',
 'its',
 'new',
 'moon',
 'rocket',
 'called',
 'the',
 'Space',
 'Launch',
 'System',
 'SLS',
 'the',
 'most',
 'powerful',
 'rocket',
 'ever',
 'built',
 'The',
 'uncrewed',
 'mission',
 'will',
 'head',
 'toward',
 'the',
 'moon',
 'and',
 'complete',
 'one',
 'and',
 'a',
 'half',
 'orbits',
 'during',
 'its',
 'day',
 'mission',
 'If',
 'successful',
 'the',
 'voyage',
 'will',
 'set',
 'the',
 'stage',
 'for',
 'a',
 'crewed',
 'flyby

In [109]:
space_tokens = [token.lower() for token in space_tokens_raw]
space_FreqDist = FreqDist(space_tokens)
space_FreqDist

FreqDist({'the': 29, 'moon': 13, 'to': 13, 'a': 9, 'and': 8, 'nasa': 7, 's': 6, 'on': 6, 'is': 6, 'rocket': 5, ...})

### Stop Words
Stop words are common words found in nearly every corpus, and contribute little to the meaning or context of piece of natural language data (e.g. a phrase, text document, etc.), sometimes numbers are included here as well. Stop words are usually removed from the corpus when vectorizing a text document.

In [110]:
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

space_stopped = [word for word in space_tokens if word not in stopwords_list]
space_stopped_FreqDist = FreqDist(space_stopped)
print("corpus size: ",len(space_stopped_FreqDist))
space_stopped_FreqDist


corpus size:  157


FreqDist({'moon': 13, 'nasa': 7, 'rocket': 5, 'launch': 3, 'astronauts': 3, 'lunar': 3, 'mission': 3, 'science': 3, 'day': 2, 'surface': 2, ...})

### Term Frequency & Inverse Document Frequency TF-IDF

$$TF(t)= \frac{\text{number of times t appears in a document}}{\text{total number of terms in the document}}$$

$$IDF(t) = log_{e}\big(\frac{\text{total number of documents}}{\text{number of documents with t in it}})$$

In [111]:
def corpus_digest(text_doc:str):
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    tokens_raw = nltk.regexp_tokenize(text_doc,pattern)
    lowered_tokens = [token.lower() for token in tokens_raw]
    
    stopwords_list = stopwords.words('english')
    stopwords_list += list(string.punctuation)
    stopwords_list += ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    stopped = [word for word in lowered_tokens if word not in stopwords_list]
    
    bigram_measrues = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(stopped)
    
    scored = finder.score_ngrams(bigram_measrues.raw_freq)
    freq_dist = FreqDist(stopped)

    return freq_dist

In [112]:
corpi = [corpus_digest(space),corpus_digest(world),corpus_digest(education)]

In [113]:
# count total number of instances for each word accross all documents
big_corpus_tf = {word: 0 for word in (list(corpi[0].keys())+list(corpi[1].keys())+list(corpi[2].keys()))}
for word in big_corpus_tf.keys():
    for corpus in corpi:
        big_corpus_tf[word] += corpus[word]


# count number of documents each word occurs in 
big_corpus_doc_freq = {}
for word in big_corpus_tf:
    num_of_docs = 0
    for dict in corpi:
        if word in dict.keys():
            num_of_docs += 1 
    big_corpus_doc_freq[word] = num_of_docs


# calculate term frequency and IDF for each document
tf_dicts = []
idf_dicts = []
for corpus in corpi:
    tf_dicts.append({word:(corpus[word]/len(corpus.keys())) for word in corpus})
for corpus in corpi:
    idf_dicts.append({word:(3/big_corpus_doc_freq[word]) for word in corpus})


In [114]:
space_tf_idf = {word:(tf_dicts[0][word],idf_dicts[0][word]) for word in corpi[0].keys()}
world_tf_idf = {word:(tf_dicts[1][word],idf_dicts[1][word]) for word in corpi[1].keys()}
education_tf_idf = {word:(tf_dicts[2][word],idf_dicts[2][word]) for word in corpi[2].keys()}

In [115]:
space_tf_idf

{'marvelous': (0.006369426751592357, 3.0),
 'day': (0.012738853503184714, 1.5),
 'moon': (0.08280254777070063, 3.0),
 'launch': (0.01910828025477707, 3.0),
 'nasa': (0.044585987261146494, 3.0),
 'rocket': (0.03184713375796178, 3.0),
 'floridajoe': (0.006369426751592357, 3.0),
 'raedle': (0.006369426751592357, 3.0),
 'getty': (0.006369426751592357, 1.5),
 'images': (0.006369426751592357, 1.5),
 'know': (0.006369426751592357, 3.0),
 'human': (0.006369426751592357, 3.0),
 'stepped': (0.006369426751592357, 3.0),
 'foot': (0.006369426751592357, 3.0),
 'since': (0.006369426751592357, 1.5),
 'well': (0.006369426751592357, 3.0),
 'wants': (0.006369426751592357, 3.0),
 'change': (0.006369426751592357, 1.5),
 'today': (0.006369426751592357, 3.0),
 'marks': (0.006369426751592357, 3.0),
 'major': (0.006369426751592357, 3.0),
 'milestone': (0.006369426751592357, 3.0),
 'quest': (0.006369426751592357, 3.0),
 'put': (0.006369426751592357, 3.0),
 'astronauts': (0.01910828025477707, 3.0),
 'back': (0.0

### Bigrams

In [116]:
space_bigram_measrues = nltk.collocations.BigramAssocMeasures()
space_finder = BigramCollocationFinder.from_words(space_stopped)
space_scored = space_finder.score_ngrams(space_bigram_measrues.raw_freq)
space_scored[:25]

[(('moon', 'program'), 0.00975609756097561),
 (('new', 'moon'), 0.00975609756097561),
 (('according', 'plan'), 0.004878048780487805),
 (('administrator', 'bill'), 0.004878048780487805),
 (('agency', 'also'), 0.004878048780487805),
 (('agency', 'set'), 0.004878048780487805),
 (('also', 'inspire'), 0.004878048780487805),
 (('also', 'vehicle'), 0.004878048780487805),
 (('americans', 'pursue'), 0.004878048780487805),
 (('apollo', 'name'), 0.004878048780487805),
 (('artemis', 'greek'), 0.004878048780487805),
 (('artemis', 'twin'), 0.004878048780487805),
 (('astronauts', 'back'), 0.004878048780487805),
 (('astronauts', 'capsule'), 0.004878048780487805),
 (('astronauts', 'head'), 0.004878048780487805),
 (('atmosphere', 'flowing'), 0.004878048780487805),
 (('back', 'lunar'), 0.004878048780487805),
 (('beginnings', 'solar'), 0.004878048780487805),
 (('big', 'buzzy'), 0.004878048780487805),
 (('bill', 'nelson'), 0.004878048780487805),
 (('billions', 'cost'), 0.004878048780487805),
 (('boost', 'r

## Word Tokenization Methods

### Stemming and Lemmatization  
***Stemming*** is a rather naive method of word tokenization that simply removes and suffixes, prefixes, conjugations, or other modifications to the word.  

***Lemmatization*** is a more sophisticated method that uses known features of a word to reduce an instance of a word to its 'lemma'.  
<img src='images/stem_vs_lem.jpg'>

## Context-Free Grammars (CFG) & Part of Speech Tagging (POS)  
<img src='images/levelsOfLanguage.png' width=1000>

Its useful to have more than just simple corpus statistics about a word, but also vectorizable data about the *information* embedded within a word. Context-Free Grammars are a method used to generate "part of speech" (POS) tags which can help make vectorizing this dimension of lanugage possible. Its important to note that when discussing CFGs we are speaking in regard to *syntax* as shown above.  

A parse tree is generated by applying a senteance to a CFG that has been manually generated.

consider the following sentence: 

***"While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don't know."***

A grammar can be written for this sentence that provids multiple valid POS tag formats

- **S -> NP VP** A sentence (S) consists of a Noun Phrase (NP) followed by a Verb Phrase (VP).  
- **PP -> P NP A** Prepositional Phrase (PP) consists of a Preposition (P) followed by a Noun Phrase (NP)  
- **NP -> Det N | Det N PP | 'I'** A Noun Phrase (NP) can consist of:  
    - a Determiner (Det) followed by a Noun (N), or (as denoted by |)  
    - a Determiner (Det) followed by a Noun (N), followed by a Prepositional Phrase (PP), or  
    - The token 'I'.
- **VP -> V NP | VP PP** A Verb Phrase can consist of:
    - a Verb (V) followed by a Noun Phrase (NP) or
    - a Verb Phrase (VP) followed by a Prepositional Phrase (PP)
- **Det -> 'an' | 'my'** Determiners are the tokens 'an' or 'my'
- **N -> 'elephant' | 'pajamas'** Nouns are the tokens 'elephant' or 'pajamas'
- **V -> 'shot'** Verbs are the token 'shot'
- **P -> 'in'** Prepositions are the token 'in'  

This grammar parses this sentence as:  
<img src='images/sentence_parse.png' width=600>



In [117]:
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True);

In [118]:
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")

In [119]:
parser = nltk.ChartParser(groucho_grammar)
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']

for tree in parser.parse(sent):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


In [120]:
# STEP 1: write a grammar
grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N PP | N | Det NP | Adj NP
VP -> V NP | VP PP
Det -> 'the'
Adj -> '100m'
N -> 'usain_bolt' | 'record' | 
V -> 'broke'
P -> 
""")
grammar

<Grammar with 15 productions>

In [121]:
# STEP 2: define setnence
from nltk import word_tokenize
sent = 'usain_bolt broke the 100m record'
sent

'usain_bolt broke the 100m record'

In [122]:
# STEP 3: tokenize sentence
tokenized_sent = word_tokenize(sent)
tokenized_sent

['usain_bolt', 'broke', 'the', '100m', 'record']

In [123]:
# STEP 4: fit grammar to the parser
parser = nltk.ChartParser(grammar)
parser

<nltk.parse.chart.ChartParser at 0x27cd90a9820>

In [124]:
# STEP 5: transform tokens with parser
for tree in parser.parse(tokenized_sent):
    print(tree)

(S
  (NP (N usain_bolt))
  (VP (V broke) (NP (Det the) (NP (Adj 100m) (NP (N record))))))
(S
  (NP (N usain_bolt))
  (VP
    (V broke)
    (NP (Det the) (N ) (PP (P ) (NP (Adj 100m) (NP (N record)))))))
(S
  (NP (N usain_bolt))
  (VP
    (VP (V broke) (NP (N )))
    (PP (P ) (NP (Det the) (NP (Adj 100m) (NP (N record)))))))
(S
  (NP (N usain_bolt))
  (VP
    (VP (V broke) (NP (N )))
    (PP
      (P )
      (NP (Det the) (N ) (PP (P ) (NP (Adj 100m) (NP (N record))))))))
(S
  (NP (N usain_bolt))
  (VP
    (VP (VP (V broke) (NP (N ))) (PP (P ) (NP (N ))))
    (PP (P ) (NP (Det the) (NP (Adj 100m) (NP (N record)))))))
(S
  (NP (N usain_bolt))
  (VP
    (VP (VP (V broke) (NP (N ))) (PP (P ) (NP (N ))))
    (PP
      (P )
      (NP (Det the) (N ) (PP (P ) (NP (Adj 100m) (NP (N record))))))))
(S
  (NP (N usain_bolt))
  (VP
    (VP (V broke) (NP (Det the) (NP (N ))))
    (PP (P ) (NP (Adj 100m) (NP (N record))))))
(S
  (NP (N usain_bolt))
  (VP
    (VP (V broke) (NP (Det the) (N ) (PP (P ) (

In [125]:
nltk.pos_tag(tokenized_sent)

[('usain_bolt', 'JJ'),
 ('broke', 'VBD'),
 ('the', 'DT'),
 ('100m', 'CD'),
 ('record', 'NN')]

- NLP has become increasingly popular over the past few years, and NLP researchers have achieved very insightful insights  
- The Natural Language Tool Kit (NLTK) is one of the most popular Python libraries for NLP  
- Regular Expressions are an important part of NLP, which can be used for pattern matching and filtering  
- Regular Expressions can become confusing, so make sure to use our provided cheat sheet the first few times you work with regex  
- It is strongly recommended you take some time to use regex tester websites to ensure you understand how changing your regex pattern affects your results when working towards a correct answer!  
- Feature Engineering is essential when working with text data, and to understand the dynamics of your text  
- Common feature engineering techniques are removing stop words, stemming, lemmatization, and n-grams  
- When diving deeper into grammar and linguistics, context-free grammars and part-of-speech tagging is important  
- In this context, parse trees can help computers when dealing with ambiguous words  
- How you clean and preprocess your data will have a major effect on the conclusions you'll be able to draw in your NLP classification problems  