# Natural Language Processing

<img src="https://media.giphy.com/media/msKNSs8rmJ5m/giphy.gif" alt="Drawing" style="width: 300px;"/>




<img src = 'images/thinking.jpeg'></img>

#### Scenario: You wok for a international political consultant.  Your work is on the level and not shady at all.  Scott, a friendly coworker, is trying to sort through news articles to quickly filter political from non-political articles. He has heard you possess some solid NLP chops, and has asked for your help in automating the task.

<div>
<p float = 'left'> What type of problem is this?</p>
<p float = 'left'> What steps do you anticipate carrying out? </p>
<p float = 'left'> What challenges do you foresee? </p>
</div>


<img src = 'https://media.giphy.com/media/WqLmcthJ7AgQKwYJbb/giphy.gif' alt="Drawing" style="width: 300px;"  float = 'right'> </img>

## Vanilla Python Text Exploration

Explore the texts in the example texts by:

1. Creating a list of words in each text.
2. Counting the number of occurences of the words.  
3. Ordering the words by number of occurences.
4. Comparing the counts





In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
import string

with open('text_examples/A.txt', 'r') as read_file:
    text = read_file.read()

In [7]:
# Create a list of words
text = text.replace('\n', '')
tokens = text.split(' ')

In [10]:
# Count the number of occurences
token_counts = defaultdict(int)
for t in tokens:
    token_counts[t] += 1

token_counts

defaultdict(int,
            {'Reboot': 1,
             'ordered': 2,
             'for': 4,
             'EU': 4,
             'patent': 4,
             'lawA': 1,
             'European': 2,
             'Parliament': 1,
             'committee': 1,
             'has': 3,
             'a': 6,
             'rewrite': 1,
             'of': 6,
             'the': 17,
             'proposals': 2,
             'controversial': 1,
             'new': 2,
             'Union': 1,
             'rules': 1,
             'which': 1,
             'govern': 1,
             'computer-based': 1,
             'inventions.The': 1,
             'Legal': 1,
             'Affairs': 1,
             'Committee': 1,
             '(JURI)': 1,
             'said': 2,
             'Commission': 1,
             'should': 1,
             're-submit': 1,
             'Computer': 1,
             'Implemented': 1,
             'Inventions': 1,
             'Directive': 1,
             'after': 1,
             'MEPs

In [11]:
file_names = [f'{letter}.txt' for letter in string.ascii_uppercase[:12]]

def vanilla_tokenizer(file_name):
    token_counts = defaultdict(int)
    with open(f'text_examples/{file_name}', 'r') as read_file:
        text = read_file.read().replace('\n', ' ')
    
    tokens = text.split(' ')
    
    for token in tokens:
        token_counts[token] += 1
        
    return token_counts

In [19]:
word_df = pd.DataFrame()
for f in file_names:
    f_token_df = pd.DataFrame(vanilla_tokenizer(f), index=[f'{f}'])
    word_df = pd.concat([word_df, f_token_df], axis=0, sort=True)

word_df.fillna(value=0)

Unnamed: 0,Unnamed: 1,"""And","""For","""I","""I'm","""Jason","""Recipients","""Skype's","""Take","""This",...,year,year's,"year,",year.,years,"£1,000",£12m.,£3bn,£4m,"£7,455."
A.txt,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B.txt,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
C.txt,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
D.txt,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
E.txt,4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
F.txt,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
G.txt,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
H.txt,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
I.txt,4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
J.txt,3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


<h2>Bag of Words</h2>

<img src = "images/bag_of_word.jpg"></img>

What is the problem with text in relation to machine learning?
BOW takes a text, breaks it up into small pieces (words, bigrams, stems, lemma), an converts it into counts.  These counts can then be fed into our familiar machine learning algorithms.

Question: Did any algorithm pop into your head that might be particularly suited to bags of words?

## Steps for creating a bag of words

1. make lowercase 
2. remove punctuation
3. remove stopwords
4. apply stemmer/lemmatizer



To help us with the above steps, we will introduce a new library, **NLTK**  [documentation](https://www.nltk.org/).




In [21]:
#We will come back to the articles later, but to practice preparing a text, 
#let's use another of the NLTK resources (https://www.nltk.org/book/ch02.html)

import nltk
from nltk import word_tokenize, regexp_tokenize
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

In [23]:
text = nltk.corpus.gutenberg.raw('carroll-alice.txt').replace('\n',' ')[:2000]

In [25]:
## tokenize the text
nltk.download('punkt')
text_tokens = word_tokenize(text)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/flatironschool/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [27]:
# or use regexp which can take care of punctuation removal as well.
#https://regexr.com/
pattern = ("([a-zA-Z]+(?:'[a-z]+)?)")
tokenize = regexp_tokenize(text, pattern)

In [28]:
tokenize

["Alice's",
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 'CHAPTER',
 'I',
 'Down',
 'the',
 'Rabbit',
 'Hole',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on',
 'the',
 'bank',
 'and',
 'of',
 'having',
 'nothing',
 'to',
 'do',
 'once',
 'or',
 'twice',
 'she',
 'had',
 'peeped',
 'into',
 'the',
 'book',
 'her',
 'sister',
 'was',
 'reading',
 'but',
 'it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it',
 'and',
 'what',
 'is',
 'the',
 'use',
 'of',
 'a',
 'book',
 'thought',
 'Alice',
 'without',
 'pictures',
 'or',
 'conversation',
 'So',
 'she',
 'was',
 'considering',
 'in',
 'her',
 'own',
 'mind',
 'as',
 'well',
 'as',
 'she',
 'could',
 'for',
 'the',
 'hot',
 'day',
 'made',
 'her',
 'feel',
 'very',
 'sleepy',
 'and',
 'stupid',
 'whether',
 'the',
 'pleasure',
 'of',
 'making',
 'a',
 'daisy',
 'chain',
 'would',
 'be',
 'worth',
 'the',
 'trouble',
 'of',
 'getti

In [31]:
# 1. Make Lowercase
token_lowercase = [t.lower() for t in tokenize]

In [38]:
# 2. Remove punctuation
# 3. Remove Stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
# stopwords.words()
stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/flatironschool/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
token_lowercase = [t for t in token_lowercase if t not in stopwords]

# Stemmers/Lemmatizers

In [40]:
## Stemmers/Lemmatizers
## why would I use a stemmer and not a lemmatizer at all times?
# raw_text = nltk.corpus.gutenberg.raw()
from nltk.stem import *


example = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

p_stem = PorterStemmer()
stemmed_words = [p_stem.stem(word) for word in example]
stemmed_words

['caress',
 'fli',
 'die',
 'mule',
 'deni',
 'die',
 'agre',
 'own',
 'humbl',
 'size',
 'meet',
 'state',
 'siez',
 'item',
 'sensat',
 'tradit',
 'refer',
 'colon',
 'plot']

Porter Stemmer: Least aggressive stemmer.
Snowball stemmer: more aggressive in how it stems.

In [42]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/flatironschool/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [43]:
# Lemmatizers

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

example = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

from nltk import pos_tag
# Problem with POS
wordnet_lemmatizer.lemmatize(example[6])

'agreed'

In [None]:
## Frequency Distributions the easy way

from nltk import FreqDist

## Comparing using data frames

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['the bread week winner goes to the final', 
        'the winner never has a soggy bottom', 
        'the contestants at the bottom were poor bread bakers' ]
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

# Each word is a dimension!

![word](https://media.giphy.com/media/xT1R9ERHwyzbCkIwla/giphy.gif)