# The Natural Language Process  
1. Segmentation
    * Break the document down into sentences
2. Tokenizing
    * Find the individual words used
    * Basically an array of words in the document
3. Stop Words
    * "a", "the", "and", ... commonly used words that help with sentence structure, but don't add context.
    * Remove the stop words
4. Stemming
    * Different words that have the same root.
        * Swimmer, Swimming, Swims, Swam, ...
        * Same root of "swim"
        * Tenses of words
5. Lemmatization
    * Some words don't stem together
        * Ex. Universal and University are not stems of Universe
    * Look at dictionairy definitions and associate words with common meaning
    * "Better" and "Good" mean the same, but "Better" is an adjective, and "good" is a more general term, so we choose "Good" as a lemma.
6. Speach Tagging
    * Where is each token used in the sentence?
    * Label each word ad Noun, Verb, Preposition, etc.
7. Named Entity Tagging or Named Entity Recognition
    * Flagging names of locations, movies, people, etc. that occur in the document.
    * Is there any entity associated with a particular token?
        * Ex. "Utah" might be associated with a state in the United States.
        * Ex. "Dusty Shaw" might be associated with a student at Snow College.

# Bag of Words  

Given a list of words, how likely is it to be in either bag?

Create an empty array that represents 20,000 commonly used words. 

So we have an empty array of 

$$[0,0,0,0,...]$$

Where each position represents a specific word.

* array[0] = SOS (start of sentence)
* array[1] = EOS (end of sentence)
* array[n] = is reserved for any special words not represented in array  

Then simply fill your array with a count of how often each word occurs in your sample.
* Note -> Words with the same root or lemma will be counted together.  

Do this with thousands of samples, and send these arrays through a machine learning model (like naive bayeds, logistic regression, decision trees, random forests, deep neural networks)  


Quick example: 

> Fantastic work, John! Your work is worthy of JPL! Let me know if you have questions. Michael.

Array:
$$[\text{SOS, EOS, if, is, did, not, me, you, have, get, good, perfect, worth, amazing, fantastic, job, work, ..., your, mine, ',', '.', '!', know, question, shall, go, let, ..., (special words)}]$$

Your feedback produces this array:
$$[1, 1, 1, 1, 0, 0, 1, 2, ...]$$  
Then send this array into your model along with the 6,732 other feedbacks like it...

In [5]:
import pandas as pd
import numpy as np

data = pd.read_csv('./Data/Restaurant_Reviews.tsv', delimiter="\t", quoting=3)

In [7]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []

for i in range(0, len(data)):
    review = data['Review'][i]
    review = re.sub(r'[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = "".join(review)
    corpus.append(review)

ModuleNotFoundError: No module named 'nltk'