# Tokenizing Sentences
1. Split apart corpus into sentences.
2. Split apart sentences into words.

In [None]:
# Why not just tokenize myself?
import nltk
text = "I made two purchases today! I bought a bag of grapes for $4.99, \
but then... realized John Francis already bought some at the Y.M.C.A!"

In [None]:
# trying to write our own tokenizer




In [None]:
# Using NLTK sent_tokenize()



## Stemming

<img src="images/stemming-examples.png" alt="Different Stemming Techniques" style="width:600px;"/>

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language [Source](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)

In Python, we can use **`nltk.stem.porter.PorterStemmer`** stem our words:

```python
stemmer = PorterStemmer()
print(stemmer.stem("caressed"))  # caress
print(stemmer.stem("athlete"))  # athlet
print(stemmer.stem("athletics"))  # athlet
print(stemmer.stem("media"))  # media
print(stemmer.stem("photography"))  # photographi
print(stemmer.stem("sexy"))  # sexi
print(stemmer.stem("journalling"))  # journal
print(stemmer.stem("Slovakia")) # slovakia
print(stemmer.stem("corpora")) # corpora
print(stemmer.stem("thieves")) # thiev
print(stemmer.stem("rocks")) # rock
```

## Lemmatization

```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("caressed")) #caressed
print(lemmatizer.lemmatize("athlete")) #athlete
print(lemmatizer.lemmatize("athletics")) #athletics
print(lemmatizer.lemmatize("media"))
print(lemmatizer.lemmatize("photography")) #photography
print(lemmatizer.lemmatize("sexy")) #sexy
print(lemmatizer.lemmatize("journalling")) #journalling
print(lemmatizer.lemmatize("Slovakia")) #Slovakia
print(lemmatizer.lemmatize("corpora")) # corpus
print(lemmatizer.lemmatize("thieves")) # thief
print(lemmatizer.lemmatize("rocks")) #rock
```

Why would you ever care to use stemming?
- smaller and faster
- simplicity in "good enough"
- can often **provide higher recall (coverage)** if you are using it for text searching: `drives` and `drivers` will likely shorten to `driv`, which may be useful if your search engine wants to make sure to get all relevant documents, even at the cost of surfacing a few irrelevant documents
- could potentially be more useful for predictive models that tend to overfit

## Scoring Metrics

<img src="images/confusion_matrix2.png" alt="Different Stemming Techniques" style="width:600px;"/>

### Precision/Recall

**Recall:** What percent of the positive classes did the model successfully predict?
**Precision:** When a model predicted a positive class, what percentage of the time was it correct?

In terms of NLP / stemming / lemmatization:

**Recall**: After processing (tokenizing, stemming/lemmatizing) the data, what percent of the relevant search results were surfaced? Ie. - when a user searches for "blue jeans", did all the results returned include all the relevant items (blue-ish colored denim pants)?

**Precision**: After processing (tokenizing, stemming/lemmatizing) the data, what percent of the results returned were relevant?

<img src="images/matrix_practice2.png" alt="Different Stemming Techniques" style="width:600px;"/>

**Precision:** $\frac{?}{?}$

**Recall:** $\frac{?}{?}$

### F1 Score

The F1 score of a model represents the harmonic mean between precision and recall, and is defined as 

$$
\begin{equation}
F_{1} = 2 * \frac{P * R}{P + R}
\end{equation}
$$

## Exercise:

##### 1. For each of the following statements, label them True or False. If False, briefly explain why:

A. Text typically should be processed via either stemming or lemmatization, but not both.

B. Texts processed using lemmatization will typically have higher recall than stemming.

C. If the **F1 score** of a model is **1.0 (100%)**, then the accuracy of your model must also be **100%**.

##### 2. Calculate precison and recall given the following results from a confusion matrix:

<img src="images/exercise.jpeg" alt="Different Stemming Techniques" style="width:600px;"/>


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["It's still early, so box-office disappointments are still among the highest-grossing movies of the year.", 
        "That movie was terrific", "You love cats", 
        "Pay for top executives at big US companies is vastly higher than what everyday workers make, and a new report from The Wall Street Journal has found that CEOs have hit an eye-popping milestone in the size of their monthly paychecks."]
# create the transform


# tokenize and build vocab


# vectorize the corpus


# summarize encoded vector


# Notice what type of object this is



In [None]:
# see the outputted vectors



In [None]:
# load vectorized corpus into Pandas dataframe
import pandas as pd


In [None]:
# run a quick correlation analysis to see if any word pairs show rough co-occurence
