# NLP Modeling Lesson

In this lesson, we'll do a bit of feature engineering, and then model our text data. We'll be aiming to predict whether a given text message is spam or not, and trying to predict the category of news articles.

## Feature Extraction: TF-IDF
- **TF**: Term Frequency; how often a particular word appears in a document. 

        "apple" appears in this document 15 times
- **IDF**: Inverse Document Frequency; a measure of how many (related) documents contain a particular word. 

        "apple" is found in 10 of the 58 documents in our sample
        
- **TF-IDF**: A combination of the two measures above.

### TF: Term Frequency

Term frequency can be calculated in a number of ways, all of which reflect how frequently a word appears in a document.

- **Raw Count**: This is simply the count of the number of occurances of each word.
- **Frequency**: The number of times each word appears divided by the total number of words.
- **Augmented Frequency**: The frequency of each word divided by the maximum frequency. This can help prevent bias towards larger documents.

Let's take a look at an example:

In [1]:
from pprint import pprint

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from prepare import basic_clean, lemmatize

ModuleNotFoundError: No module named 'prepare'

In [None]:
document = 'Mary had a little lamb, a little lamb, a little lamb.'

# clean up the text
document = document.lower().replace(',', '').replace('.', '')
# transform into a series
words = pd.Series(document.split())
words

From the series we can extract the value_counts, which is our raw count for term frequency. Once we have the raw counts, we can calculate the other measures.

In [None]:
words.value_counts()

In [None]:
lullaby = pd.DataFrame({'raw_count':words.value_counts()})
lullaby

In [None]:
lullaby['frequency'] = lullaby.raw_count.apply(lambda x: x/lullaby.raw_count.sum())
lullaby

In [None]:
lullaby['augmented_frequency'] = lullaby.frequency.apply(lambda x: x/lullaby.frequency.max())
lullaby

In [None]:
# We can accomplish the same task using fewer computational resources using .assign
(pd.DataFrame({'raw_count': words.value_counts()})
 .assign(frequency=lambda df: df.raw_count / df.raw_count.sum())
 .assign(augmented_frequency=lambda df: df.frequency / df.frequency.max()))

**Takeaways**: These are simply numeric representations of one characteristic of the strings in our corpus (frequency). Aside from simply showing us that some words are more frequent than others, this information by itself doesn't provide us much value. 

## IDF: Inverse Document Frequency

Inverse Document Frequency also provides information about individual words, but, in order to use this measure, we must have multiple documents, i.e. several different bodies of text.

Inverse Document Frequency tells us how much **information** a word provides. It is based on how commonly a word appears across multiple documents. The metric is divised such that the more frequently a word appears, the lower the IDF for that word will be.

      idf(word) = log(# of documents / # of documents containing word)
      
> If a given word doesn't appear in any documents, the denominator in the equation above would be zero, so some definitions of idf will add 1 to the denominator.

For example, imagine we have 20 documents. We can visualize what the idf score looks like with the code below:

In [None]:
n_documents = 20

x = np.arange(1, n_documents + 1)
y = np.log(n_documents / x)

plt.figure(figsize=(12, 8))
plt.plot(x, y, marker='.')

plt.xticks(x)
plt.xlabel('# of Documents the word appears in')
plt.ylabel('IDF')
plt.title('IDF for a given word')

**Takeaways**: Suppose you are trying to create a model that predicts whether a given corpus is written before 1900 or after 1900. If a word appears in every document of your sample, its not going to provide much insight. But if a word only appears in a small number of documents, then it could be representative of an underlying trend (i.e. "afternoonified" shows up in a small number of documents all of which were written before 1900).

    High IDF = More information

Let's look at an example of calculating IDF:

In [None]:
# our 3 example documents
documents = {
    'news': 'Codeup announced last thursday that they just launched a new data science program. It is 18 weeks long.',
    'description': 'Codeup\'s data science program teaches hands on skills using Python and pandas.',
    'context': 'Codeup\'s data science program was created in response to a percieved lack of data science talent, and growing demand.'
}
pprint(documents)

print('\nCleaning and lemmatizing...\n')

documents = {topic: lemmatize(basic_clean(documents[topic])) for topic in documents}
pprint(documents)

In [None]:
# Visualize document values to help explain our upcoming idf function
documents.values()

In [None]:
def idf(word):
    '''A simple way to calculate idf for demonstration. Note that this 
    function relies on a globally defined documents variable.'''
    n_occurences = sum([1 for doc in documents.values() if word in doc])
    return len(documents) / n_occurences

In [None]:
# Get a list of the unique words
unique_words = pd.Series(' '.join(documents.values()).split()).unique()
unique_words

In [None]:
# put the unique words into a data frame
(pd.DataFrame(dict(word=unique_words))
 # calculate the idf for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # sort the data for presentation purposes
 .set_index('word')
 .sort_values(by='idf', ascending=False))

**Takeaways**: Words with the lowest IDF score were found in every document. They do us no good in helping us to distinguish whether a given corpus was "context", "description", or "news". Words with high IDF scores are more strongly linked to a particular classification. 

But this sample is so small that we should be cautious in using the word "on" as a means to classify a future corpus. 

> The calculation for an individual IDF score requires a word **and** a set of documents.

## TF-IDF

TF-IDF is simply the multiplication of the two metrics we've discussed above. Let's calculate an TF-IDF for all of the words and documents:

In [None]:
# We will create an empty list to store values for us as we iterate through our data
tfs = []

# Start by iterating over all the documents. We can use .items() to speed up our loop:
documents.items()

In [None]:
# Create a for loop
for doc, text in documents.items():
    # We will make a dataframe that contains the term frequency for every word
    df = (pd.Series(text.split())
          .value_counts()
          .reset_index()
          .set_axis(['word', 'raw_count'], axis=1, inplace=False)
          .assign(tf=lambda df: df.raw_count / df.shape[0])
          .drop(columns='raw_count')
          .assign(doc=doc))
    # Then add that data frame to our list
    tfs.append(df)

In [None]:
# What actually happened in that code block? Overexplanation using print statements:
print("BEGINNING LOOP")
print("\n")
for doc, text in documents.items():
    print("Text being manipulated:")
    print('-----------------------------------------')
    print(f'Document: {doc}')
    print(f'Text: {text}')
    print('\n')
    print("Step 1: Splitting the corpus into a list of words")
    print("df = (pd.Series(text.split()))")
    print('-----------------------------------------')
    df = (pd.Series(text.split()))
    print(df)
    print('\n')
    
    print("Step 2: Converting list of words into a value count array")
    print("df = df.value_counts()")
    print('-----------------------------------------')
    df = df.value_counts()
    print(df)
    print('\n')
    
    print("Step 3: Resetting the index")
    print("df = df.reset_index()")
    print('-----------------------------------------')
    df = df.reset_index()
    print(df)
    print('\n')
    
    print("Step 4: Relabeling the columns")
    print("df = df.set_axis(['word', 'raw_count'], axis=1, inplace=False)")
    print('-----------------------------------------')    
    df = df.set_axis(['word', 'raw_count'], axis=1, inplace=False)
    print(df)
    print('\n')    
    
    print("Step 5: Calculating the Term Frequency of Each Word within this one corpus")
    print("df['tf'] = df.raw_count.apply(lambda x: x/df.shape[0])")
    print('-----------------------------------------')      
    df['tf'] = df.raw_count.apply(lambda x: x/df.shape[0])
    print(df)
    print('\n')
    
    print("Step 6: Dropping the 'raw_count' column")
    print("df = df.drop(columns='raw_count')")
    print('-----------------------------------------')      
    df = df.drop(columns='raw_count')
    print(df)
    print('\n')    
    
    print("Step 6: Adding the document label for this corpus")
    print("df['doc'] = doc")
    print('-----------------------------------------') 
    df['doc'] = doc
    print(df)
    print('\n')
    print("ITERATION OF ELEMENT COMPLETE")
    print('\n', '\n')

In [None]:
tfs

In [None]:
# We'll then concatenate all the tf values together.
(pd.concat(tfs)
 # calculate the idf value for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # then use the if and idf values to calculate tf-idf 
 .assign(tf_idf=lambda df: df.idf * df.tf)
 .drop(columns=['tf', 'idf'])
 .sort_values(by='tf_idf', ascending=False)
 .reset_index(drop=True))

It's more common to see the data presented with the words as features, and the documents as observations, like this:

In [None]:
# We'll then concatenate all the tf values together.
print("TF-IDF for each word/doc combination")
(pd.concat(tfs)
 # calculate the idf value for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # then use the if and idf values to calculate tf-idf 
 .assign(tf_idf=lambda df: df.idf * df.tf)
 .drop(columns=['tf', 'idf'])
 .sort_values(by='tf_idf', ascending=False)
 .pipe(lambda df: pd.crosstab(df.doc, df.word, values=df.tf_idf, aggfunc=lambda x: x))
 .fillna(0))

## TF-IDF with scikit-learn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidfs = tfidf.fit_transform(documents.values())
tfidfs

We get back a sparse matrix, a matrix with more 0s than anything else. Numpy has a special type that makes some manipulations and operations faster on sparse matrices.

Becuase our data set is pretty small, we can convert our sparse matrix to a regular one, and put everything in a dataframe. If our data were larger, the operation below might take much longer.

In [None]:
pd.DataFrame(tfidfs.todense(), columns=tfidf.get_feature_names())

Why are the values different? Because in our manual version we used a simplified formula. Scikit-learn uses the proper IDF formula to calculate TF-IDF.

## Modeling

Now we'll use the computed TF-IDF values as features in a model. We'll take a look at the spam data set first.

Because of the way we are modeling the data, we have a lot of columns, and it is not uncommon to have more columns than rows. Also, our data is very imbalanced in the class distribution, that is, there are many more ham messages than spam messages.

Other than these considerations, we can treat this as a standard classification problem. We'll use logistic regression as an example:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from env import user, password, host

def get_db_url(database, host=host, user=user, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{database}'

url = get_db_url("spam_db")
sql = "SELECT * FROM spam"

df = pd.read_sql(sql, url, index_col="id")
df.head()

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df.text)
y = df.label

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.2)

train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

lm = LogisticRegression().fit(X_train, y_train)

train['predicted'] = lm.predict(X_train)
test['predicted'] = lm.predict(X_test)

In [None]:
print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.predicted, train.actual))
print('---')
print(classification_report(train.actual, train.predicted))

In [None]:
print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('---')
print(classification_report(test.actual, test.predicted))

## Exercises

Do your work for this exercise in a file named `model.ipynb`.

Take the work we did in the lessons further:

1. What other types of models (i.e. different classifcation algorithms) could you use? Create a model with a different algorithm.
2. How do the models compare when trained on term frequency data alone, instead of TF-IDF values?