## Part 2 - PreProcessing the Text Data

In this notebook, we will begin to do some NLP preprocessing as well as inputing our text data into a TFIDF vectorizer. We need do this before we can start to do some modeling, specifcally LDA Topic Modeling.

In [2]:
#Importing the holy trinity
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Importing NLP plugins
from nltk.corpus import stopwords 
ENGLISH_STOP_WORDS = stopwords.words('english')
from nltk.stem import PorterStemmer

#Import our TfidFVectorizer plugin
from sklearn.feature_extraction.text import TfidfVectorizer

# We don't need to see the warnings :) 
import warnings
warnings.filterwarnings('ignore')

#### Let us create some helper function to ease our workflow

In [6]:
def tokenizer(text):
    '''
    Simple tokenizer:
    1.) Removes stopwords
    2.) Use PorterStemmer
    '''
    
    #Split each word up in text, which is a long string of words. 
    #These words are called tokens
    
    list_of_tokens = text.split(' ')
    
    #Let us use a stemmer
    stemmer = PorterStemmer()
    
    #list of cleaned_tokens
    cleaned_tokens = []

    #Remove Stopwords
    for token in list_of_tokens:
        if (not stopword in ENGLISH_STOP_WORDS):
            # Stemm words
            token_stemmed = stemmer(token)
                
            cleaned_tokens.append(token_stemmed)
            
    return cleaned_tokens

Now that we created our tokenizer function which will be fed into our TFIDF vectorizer down below, ideally I want to optimize the min_df value of our tokenizer. The **min_df** used for removing terms that appear too infrequently.

For example:
* min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
* min_df = 5 means "ignore terms that appear in less than 5 documents".

In [None]:
min_df_range = range(100, 5000, 50)

for i in min_df_range:
    