# Text Cleaning Before Document Vectorization | BAIS:6100

**Instructor: Qihang Lin**

When using vectorizers from **sklearn** to create a DTM, we have less control on how the documents should be cleaned before they are converted into a DTM. For example, a vectorizer from sklearn always does stemming after creating n-grams. As a result, only the last term in the n-grams will be stemmed. 

If we want to do something different, we can use text cleaning techniques and "for-loop" to clean each document before it is taken by vectorizers. This way, we can better control the process have more flexibility. 

In [1]:
import pandas as pd
import nltk
df = pd.read_csv("classdata/clinton-street-social-club.csv",encoding="latin-1")

In [2]:
#Create the stop word list of nltk
global_stopwords = nltk.corpus.stopwords.words("english") 
#Creat the Snowball stemmer
stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)

The following code clean each document without using vectorizer. We can put the cleaning steps in any order.

In [3]:
temptext=[]                              #Create an empty list to receive the cleaned docs. 
for doc in df["reviews"]:                #Let doc becomes each of the reviews sequentially.   
  tokens = nltk.word_tokenize(doc)       #Tokenize a review
  tokens = [s.lower() for s in tokens]   #Turn tokens to lower
  tokens = [s for s in tokens if s not in global_stopwords] #Remove stop words
  tokens = [s for s in tokens if len(s)>2]                  #Remove short tokens (<=2 char)
  tokens = [stemmer.stem(s) for s in tokens]                #Stem each token
  doc = " ".join(tokens)                 #Join tokens with whitespace to create a cleaned review 
  temptext.append(doc)                   #Add it to the receiving list.
df["reviews_clean"]=temptext             #Create a new column with the cleaned reviews.  

In [4]:
#Check the first cleaned review.
df["reviews_clean"][0]

"jazzi vibe chill atmospher clinton street social club may speakeasi iowa citi entranc small door next short burger barber shop n't know re look may never find serious mimic new orlean cultur delici food creativ alcohol beverag must tri beignet shrimp cocktail house-batter curd sweet corn fritter alcohol drink love ramo gin fizz tend like anyth tast alcohol one right balanc right amount fizz ad foam egg white uniqu compar usual bar locat downtown opportun learn tend bar know bartend joy girl definit know alcohol made delici drink told histor discoveri"

Finally, we give the cleaned documents to the vectorizer. Since we have removed the stop words, we don't need to do it again during vectorization. This example uses the standard vectorizer. We can add optional functions (e.g. tfidf, customized vocabulary, n-gram) if we want. 

Here, we use bigram to create the DTM. You can compare it with the bigram DTM we created in "Document Vectorization.ipynb". At that document, we can only stem the second term in each bigram but here we can stem both terms. 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer 
vectorizer = CountVectorizer(ngram_range=(2,2))
DTM = vectorizer.fit_transform(df["reviews_clean"])
DTM.shape

(132, 7082)

In [6]:
df = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

df.sort_values(by="Frequency",inplace=True,ascending=False)
df.reset_index(inplace=True,drop=True)
df[0:10]

Unnamed: 0,Term,Frequency
0,iowa citi,51
1,clinton street,37
2,chees curd,34
3,social club,31
4,street social,23
5,happi hour,18
6,devil egg,14
7,mac chees,12
8,feel like,11
9,one best,10
