# Tokenizing 

In this notebook I am first going to lemmatize my data, and then tokenize it by using two methods. One will be term frequency inverse docuemtn frequency transformer, and the other method will be cout vectorizing. I will use the count vectorized dataframe mainly for eda, as it simply counts the appearance of a word in a document (ourd document's will be rap songs) and thus is easily interpreted. I will reserve my TFIDF data for modeling. 

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
import numpy as np
from sklearn.feature_extraction import stop_words
import nltk
from nltk.stem import WordNetLemmatizer
import re
import time

Reading in my recently cleaned data

In [5]:
df = pd.read_csv('./clean_data_years.csv', index_col=0)

In [6]:
df.head(7)

Unnamed: 0,lyrics,date,title,artist,year,hot100
0,My servants began to forge what was to become...,1999-04-20,?,MF Doom,1999.0,0.0
1,"""Things take a turn for the worse"" ""Send him b...",2009-03-24,Absolutely,MF Doom,2009.0,0.0
2,One more beer And I'll take you all All of yo...,2002-01-01,All Outta Ale,MF Doom,2002.0,0.0
3,"Yea, that's right It's not a Hardy Boy myster...",2009-03-24,Angelz,MF Doom,2009.0,0.0
4,It was back in the days when I met a brillian...,1999-04-20,Back in the Days,MF Doom,1999.0,0.0
5,"HMMMM The flow is toe in, precision as an afr...",2009-03-24,Ballskin,MF Doom,2009.0,0.0
6,If you're waiting for a parade there ain't no...,2009-03-24,Batty Boyz,MF Doom,2009.0,0.0


Instantiating my Word Net Lemmatizer, which I'll use for lemmatizing 

In [7]:
wnt = WordNetLemmatizer()

Creating a function that will lowercase my lyrics,get rid of quoutes and backslashs, as well remove 'ing' and its variants. This will ensure the lemmatization goes smoothly and works properly.

In [8]:
def lemmatize(df):
        df['lyrics'] = df['lyrics'].str.lower()
        df['lyrics'] = df['lyrics'].str.replace('"', '')
        df['lyrics'] = df['lyrics'].str.replace('(,|\.)', '')
        df['lyrics'] = df['lyrics'].map(lambda x: re.sub("(in|ing|in')", '', x))
        
        df['lyrics'] = df['lyrics'].map(lambda x: ' '.join([wnt.lemmatize(word) for word in x.split()]))

Lemmatizing the data

In [9]:
lemmatize(df)

Checking my work

In [10]:
df.head()

Unnamed: 0,lyrics,date,title,artist,year,hot100
0,my servant began to forge what wa to become th...,1999-04-20,?,MF Doom,1999.0,0.0
1,thgs take a turn for the worse send him back w...,2009-03-24,Absolutely,MF Doom,2009.0,0.0
2,one more beer and i'll take you all all of you...,2002-01-01,All Outta Ale,MF Doom,2002.0,0.0
3,yea that's right it's not a hardy boy mystery ...,2009-03-24,Angelz,MF Doom,2009.0,0.0
4,it wa back the day when i met a brilliant stud...,1999-04-20,Back in the Days,MF Doom,1999.0,0.0


Saving my lemmatized data

In [12]:
df.to_csv('./final_clean_lemma_df.csv')

Creating a custom list of stop words to not include in my tokenized data frames. A lot of things I'm excluding are things parts of words with punctuation, like the 'll' in 'I'll'. This is done because tokenizers tokenize on punctuation.

In [13]:
custom_stopwords = list(stop_words.ENGLISH_STOP_WORDS)

In [14]:
custom_stopwords.extend(['like', 'll', 'ain','don','em','er','wa',
                         'ya','just','let','got','den','ol','izz','im',
                         'letting','hol','right','hah','dat','ve','mon',
                         'la', 'aw','whit','ma','da','uhh','gon','wit'])

Tokenizing my dataframe. I'm excluding words that don't appear in at least $5$ documents and excluding words that appear in over $70\%$ of the documents. This will hopefully tame the size of my tokenized dataframes

In [15]:
cvec = CountVectorizer(stop_words=custom_stopwords, min_df=5,max_df=.7)

In [16]:
tvec = TfidfVectorizer(stop_words=custom_stopwords, min_df=5,max_df=.7)

In [17]:
Xc = cvec.fit_transform(df['lyrics'])

In [18]:
Xt = tvec.fit_transform(df['lyrics'])

Creating tokenized dataframes out of my tokenized arrays to be used for modeling and EDA

In [30]:
df_token = pd.DataFrame(Xt.toarray(), columns=cvec.get_feature_names())

In [31]:
df_token['date_year'] = df['year']

In [32]:
X = pd.DataFrame(Xc.toarray(),columns=cvec.get_feature_names())

In [33]:
X['date_year'] = df['year']

Saving my dataframes

In [34]:
X.to_csv('./X.csv')

In [38]:
df_token.to_csv('./df_token.csv')

In [39]:
df_token.shape

(17228, 22950)