## BoW and Tf-idf NLP Modeling
August, 2018 - __Christopher Sanchez__

Building a model for natural language processing is a multi step process. Some of the possible steps include: 
- processing, cleaning and parsing the language data, 
- creating features using various NLP methods (Bag of words and Tf-idf will be used in this example)
- fit supervised learning models to the created features
- examine the effectiveness of the models using cross validation
- refine the models with the intention of improving the ability of the models.

Cell one contains some of the imports that will be used in our model.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn
import re
import sys
import warnings
from sklearn import ensemble
from sklearn.model_selection import train_test_split, cross_val_score

if not sys.warnoptions:
    warnings.simplefilter("ignore")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Next is the importation of the dataset, and cleaning of the links from the data.

In [2]:
df = pd.read_csv('mbti.csv')
df['posts'] = df['posts'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Display of the head to show the initial visualization of the data.

In [3]:
df.head()

Unnamed: 0,type,posts
0,INFJ,' and intj moments sportscenter not top ten...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,"'Good one _____ course, to which I say I k..."
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


Value counts is a great way to display the unique class values and the counts of each. If data is too dominated by one class it can be hard to effectively build a model.

In [4]:
df.type.value_counts()

INFP    1832
INFJ    1470
INTP    1304
INTJ    1091
ENTP     685
ENFP     675
ISTP     337
ISFP     271
ENTJ     231
ISTJ     205
ENFJ     190
ISFJ     166
ESTP      89
ESFP      48
ESFJ      42
ESTJ      39
Name: type, dtype: int64

Using the map function to convert the class categorical variables to numerical values.

In [5]:
df['type'] = df['type'].map({'INFJ':0, 'ENTP':1, 'INTP':2, 'INTJ':3, 'ENTJ':4, 'ENFJ':5, 'INFP':6, 'ENFP':7,
       'ISFP':8, 'ISTP':9, 'ISFJ':10, 'ISTJ':11, 'ESTP':12, 'ESFP':13, 'ESTJ':14, 'ESFJ':15})

Ensure that the map function worked as intended.

In [6]:
df.head()

Unnamed: 0,type,posts
0,0,' and intj moments sportscenter not top ten...
1,1,'I'm finding the lack of me in these posts ver...
2,2,"'Good one _____ course, to which I say I k..."
3,3,"'Dear INTP, I enjoyed our conversation the o..."
4,4,'You're fired.|||That's another silly misconce...


Creating the input feature and the output in order to being building the models, starting with bag of words.

In [65]:
X = df['posts']
y = df['type']

# Display the shape to make sure the length is the same.
print(X.shape)
print(y.shape)

(8675,)
(8675,)


Split the data with the default settings(test set is 25%). It is important to split the data to properly evaluate the model further on.

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=24)

CountVectorizer is an effective way to create the bag of words model.

In [67]:
from sklearn.feature_extraction.text import CountVectorizer

# Setting max features to 1500 will choose the 1500 with the highest count.
vectorizer = CountVectorizer(max_features=1500)
# Train the model and transform it to a sparse matrix with X_train.
X_train_matrix = vectorizer.fit_transform(X_train)

X_train_matrix

<6506x1500 sparse matrix of type '<class 'numpy.int64'>'
	with 2261801 stored elements in Compressed Sparse Row format>

In [68]:
# Repeat with 
X_test_matrix = vectorizer.transform(X_test)
X_test_matrix

<2169x1500 sparse matrix of type '<class 'numpy.int64'>'
	with 757710 stored elements in Compressed Sparse Row format>

The first classifier that will be used is Multinomial Naive Bayes. Naive bayes tends to excel in natural language processing situations. The effectiveness of the classifer will be determined via cross validation.

In [69]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train_matrix, y_train)

print('Training set score:', mnb.score(X_train_matrix, y_train))
print('\nTest set score:', mnb.score(X_test_matrix, y_test))
print('\nCross Val score:',cross_val_score(mnb, X_test_matrix, y_test, cv=5))

Training set score: 0.6474023977866584

Test set score: 0.5394190871369294

Cross Val score: [0.5260771  0.57665904 0.48036952 0.51740139 0.53395785]


Not too bad the model is predicting better than a coin flip. There is some overfitting going on.

Random forest is a very powerful and versatile classifier. The process is repeated below, this time with Random forest.

In [77]:
rfc = ensemble.RandomForestClassifier()

rfc.fit(X_train_matrix, y_train)

print('Training set score:', rfc.score(X_train_matrix, y_train))
print('\nTest set score:', rfc.score(X_test_matrix, y_test))
print('\nCross Val score:',cross_val_score(rfc, X_test_matrix, y_test, cv=5))

Training set score: 0.9932370119889333

Test set score: 0.4181650530198248

Cross Val score: [0.34013605 0.36613272 0.3187067  0.33410673 0.33489461]


The Random Forest model performed much worse than the Naive Bayes model. 

SpaCy is a powerful package for natural language processing. It is effective at parsing and processing text 

In [13]:
import spacy

# Making a copy of the dataframe for editing with spacy.
spacy_df = df.copy()

# loading spacy 
nlp = spacy.load('en')

A lemma is the root of a word, and can make it much easier for a classifier to classify. 

In [14]:
# using a lambda function to parse the dataframe with nlp, take the lemma of all words, and remove punctation and stop words.
spacy_df['posts'] = spacy_df['posts'].apply(lambda row: [token.lemma_ for token in nlp(row) if not token.is_punct and not token.is_stop])
# convert the data back to strings as spacy converts the rows to lists.
spacy_df['posts'] = spacy_df['posts'].astype(str).str.replace('\[|\]|\'', '')

Display the new head to make sure the data looks ok.

In [15]:
spacy_df.head()

Unnamed: 0,type,posts
0,0,"intj, moment, , sportscenter, play, , pr..."
1,1,"-PRON-, be, find, lack, post, alarming.|||sex,..."
2,2,"good, , , course, -PRON-, -PRON-, know, be..."
3,3,"dear, intp, , -PRON-, enjoy, conversation, d..."
4,4,"-PRON-, be, fired.|||that, be, silly, misconce..."


Recreate X and y with the new spacy dataframe. Train and split the data

In [16]:
X = spacy_df['posts']
y = spacy_df['type']
print(X.shape)
print(y.shape)

(8675,)
(8675,)


In [17]:
spacy_X_train, spacy_X_test, spacy_y_train, spacy_y_test = train_test_split(X, y, random_state=24)

Create the bag of words with count vectorizer and the new spacy parsed, cleaned, and processed X_train.

In [48]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)

spacy_X_train_matrix = vectorizer.fit_transform(spacy_X_train)

spacy_X_train_matrix

<6506x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 1480270 stored elements in Compressed Sparse Row format>

Repeat with the new X_test

In [49]:
spacy_X_test_matrix = vectorizer.transform(spacy_X_test)
spacy_X_test_matrix

<2169x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 495868 stored elements in Compressed Sparse Row format>

Run the Naive Bayes classifier to determine improvements.

In [50]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(spacy_X_train_matrix, spacy_y_train)

print('Training set score:', mnb.score(spacy_X_train_matrix, spacy_y_train))
print('\nTest set score:', mnb.score(spacy_X_test_matrix, spacy_y_test))
print('\nCross Val score:',cross_val_score(mnb, spacy_X_test_matrix, spacy_y_test, cv=5))

Training set score: 0.6603135567168767

Test set score: 0.5850622406639004

Cross Val score: [0.57142857 0.59954233 0.53117783 0.55916473 0.55971897]


There is still some overfitting going on, but it did perform better by about 5% accuracy

Repeating with Random Forest again.

In [76]:
rfc = ensemble.RandomForestClassifier()

rfc.fit(spacy_X_train_matrix, spacy_y_train)

print('Training set score:', rfc.score(spacy_X_train_matrix, spacy_y_train))
print('\nTest set score:', rfc.score(spacy_X_test_matrix, spacy_y_test))
print('\nCross Val score:',cross_val_score(rfc, spacy_X_test_matrix, spacy_y_test, cv=5))

Training set score: 0.9932370119889333

Test set score: 0.4319963116643615

Cross Val score: [0.37641723 0.3409611  0.38799076 0.38283063 0.37704918]


The improvement was slight. Only improving by about 2%. The Random Forest classifier doesn't seem to be a very good model for the job.

### Tfidf Vectorizer
is another method of feature extraction. 

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=4, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

# Fit and transform X_train with the tfidf vectorizer.
tfidf_X_train_matrix = tfidf_vectorizer.fit_transform(X_train)

tfidf_X_train_matrix

<6506x28366 sparse matrix of type '<class 'numpy.float64'>'
	with 2100905 stored elements in Compressed Sparse Row format>

Create sparse matrix for X_test 

In [79]:
tfidf_X_test_matrix = tfidf_vectorizer.transform(X_test)
tfidf_X_test_matrix

<2169x28366 sparse matrix of type '<class 'numpy.float64'>'
	with 699059 stored elements in Compressed Sparse Row format>

Determine effectiveness of the tfidf vectorizer with the naive bayes model.

In [80]:
mnb = MultinomialNB()
mnb.fit(tfidf_X_train_matrix, y_train)

print('Training set score:', mnb.score(tfidf_X_train_matrix, y_train))
print('\nTest set score:', mnb.score(tfidf_X_test_matrix, y_test))
print('\nCross Val score:',cross_val_score(mnb, tfidf_X_test_matrix,y_test, cv=5))

Training set score: 0.3061789117737473

Test set score: 0.24343015214384509

Cross Val score: [0.21768707 0.2173913  0.21939954 0.22041763 0.22248244]


The model is doing horribly with only a 24% accuracy rate, though it doesn't seem to be overfitting much.

Hopefully the Random Forest classifier below will do better.

In [81]:
rfc = ensemble.RandomForestClassifier()

rfc.fit(tfidf_X_train_matrix, y_train)

print('Training set score:', rfc.score(tfidf_X_train_matrix, y_train))
print('\nTest set score:', rfc.score(tfidf_X_test_matrix, y_test))
print('\nCross Val score:',cross_val_score(rfc, tfidf_X_test_matrix,y_test, cv=5))

Training set score: 0.9933907162619121

Test set score: 0.334716459197787

Cross Val score: [0.29024943 0.30892449 0.30254042 0.27842227 0.35362998]


The Random Forest classifier did perform better, however it is overfitting.

Will spacy improve the quality of the models?

In [82]:
spacy_tfidf_X_train_matrix = tfidf_vectorizer.fit_transform(spacy_X_train)

spacy_tfidf_X_train_matrix

<6506x23566 sparse matrix of type '<class 'numpy.float64'>'
	with 1824238 stored elements in Compressed Sparse Row format>

In [83]:
spacy_tfidf_X_test_matrix = tfidf_vectorizer.transform(spacy_X_test)
spacy_tfidf_X_test_matrix

<2169x23566 sparse matrix of type '<class 'numpy.float64'>'
	with 605914 stored elements in Compressed Sparse Row format>

In [84]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(spacy_tfidf_X_train_matrix, spacy_y_train)

print('Training set score:', mnb.score(spacy_tfidf_X_train_matrix, y_train))
print('\nTest set score:', mnb.score(spacy_tfidf_X_test_matrix, y_test))
print('\nCross Val score:',cross_val_score(mnb, spacy_tfidf_X_test_matrix, spacy_y_test, cv=5))

Training set score: 0.3130956040577928

Test set score: 0.2512678653757492

Cross Val score: [0.21768707 0.2173913  0.21939954 0.22041763 0.22248244]


SpaCy allowed us a very slight improvement, from 24%-25% 

Random Forest:

In [85]:
rfc = ensemble.RandomForestClassifier()

rfc.fit(spacy_tfidf_X_train_matrix, spacy_y_train)

print('Training set score:', rfc.score(spacy_tfidf_X_train_matrix, y_train))
print('\nTest set score:', rfc.score(spacy_tfidf_X_test_matrix, y_test))
print('\nCross Val score:',cross_val_score(rfc, spacy_tfidf_X_test_matrix, spacy_y_test, cv=5))

Training set score: 0.9940055333538272

Test set score: 0.3084370677731674

Cross Val score: [0.27437642 0.32951945 0.27482679 0.29234339 0.27868852]


The random forest is actually performing worse with spacy.

## Discussion and conclusion

Bag of words seems to be the more effective of the two feature extraction methods that were used. It was interesting to see that Naive Bayes outperformed the Random Forest being such a simple classifier.