# OkCupid Analysis Part 4 - Modeling the Data

Our final part of our analysis is focused on modeling the data with the help of Machine Learning. The overall intention here is to be able to predict data to help OkCupid users match more easily. We are going to be focusing on text data. Since we are using text, not numbers, we use classifiers. We have labeled data and are trying to predict categories so we use the Naive Bayes method.

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features.

In [1]:
# Packages for Data Analysis and Machine Learning
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Setup Pandas
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

## OkCupid Dataset

In [2]:
# Load .csv into dataframe for use 
df = pd.read_csv('new_df.csv')
df = df.drop('Unnamed: 0', 1)
df.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me: i would love to think that i was ...,currently working as an international agent fo...,making people laugh. ranting about a good sal...,"the way i look. i am a six foot half asian, ha...","books: absurdistan, the republic, of mice and...",food. water. cell phone. shelter.,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet! you are t...,"asian, white",75.0,-1.0,transportation,2012-06-28-20-30,"south san francisco, california","doesn't have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means. 1. i am...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories. my ...,,,i am very open and will share just about anyth...,,white,70.0,,hospitality / travel,2012-06-29-21-41,"oakland, california","doesn't have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,okay this is where the cultural matrix gets so...,movement conversation creation contemplatio...,,viewing. listening. dancing. talking. drinking...,"when i was five years old, i was known as ""the...","you are bright, open, intense, silly, ironic, ...",,68.0,-1.0,,2012-06-27-09-10,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,"bataille, celine, beckett. . . lynch, jarmusc...",,cats and german philosophy,,,you feel so inclined.,white,71.0,20000.0,student,2012-06-28-14-22,"berkeley, california",doesn't want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at: http://bagsbrown...,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians at the momen...",,,,,,"asian, black, other",66.0,-1.0,artistic / musical / writer,2012-06-27-21-26,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


### Exploring data

Here, we check on different categories for potential helpfulness in predicting OkCupid users. We have several variables at our disposal. One thing that we could do is combine categories for clustering. However, we have the most data in the number of essays.  With over half a million essays, it could very well be enough of a sample size to predict another category. Something very simple that we can do is predict gender. However, we could predict education, income, or just about any other category with the same algoritims to be used in this notebook.

The reason why we use gender to predict first is because it has the least missing data. It is also very difficult to lie about your gender compared to all other variables.

In [3]:
# Exploring possible variables for Machine Learning
n_men = len(df[df.sex == 'm'])
n_women = len(df[df.sex == 'f'])
n_essays = (df.loc[:,'essay0':'essay9']).size
n_body = df.body_type.unique().size
n_drink = df.drinks.unique().size
n_drug = df.drugs.unique().size
n_education = df.education.unique().size
n_ethnicity = df.ethnicity.unique().size
n_income = df.income.unique().size
n_orientation = df.orientation.unique().size

print("Number of male users: {:d}".format(n_men))
print("Number of essays: {:d}".format(n_essays))
print("Number of female users: {:d}".format(n_women))
print("Number of body types: {:d}".format(n_body))
print("Number of drinking types: {:d}".format(n_drink))
print("Number of drug types: {:d}".format(n_drug))
print("Number of ethnicity combinations: {:d}".format(n_ethnicity))
print("Number of income levels: {:d}".format(n_income))
print("Number of orientation types: {:d}".format(n_orientation))


Number of male users: 35829
Number of essays: 599460
Number of female users: 24117
Number of body types: 13
Number of drinking types: 7
Number of drug types: 4
Number of ethnicity combinations: 218
Number of income levels: 14
Number of orientation types: 3


In [4]:
# Combining all 10 essays to a single column for each user
# Also fixing missing essays to blank values
df['all_essays'] = ''
essay_names = df.loc[:,'essay0':'essay9']
for essay_name in essay_names:
    df[essay_name] = df[essay_name].replace(np.nan, ' ')
    df['all_essays'] = df[essay_name] + ' ' + df['all_essays']

In [5]:
# Dataframe for essay to gender predictor
# Could reproduce to include other variables
essay_sex = df[['all_essays', 'sex']]
essay_sex.head(15)

Unnamed: 0,all_essays,sex
0,you want to be swept off your feet! you are t...,m
1,i am very open and will share just about any...,m
2,"you are bright, open, intense, silly, ironic, ...",m
3,you feel so inclined. cats and german phil...,m
4,"music: bands, rappers, musicians at...",m
5,you're awesome. i cried on my first day at sch...,m
6,my typical friday night plotting to take ove...,f
7,out and about or relaxing at home with a g...,f
8,http://www.youtube.com/watch?v=4dxbwzuwsxk let...,f
9,you can rock the bells <em><strong>and say hi....,m


In [6]:
# Converting essays into a vector space model
from sklearn.feature_extraction.text import CountVectorizer

def make_xy(essay_sex, vectorizer=None):
    #Your code here    
    if vectorizer is None:
        vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(essay_sex.all_essays)
    X = X.tocsc()  # some versions of sklearn return COO format
    y = (essay_sex.sex == 'm').values.astype(np.int)
    return X, y
X, y = make_xy(essay_sex)

We use MultinomialNB with the naive Bayes algorithm for multinomially distributed data. It is also one of the two classic naive Bayes variants used in text classification.

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, y)
clf = MultinomialNB().fit(xtrain, ytrain)
print("MN Accuracy: %0.2f%%" % (100 * clf.score(xtest, ytest)))

MN Accuracy: 72.74%


In [8]:
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data:     %0.2f" % (test_accuracy))

Accuracy on training data: 0.79
Accuracy on test data:     0.73


In [9]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, clf.predict(xtest)))

[[4959 1130]
 [2955 5943]]


In K-fold cross-validation, we divide the data into $K$ non-overlapping parts. We train on $K-1$ of the folds and test on the remaining fold. We then iterate, so that each fold serves as the test fold exactly once. The function cv_score performs the K-fold cross-validation algorithm for us, but we need to pass a function that measures the performance of the algorithm on each fold.

In [10]:
from sklearn.model_selection import KFold
def cv_score(clf, X, y, scorefunc):
    result = 0.
    nfold = 5
    for train, test in KFold(nfold).split(X): # split data into train/test groups, 5 times
        clf.fit(X[train], y[train]) # fit the classifier, passed is as clf.
        result += scorefunc(clf, X[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

We use the log-likelihood as the score here in scorefunc. The higher the log-likelihood, the better. Indeed, what we do in cv_score above is to implement the cross-validation part of GridSearchCV.
The custom scoring function scorefunc allows us to use different metrics depending on the decision risk we care about (precision, accuracy, profit etc.) directly on the validation set. You will often find people using roc_auc, precision, recall, or F1-score as the scoring function.

In [11]:
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    female = y == 0
    male = ~female
    return prob[female, 0].sum() + prob[male, 1].sum()

We'll cross-validate over the regularization parameter $\alpha$.
Let's set up the train and test masks first, and then we can run the cross-validation procedure.

In [12]:
from sklearn.model_selection import train_test_split
_, itest = train_test_split(range(essay_sex.shape[0]), train_size=0.7)
mask = np.zeros(essay_sex.shape[0], dtype=np.bool)
mask[itest] = True

The log_likelihood function seems to be scoring x by the sums of the log probabilities of getting a 'female' rating and getting a 'male' rating. We are trying to optimize for the greatest log likelihood.
If we choose a value of $\alpha$ that is too high, then we lose too much information and the model will be too general. Only words with the highest occurances will be captured.

In [13]:
#the grid of parameters to search over
alphas = [.1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
maxscore=-np.inf
for alpha in alphas:
    for min_df in min_dfs:         
        vectorizer = CountVectorizer(min_df = min_df)       
        Xthis, ythis = make_xy(essay_sex, vectorizer)
        Xtrainthis=Xthis[mask]
        ytrainthis=ythis[mask]
        #your code here
        clf = MultinomialNB(alpha=alpha)
        cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)

        if cvscore > maxscore:
            maxscore = cvscore
            best_alpha, best_min_df = alpha, min_df

In [14]:
print('alpha: {:.2f}'.format(best_alpha))
print('min_df: {}'.format(best_min_df))

alpha: 5.00
min_df: 0.0001


In [15]:
vectorizer = CountVectorizer(min_df=best_min_df)
X, y = make_xy(essay_sex, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data:     %0.2f" % (test_accuracy))

Accuracy on training data: 0.81
Accuracy on test data:     0.75


With our result above, although our new accuracy is better and isn't as overfit as our previous training/test data. This is a good thing, it makes our predictions more accurate.

## Interpretation

### What are the strongly predictive features?
We use a neat trick to identify strongly predictive features (i.e. words).
first, we create a data set such that each row has exactly one feature. This is represented by the identity matrix.
use the trained classifier to make predictions on this matrix
sort the rows by predicted probabilities, and pick the top and bottom $K$ rows

In [16]:
# Getting list of words that predict gender
words = np.array(vectorizer.get_feature_names())

x = np.eye(xtest.shape[1])
probs = clf.predict_log_proba(x)[:, 0]
ind = np.argsort(probs)

men_words = words[ind[:10]]
women_words = words[ind[-10:]]

men_prob = probs[ind[:10]]
women_prob = probs[ind[-10:]]

print("Men words\t           P(Men word | word)")
for w, p in zip(men_words, men_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))
    
print("Women words\t        P(Men word | word)")
for w, p in zip(women_words, women_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))

Men words	           P(Men word | word)
                nbsp 0.93
                 sup 0.93
          programmer 0.91
       skateboarding 0.91
           computers 0.90
            blahblah 0.89
              badger 0.89
            mechanic 0.88
         engineering 0.87
                disc 0.87
Women words	        P(Men word | word)
               heels 0.19
                aunt 0.19
                 pjs 0.18
             trapeze 0.17
            knitting 0.17
               girly 0.17
               zumba 0.17
              tomboy 0.16
            lipstick 0.16
               gloss 0.13


### Misinterpretations

Our model isn't perfect. It can't always predict genders accurately. Some simple variables that could influence essays are variables like orientation and education. Here we explore the mis-predicted quotes.

In [17]:
x, y = make_xy(essay_sex, vectorizer)

prob = clf.predict_proba(x)[:, 0]
predict = clf.predict(x)

bad_women = np.argsort(prob[y == 0])[:3]
bad_men = np.argsort(prob[y == 1])[-3:]

It seems that the simple reason to a mis-prediction for a female is too much html code formatting. Things like adding hyperlinks and formatting could throw off the algoritim. Also, using crude language is a case as well. See below for an example.

In [18]:
print("Mis-predicted Women quotes")
print('---------------------------')
for row in bad_women:
    print(essay_sex[y == 0].all_essays.iloc[row])
    print('')

Mis-predicted Women quotes
---------------------------
you want to chat or get to know me, have similar interests. i like guys who are responsible, fun/silly, ambitious, have a job (or are in school), and i prefer that they don't smoke. i like when they are caring, smart, have (some) manners (haha), is ambitious, kind, honest, near to my age, don't live too far, and it's nice if they like to work out a few times a week. i need more friends that live nearby! either home watching tv or a movie, doing homework,  out with a friend/friends, or  online... what i want to "do with my life", career/going back to school  worrying :)  homework/studying  forcing myself to go to the gym...haha  guys  music  movies  tv  family  friends my family  my friends  food  music  movies/tv  my contacts/glasses books: <a class="ilink" href="/interests?i=harry+potter">harry potter</a> series, <a class="ilink" href= "/interests?i=the+time+traveller%27s+wife">the time traveller's wife</a> , <a class="ilink" href

For men, it seems that sexual orientation strongly affects predictions. See below for an example.

In [19]:
print("Mis-predicted Men quotes")
print('--------------------------')
for row in bad_men:
    print(essay_sex[y == 1].all_essays.iloc[row])
    print('')

Mis-predicted Men quotes
--------------------------
you feel me and it resonates with you, you're intrigued, and you can probably tell i'm for real. i masturbated once. dinner and a movie the interesting times we live in and what's unfolding on the planet the sun  a woman's body and voice  the heart and the voice that arises from it  water in all it's forms  wisdom  good friends movies: life is beautiful, wizard of oz, the fountain, dead poets society, avatar (parts), raising arizona, i am.  books: the water of life, the seat of the soul, the way of the superior man, the way of the peaceful warrior, the power of now.  shows: the daily show and colbert report  my music tastes are too diverse to list, from rock to pop to classical. they tend to feel at ease around me  depth, friendliness, playfulness and loving energy my work  loving and going deep with myself and others  inspiring myself and others  honest communication  feeling intuitively and reading the energy of people and places  a

### Testing for new essay

It is important that we test our model on new data. Here we try a bunch of sentences to predict gender.

First example is very simple. As expected, we get a good result.

In [20]:
text = ['I am a programmer with a beard']

vectorizer = CountVectorizer(min_df=best_min_df,
                             vocabulary=vectorizer.get_feature_names())
x = vectorizer.fit_transform(text)
prob = clf.predict_proba(vectorizer.transform(text))
man_prob = prob[0, 1]
woman_prob = prob[0, 0]
man_prob = ('%.2f' % (100*man_prob))
woman_prob = ('%.2f' % (100*woman_prob))
# Prediction
print("Text/Essay: ", text)
print('')
if clf.predict(x) == 1:
    print("This is", man_prob, "percent likely from a man's essay")
else:
    print("This is", woman_prob, "percent likely from a woman's essay")

Text/Essay:  ['I am a programmer with a beard']

This is 96.85 percent likely from a man's essay


Second example is similar to the first, we get a good result.

In [21]:
text = ['I like lipgloss and lipstick']

vectorizer = CountVectorizer(min_df=best_min_df,
                             vocabulary=vectorizer.get_feature_names())
x = vectorizer.fit_transform(text)
prob = clf.predict_proba(vectorizer.transform(text))
man_prob = prob[0, 1]
woman_prob = prob[0, 0]
man_prob = ('%.2f' % (100*man_prob))
woman_prob = ('%.2f' % (100*woman_prob))
# Prediction
print("Text/Essay: ", text)
print('')
if clf.predict(x) == 1:
    print("This is", man_prob, "percent likely from a man's essay")
else:
    print("This is", woman_prob, "percent likely from a woman's essay")

Text/Essay:  ['I like lipgloss and lipstick']

This is 95.43 percent likely from a woman's essay


Let's try doing clever manipulations on a simple sentence which would reverse its meaning. We start with this below.

In [22]:
text = ['I like computers']
vectorizer = CountVectorizer(min_df=best_min_df,
                             vocabulary=vectorizer.get_feature_names())
x = vectorizer.fit_transform(text)
prob = clf.predict_proba(vectorizer.transform(text))
man_prob = prob[0, 1]
woman_prob = prob[0, 0]
man_prob = ('%.2f' % (100*man_prob))
woman_prob = ('%.2f' % (100*woman_prob))
# Prediction
print("Text/Essay: ", text)
print('')
if clf.predict(x) == 1:
    print("This is", man_prob, "percent likely from a man's essay")
else:
    print("This is", woman_prob, "percent likely from a woman's essay")

Text/Essay:  ['I like computers']

This is 90.05 percent likely from a man's essay


Seeing the positive result above, we try to see if we can get the opposite by reversing the word "like" to "hate". Our model does not do so well, it still classifies the user as a male but with less confidence.

In [23]:
text = ['I hate computers']
vectorizer = CountVectorizer(min_df=best_min_df,
                             vocabulary=vectorizer.get_feature_names())
x = vectorizer.fit_transform(text)
prob = clf.predict_proba(vectorizer.transform(text))
man_prob = prob[0, 1]
woman_prob = prob[0, 0]
man_prob = ('%.2f' % (100*man_prob))
woman_prob = ('%.2f' % (100*woman_prob))
# Prediction
print("Text/Essay: ", text)
print('')
if clf.predict(x) == 1:
    print("This is", man_prob, "percent likely from a man's essay")
else:
    print("This is", woman_prob, "percent likely from a woman's essay")

Text/Essay:  ['I hate computers']

This is 87.64 percent likely from a man's essay


Here, we see if the word "hate" or "like" are  more from the male or female category.

In [24]:
text = ['hate']
vectorizer = CountVectorizer(min_df=best_min_df,
                             vocabulary=vectorizer.get_feature_names())
x = vectorizer.fit_transform(text)
prob = clf.predict_proba(vectorizer.transform(text))
man_prob = prob[0, 1]
woman_prob = prob[0, 0]
man_prob = ('%.2f' % (100*man_prob))
woman_prob = ('%.2f' % (100*woman_prob))
# Prediction
print("Text/Essay: ", text)
print('')
if clf.predict(x) == 1:
    print("This is", man_prob, "percent likely from a man's essay")
else:
    print("This is", woman_prob, "percent likely from a woman's essay")

Text/Essay:  ['hate']

This is 54.40 percent likely from a man's essay


In [25]:
text = ['like']
vectorizer = CountVectorizer(min_df=best_min_df,
                             vocabulary=vectorizer.get_feature_names())
x = vectorizer.fit_transform(text)
prob = clf.predict_proba(vectorizer.transform(text))
man_prob = prob[0, 1]
woman_prob = prob[0, 0]
man_prob = ('%.2f' % (100*man_prob))
woman_prob = ('%.2f' % (100*woman_prob))
# Prediction
print("Text/Essay: ", text)
print('')
if clf.predict(x) == 1:
    print("This is", man_prob, "percent likely from a man's essay")
else:
    print("This is", woman_prob, "percent likely from a woman's essay")

Text/Essay:  ['like']

This is 60.35 percent likely from a man's essay


## Possible improvements

Seeing how our model differentiate "like" and "hate" properly, we can look deeping into the text analysis that we used.

Something that we could look into are n-grams.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n in modern language, e.g., "four-gram", "five-gram", and so on.

A simple example could be from our example previously used.
For a unigram: "I" "hate" "coumputers". This would be influenced more by the term: computers which is generally associated with males.
For a bi-gram: "I hate" "hate computers". This would be influenced more by the term: "hate computers" which is generally not associated with males.

#### Testing with uni/bigrams

Compared to our previous model of 73/71 percent for traning/test data, there is only a 1 percent improvement. This could be utilized in a next study, but not enough to affect our current results.

In [26]:
vectorizer = CountVectorizer(ngram_range=(1,2),min_df=best_min_df)
X, y = make_xy(essay_sex, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))

Accuracy on training data: 0.873832
Accuracy on test data:     0.748105


#### Testing with only bigrams

The results with bigrams are better than our previous model but overfitted. We should skip this.

In [27]:
vectorizer = CountVectorizer(ngram_range=(2,2),min_df=best_min_df)
X, y = make_xy(essay_sex, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))

Accuracy on training data: 0.905805
Accuracy on test data:     0.734665


#### Testing with random forests

Here, we get an amazing 98 percent accuracy on the training data but it is the most overfitted model we have seen. Skip this.

In [28]:
from sklearn.ensemble import RandomForestClassifier

vectorizer = CountVectorizer(min_df=best_min_df)
X, y = make_xy(essay_sex, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

rforest = RandomForestClassifier()
clf = rforest.fit(xtrain,ytrain)

training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))

Accuracy on training data: 0.979704
Accuracy on test data:     0.659049


#### Testing with TfidfVectorizer

TF-IDF Weighting for Term Importance
TF-IDF stands for
Term-Frequency X Inverse Document Frequency.
In the standard CountVectorizer model above, we used just the term frequency in a document of words in our vocabulary. In TF-IDF, we weight this term frequency by the inverse of its popularity in all documents. For example, if the word "movie" showed up in all the documents, it would not have much predictive value. It could actually be considered a stopword. By weighing its counts by 1 divided by its overall frequency, we downweight it. We can then use this TF-IDF weighted features as inputs to any classifier. TF-IDF is essentially a measure of term importance, and of how discriminative a word is in a corpus. There are a variety of nuances involved in computing TF-IDF, mainly involving where to add the smoothing term to avoid division by 0, or log of 0 errors. The formula for TF-IDF in scikit-learn differs from that of most textbooks:
$$\mbox{TF-IDF}(t, d) = \mbox{TF}(t, d)\times \mbox{IDF}(t) = n_{td} \log{\left( \frac{\vert D \vert}{\vert d : t \in d \vert} + 1 \right)}$$
where $n_{td}$ is the number of times term $t$ occurs in document $d$, $\vert D \vert$ is the number of documents, and $\vert d : t \in d \vert$ is the number of documents that contain $t$

Just by looking at the results, we can see that the training and test data are very similar in accuracy. However, the accuracy is too low to use in an analysis compared to over 70% accuracy.

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=best_min_df)
X, y = make_xy(essay_sex, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))

Accuracy on training data: 0.593861
Accuracy on test data:     0.600234


Using n-grams with TF-IDF weighting performs even worse. Skip.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=best_min_df)
X, y = make_xy(essay_sex, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))

Accuracy on training data: 0.593416
Accuracy on test data:     0.599662
