# Testing a fiction filter

This very simply loads some training data and trains a regularized logistic regression on it, using gridsearch to find an optimal number of features and regularization constant. We optimize on F1 score.

There are more sophisticated feature-selection strategies than this, but if I used them I would also need a more sophisticated validation strategy to avoid fooling myself; e.g. a validation set separate from the test set.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
import pickle

In [22]:
rawdata = pd.read_csv('trainingdata.tsv', sep = '\t')
rawdata.head()

Unnamed: 0,sequenceID,genrecode,#rareword,tale,novels,letters,the,of,and,tales,...,therefore,rest,self,late,person,somewhat,soon,excellent,instead,seem
0,401,n,0.44086,0.0,0.0,0.0,0.010753,0.010753,0.010753,0.0,...,0.0,0.0,0.0,0.0,0.0,0.010753,0.0,0.0,0.0,0.0
1,433,y,0.564748,0.0,0.0,0.0,0.003597,0.003597,0.003597,0.0,...,0.0,0.0,0.0,0.0,0.0,0.003597,0.0,0.0,0.0,0.0
2,209,y,0.45098,0.009804,0.0,0.0,0.009804,0.009804,0.009804,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,280,y,0.57047,0.0,0.0,0.0,0.006711,0.006711,0.006711,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,393,n,0.771156,0.0,0.0,0.0,0.001192,0.001192,0.001192,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001192,0.0


This is the complete dataset as it exists on disk. We may not use all the features. Notice how the most common feature is '#rareword', i.e., English word outside this limited vocabulary.

We're going to select only the columns for features, leaving out the first two metadata columns.

In [23]:
termdoc = rawdata.iloc[ : , 2 : 402]
termdoc.shape

(426, 400)

Let's create a vector that maps genrecode (is it fiction n/y) to an integer code.

In [24]:
classvec = rawdata.genrecode.map({'n' : 0, 'y': 1})

Now the function that actually trains a logistic model.

In [25]:
def onepass(cval, numfeatures, termdoc, classvec):
    '''
    cval is the regularization constant
    numfeatures the number of features to use in the model
    termdoc is X, aka the feature matrix
    classvec is y, aka the vector of class integers to be predicted
    '''
    
    scaler = StandardScaler()
    data = scaler.fit_transform(termdoc.iloc[ : , 0 : numfeatures])
    
    # Note that we scale and center the columns of the feature matrix.
    
    model = LogisticRegression(C = cval)
    f1_scores = cross_val_score(model, data, classvec,
                             scoring = 'f1', cv=10)
    f1 = sum(f1_scores) / len(f1_scores)
    # Tenfold crossvalidation, using F1 score.
    
    precision_scores = cross_val_score(model, data, classvec,
                             scoring = 'precision', cv=10)
    precision = sum(precision_scores) / len(precision_scores)
    
    recall_scores = cross_val_score(model, data, classvec,
                             scoring = 'recall', cv=10)
    recall = sum(recall_scores) / len(recall_scores)
    
    f05 = 1.5 * (precision * recall) / ((.5 * precision) + recall)
    
    model.fit(data, classvec)
    predictions = model.predict(data)
    
    # We return both the average F1 score of a cross-
    # validated model, and the predictions of a model
    # trained on all the data.
    
    return f05, precision, recall, predictions

Grid search across a range of feature numbers and regularization constants. Notice that for regularization, we iterate across an integer range but then divide by ten thousand. So the best value of 100 is actually .01.

In [30]:
bestscores = []
for features in range(150, 400, 10):
    for cval in range(150, 300, 5):
        f05, precision, recall, predictions = onepass(cval/ 10000, features, termdoc, classvec)
        print(f05, cval, features)
        bestscores.append((f05, cval, features))
        

0.870240289584 150 150
0.871831652962 155 150
0.871831652962 160 150
0.871831652962 165 150
0.874160246982 170 150
0.874160246982 175 150
0.874160246982 180 150
0.875620897134 185 150
0.877181549435 190 150
0.877181549435 195 150
0.877361340513 200 150
0.877361340513 205 150
0.877361340513 210 150
0.878897989729 215 150
0.878897989729 220 150
0.878897989729 225 150
0.877337867087 230 150
0.879101544268 235 150
0.879101544268 240 150
0.879101544268 245 150
0.879101544268 250 150
0.879101544268 255 150
0.879101544268 260 150
0.879101544268 265 150
0.879101544268 270 150
0.879101544268 275 150
0.879101544268 280 150
0.879101544268 285 150
0.880546258816 290 150
0.880546258816 295 150
0.86406798538 150 160
0.86406798538 155 160
0.86406798538 160 160
0.86959676444 165 160
0.868142853746 170 160
0.868142853746 175 160
0.866750982742 180 160
0.865158173497 185 160
0.865158173497 190 160
0.865158173497 195 160
0.865158173497 200 160
0.863264009496 205 160
0.863264009496 210 160
0.864966033919 

In [31]:
bestscores.sort()
bestscores[-1]

(0.90416108246769544, 265, 210)

Test the best model more specifically. Note that precision is good, which is important.

In [32]:
f05, precision, recall, predictions = onepass(.0265, 210, termdoc, classvec)
print(f05, precision, recall, sum(predictions))

0.904161082468 0.944881633054 0.832413793103 270


### Produce a model to export

Ultimately we have to make a model to use, and this can't be crossvalidated.

In [33]:
scaler = StandardScaler()
data = scaler.fit_transform(termdoc.iloc[ : , 0 : 210])
model = LogisticRegression(C = .0265)
model.fit(data, classvec)
print('Model trained.')

Model trained.


In [34]:
with open('fictionreview_scaler.pkl', mode = 'wb') as f:
    pickle.dump(scaler, f)

with open('fictionreview_model.pkl', mode = 'wb') as f:
    pickle.dump(model, f)

words = list(termdoc.columns)
with open('fictionreview_vocab.txt', mode = 'w', encoding = 'utf-8') as f:
    for w in words[0: 400]:
        f.write(w + '\n')
    
    