# Testing a fiction filter

This very simply loads some training data and trains a regularized logistic regression on it, using gridsearch to find an optimal number of features and regularization constant. We optimize on F0.5 score, a harmonic mean of precision and recall that puts more emphasis on precision. In practice, I don't think this produces results hugely different from F1 score.

There are more sophisticated feature-selection strategies than simply using *n* most common, but if I used more sophisticated selection strategies I would also need a more sophisticated validation strategy to avoid fooling myself; e.g. a validation set separate from the test set. Without extensive resources for generating labeled data, I'm trying to keep things relatively quick and simple.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
import pickle



In [2]:
rawdata = pd.read_csv('trainingdata.tsv', sep = '\t')
rawdata.head()

Unnamed: 0,sequenceID,genrecode,#matchquality,#rareword,of,the,and,a,is,to,...,neither,self,rest,drawn,effect,somewhat,latter,lord,u,beauty
0,393,n,2.351964,0.769964,0.001192,0.001192,0.001192,0.001192,0.001192,0.001192,...,0.0,0.0,0.0,0.001192,0.001192,0.0,0.0,0.001192,0.0,0.0
1,263,n,2.757639,0.733205,0.00096,0.00096,0.00096,0.00096,0.00096,0.00096,...,0.00096,0.00096,0.0,0.0,0.0,0.00096,0.00096,0.00096,0.00096,0.0
2,56,y,2.404113,0.481013,0.004219,0.004219,0.004219,0.004219,0.004219,0.004219,...,0.0,0.0,0.0,0.0,0.004219,0.0,0.0,0.0,0.0,0.0
3,8-3,y,2.256296,0.416667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,585,y,2.243333,0.414894,0.010638,0.010638,0.010638,0.010638,0.010638,0.010638,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This is the complete dataset as it exists on disk. We may not use all the features. Notice how the most common feature is '#rareword', i.e., English word outside this limited vocabulary.

We're going to select only the columns for features, leaving out the first two metadata columns.

In [3]:
termdoc = rawdata.iloc[ : , 2 : 402]
termdoc.shape

(514, 400)

Let's create a vector that maps genrecode (is it fiction n/y) to an integer code.

In [4]:
classvec = rawdata.genrecode.map({'n' : 0, 'y': 1})

Now the function that actually trains a logistic model.

In [5]:
def onepass(cval, numfeatures, termdoc, classvec):
    '''
    cval is the regularization constant
    numfeatures the number of features to use in the model
    termdoc is X, aka the feature matrix
    classvec is y, aka the vector of class integers to be predicted
    '''
    
    scaler = StandardScaler()
    data = scaler.fit_transform(termdoc.iloc[ : , 0 : numfeatures])
    
    # Note that we scale and center the columns of the feature matrix.
    
    model = LogisticRegression(C = cval)
    f1_scores = cross_val_score(model, data, classvec,
                             scoring = 'f1', cv=10)
    f1 = sum(f1_scores) / len(f1_scores)
    # Tenfold crossvalidation, using F1 score.
    
    precision_scores = cross_val_score(model, data, classvec,
                             scoring = 'precision', cv=10)
    precision = sum(precision_scores) / len(precision_scores)
    
    recall_scores = cross_val_score(model, data, classvec,
                             scoring = 'recall', cv=10)
    recall = sum(recall_scores) / len(recall_scores)
    
    f05 = 1.5 * (precision * recall) / ((.5 * precision) + recall)
    
    model.fit(data, classvec)
    predictions = model.predict(data)
    
    # We return both the average F1 score of a cross-
    # validated model, and the predictions of a model
    # trained on all the data.
    
    return f05, precision, recall, predictions

Grid search across a range of feature numbers and regularization constants. Notice that for regularization, we iterate across an integer range but then divide by ten thousand. So the best value of 100 is actually .01.

In [6]:
bestscores = []
for features in range(140, 410, 10):
    for cval in range(100, 350, 10):
        f05, precision, recall, predictions = onepass(cval/ 10000, features, termdoc, classvec)
        print(f05, cval, features)
        bestscores.append((f05, cval, features))
        

0.8492405418 100 140
0.847416426482 110 140
0.850099335925 120 140
0.849488127658 130 140
0.850940640231 140 140
0.850953064438 150 140
0.848344292706 160 140
0.849641291058 170 140
0.846607974572 180 140
0.846188518084 190 140
0.847473999941 200 140
0.848743070063 210 140
0.850085351866 220 140
0.85261777083 230 140
0.853473000544 240 140
0.854775704749 250 140
0.851574026675 260 140
0.849935311063 270 140
0.849935311063 280 140
0.849935311063 290 140
0.851207614223 300 140
0.851207614223 310 140
0.851207614223 320 140
0.852489834096 330 140
0.852489834096 340 140
0.855808583796 100 150
0.857183175798 110 150
0.854353873178 120 150
0.854225643179 130 150
0.855564658516 140 150
0.857670900322 150 150
0.857670900322 160 150
0.858914859107 170 150
0.860214972866 180 150
0.861952001035 190 150
0.864464332044 200 150
0.862461831751 210 150
0.863716861444 220 150
0.863716861444 230 150
0.864936831272 240 150
0.864356934474 250 150
0.865864767729 260 150
0.865091222178 270 150
0.865091222178

0.873594116549 180 280
0.870152260062 190 280
0.871516536179 200 280
0.868176271068 210 280
0.868176271068 220 280
0.869964571279 230 280
0.871261754041 240 280
0.867413441514 250 280
0.867413441514 260 280
0.866061449432 270 280
0.866061449432 280 280
0.870147129385 290 280
0.868803431525 300 280
0.868803431525 310 280
0.868803431525 320 280
0.870124963723 330 280
0.870124963723 340 280
0.87474796975 100 290
0.874842112922 110 290
0.880953922062 120 290
0.88046154544 130 290
0.879165454661 140 290
0.880481757443 150 290
0.881910449235 160 290
0.881910449235 170 290
0.884573127037 180 290
0.881510741405 190 290
0.882801613137 200 290
0.884094996539 210 290
0.882801613137 220 290
0.885407522842 230 290
0.884069884319 240 290
0.885407522842 250 290
0.882614544851 260 290
0.88144285139 270 290
0.88144285139 280 290
0.882206608578 290 290
0.882206608578 300 290
0.878343675407 310 290
0.878343675407 320 290
0.877033389997 330 290
0.875302995576 340 290
0.873993101547 100 300
0.873993101547 

In [7]:
bestscores.sort()
bestscores[-1]

(0.8854075228418341, 250, 290)

Test the best model more specifically. Note that precision is good, which is important.

**Recall 78.8, precision 94.4.**

In [8]:
f05, precision, recall, predictions = onepass(.025, 290, termdoc, classvec)
print(f05, precision, recall, sum(predictions))

0.885407522842 0.943554839776 0.788253968254 337


### Produce a model to export

Ultimately we have to make a model to use, and this can't be crossvalidated.

In [9]:
scaler = StandardScaler()
data = scaler.fit_transform(termdoc.iloc[ : , 0 : 290])
model = LogisticRegression(C = .0250)
model.fit(data, classvec)
print('Model trained.')

Model trained.


In [10]:
with open('fictionreview_scaler.pkl', mode = 'wb') as f:
    pickle.dump(scaler, f)

with open('fictionreview_model.pkl', mode = 'wb') as f:
    pickle.dump(model, f)

words = list(termdoc.columns)
with open('fictionreview_vocab.txt', mode = 'w', encoding = 'utf-8') as f:
    for w in words[0: 400]:
        f.write(w + '\n')
    
    

There we actually export it.

#### Examine feature weights.

In [11]:
words = list(termdoc.columns)

In [13]:
features = list(zip(list(model.coef_[0]), words[0: 290]))
features.sort()
for x, y in features:
    print(y, x)

voyages -0.261827961886
letters -0.224647052144
subject -0.198401479471
set -0.150085831867
history -0.128063938661
#rareword -0.126347714836
every -0.125106509586
under -0.121826325571
i -0.118591572494
those -0.117814165713
years -0.113434810042
must -0.112848852913
have -0.0992388495922
here -0.0985770540385
think -0.0923195072766
no -0.0916571324327
my -0.0904910984421
something -0.0821340851033
whose -0.0788725257495
even -0.0780926426055
own -0.0773125400245
#romannumeral -0.0755540278862
j -0.0751430955746
seems -0.0730622234412
fact -0.0725249282783
present -0.0720757908135
among -0.071412565102
#placename -0.0712506528482
me -0.0695599992231
who -0.0606753957184
p -0.0606284239244
each -0.0598917045514
their -0.0593917753438
general -0.0578274678573
also -0.0574639018566
there -0.05629184788
natural -0.0562326352692
mind -0.0553297658734
h -0.0545785112665
most -0.0536637550975
t -0.052766847354
co -0.049981536323
till -0.0488838405709
so -0.0477713800443
cannot -0.04767782582