## Generate Topic Model
Create a topic model using the 20 news group dataset
Model will be used to split Dailymail dataset into separate topics

Baseline of Naive Bayes / SVM model found here: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline
Linear SVM is mathematically equivalent to Logistic Regression

I tested first on multinomial NB, but found that without any fine-tuning or feature engineering, NB/SVM performed much better on the whole data

In [7]:
import re
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import  CountVectorizer, TfidfVectorizer

import sys, os
sys.path.append(os.getcwd())
from NBSVM_TopicModel import model_20ng


In [2]:
# use sklearn to easily acquire data - remove extraneous information
def extract_20ng(print_sample = False):
    train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
    test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
    
    if print_sample:
        print("Print 20 Newsgroup sample \n")
        print(train.data[0], "\n", train.target_names[train.target[0]])
    
    return train, test

Generate labels dataframe for binary classification

In [3]:
def gen_y(data):
    df = []
    # for each target, create a row with file id and dummies for each of the labels
    for i in range(len(data.data)):
        row = [re.sub("^.*/", "", data.filenames[i])]
        labels = [0]*len(data.target_names) #20
        labels[data.target[i]] = 1
        row += labels
        df.append(row)
      
    cols = ['filename'] 
    cols += data.target_names
    
    return pd.DataFrame(df, columns = cols)

#df = gen_y(train)
#df.head()

Run our model pipeline 

In [8]:
clf20 = model_20ng()

# run model
def main(clf):
    train, test = extract_20ng(True)
    clf20.add_topics(train.target_names)
    
    # create TF-IDF vectorizer
    print("\n...Generating TF-IDF Matrices...")
    vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=clf20.spacy_tokenizer,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
    train_vec = vec.fit_transform(train.data)
    test_vec = vec.transform(test.data)
    clf20.vec = vec # save vectorizer
    
    # generate label dataframes
    y_train = gen_y(train)
    y_test = gen_y(test)
    
    preds = np.zeros((len(test.data), len(y_test)-1))
    
    # run a binary model for each label
    print("\n...Fitting models...")
    for i, col in enumerate(y_train.iloc[:,1:]):
        print(col)
        model, r = clf20.get_mdl(train_vec, y_train[col])
        preds[:,i] = model.predict_proba(test_vec.multiply(r))[:,1]
        
        # add results
        clf20.add_result(r, model)
        
    # select max label from each of the binary models
    preds[:,-1], _ = clf20.select_max_labels(preds)
    
    print("\n...Evaluating model...")
    clf20.evaluate(test.target, preds[:,-1])
        
    return clf20, preds
    
clf20, preds = main(clf20)

Print 20 Newsgroup sample 

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail. 
 rec.autos

...Generating Bag of Word Matrices...

...Fitting models...
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc

...Evaluating model...
Model Accuracy is....  71.18
              precision    recall  f1

Save model object

In [11]:
import pickle
import os
os.chdir("../..")
with open("./Models/20ng_topicModel", "wb") as f:
    pickle.dump(clf20, f)