# Logistic Regression
This Regression model has a high accuracy score but it also indicates a significant amount of oversfitting. I spent some time playing various parameters with only marginal gains in accuracy or reduction in overfitting. Logisitc regression seemed like a natrual choice for this project. It's a little more straightforward and modeled my data relatively fast.


In [21]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
import re
import pickle
from sklearn.dummy import DummyClassifier
import matplotlib.pyplot as plt

# Loading  my data

In [2]:
combined = pd.read_pickle('../assets/combined.pkl')

In [3]:
combined.columns

Index(['author', 'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_id', 'body', 'created_utc', 'distinguished', 'edited', 'id',
       'link_id', 'no_follow', 'parent_id', 'permalink', 'retrieved_on',
       'rte_mode', 'score', 'send_replies', 'stickied', 'subreddit',
       'subreddit_id'],
      dtype='object')

# Loading my assets

In [4]:
X_train = pd.read_pickle('../assets/X_train.pkl')
X_test = pd.read_pickle('../assets/X_test.pkl')
y_train = pd.read_pickle('../assets/y_train.pkl')
y_test = pd.read_pickle('../assets/y_test.pkl')

# Setting up the Pipeline
This included a tfidf vectorizer which was chosen to help manage the sheer volume of data I was working with. I like that tfidf helps scale the data a bit given that it's not just a direct count of word occurence. That said tfidf vectorizing is still a bag of words model so it disregards grammar and sentiment.

In [5]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english',ngram_range=(1,2))),
    ('lr',LogisticRegression(solver='liblinear')),
    
])


# Setting up the parameter grid

In [6]:
param_grid =  {
    'tfidf__min_df': np.arange(1,5,2),
    'tfidf__max_df': [.10, .98, 1.0],
    'lr__C': np.linspace(0.1,.9,3)
    
    
}

In [7]:
gs = GridSearchCV(pipe, param_grid=param_grid,verbose=1)

# Fitting the model

In [8]:
gs.fit(X_train,y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed: 24.7min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
 ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': array([1, 3]), 'tfidf__max_df': [0.1, 0.98, 1.0], 'lr__C': array([0.1, 0.5, 0.9])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

# Scoring the model
This score indicated to me that the model overfit the training data

In [18]:
gs.score(X_train,y_train)

0.8130511946459845

In [19]:
gs.score(X_test,y_test)

0.7387753482281043

In [12]:
gs.best_score_

0.7318809419564674

In [13]:
gs.best_estimator_

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.98, max_features=None, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

# Seeing how it generalizes on new data

In [14]:
gs.predict(["Trump mocks NFL for ratings drop; suggests numbers would improve if players didn't kneel"]) #Fox news Headline

array(['Libertarian'], dtype='<U19')

In [15]:
gs.predict(["The Simple Solution to Inequality"])#Jacobin Headline

array(['LateStageCapitalism'], dtype='<U19')

# Baseline score

In [16]:
combined['subreddit'].value_counts()/combined.shape[0]

Libertarian            0.526845
LateStageCapitalism    0.473155
Name: subreddit, dtype: float64

In [17]:
with open('../assets/logistic_regression.pkl','wb+') as f:
    pickle.dump(gs,f)