# Logistic Regression
My Regression model has a high accuracy score but it also indicates some oversfitting. I spent some time tuning various parameters with only marginal gains in accuracy or reduction in overfitting. 
Initailly Logisitc regression seemed like a natrual choice for this project. It's a little more straightforward than a decsion tree based model and the output end's up being a pretty interpretable. It gives you the ability to see the probability of each prediction. I liked how quickly this model fit given the large amount of data present. In general Logistic Regression is a low variance, high bias model and it makes the assumption that the log odds of the probability of an event are a linear combination of independent or predictor variables.
My results mirrored the basic assumption about LR. It was a very low variance model with a high amount of bias.

In [60]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
import re
import pickle
import matplotlib.pyplot as plt

# Loading  my data

In [61]:
combined = pd.read_pickle('../assets/combined.pkl')

In [3]:
combined.columns

Index(['author', 'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_id', 'body', 'created_utc', 'distinguished', 'edited', 'id',
       'link_id', 'no_follow', 'parent_id', 'permalink', 'retrieved_on',
       'rte_mode', 'score', 'send_replies', 'stickied', 'subreddit',
       'subreddit_id'],
      dtype='object')

# Loading my assets

In [62]:
X_train = pd.read_pickle('../assets/X_train.pkl')
X_test = pd.read_pickle('../assets/X_test.pkl')
y_train = pd.read_pickle('../assets/y_train.pkl')
y_test = pd.read_pickle('../assets/y_test.pkl')

# Setting up the Pipeline
This included a tfidf vectorizer which was chosen to help manage the sheer volume of data I was working with. I like that tfidf helps scale the data a bit given that it's not just a direct count of word occurence. That said tfidf vectorizing is still a bag of words model so it disregards grammar and sentiment.

In [5]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english',ngram_range=(1,2))),
    ('lr',LogisticRegression(solver='liblinear')),
    
])


# Setting up the parameter grid

In [6]:
param_grid =  {
    'tfidf__min_df': np.arange(1,5,2),
    'tfidf__max_df': [.10, .98, 1.0],
    'lr__C': np.linspace(0.1,.9,3)
    
    
}

In [7]:
gs = GridSearchCV(pipe, param_grid=param_grid,verbose=1)

# Fitting the model

In [8]:
gs.fit(X_train,y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed: 24.7min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
 ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': array([1, 3]), 'tfidf__max_df': [0.1, 0.98, 1.0], 'lr__C': array([0.1, 0.5, 0.9])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

# Scoring the model
This score indicated to me that the model overfit the training data

In [18]:
gs.score(X_train,y_train)

0.8130511946459845

In [19]:
gs.score(X_test,y_test)

0.7387753482281043

In [12]:
gs.best_score_

0.7318809419564674

In [13]:
gs.best_estimator_

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.98, max_features=None, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

# Seeing how it generalizes on new data

In [14]:
gs.predict(["Trump mocks NFL for ratings drop; suggests numbers would improve if players didn't kneel"]) #Fox news Headline

array(['Libertarian'], dtype='<U19')

In [15]:
gs.predict(["The Simple Solution to Inequality"])#Jacobin Headline

array(['LateStageCapitalism'], dtype='<U19')

# Baseline score

In [16]:
combined['subreddit'].value_counts()/combined.shape[0]

Libertarian            0.526845
LateStageCapitalism    0.473155
Name: subreddit, dtype: float64

In [17]:
with open('../assets/logistic_regression.pkl','wb+') as f:
    pickle.dump(gs,f)

In [63]:
 gs = pickle.load(open( "../assets/logistic_regression.pkl", "rb" ))

In [64]:
X_train_preds = gs.predict(X_train)

# Making a predictions dataframe
Many of my models achieved very similar accuracy scores. Interestingly enough many of the same posts showed up as important predictors across multiple models.

In [7]:
preds = pd.DataFrame({
    "preds":X_train_preds,
    "features":X_train,
    "truth":y_train
})

In [57]:
preds.head(10)

Unnamed: 0,preds,features,truth
30147,LateStageCapitalism,all white meat. babies of colour are kept in a...,LateStageCapitalism
57685,Libertarian,it was illegal for people of other religions t...,Libertarian
29468,Libertarian,the usa is already spending way more per capit...,LateStageCapitalism
110065,LateStageCapitalism,"overall, yes. the genocide of the natives was ...",LateStageCapitalism
120518,Libertarian,&; this is basic burden of proof lol ...,Libertarian
103194,LateStageCapitalism,"well, i work at amazon and love everything abo...",LateStageCapitalism
165256,Libertarian,property is who is in charge now?,Libertarian
92019,Libertarian,no they haven't. the number of actual nazis is...,Libertarian
108695,LateStageCapitalism,"so just...fuck disabled people, i guess?",LateStageCapitalism
164912,LateStageCapitalism,i don't know.... shared services sounds alot l...,Libertarian


# Exploring  posts
With more compute power I would love to try and genralize this idea by feeding it posts from news sources like breitbart and slate. Paring that with sentiment analysis to build a model that may be able accurately predict political affiliation or bias in the writings based on the style and keyword density. A potential difficulty with that is being able have a model genralize well in a situation where one side maybe referencing the other thereby using keywords that might trigger a false positive. At the very least if the model decent enough accuracy it could be used to flag an article for a human to review. 

In [43]:
preds.iloc[29468]

preds                                     LateStageCapitalism
features    countries without profit-driven, capitalistic ...
truth                                             Libertarian
Name: 63290, dtype: object

# This post incorrectly predicted my 1 class or r/Libertarian
It looks to me based on my eda that this is the case because of the density of the word America in the sentence. An interesting thing to note is that although my post is misclassifed. The content of the post seems to be from a user who holds values that would fall inline with libertarian philosophy responding to another user who may have made some sort of critique on market driven economies when in relation to healthcare. So in this case, through the keyword density, this model was able to correctly predict in a crude wat the underlying philosophy associated with the post. This demonstrates to me that through some more fine tuning of the model and with a slighty modified mission. This model could be used to understand or predict where someone falls on the political specturm based on the language that they use. That type of technology might be extended to provide a sort of bias score for a given body of text. One idea I would also love to try is extending this by taking articles written on the same subject by multiple news outlets and examning the similarities and differences between the language used.  From there one might be able to create a score that indicates the potential level of bias in the writing based on the density of certain keywords. A step further would be having the score indicate from which side of the aisle the bias in coming from. This could be of enourmous use in helping slow down the trend of fake news.

In [59]:
preds.iloc[29468]

preds                                     LateStageCapitalism
features    countries without profit-driven, capitalistic ...
truth                                             Libertarian
Name: 63290, dtype: object

In [44]:
preds['features'].iloc[29468]

'countries without profit-driven, capitalistic economies haven’t produced the medical advancements that everyone gets to use today.  antibiotics?  america.  chemotherapy?  america.  almost every modern drug?  america.  most modern surgical procedures? america.  it’s not as much a “blanket statement” as it is a “historical statement”.'

# True Negative
Here the model identified a true negative again certain keywords that showed up as important in making a distinction between LSC and LIB are present: Jobs,poverty,wages,. Again the key thing to note is the density of the words in each excerpt.

In [58]:
preds.iloc[195414]

preds                                     LateStageCapitalism
features    i'm not exactly sure what you're arguing about...
truth                                     LateStageCapitalism
Name: 24594, dtype: object

In [48]:
preds['features'].iloc[195414]

"i'm not exactly sure what you're arguing about? i'm saying they need to be paid more and you're saying there is no conspiracy... i never said there was.   people need jobs and they're paying poverty wages for a position that can mean life or death. if you have a certificate that says you can make $ versus your state minimum wage which is probably lower you'll probably take it.   i feel like you're just being argumentative to be that way, or you've never been in a position where you just had to take a job and couldn't be picky."

In [50]:
preds['features'].iloc[42368]


'sometimes i try to look up ways to save money/cut my budget and the advice is always things i already do. i already spend less than the / of income on housing rule, and my grocery budget is way lower than these advice columns recommend. there’s really not much left to cut 😞'

In [17]:
mis_pred = preds[(preds['preds'] != preds['truth'])]

In [56]:
mis_pred.head(10)

Unnamed: 0,preds,features,truth
29468,Libertarian,the usa is already spending way more per capit...,LateStageCapitalism
164912,LateStageCapitalism,i don't know.... shared services sounds alot l...,Libertarian
194812,Libertarian,tribalism.,LateStageCapitalism
178569,LateStageCapitalism,cheers!,Libertarian
124230,LateStageCapitalism,g,Libertarian
156197,LateStageCapitalism,you incorrectly used a word and moved on cool ...,Libertarian
148420,LateStageCapitalism,"nah, i don't want that. i'm pissed when a ven...",Libertarian
59412,LateStageCapitalism,"""you're shit"" ""fuck you"". ?",Libertarian
195414,Libertarian,bro there’s a difference between model and cas...,LateStageCapitalism
183841,Libertarian,"you would be an expert on shitposting, so i wi...",LateStageCapitalism


# Only around 18% percent of my posts are being misclassified.


In [54]:
len(mis_pred)/len(preds)

0.1869488053540155

In [88]:
gs.predict_proba(X_train)

array([[0.79170234, 0.20829766],
       [0.27292397, 0.72707603],
       [0.41852437, 0.58147563],
       ...,
       [0.40897996, 0.59102004],
       [0.60113064, 0.39886936],
       [0.1994428 , 0.8005572 ]])