In [1]:
from helpers import *

import sys

import bz2
import json

import pickle

import numpy as np
# import scipy

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.linear_model import SGDClassifier
# from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC

# from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV

%load_ext autoreload
%autoreload 2

In [2]:
# Printing without trunctions
# np.set_printoptions(threshold=sys.maxsize)
# pd.set_option('display.max_colwidth', None)

---

**TODO describe a bit about party labeling and how we only keep politicians!!**

----

## Predicting which political party does a quote's message lean towards

Our first task is to create a model that can classify the political inclinations of a single quote.

Since, our classification task strongly resembles that of NLP sentiment analysis
we applied a corresponding methodology.

Which can be summed up by the following steps by the following steps:
1. Label data (done in part2)
2. Clean quotations (augmented in part3, this part)
3. Vectorize quotations
4. Train models and select optimal model for prediction

Due to the complex nature of the task, which is to predict whether a single quote
was said by a republican or democrat politician reaching a high accuracy is very
difficult and so we had to optimize all the steps described above.

Also, as noted in the course and on various online resources. It is sometimes
better to have less data cleaning and text preprocessing in an
NLP sentiment analysis tasks.

We therfore had to find the optimal pipeline. This required finding the best 
combination of text preprocessor/cleaner, vectorizer and ML model. Given, that
this isn't an ML class we tested a few computationally simple models
(which can also be trained in a reasonable time frame) and focused rather on
optimizing the preprocessing.

### Finding the optimal level of preprocessing

Our strategy in order to complete this task is to generate a dataset of quotes
where each quote has with 5 different levels of text preprocessing, ranging 
from very light to very strong. Then we run cross validation with a few simple models 
(to keep execution time reasonable) and aggregate a few performance metrics to
identify which level of preprocessing yielded the best performance with our model.

Unlike for the final classification pipeline/model we perform all this only with 
quotes from 2020 since it is a reasonably sized snapshot of the data for this task.

Each level of preprocessing was given a 1 letter name A,B,...,E.
Here is a description of the 5 different levels of preprocessing:
- A: Some trivial cleanup, removing digits and diacritics.
- B: All steps in A + casefolding and removing punctuation.
- C: All steps in B + removing stopwords.
- D: All steps in C + stemming. Using the snowball stemmer.
- E: All steps in C + lemmatization. Using nltk's WordNetLemmatizer.

We perform the analysis in the cells below. We first prepare the data
before running out tests.

In [3]:
# Load preprocessed data containing all variants of text processing
path = fixpath(QUOTES_2020_LABELED_CLEANED_VARIANTS)

df_raw = pd.read_json(path, orient='records', lines=True)
print(df_raw.shape)
df_raw.head()

(396818, 11)


Unnamed: 0,quoteID,quotation,speaker,date,id,party_label,quotation_cleanA,quotation_cleanB,quotation_cleanC,quotation_cleanD,quotation_cleanE
0,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,2020-01-16 12:00:13,Q367796,R,[ Department of Homeland Security ] was livid ...,department of homeland security was livid and ...,department homeland security livid strongly ur...,depart homeland secur livid strong urg agenda ...,department homeland security livid strongly ur...
1,2020-03-19-000276,[ These ] actions will allow households who ha...,Ben Carson,2020-03-19 19:14:00,Q816459,R,[ These ] actions will allow households who ha...,these actions will allow households who have a...,actions allow households fha insured mortgage ...,action allow household fha insur mortgag meet ...,action allow household fha insured mortgage me...
2,2020-01-22-009723,be pivotal in addressing financial frustrations,Ben Carson,2020-01-22 21:07:39,Q816459,R,be pivotal in addressing financial frustrations,be pivotal in addressing financial frustrations,pivotal addressing financial frustrations,pivot address financi frustrat,pivotal addressing financial frustration
3,2020-02-04-110477,We're talking about `Do we want to continue th...,Ben Carson,2020-02-04 23:02:36,Q816459,R,We're talking about `Do we want to continue th...,we re talking about do we want to continue the...,talking want continue lifestyle characterized ...,talk want continu lifestyl character american ...,talking want continue lifestyle characterized ...
4,2020-01-28-051506,It's not just a matter of throwing more and mo...,Ben Carson,2020-01-28 19:23:36,Q816459,R,It's not just a matter of throwing more and mo...,it s not just a matter of throwing more and mo...,matter throwing money vouchers services gettin...,matter throw money voucher servic get peopl sy...,matter throwing money voucher service getting ...


In [4]:
df = df_raw.copy()

# Droping quotes of people in both parties (except most popular members who were labeled manually)
df = df[df['party_label'] != 'RD']

We drop very short quotes as a quote that is particularly short will give us
little information and would most likely be irrelevant in the classification
task. We just drop quotes shorter than 90% of all other quotes.

In [5]:
# Droping short quotes. Quotes shorter than 90% of all other quotes
df = drop_short_quotes(df, threshold_quantile=0.1, quote_col_name='quotation_cleanE')
df.shape

(349675, 11)

In [6]:
# Checking if dataset is balanced
df.party_label.value_counts()

D    198353
R    151322
Name: party_label, dtype: int64

In [7]:
# Data is unbalanced. Since we have a lot of data we just downsample
df = downsample(df, 'party_label')

In [8]:
# Checking if the data is well balanced now
df['party_label'].value_counts()

D    151322
R    151322
Name: party_label, dtype: int64

In [9]:
# Several different sized version of the data for convenience. Since some
# models we test take long to train. Our final prediction for best level of
# preprocessing will be done on the full data frame (~220k quotes from 2020)
# that we generated above.

df_micro = df.sample(1000)
df_mini = df.sample(10000)
df = df.sample(frac=1)

In [10]:
def test_classifer(df, pipeline, break_after_one_iter=False):
    """
    Function to test different all version of preprocessed quotes with a given
    classifer.
    """
    
    cols = [
        'quotation_cleanA',
        'quotation_cleanB',
        'quotation_cleanC',
        'quotation_cleanD',
        'quotation_cleanE',
    ]

    for col in cols:
        
        # Get quotation preprocessing variant
        X = df[col].values

        # Get label and convert to useful format
        y = df['party_label'].values
        y = convert_labels(y)

        # Run cross validation with different metrics
        # scoring=['accuracy', 'precision', 'recall', 'f1']
        scoring=['accuracy', 'f1']
        res = cross_validate(pipeline, X, y, scoring=scoring, cv=3)
        res.pop('score_time')

        # Print results
        print(f'Col: {col}')
        print_cross_validate_results(res)
        
        if break_after_one_iter:
            break

On each level of preprocessed text we run cross validations with 3 different ML models. 
Multinomial Naive Bayes, LogisticRegression and Gradient Boosted Trees.

We only use Tfidf vectorization but test different levels of N-gram expansion.

In [11]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', MultinomialNB()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 5.078	std: 0.628
	test_accuracy        - 	avg: 0.690	std: 0.001
	test_f1              - 	avg: 0.694	std: 0.001
Col: quotation_cleanB
	fit_time             - 	avg: 4.875	std: 0.612
	test_accuracy        - 	avg: 0.690	std: 0.001
	test_f1              - 	avg: 0.694	std: 0.001
Col: quotation_cleanC
	fit_time             - 	avg: 2.725	std: 0.088
	test_accuracy        - 	avg: 0.689	std: 0.001
	test_f1              - 	avg: 0.692	std: 0.001
Col: quotation_cleanD
	fit_time             - 	avg: 2.699	std: 0.092
	test_accuracy        - 	avg: 0.680	std: 0.001
	test_f1              - 	avg: 0.683	std: 0.000
Col: quotation_cleanE
	fit_time             - 	avg: 2.971	std: 0.288
	test_accuracy        - 	avg: 0.686	std: 0.001
	test_f1              - 	avg: 0.689	std: 0.001


In [32]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', LogisticRegression(max_iter=1000)),
        ])

# test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 10.666	std: 0.746
	test_accuracy        - 	avg: 0.670	std: 0.001
	test_f1              - 	avg: 0.666	std: 0.000
Col: quotation_cleanB
	fit_time             - 	avg: 11.970	std: 2.888
	test_accuracy        - 	avg: 0.670	std: 0.001
	test_f1              - 	avg: 0.666	std: 0.000
Col: quotation_cleanC
	fit_time             - 	avg: 7.622	std: 2.836
	test_accuracy        - 	avg: 0.669	std: 0.001
	test_f1              - 	avg: 0.665	std: 0.000
Col: quotation_cleanD
	fit_time             - 	avg: 4.686	std: 0.382
	test_accuracy        - 	avg: 0.664	std: 0.001
	test_f1              - 	avg: 0.659	std: 0.000
Col: quotation_cleanE
	fit_time             - 	avg: 5.684	std: 0.295
	test_accuracy        - 	avg: 0.668	std: 0.001
	test_f1              - 	avg: 0.663	std: 0.000


In [40]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', GradientBoostingClassifier()),
        ])

# Very long to run do sometime later
# test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 157.447	std: 2.448
	test_accuracy        - 	avg: 0.589	std: 0.001
	test_f1              - 	avg: 0.494	std: 0.003


KeyboardInterrupt: 

In [45]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', LinearSVC()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 6.733	std: 0.309
	test_accuracy        - 	avg: 0.674	std: 0.001
	test_f1              - 	avg: 0.672	std: 0.001
Col: quotation_cleanB
	fit_time             - 	avg: 5.830	std: 0.211
	test_accuracy        - 	avg: 0.674	std: 0.001
	test_f1              - 	avg: 0.672	std: 0.001
Col: quotation_cleanC
	fit_time             - 	avg: 4.404	std: 0.248
	test_accuracy        - 	avg: 0.672	std: 0.001
	test_f1              - 	avg: 0.670	std: 0.000
Col: quotation_cleanD
	fit_time             - 	avg: 4.410	std: 0.185
	test_accuracy        - 	avg: 0.666	std: 0.001
	test_f1              - 	avg: 0.664	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 4.574	std: 0.207
	test_accuracy        - 	avg: 0.669	std: 0.001
	test_f1              - 	avg: 0.667	std: 0.000


Now we'll try using using unigrams and bigrams!

In [29]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,2))),
            ('clf', MultinomialNB()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 17.314	std: 0.677
	test_accuracy        - 	avg: 0.733	std: 0.001
	test_f1              - 	avg: 0.741	std: 0.000
Col: quotation_cleanB
	fit_time             - 	avg: 14.906	std: 0.187
	test_accuracy        - 	avg: 0.733	std: 0.001
	test_f1              - 	avg: 0.740	std: 0.000
Col: quotation_cleanC
	fit_time             - 	avg: 11.646	std: 0.479
	test_accuracy        - 	avg: 0.743	std: 0.002
	test_f1              - 	avg: 0.747	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 10.702	std: 0.008
	test_accuracy        - 	avg: 0.739	std: 0.001
	test_f1              - 	avg: 0.743	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 11.538	std: 0.058
	test_accuracy        - 	avg: 0.742	std: 0.002
	test_f1              - 	avg: 0.745	std: 0.002


In [42]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,2))),
            ('clf', LogisticRegression(max_iter=1000)),
        ])

# test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 28.409	std: 1.658
	test_accuracy        - 	avg: 0.695	std: 0.002
	test_f1              - 	avg: 0.690	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 29.686	std: 5.530
	test_accuracy        - 	avg: 0.695	std: 0.002
	test_f1              - 	avg: 0.690	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 40.314	std: 4.547
	test_accuracy        - 	avg: 0.700	std: 0.002
	test_f1              - 	avg: 0.694	std: 0.001
Col: quotation_cleanD
	fit_time             - 	avg: 36.972	std: 4.372
	test_accuracy        - 	avg: 0.699	std: 0.002
	test_f1              - 	avg: 0.693	std: 0.002
Col: quotation_cleanE
	fit_time             - 	avg: 43.261	std: 2.992
	test_accuracy        - 	avg: 0.700	std: 0.003
	test_f1              - 	avg: 0.695	std: 0.002


In [46]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,2))),
            ('clf', LinearSVC()),
        ])

# test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 18.149	std: 1.085
	test_accuracy        - 	avg: 0.708	std: 0.002
	test_f1              - 	avg: 0.706	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 17.297	std: 1.203
	test_accuracy        - 	avg: 0.708	std: 0.002
	test_f1              - 	avg: 0.706	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 12.830	std: 0.049
	test_accuracy        - 	avg: 0.718	std: 0.003
	test_f1              - 	avg: 0.716	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 10.893	std: 0.191
	test_accuracy        - 	avg: 0.714	std: 0.003
	test_f1              - 	avg: 0.712	std: 0.002
Col: quotation_cleanE
	fit_time             - 	avg: 11.049	std: 0.225
	test_accuracy        - 	avg: 0.717	std: 0.003
	test_f1              - 	avg: 0.714	std: 0.003


Once again the best performing classifer is Multinomial Naive Bayes, its also
the fastest one to train.

How about adding trigrams too! We don't train all 3 models as training times
get out of hand.

In [30]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,3))),
            ('clf', MultinomialNB()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 34.931	std: 0.698
	test_accuracy        - 	avg: 0.743	std: 0.001
	test_f1              - 	avg: 0.751	std: 0.000
Col: quotation_cleanB
	fit_time             - 	avg: 36.490	std: 1.254
	test_accuracy        - 	avg: 0.743	std: 0.001
	test_f1              - 	avg: 0.751	std: 0.000
Col: quotation_cleanC
	fit_time             - 	avg: 23.508	std: 0.794
	test_accuracy        - 	avg: 0.751	std: 0.002
	test_f1              - 	avg: 0.754	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 23.414	std: 1.621
	test_accuracy        - 	avg: 0.749	std: 0.002
	test_f1              - 	avg: 0.753	std: 0.002
Col: quotation_cleanE
	fit_time             - 	avg: 22.466	std: 1.378
	test_accuracy        - 	avg: 0.750	std: 0.002
	test_f1              - 	avg: 0.753	std: 0.002


In [47]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,3))),
            ('clf', LinearSVC()),
        ])

# test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 37.191	std: 1.032
	test_accuracy        - 	avg: 0.718	std: 0.001
	test_f1              - 	avg: 0.715	std: 0.001
Col: quotation_cleanB
	fit_time             - 	avg: 32.412	std: 0.392
	test_accuracy        - 	avg: 0.719	std: 0.001
	test_f1              - 	avg: 0.715	std: 0.001
Col: quotation_cleanC
	fit_time             - 	avg: 21.432	std: 0.398
	test_accuracy        - 	avg: 0.724	std: 0.002
	test_f1              - 	avg: 0.721	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 20.245	std: 0.462
	test_accuracy        - 	avg: 0.723	std: 0.002
	test_f1              - 	avg: 0.720	std: 0.002
Col: quotation_cleanE
	fit_time             - 	avg: 21.807	std: 1.688
	test_accuracy        - 	avg: 0.724	std: 0.002
	test_f1              - 	avg: 0.720	std: 0.002


Here we achieve our best accuracy of 72.4%, with both LinearSVC and MultinomialNB. Eventhough, it is a rather low score we found it to be acceptable given the difficulty of the task. After all predicting predicting the political party of a speaker based on one quote alone is quite a feat. We assume this could be further improved by using word2vec or BERT but that would be beyond the scope of this project.

Later, we will also aggregate our predictions to predict the political affiliation of a speaker based on all the quotes that are attributed to them and as such we can expect a better performance there!

In [None]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,4))),
            ('clf', LinearSVC()),
        ])

# test_classifer(df, pipeline)

In [None]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(2,2))),
            ('clf', MultinomialNB()),
        ])

test_classifer(df, pipeline)