In [1]:
from helpers import *

import sys

import bz2
import json

import pickle

import numpy as np
# import scipy

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.linear_model import SGDClassifier
# from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC

# from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV

%load_ext autoreload
%autoreload 2

In [2]:
# Printing without trunctions
# np.set_printoptions(threshold=sys.maxsize)
# pd.set_option('display.max_colwidth', None)

---

## Improvements to preprocessing pipeline since part2

We implemented multiple improvements to our preprocessing pipeline since the previous milestone in order to improve classification results. Here's a list of issues we noticed and improvements to address them:

1. In order to select relevant quotes we used the occupation attribute to select only speakers who are politicians. As quotes by actors would have little relevance in our study.
2. There are sometimes several different people with the same name. For example, there is both Donald Trump the politician we all know and love but there's also another Donald Trump (a physician) in wikidata. Here we can assume that all quotes with speaker Donald Trump correspond to the politican but there are other cases. The name Tim Cahill is shared by an American Football player, an American Politician, a Screenwriter and more. Most quotes here come from the American football player but it is very complicated to identifiy which quotes corresponds to which Tim Cahill. Since, we have a vast amount of data we decided to manually handle a few cases which correspond to famous politicians such as Donald Trump and for other cases such as Tim Cahill we discard all the quotes related to that name.
3. Several speakers such as Hillary Clinton and Donald Trump have been members of both the Democrats and Republicans in their lives. At first we discarded all such cases but since Trump and Clinton are famous politicans we manually attributed the party label to them corresponding to the party they are most well associated with (Democrats for Clinton and Republicans for Trump). We apply the same method for a few other politicians such as Michael Bloomberg.

----

## Predicting which political party does a quote's message lean towards

Our first task is to create a model that can classify the political inclinations of a single quote.

Since, our classification task strongly resembles that of NLP sentiment analysis
we applied a corresponding methodology.

Which can be summed up by the following steps by the following steps:
1. Label data (done in part2)
2. Clean quotations (augmented in part3)
3. Vectorize quotations
4. Train models and select optimal model for prediction

Due to the complex nature of the task, which is to predict whether a single quote
was said by a republican or democrat politician reaching a high accuracy is very
difficult and so we had to optimize all the steps described above.

Also, as noted in the course and on various online resources. It is sometimes
better to have less data cleaning and text preprocessing in an
NLP sentiment analysis tasks.

We therfore had to find the optimal pipeline. This required finding the best 
combination of text preprocessor/cleaner, vectorizer and ML model. Given, that
this isn't an ML class we tested a few computationally simple models
(which can also be trained in a reasonable time frame) and focused rather on
optimizing the preprocessing.

### Finding the optimal level of preprocessing

Our strategy in order to complete this task is to generate a dataset of quotes
where each quote has with 5 different levels of text preprocessing, ranging 
from very light to very strong. Then we run cross validation with a few simple models 
(to keep execution time reasonable) and aggregate a few performance metrics to
identify which level of preprocessing yielded the best performance with our model.

Unlike for the final classification pipeline/model we perform all this only with 
quotes from 2020 since it is a reasonably sized snapshot of the data for this task.

Each level of preprocessing was given a 1 letter name A,B,...,E.
Here is a description of the 5 different levels of preprocessing:
- A: Some trivial cleanup, removing digits and diacritics.
- B: All steps in A + casefolding and removing punctuation.
- C: All steps in B + removing stopwords.
- D: All steps in C + stemming. Using the snowball stemmer.
- E: All steps in C + lemmatization. Using nltk's WordNetLemmatizer.

We perform the analysis in the cells below. We first prepare the data
before running out tests.

In [3]:
# Load preprocessed data containing all variants of text processing
path = fixpath(QUOTES_2020_LABELED_CLEANED_VARIANTS)

df_raw = pd.read_json(path, orient='records', lines=True)
print(df_raw.shape)
df_raw.head()

(396818, 11)


Unnamed: 0,quoteID,quotation,speaker,date,id,party_label,quotation_cleanA,quotation_cleanB,quotation_cleanC,quotation_cleanD,quotation_cleanE
0,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,2020-01-16 12:00:13,Q367796,R,[ Department of Homeland Security ] was livid ...,department of homeland security was livid and ...,department homeland security livid strongly ur...,depart homeland secur livid strong urg agenda ...,department homeland security livid strongly ur...
1,2020-03-19-000276,[ These ] actions will allow households who ha...,Ben Carson,2020-03-19 19:14:00,Q816459,R,[ These ] actions will allow households who ha...,these actions will allow households who have a...,actions allow households fha insured mortgage ...,action allow household fha insur mortgag meet ...,action allow household fha insured mortgage me...
2,2020-01-22-009723,be pivotal in addressing financial frustrations,Ben Carson,2020-01-22 21:07:39,Q816459,R,be pivotal in addressing financial frustrations,be pivotal in addressing financial frustrations,pivotal addressing financial frustrations,pivot address financi frustrat,pivotal addressing financial frustration
3,2020-02-04-110477,We're talking about `Do we want to continue th...,Ben Carson,2020-02-04 23:02:36,Q816459,R,We're talking about `Do we want to continue th...,we re talking about do we want to continue the...,talking want continue lifestyle characterized ...,talk want continu lifestyl character american ...,talking want continue lifestyle characterized ...
4,2020-01-28-051506,It's not just a matter of throwing more and mo...,Ben Carson,2020-01-28 19:23:36,Q816459,R,It's not just a matter of throwing more and mo...,it s not just a matter of throwing more and mo...,matter throwing money vouchers services gettin...,matter throw money voucher servic get peopl sy...,matter throwing money voucher service getting ...


In [4]:
df = df_raw.copy()

# Droping quotes of people in both parties (except most popular members who were labeled manually)
df = df[df['party_label'] != 'RD']

We drop very short quotes as a quote that is particularly short will give us
little information and would most likely be irrelevant in the classification
task. We just drop quotes shorter than 90% of all other quotes.

In [5]:
# Droping short quotes. Quotes shorter than 90% of all other quotes
df = drop_short_quotes(df, threshold_quantile=0.1, quote_col_name='quotation_cleanE')
df.shape

(349675, 11)

In [6]:
# Checking if dataset is balanced
df.party_label.value_counts()

D    198353
R    151322
Name: party_label, dtype: int64

In [7]:
# Data is unbalanced. Since we have a lot of data we just downsample
df = downsample(df, 'party_label')

In [8]:
# Checking if the data is well balanced now
df['party_label'].value_counts()

D    151322
R    151322
Name: party_label, dtype: int64

In [9]:
# Several different sized version of the data for convenience. Since some
# models we test take long to train. Our final prediction for best level of
# preprocessing will be done on the full data frame (~220k quotes from 2020)
# that we generated above.

df_micro = df.sample(1000)
df_mini = df.sample(10000)
df = df.sample(frac=1)

### Cross validation

Now, we'll run cross validation on all combinations of files to find the best model and level of preprocessing. Here's a function for convenience and cleanliness.

In [10]:
def test_classifer(df, pipeline, break_after_one_iter=False):
    """
    Function to test different all version of preprocessed quotes with a given
    classifer.
    """
    
    cols = [
        'quotation_cleanA',
        'quotation_cleanB',
        'quotation_cleanC',
        'quotation_cleanD',
        'quotation_cleanE',
    ]

    for col in cols:
        
        # Get quotation preprocessing variant
        X = df[col].values

        # Get label and convert to useful format
        y = df['party_label'].values
        y = convert_labels(y)

        # Run cross validation with different metrics
        # scoring=['accuracy', 'precision', 'recall', 'f1']
        scoring=['accuracy', 'f1']
        res = cross_validate(pipeline, X, y, scoring=scoring, cv=3)
        res.pop('score_time')

        # Print results
        print(f'Col: {col}')
        print_cross_validate_results(res)
        
        if break_after_one_iter:
            break

On each level of preprocessed text we run cross validations with 3 different ML models. 
Multinomial Naive Bayes, LogisticRegression and Gradient Boosted Trees.

We only use Tfidf vectorization but test different levels of N-gram expansion.

In [11]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', MultinomialNB()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 4.220	std: 0.112
	test_accuracy        - 	avg: 0.691	std: 0.002
	test_f1              - 	avg: 0.695	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 4.399	std: 0.373
	test_accuracy        - 	avg: 0.691	std: 0.002
	test_f1              - 	avg: 0.695	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 3.043	std: 0.216
	test_accuracy        - 	avg: 0.689	std: 0.002
	test_f1              - 	avg: 0.692	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 2.657	std: 0.028
	test_accuracy        - 	avg: 0.679	std: 0.002
	test_f1              - 	avg: 0.682	std: 0.002
Col: quotation_cleanE
	fit_time             - 	avg: 2.740	std: 0.140
	test_accuracy        - 	avg: 0.686	std: 0.002
	test_f1              - 	avg: 0.688	std: 0.002


In [15]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', LogisticRegression(max_iter=1000)),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 13.313	std: 0.758
	test_accuracy        - 	avg: 0.692	std: 0.002
	test_f1              - 	avg: 0.692	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 12.015	std: 0.806
	test_accuracy        - 	avg: 0.692	std: 0.002
	test_f1              - 	avg: 0.692	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 5.339	std: 0.540
	test_accuracy        - 	avg: 0.689	std: 0.003
	test_f1              - 	avg: 0.689	std: 0.003
Col: quotation_cleanD
	fit_time             - 	avg: 5.875	std: 0.248
	test_accuracy        - 	avg: 0.682	std: 0.001
	test_f1              - 	avg: 0.680	std: 0.002
Col: quotation_cleanE
	fit_time             - 	avg: 6.857	std: 1.005
	test_accuracy        - 	avg: 0.687	std: 0.002
	test_f1              - 	avg: 0.686	std: 0.002


In [16]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', GradientBoostingClassifier(learning_rate=1)),
        ])

# Very long to run do sometime later
test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 193.228	std: 5.168
	test_accuracy        - 	avg: 0.632	std: 0.001
	test_f1              - 	avg: 0.619	std: 0.003
Col: quotation_cleanB
	fit_time             - 	avg: 199.264	std: 14.585
	test_accuracy        - 	avg: 0.632	std: 0.002
	test_f1              - 	avg: 0.623	std: 0.003
Col: quotation_cleanC
	fit_time             - 	avg: 107.047	std: 0.712
	test_accuracy        - 	avg: 0.628	std: 0.003
	test_f1              - 	avg: 0.581	std: 0.004
Col: quotation_cleanD
	fit_time             - 	avg: 104.179	std: 1.879
	test_accuracy        - 	avg: 0.631	std: 0.003
	test_f1              - 	avg: 0.596	std: 0.006
Col: quotation_cleanE
	fit_time             - 	avg: 105.497	std: 1.417
	test_accuracy        - 	avg: 0.630	std: 0.002
	test_f1              - 	avg: 0.588	std: 0.002


In [17]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', LinearSVC()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 6.754	std: 0.082
	test_accuracy        - 	avg: 0.696	std: 0.002
	test_f1              - 	avg: 0.697	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 6.652	std: 0.049
	test_accuracy        - 	avg: 0.696	std: 0.002
	test_f1              - 	avg: 0.697	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 4.918	std: 0.310
	test_accuracy        - 	avg: 0.693	std: 0.001
	test_f1              - 	avg: 0.693	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 4.893	std: 0.268
	test_accuracy        - 	avg: 0.685	std: 0.002
	test_f1              - 	avg: 0.684	std: 0.002
Col: quotation_cleanE
	fit_time             - 	avg: 4.830	std: 0.004
	test_accuracy        - 	avg: 0.691	std: 0.002
	test_f1              - 	avg: 0.690	std: 0.003


Now we'll try using using unigrams and bigrams!

In [18]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,2))),
            ('clf', MultinomialNB()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 12.863	std: 0.103
	test_accuracy        - 	avg: 0.733	std: 0.002
	test_f1              - 	avg: 0.740	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 12.739	std: 0.025
	test_accuracy        - 	avg: 0.733	std: 0.002
	test_f1              - 	avg: 0.740	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 9.833	std: 0.147
	test_accuracy        - 	avg: 0.744	std: 0.002
	test_f1              - 	avg: 0.747	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 8.791	std: 0.113
	test_accuracy        - 	avg: 0.739	std: 0.001
	test_f1              - 	avg: 0.743	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 9.661	std: 0.021
	test_accuracy        - 	avg: 0.742	std: 0.002
	test_f1              - 	avg: 0.746	std: 0.002


In [19]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,2))),
            ('clf', LogisticRegression(max_iter=1000)),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 32.950	std: 4.012
	test_accuracy        - 	avg: 0.723	std: 0.002
	test_f1              - 	avg: 0.724	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 37.281	std: 4.540
	test_accuracy        - 	avg: 0.723	std: 0.002
	test_f1              - 	avg: 0.724	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 28.543	std: 7.263
	test_accuracy        - 	avg: 0.727	std: 0.002
	test_f1              - 	avg: 0.728	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 23.331	std: 3.219
	test_accuracy        - 	avg: 0.724	std: 0.001
	test_f1              - 	avg: 0.725	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 31.883	std: 5.888
	test_accuracy        - 	avg: 0.727	std: 0.001
	test_f1              - 	avg: 0.727	std: 0.001


In [20]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,2))),
            ('clf', LinearSVC()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 18.643	std: 0.121
	test_accuracy        - 	avg: 0.740	std: 0.001
	test_f1              - 	avg: 0.740	std: 0.001
Col: quotation_cleanB
	fit_time             - 	avg: 18.911	std: 0.357
	test_accuracy        - 	avg: 0.740	std: 0.001
	test_f1              - 	avg: 0.740	std: 0.001
Col: quotation_cleanC
	fit_time             - 	avg: 13.641	std: 0.110
	test_accuracy        - 	avg: 0.747	std: 0.001
	test_f1              - 	avg: 0.748	std: 0.001
Col: quotation_cleanD
	fit_time             - 	avg: 12.134	std: 0.333
	test_accuracy        - 	avg: 0.743	std: 0.001
	test_f1              - 	avg: 0.744	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 13.292	std: 0.474
	test_accuracy        - 	avg: 0.746	std: 0.001
	test_f1              - 	avg: 0.747	std: 0.001


Once again the best performing classifer is Multinomial Naive Bayes, its also
the fastest one to train.

How about adding trigrams too! We don't train all 3 models as training times
get out of hand.

In [21]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,3))),
            ('clf', MultinomialNB()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 30.872	std: 0.259
	test_accuracy        - 	avg: 0.743	std: 0.002
	test_f1              - 	avg: 0.750	std: 0.001
Col: quotation_cleanB
	fit_time             - 	avg: 30.110	std: 0.378
	test_accuracy        - 	avg: 0.743	std: 0.002
	test_f1              - 	avg: 0.750	std: 0.001
Col: quotation_cleanC
	fit_time             - 	avg: 20.760	std: 1.678
	test_accuracy        - 	avg: 0.752	std: 0.002
	test_f1              - 	avg: 0.755	std: 0.002
Col: quotation_cleanD
	fit_time             - 	avg: 18.049	std: 0.195
	test_accuracy        - 	avg: 0.749	std: 0.001
	test_f1              - 	avg: 0.752	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 18.768	std: 0.158
	test_accuracy        - 	avg: 0.752	std: 0.001
	test_f1              - 	avg: 0.755	std: 0.001


In [22]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,3))),
            ('clf', LinearSVC()),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 34.433	std: 0.513
	test_accuracy        - 	avg: 0.750	std: 0.001
	test_f1              - 	avg: 0.752	std: 0.001
Col: quotation_cleanB
	fit_time             - 	avg: 34.416	std: 0.420
	test_accuracy        - 	avg: 0.750	std: 0.001
	test_f1              - 	avg: 0.752	std: 0.001
Col: quotation_cleanC
	fit_time             - 	avg: 22.700	std: 0.272
	test_accuracy        - 	avg: 0.754	std: 0.001
	test_f1              - 	avg: 0.755	std: 0.001
Col: quotation_cleanD
	fit_time             - 	avg: 21.397	std: 0.053
	test_accuracy        - 	avg: 0.752	std: 0.001
	test_f1              - 	avg: 0.754	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 22.074	std: 0.173
	test_accuracy        - 	avg: 0.753	std: 0.002
	test_f1              - 	avg: 0.755	std: 0.002


In [24]:
pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,3))),
            ('clf', MultinomialNB(alpha=1.8)),
        ])

test_classifer(df, pipeline)

Col: quotation_cleanA
	fit_time             - 	avg: 32.895	std: 0.624
	test_accuracy        - 	avg: 0.736	std: 0.002
	test_f1              - 	avg: 0.744	std: 0.002
Col: quotation_cleanB
	fit_time             - 	avg: 33.253	std: 1.347
	test_accuracy        - 	avg: 0.736	std: 0.002
	test_f1              - 	avg: 0.744	std: 0.002
Col: quotation_cleanC
	fit_time             - 	avg: 21.559	std: 0.430
	test_accuracy        - 	avg: 0.747	std: 0.003
	test_f1              - 	avg: 0.750	std: 0.003
Col: quotation_cleanD
	fit_time             - 	avg: 20.027	std: 0.142
	test_accuracy        - 	avg: 0.745	std: 0.002
	test_f1              - 	avg: 0.748	std: 0.001
Col: quotation_cleanE
	fit_time             - 	avg: 21.259	std: 0.317
	test_accuracy        - 	avg: 0.747	std: 0.002
	test_f1              - 	avg: 0.749	std: 0.002


As we can see we achieve our best accuracy yet by using uni, bi and tri-grams. So we will use TfidfVectorizer(ngram_range=(1,3)).

Here we achieve our best accuracy of 75% with both LinearSVC and MultinomialNB. Since the difference between the performance of both models is negligible we decided to use MultinomialNB from now on since it takes much less time to train. The same reasoning applies to the level of preprocessing we will use. We chose to use the most thorough level of cleaning (E) since it would reduce the complexity of our model (by reducing the size of the vectorizer's dictionary).

Our predictions could likely be further improved by using a more advanced method for vectorizing the quotes such as word2vec or BERT but we decided to stick with a simpler less convoluted model and focus on further data analysis.

Later, we will also aggregate our predictions to predict the political affiliation of a speaker based on all the quotes that are attributed to them and as such we can expect a better performance there!

Now we will train out main model to be used throughout our project. Which we do in the [next notebook](part3_2-model_training.ipynb).