# Modeling

## Summary


### Data Source

Data was sourced from the subreddits `r/TwoSentenceHorror` and `r/TwoSentenceComedy` using the `redshift` api.

### Questions

- Is there a model that can accurately predicts whether a two sentence post is from the `r/TwoSentenceHorror` subreddit?
- Can the hyperparameters of a model be used to improve the accuracy of this model?

### Results

The baseline accuracy was 66% of the majority class (text from `r/TwoSentenceHorror`). The un-tuned models yield the following accuracy scores and Matthews Correlation Coefficient. As we can see, all models (except the svm classifier with count vectorizer) outperform the baseline accuracy.

In addition to the accuracy score, the matthews' correlation coefficient was used to score model performance. This metric is essentially tracking the correlation between the predicted and actual values. This is a value that falls between -1 and 1. A value of 0 means that the prediction is no better than a completely random prediction.

|vectorizer | model | accuracy score | matthews correlation coefficient (mcc) |
| :---: | :---: | :---: | :---: |
|count vectorizer | logistic regression | 70% | 26% |
| | knn classifier | 70% | 25% |
| | random forest classifier | 71% | 32% |
| | adaboost | 72% | 33% |
| | svm classifier | 34% | 5% |
| | multinomial naive bayes | 68% | 21% |
| tfidf | logistic regression | 76% | 43% |
| | knn classifier | 72% | 31% |
| | random forest classifier | 73% | 35% |
| | adaboost | 72% | 38% |
| | svm classifier | 77% | 46% |
| | multinomial naive bayes | 68% | 20% |

The models that were chosen for tuning were logistic regression and logistic regression, both with tfidf vectorizer. The results of the tuning was:

| model | accuracy score | matthews correlation coefficient |
| :---: | :---: | :---: |
| logistic regression | 77% | 46% |
| svm classifier | 76% | 44% |

While the logistic regression accuracy and mcc scores improved with tuning, both scores actually dropped for svm post tuning.

In [1]:
# Basic Data Analysis Modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
df = pd.read_csv('./datasets/data_1.csv')
df.head()

Unnamed: 0,subreddit,text
0,TwoSentenceHorror,"I was watching a movie with my 5 year old son,..."
1,TwoSentenceHorror,"“You know, I’ve never bungee jumped before– ma..."
2,TwoSentenceHorror,I gently put down my baby in his crib before I...
3,TwoSentenceHorror,My cat has a very annoying habit of shoving hi...
4,TwoSentenceHorror,I shuddered as I heard the screams coming from...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3606 entries, 0 to 3605
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  3606 non-null   object
 1   text       3582 non-null   object
dtypes: object(2)
memory usage: 56.5+ KB


In [4]:
# Create a new feature column 'is_horror'

df.loc[:,'is_horror'] = df.loc[:,'subreddit'].map({
    'TwoSentenceHorror' : 1,
    'TwoSentenceComedy' : 0
})

df.loc[:, 'is_horror'].value_counts(normalize = True)

1    0.662784
0    0.337216
Name: is_horror, dtype: float64

In [5]:
df.drop(df.loc[df.isnull().any(axis = 1)].index, inplace = True)

## Baseline Accuracy

Baseline accuracy (the proportion in the majority class) is about $66\%$

## Train Test Split

In [6]:
from sklearn.model_selection import train_test_split

X = df.loc[:,'text']
y = df.loc[:,'is_horror']

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 42, stratify = y)

In [7]:
X_train.shape

(2686,)

In [8]:
X_val.shape

(896,)

In [9]:
# Pipeline Setup: word vectorizors and models

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

# Classification Metrics

from sklearn.metrics import confusion_matrix, matthews_corrcoef, precision_score, recall_score, accuracy_score

## Modelling and Text Vectorization

### Workflow

We begin by first fitting

Rather than re-writing each line of code repeatedly, I create an object that, when initialized, will handle text vectorization and model fitting in one step.

### Vectorization and Modelling without tuning

In [10]:
class vec_n_model_pipe_fit:
    def __init__(self, X_train, y_train, X_val, y_val, classifier):
        self.X_train = X_train
        self.y_train = y_train
        self.X_val = X_val
        self.y_val = y_val
        self.classifier = classifier
        
        pipe_cvec = Pipeline([('vec', CountVectorizer()), ('clsfr', classifier)])
        pipe_tvec = Pipeline([('vec', TfidfVectorizer()), ('clsfr', classifier)])
        
        self.fit_pipe_cvec = pipe_cvec.fit(X_train,y_train)
        self.fit_pipe_tvec = pipe_tvec.fit(X_train,y_train)
                
        self.train_pred_cvec = self.fit_pipe_cvec.predict(X_train)
        self.test_pred_cvec = self.fit_pipe_cvec.predict(X_val)
        
        self.train_pred_tvec = self.fit_pipe_tvec.predict(X_train)
        self.test_pred_tvec = self.fit_pipe_tvec.predict(X_val)
        
    
    def vec_model_test_perf(self):
        return {'count_vec' : 
             {'accuracy' : self.fit_pipe_cvec.score(self.X_val, self.y_val),
              'precision' : precision_score(self.y_val, self.test_pred_cvec),
              'recall' : recall_score(self.y_val, self.test_pred_cvec),
              'mcc' : matthews_corrcoef(self.y_val, self.test_pred_cvec),
              'confusion_matrix' : confusion_matrix(self.y_val,self.test_pred_cvec)},
             'tfidf' : 
             {'accuracy' : self.fit_pipe_tvec.score(self.X_val, self.y_val),
              'precision' : precision_score(self.y_val, self.test_pred_tvec),
              'recall' : recall_score(self.y_val, self.test_pred_tvec),
              'mcc' : matthews_corrcoef(self.y_val, self.test_pred_tvec),
              'confusion_matrix' : confusion_matrix(self.y_val,self.test_pred_tvec)}}
    
    def vec_model_train_perf(self):
        return {'count_vec' : 
             {'accuracy' : self.fit_pipe_cvec.score(self.X_train, self.y_train),
              'precision' : precision_score(self.y_train, self.train_pred_cvec),
              'recall' : recall_score(self.y_train, self.train_pred_cvec),
              'mcc' : matthews_corrcoef(self.y_train, self.train_pred_cvec),
              'confusion_matrix' : confusion_matrix(self.y_train,self.train_pred_cvec)},
             'tfidf' : 
             {'accuracy' : self.fit_pipe_tvec.score(self.X_train, self.y_train),
              'precision' : precision_score(self.y_train, self.train_pred_tvec),
              'recall' : recall_score(self.y_train, self.train_pred_tvec),
              'mcc' : matthews_corrcoef(self.y_train, self.train_pred_tvec),
              'confusion_matrix' : confusion_matrix(self.y_train,self.train_pred_tvec)}}
    
    def best_vectorizer(self, test = True):
        if test:
            return 'cvec' if self.vec_model_test_perf()['count_vec']['accuracy'] >  self.vec_model_test_perf()['tfidf']['accuracy'] else 'tfidf'
        else:
            return 'cvec' if self.vec_model_train_perf()['count_vec']['accuracy'] >  self.vec_model_train_perf()['tfidf']['accuracy'] else 'tfidf'
    

### Logistic Regression with Defaults

In [11]:
logit = vec_n_model_pipe_fit(X_train, y_train, X_val, y_val, LogisticRegression())

In [12]:
logit.vec_model_train_perf()

{'count_vec': {'accuracy': 0.7233804914370812,
  'precision': 0.7062825130052021,
  'recall': 0.9949267192784668,
  'mcc': 0.3537181477896348,
  'confusion_matrix': array([[ 178,  734],
         [   9, 1765]], dtype=int64)},
 'tfidf': {'accuracy': 0.8674609084139985,
  'precision': 0.8385864374403056,
  'recall': 0.9898534385569335,
  'mcc': 0.707430236965049,
  'confusion_matrix': array([[ 574,  338],
         [  18, 1756]], dtype=int64)}}

In [13]:
logit.vec_model_test_perf()

{'count_vec': {'accuracy': 0.6997767857142857,
  'precision': 0.6934131736526946,
  'recall': 0.9780405405405406,
  'mcc': 0.25551809894519173,
  'confusion_matrix': array([[ 48, 256],
         [ 13, 579]], dtype=int64)},
 'tfidf': {'accuracy': 0.7600446428571429,
  'precision': 0.7516688918558078,
  'recall': 0.9510135135135135,
  'mcc': 0.43362798610703973,
  'confusion_matrix': array([[118, 186],
         [ 29, 563]], dtype=int64)}}

### KNClassifier with Defaults

In [14]:
knn_classifier = vec_n_model_pipe_fit(X_train, y_train, X_val, y_val, KNeighborsClassifier())

In [15]:
knn_classifier.vec_model_train_perf()

{'count_vec': {'accuracy': 0.7691734921816828,
  'precision': 0.7495674740484429,
  'recall': 0.9768883878241262,
  'mcc': 0.46783851389848624,
  'confusion_matrix': array([[ 333,  579],
         [  41, 1733]], dtype=int64)},
 'tfidf': {'accuracy': 0.7896500372300819,
  'precision': 0.7638585770405937,
  'recall': 0.9864712514092446,
  'mcc': 0.525839731167524,
  'confusion_matrix': array([[ 371,  541],
         [  24, 1750]], dtype=int64)}}

In [16]:
knn_classifier.vec_model_test_perf()

{'count_vec': {'accuracy': 0.7008928571428571,
  'precision': 0.7014925373134329,
  'recall': 0.9527027027027027,
  'mcc': 0.2546086043574658,
  'confusion_matrix': array([[ 64, 240],
         [ 28, 564]], dtype=int64)},
 'tfidf': {'accuracy': 0.7176339285714286,
  'precision': 0.7153748411689962,
  'recall': 0.9510135135135135,
  'mcc': 0.31021194901810845,
  'confusion_matrix': array([[ 80, 224],
         [ 29, 563]], dtype=int64)}}

### MultinomialNB with Defaults

In [17]:
mnb = vec_n_model_pipe_fit(X_train, y_train, X_val, y_val, MultinomialNB())

In [18]:
mnb.vec_model_train_perf()

{'count_vec': {'accuracy': 0.7345495160089353,
  'precision': 0.7133092078809811,
  'recall': 1.0,
  'mcc': 0.394519100398565,
  'confusion_matrix': array([[ 199,  713],
         [   0, 1774]], dtype=int64)},
 'tfidf': {'accuracy': 0.7531645569620253,
  'precision': 0.727944193680755,
  'recall': 1.0,
  'mcc': 0.44581153114404254,
  'confusion_matrix': array([[ 249,  663],
         [   0, 1774]], dtype=int64)}}

In [19]:
mnb.vec_model_test_perf()

{'count_vec': {'accuracy': 0.6852678571428571,
  'precision': 0.6785714285714286,
  'recall': 0.9949324324324325,
  'mcc': 0.20999221011088634,
  'confusion_matrix': array([[ 25, 279],
         [  3, 589]], dtype=int64)},
 'tfidf': {'accuracy': 0.6830357142857143,
  'precision': 0.676605504587156,
  'recall': 0.9966216216216216,
  'mcc': 0.20231132546708322,
  'confusion_matrix': array([[ 22, 282],
         [  2, 590]], dtype=int64)}}

### RandomForest with Defaults

In [20]:
rf = vec_n_model_pipe_fit(X_train, y_train, X_val, y_val, RandomForestClassifier())

In [21]:
rf.vec_model_train_perf()

{'count_vec': {'accuracy': 0.9609084139985108,
  'precision': 0.9765848086807538,
  'recall': 0.963923337091319,
  'mcc': 0.9135353504573496,
  'confusion_matrix': array([[ 871,   41],
         [  64, 1710]], dtype=int64)},
 'tfidf': {'accuracy': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'mcc': 1.0,
  'confusion_matrix': array([[ 912,    0],
         [   0, 1774]], dtype=int64)}}

In [22]:
rf.vec_model_test_perf()

{'count_vec': {'accuracy': 0.7120535714285714,
  'precision': 0.7427325581395349,
  'recall': 0.8631756756756757,
  'mcc': 0.3150529376824144,
  'confusion_matrix': array([[127, 177],
         [ 81, 511]], dtype=int64)},
 'tfidf': {'accuracy': 0.7321428571428571,
  'precision': 0.7315789473684211,
  'recall': 0.9391891891891891,
  'mcc': 0.3538159641596086,
  'confusion_matrix': array([[100, 204],
         [ 36, 556]], dtype=int64)}}

In [23]:
ada = vec_n_model_pipe_fit(X_train, y_train, X_val, y_val, AdaBoostClassifier())

In [24]:
ada.vec_model_train_perf()

{'count_vec': {'accuracy': 0.7308265078183172,
  'precision': 0.7332445628051487,
  'recall': 0.9312288613303269,
  'mcc': 0.3505886429680324,
  'confusion_matrix': array([[ 311,  601],
         [ 122, 1652]], dtype=int64)},
 'tfidf': {'accuracy': 0.7639612807148176,
  'precision': 0.8028692879914984,
  'recall': 0.8517474633596392,
  'mcc': 0.4600942869172248,
  'confusion_matrix': array([[ 541,  371],
         [ 263, 1511]], dtype=int64)}}

In [25]:
ada.vec_model_test_perf()

{'count_vec': {'accuracy': 0.7243303571428571,
  'precision': 0.7260812581913499,
  'recall': 0.9358108108108109,
  'mcc': 0.33067790974010014,
  'confusion_matrix': array([[ 95, 209],
         [ 38, 554]], dtype=int64)},
 'tfidf': {'accuracy': 0.7321428571428571,
  'precision': 0.7691131498470948,
  'recall': 0.8496621621621622,
  'mcc': 0.37637111626256814,
  'confusion_matrix': array([[153, 151],
         [ 89, 503]], dtype=int64)}}

In [26]:
svc = vec_n_model_pipe_fit(X_train, y_train, X_val, y_val, SVC())

In [27]:
svc.vec_model_train_perf()

{'count_vec': {'accuracy': 0.34363365599404316,
  'precision': 1.0,
  'recall': 0.0062006764374295375,
  'mcc': 0.04597852774321858,
  'confusion_matrix': array([[ 912,    0],
         [1763,   11]], dtype=int64)},
 'tfidf': {'accuracy': 0.9817572598659717,
  'precision': 0.9731212287438289,
  'recall': 1.0,
  'mcc': 0.9596026797986088,
  'confusion_matrix': array([[ 863,   49],
         [   0, 1774]], dtype=int64)}}

In [28]:
svc.vec_model_test_perf()

{'count_vec': {'accuracy': 0.34375,
  'precision': 1.0,
  'recall': 0.006756756756756757,
  'mcc': 0.04798698971257677,
  'confusion_matrix': array([[304,   0],
         [588,   4]], dtype=int64)},
 'tfidf': {'accuracy': 0.7689732142857143,
  'precision': 0.7542932628797886,
  'recall': 0.964527027027027,
  'mcc': 0.46124237327231543,
  'confusion_matrix': array([[118, 186],
         [ 21, 571]], dtype=int64)}}

In [29]:
estimators = [(logit, 'logistic regression'), (knn_classifier, 'knn classifier'), (rf, 'random forest classifier'), (ada, 'adaboost'), (svc, 'svm classifier'), (mnb, 'multinomial naive bayes')]

In [30]:
[(name, estim.best_vectorizer()) for (estim, name) in estimators]

[('logistic regression', 'tfidf'),
 ('knn classifier', 'tfidf'),
 ('random forest classifier', 'tfidf'),
 ('adaboost', 'tfidf'),
 ('svm classifier', 'tfidf'),
 ('multinomial naive bayes', 'cvec')]

In [31]:
[(name, estim.vec_model_test_perf()['tfidf']['accuracy']) for (estim, name) in estimators]

[('logistic regression', 0.7600446428571429),
 ('knn classifier', 0.7176339285714286),
 ('random forest classifier', 0.7321428571428571),
 ('adaboost', 0.7321428571428571),
 ('svm classifier', 0.7689732142857143),
 ('multinomial naive bayes', 0.6830357142857143)]

In [32]:
[(name, estim.vec_model_test_perf()['count_vec']['accuracy']) for (estim, name) in estimators]

[('logistic regression', 0.6997767857142857),
 ('knn classifier', 0.7008928571428571),
 ('random forest classifier', 0.7120535714285714),
 ('adaboost', 0.7243303571428571),
 ('svm classifier', 0.34375),
 ('multinomial naive bayes', 0.6852678571428571)]

The best estimator/vectorizer combo based on test accuracy is the SVM classifer, with an accuracy rate of about $76.9\%$. While the worst performing is the SVM classifier with count vectorizer.

The next best performing combo is the logistic regression with `tfidf`. The accuracy for this combination is $76.0\%$

In [33]:
[(name, estim.vec_model_test_perf()['tfidf']['precision']) for (estim, name) in estimators]

[('logistic regression', 0.7516688918558078),
 ('knn classifier', 0.7153748411689962),
 ('random forest classifier', 0.7315789473684211),
 ('adaboost', 0.7691131498470948),
 ('svm classifier', 0.7542932628797886),
 ('multinomial naive bayes', 0.676605504587156)]

In [34]:
[(name, estim.vec_model_test_perf()['count_vec']['precision']) for (estim, name) in estimators]

[('logistic regression', 0.6934131736526946),
 ('knn classifier', 0.7014925373134329),
 ('random forest classifier', 0.7427325581395349),
 ('adaboost', 0.7260812581913499),
 ('svm classifier', 1.0),
 ('multinomial naive bayes', 0.6785714285714286)]

The best estimator/vectorizer combo, based on test precision, is the SVM classifer with count vectorization with a precision of $100\%$. While the worst performing is the multinomial naive bayes' classifier with tfidf.

The next best performing combo is the adaboost classifier with `tfidf`. The precision for this combination is $76.9\%$

In [35]:
[(name, estim.vec_model_test_perf()['tfidf']['recall']) for (estim, name) in estimators]

[('logistic regression', 0.9510135135135135),
 ('knn classifier', 0.9510135135135135),
 ('random forest classifier', 0.9391891891891891),
 ('adaboost', 0.8496621621621622),
 ('svm classifier', 0.964527027027027),
 ('multinomial naive bayes', 0.9966216216216216)]

In [36]:
[(name, estim.vec_model_test_perf()['count_vec']['recall']) for (estim, name) in estimators]

[('logistic regression', 0.9780405405405406),
 ('knn classifier', 0.9527027027027027),
 ('random forest classifier', 0.8631756756756757),
 ('adaboost', 0.9358108108108109),
 ('svm classifier', 0.006756756756756757),
 ('multinomial naive bayes', 0.9949324324324325)]

The best estimator/vectorizer combo, based on test recall, is the multinomial naive bayes classifer with `tfidf` with a precision of $99.7\%$. While the worst performing is the SVM classifier with count vectorization.

The next best performing combo is the multinomial naive bayes classifer with count vectorization. The recall for this combination is $99.5\%$

In [37]:
[(name, estim.vec_model_test_perf()['tfidf']['mcc']) for (estim, name) in estimators]

[('logistic regression', 0.43362798610703973),
 ('knn classifier', 0.31021194901810845),
 ('random forest classifier', 0.3538159641596086),
 ('adaboost', 0.37637111626256814),
 ('svm classifier', 0.46124237327231543),
 ('multinomial naive bayes', 0.20231132546708322)]

In [38]:
[(name, estim.vec_model_test_perf()['count_vec']['mcc']) for (estim, name) in estimators]

[('logistic regression', 0.25551809894519173),
 ('knn classifier', 0.2546086043574658),
 ('random forest classifier', 0.3150529376824144),
 ('adaboost', 0.33067790974010014),
 ('svm classifier', 0.04798698971257677),
 ('multinomial naive bayes', 0.20999221011088634)]

The best estimator/vectorizer combo based on Matthews correlation coefficient is again the SVM classifer with tfidf, with an coefficient of about $46.1\%$. The worst performing is again SVM, but with count vectorization, having a coefficient of less than $5\%$.

The next best performing combo is the logistic regression with `tfidf`. The accuracy for this combination is about $43.3\%$

### Overall Winner

The combo with the best overall score is the SVM classifier with tfidf vectorization, and it appears that tfidf vectorization is the best overall text vectorizer.

### Candidates for Tuning

- SVM classifier with tfidf vectorization
- Logistic regression with tfidf

## Hyperparameter Tuning - Tfidf Parameters

In [56]:
pipe_params_tf = {
    'tf__max_features' : (2000, 3000, 5000),
    'tf__max_df' : [0.9, 0.95],
    'tf__min_df' : [2, 3],
    'tf__stop_words' : (None,'english'),
    'tf__ngram_range' : [(1,1), (1,2), (1,3)],
    'svc__C' : [0.01, 0.1, 1, 10, 100],
    'svc__kernel' : ['rbf', 'poly'],
    'svc__degree' : [2, 3]
}

pipe_sc = Pipeline([('tf', TfidfVectorizer()), ('svc', SVC())])

# Instantiate SVM.
grid_tf = GridSearchCV(pipe_sc, param_grid=pipe_params_tf, cv = 5, verbose=2, n_jobs = -1)


# Fit on training data.
grid_tf.fit(X_train, y_train)



Fitting 5 folds for each of 1440 candidates, totalling 7200 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:   15.0s
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done 357 tasks      | elapsed:  2.5min
[Parallel(n_jobs=4)]: Done 640 tasks      | elapsed:  4.5min
[Parallel(n_jobs=4)]: Done 1005 tasks      | elapsed:  7.2min
[Parallel(n_jobs=4)]: Done 1450 tasks      | elapsed: 10.6min
[Parallel(n_jobs=4)]: Done 1977 tasks      | elapsed: 14.5min
[Parallel(n_jobs=4)]: Done 2584 tasks      | elapsed: 18.9min
[Parallel(n_jobs=4)]: Done 3273 tasks      | elapsed: 24.0min
[Parallel(n_jobs=4)]: Done 4042 tasks      | elapsed: 29.4min
[Parallel(n_jobs=4)]: Done 4893 tasks      | elapsed: 35.5min
[Parallel(n_jobs=4)]: Done 5824 tasks      | elapsed: 42.2min
[Parallel(n_jobs=4)]: Done 6837 tasks      | elapsed: 49.7min
[Parallel(n_jobs=4)]: Done 7200 out of 7200 | elapsed: 52.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        norm

In [57]:
# Evaluate model.
grid_tf.best_score_

0.7647006292704202

In [58]:
grid_tf.best_params_

{'svc__C': 10,
 'svc__degree': 2,
 'svc__kernel': 'rbf',
 'tf__max_df': 0.9,
 'tf__max_features': 5000,
 'tf__min_df': 3,
 'tf__ngram_range': (1, 2),
 'tf__stop_words': None}

In [59]:
grid_tf.best_estimator_ator_

Pipeline(memory=None,
         steps=[('tf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.9, max_features=5000,
                                 min_df=3, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('svc',
                 SVC(C=10, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=

In [61]:
pred_val_tf = grid_tf.best_estimator_.predict(X_val)
matthews_corrcoef(y_val,pred_val_tf)

0.4402177873038213

In [68]:
accuracy_score(y_val,pred_val_tf)

0.7622767857142857

In [63]:
pipe_params_lr = {
    'tf__max_features' : (2000, 3000, 5000),
    'tf__max_df' : [0.9, 0.95],
    'tf__min_df' : [2, 3],
    'tf__stop_words' : (None,'english'),
    'tf__ngram_range' : [(1,1), (1,2), (1,3)],
    'lr__C' : [0.1, 1, 10],
    'lr__penalty' : ['l1', 'l2', 'elasticnet']
}

pipe_lr = Pipeline([('tf', TfidfVectorizer()), ('lr', LogisticRegression(solver = 'saga'))])

# Instantiate SVM.
grid_lr = GridSearchCV(pipe_lr, param_grid=pipe_params_lr, cv = 5, verbose=2, n_jobs = 5)


# Fit on training data.
grid_lr.fit(X_train, y_train)

# Evaluate model.
grid_lr.best_score_

Fitting 5 folds for each of 648 candidates, totalling 3240 fits


[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  31 tasks      | elapsed:    3.8s
[Parallel(n_jobs=5)]: Done 152 tasks      | elapsed:   17.0s
[Parallel(n_jobs=5)]: Done 355 tasks      | elapsed:   44.9s
[Parallel(n_jobs=5)]: Done 638 tasks      | elapsed:  1.4min
[Parallel(n_jobs=5)]: Done 1003 tasks      | elapsed:  2.0min
[Parallel(n_jobs=5)]: Done 1448 tasks      | elapsed:  2.9min
[Parallel(n_jobs=5)]: Done 1975 tasks      | elapsed:  3.8min
[Parallel(n_jobs=5)]: Done 2582 tasks      | elapsed:  6.7min
[Parallel(n_jobs=5)]: Done 3240 out of 3240 | elapsed:  7.8min finished


0.7572559932988584

In [64]:
grid_lr.best_params_

{'lr__C': 10,
 'lr__penalty': 'l2',
 'tf__max_df': 0.9,
 'tf__max_features': 5000,
 'tf__min_df': 3,
 'tf__ngram_range': (1, 2),
 'tf__stop_words': None}

In [65]:
estimator_grid_lr = grid_lr.best_estimator_

In [66]:
matthews_corrcoef(y_val,estimator_grid_lr.predict(X_val))

0.46182695964436876

In [67]:
accuracy_score(y_val,estimator_grid_lr.predict(X_val))

0.7678571428571429

# Conclusions

- Hyperparameter tuning doesn't always yield the best results.
- When tuned properly, even 'basic' models can perform better than more elaborate models.