# Data Science Workflow
## Find the Best Model

This notebook shows how to use some of the functions located in `reddit_functions` to compare the performance of different models on the data.

A second workflow is included to take the parameters of the best model and create a new model and fit it on the entire dataset and see the improvement.

In [1]:
from pprint import pprint
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
from helpers import databases
from helpers import dataloader
from helpers import grid_models
from helpers.reddit_functions import Reddit

In [3]:
subreddit_list = ['css', 'html', 'javascript', 'datascience', 'machinelearning', 'etl', 'python', 'dataengineering']

In [4]:
# subreddit_list = ['datascience','machinelearning','dataengineering','python','aws','sql']

In [5]:
df = dataloader.data_selector(subreddit_list, 'sqlite')

Connection to SQLite DB successful


In [6]:
# get rid of list items with no data retrieved
subreddit_list = [sub for sub in subreddit_list if sub in df.subreddit.unique()]
subreddit_list

['css',
 'html',
 'javascript',
 'datascience',
 'machinelearning',
 'etl',
 'python',
 'dataengineering']

In [7]:
df = dataloader.subreddit_encoder(df)

Subreddits and codes added: {'css': 0, 'html': 1, 'javascript': 2, 'datascience': 3, 'machinelearning': 4, 'etl': 5, 'python': 6, 'dataengineering': 7}


In [8]:
df.sample(10)

Unnamed: 0,title,subreddit,date,sub_code
10873,[R][P] Online Advanced Machine Learning Study ...,machinelearning,2020-04-02,4
1591,HELP!!! Does anyone know how to fix it?,html,2020-03-29,1
10763,[P] We are looking to detect hate speech in tw...,machinelearning,2020-04-02,4
10257,Can anyone tell me where to find live and hist...,datascience,2020-04-02,3
9533,VS Code Extension for Base Web and React View ...,javascript,2020-04-02,2
9213,Dark Reader is now available as a JavaScript l...,javascript,2020-04-02,2
3964,Transfer learning paper help [Project],machinelearning,2020-03-29,4
10469,What machine learning models have you created ...,datascience,2020-04-02,3
8530,CSS var() animation redifiner using javascript,html,2020-04-02,1
7186,"My pure CSS morph effect, what do you think",css,2020-04-02,0


In [9]:
X = df['title']
y = df['sub_code']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

In [11]:
useless_words = set(['using', 'help', 'new', 'data', 'science', 'machine', 'learning', 'use', 'need'])

custom_stop_words = ENGLISH_STOP_WORDS.union(subreddit_list, useless_words)

In [12]:
redfuncs = Reddit()

In [13]:
preprocessors = grid_models.preprocessors
estimators = grid_models.estimators

In [14]:
preprocessors['count_vec']['pipe_params']['count_vec__stop_words'].append(custom_stop_words)
# preprocessors['count_vec']['pipe_params']['count_vec__stop_words'].remove('english')

In [15]:
preprocessors['tfidf']['pipe_params']['tfidf__stop_words'].append(custom_stop_words)
# preprocessors['tfidf']['pipe_params']['tfidf__stop_words'].remove('english')

### Compare All Models

In [16]:
compare_df = redfuncs.compare_models(X_train, X_test, y_train, y_test, cv=3, verbose=1)

Fitting model with CountVectorizer and Extra Trees Classifier
Fitting 3 folds for each of 54 candidates, totalling 162 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed: 11.7min finished


Fitting model with TfidVectorizer and Extra Trees Classifier
Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed: 12.6min finished


Fitting model with CountVectorizer and Gradient Boosting Classifier
Fitting 3 folds for each of 54 candidates, totalling 162 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed: 11.0min finished


Fitting model with TfidVectorizer and Gradient Boosting Classifier
Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed: 21.0min finished


Fitting model with CountVectorizer and ElasticNet Classifier
Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:    3.3s finished
  (train_score - test_score) / train_score * 100,
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting model with TfidVectorizer and ElasticNet Classifier
Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:    2.9s finished
  (train_score - test_score) / train_score * 100,
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting model with CountVectorizer and Passive Agressive Classifier
Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:    2.2s finished


Fitting model with TfidVectorizer and Passive Agressive Classifier
Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    1.7s finished


Fitting model with CountVectorizer and Stochastic Gradient Descent Classifier
Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:    6.6s finished


Fitting model with TfidVectorizer and Stochastic Gradient Descent Classifier
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:    4.2s finished


Fitting model with CountVectorizer and Nu Support Vector Classifier
Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  2.1min finished


Fitting model with TfidVectorizer and Nu Support Vector Classifier
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:  1.8min finished


In [17]:
compare_df.sort_values(by='best_test_score', ascending=False)

Unnamed: 0,date,preprocessor,estimator,best_params,best_train_score,best_test_score,variance,prep_code,est_code,sub_list
1,2020-04-03 20:49:46.042858,TfidVectorizer,Extra Trees Classifier,"{'extratrees__bootstrap': True, 'extratrees__c...",0.991015,0.878154,11.388467,tfidf,extratrees,na
3,2020-04-03 21:24:23.174866,TfidVectorizer,Gradient Boosting Classifier,"{'gradboost__learning_rate': 0.1, 'gradboost__...",0.985089,0.862672,12.426988,tfidf,gradboost,na
0,2020-04-03 20:36:23.190340,CountVectorizer,Extra Trees Classifier,"{'count_vec__max_df': 0.4, 'count_vec__max_fea...",0.986523,0.861525,12.670507,count_vec,extratrees,na
6,2020-04-03 21:24:32.781211,CountVectorizer,Passive Agressive Classifier,"{'count_vec__max_df': 0.5, 'count_vec__max_fea...",0.978589,0.842317,13.925421,count_vec,passive,na
2,2020-04-03 21:01:33.374847,CountVectorizer,Gradient Boosting Classifier,"{'count_vec__max_df': 0.3, 'count_vec__max_fea...",0.961671,0.834862,13.186261,count_vec,gradboost,na
7,2020-04-03 21:24:34.843634,TfidVectorizer,Passive Agressive Classifier,"{'passive__C': 1.0, 'passive__average': False,...",0.967597,0.832282,13.984635,tfidf,passive,na
8,2020-04-03 21:24:41.879519,CountVectorizer,Stochastic Gradient Descent Classifier,"{'count_vec__max_df': 0.4, 'count_vec__max_fea...",0.944944,0.81078,14.198074,count_vec,sgd,na
9,2020-04-03 21:24:46.356070,TfidVectorizer,Stochastic Gradient Descent Classifier,"{'sgd__alpha': 0.0001, 'sgd__average': False, ...",0.91732,0.790711,13.802036,tfidf,sgd,na
11,2020-04-03 21:29:11.756375,TfidVectorizer,Nu Support Vector Classifier,"{'nusvc__cache_size': 200, 'nusvc__decision_fu...",0.822023,0.736525,10.400849,tfidf,nusvc,na
10,2020-04-03 21:27:07.690591,CountVectorizer,Nu Support Vector Classifier,"{'count_vec__max_df': 0.3, 'count_vec__max_fea...",0.789906,0.707282,10.460002,count_vec,nusvc,na


In [1]:
!pwd

/home/datapointchris/github/reddit_nlp


In [19]:
date = str(datetime.datetime.now())
compare_df.to_csv(f'data/compare_df/{date}')

In [20]:
# [pprint(params) for params in compare_df.sort_values(by='best_test_score', ascending=False)['best_params']]

In [21]:
best_model = compare_df.sort_values(by='best_test_score', ascending=False).iloc[0, :].to_dict()
best_model

{'date': Timestamp('2020-04-03 20:49:46.042858'),
 'preprocessor': 'TfidVectorizer',
 'estimator': 'Extra Trees Classifier',
 'best_params': {'extratrees__bootstrap': True,
  'extratrees__class_weight': None,
  'extratrees__max_depth': None,
  'extratrees__max_leaf_nodes': None,
  'extratrees__min_samples_leaf': 1,
  'extratrees__min_samples_split': 2,
  'extratrees__min_weight_fraction_leaf': 0.0,
  'extratrees__n_estimators': 500,
  'tfidf__max_features': 5000,
  'tfidf__ngram_range': (1, 1),
  'tfidf__norm': 'l1',
  'tfidf__stop_words': 'english',
  'tfidf__strip_accents': None,
  'tfidf__use_idf': True},
 'best_train_score': 0.9910151022748996,
 'best_test_score': 0.8781536697247706,
 'variance': 11.388467470480807,
 'prep_code': 'tfidf',
 'est_code': 'extratrees',
 'sub_list': 'na'}

## Make a new model with the best params from the search

In [22]:
best_pipe = Pipeline([
    (best_model['prep_code'], preprocessors[best_model['prep_code']]['processor']),
    (best_model['est_code'], estimators[best_model['est_code']]['estimator'])
])
best_pipe.set_params(**best_model['best_params'])
# fit on entire dataset
best_pipe.fit(X, y)
best_pipe_score = best_pipe.score(X, y)
best_pipe_score

0.9886021505376344

In [23]:
cross_score = cross_val_score(best_pipe, X, y)
print(cross_score, cross_score.mean())




[0.98151333 0.89591398 0.98881239] 0.9554132328408352


### Model Improvement

In [24]:
# baseline
y.value_counts(normalize=True)

4    0.142437
7    0.140000
1    0.139140
2    0.133692
6    0.132760
0    0.131971
3    0.126237
5    0.053763
Name: sub_code, dtype: float64

In [25]:
# how much improvement over baseline
best_pipe_score - y.value_counts(normalize=True)[0]

0.8566308243727598

In [26]:
# how much difference from the best worst model to the best best model
best_pipe_score - min(compare_df['best_test_score'])

0.9889406777941373

In [27]:
# how much improvement from retraining on entire dataset
best_pipe_score - best_model['best_test_score']

0.11044848081286385