# Data Science Workflow
## Find the Best Model

This notebook shows how to use some of the functions located in `reddit_functions` to compare the performance of different models on the data.

A second workflow is included to take the parameters of the best model and create a new model and fit it on the entire dataset and see the improvement.

In [1]:
from pprint import pprint
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
from helpers import databases
from helpers import dataloader
from helpers import grid_models
from helpers.reddit_functions import Reddit

In [3]:
subreddit_list = ['css', 'html', 'javascript', 'datascience', 'machinelearning', 'etl', 'python', 'dataengineering']

In [4]:
# subreddit_list = ['datascience','machinelearning','dataengineering','python','aws','sql']

In [5]:
df = dataloader.data_selector(subreddit_list, 'sqlite')

Connection to SQLite DB successful


In [6]:
# get rid of list items with no data retrieved
subreddit_list = [sub for sub in subreddit_list if sub in df.subreddit.unique()]
subreddit_list

['css',
 'html',
 'javascript',
 'datascience',
 'machinelearning',
 'etl',
 'python',
 'dataengineering']

In [7]:
df = dataloader.subreddit_encoder(df)

Subreddits and codes added: {'css': 0, 'html': 1, 'javascript': 2, 'datascience': 3, 'machinelearning': 4, 'etl': 5, 'python': 6, 'dataengineering': 7}


In [8]:
df.sample(10)

Unnamed: 0,title,subreddit,date,sub_code
8872,Need urgent help with a small project,html,2020-04-02,1
3378,International Students beware of Data Science ...,datascience,2020-03-29,3
9402,About coding the “FizzBuzz” interview question,javascript,2020-04-02,2
529,Glowing Nav on hover,css,2020-03-29,0
6127,Pyspark - how do I use groupby with lists?,dataengineering,2020-03-29,7
2043,Everything You Need to Know About Regular Expr...,javascript,2020-03-29,2
11374,[R] A Road Map to Strong Intelligence,machinelearning,2020-04-02,4
5560,How important is it to follow PEP guidelines f...,python,2020-03-29,6
384,What are the limitations of css grid in terms ...,css,2020-03-29,0
8128,cart code,html,2020-04-02,1


In [9]:
X = df['title']
y = df['sub_code']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

In [11]:
useless_words = set(['using', 'help', 'new', 'data', 'science', 'machine', 'learning', 'use', 'need'])

custom_stop_words = ENGLISH_STOP_WORDS.union(subreddit_list, useless_words)

In [12]:
redfuncs = Reddit()

In [13]:
preprocessors = grid_models.preprocessors
estimators = grid_models.estimators

In [14]:
preprocessors['count_vec']['pipe_params']['count_vec__stop_words'].append(custom_stop_words)
# preprocessors['count_vec']['pipe_params']['count_vec__stop_words'].remove('english')

In [15]:
preprocessors['tfidf']['pipe_params']['tfidf__stop_words'].append(custom_stop_words)
# preprocessors['tfidf']['pipe_params']['tfidf__stop_words'].remove('english')

### Compare All Models

In [None]:
compare_df = redfuncs.compare_models(X_train, X_test, y_train, y_test, cv=3, verbose=1)

Fitting model with CountVectorizer and Logistic Regression
Fitting 3 folds for each of 576 candidates, totalling 1728 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   57.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 1728 out of 1728 | elapsed:  7.1min finished


Fitting model with TfidVectorizer and Logistic Regression
Fitting 3 folds for each of 512 candidates, totalling 1536 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   34.8s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 1536 out of 1536 | elapsed:  2.5min finished


Fitting model with CountVectorizer and Random Forest
Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   41.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:  3.8min finished


Fitting model with TfidVectorizer and Random Forest
Fitting 3 folds for each of 96 candidates, totalling 288 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 288 out of 288 | elapsed:  3.7min finished


Fitting model with CountVectorizer and K Nearest Neighbors
Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  3.0min finished


Fitting model with TfidVectorizer and K Nearest Neighbors
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:  9.2min finished


Fitting model with CountVectorizer and Multinomial Bayes Classifier
Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:    8.1s finished


Fitting model with TfidVectorizer and Multinomial Bayes Classifier
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:    5.6s finished


Fitting model with CountVectorizer and Support Vector Classifier
Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:  6.1min finished


Fitting model with TfidVectorizer and Support Vector Classifier
Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  6.0min finished


Fitting model with CountVectorizer and AdaBoost Classifier
Fitting 3 folds for each of 81 candidates, totalling 243 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done 243 out of 243 | elapsed:   18.0s finished


Fitting model with TfidVectorizer and AdaBoost Classifier
Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 216 out of 216 | elapsed:    9.4s finished


Fitting model with CountVectorizer and Bagging Classifier
Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   20.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.4min


In [None]:
compare_df.sort_values(by='best_test_score', ascending=False)

In [None]:
date = str(datetime.datetime.now())
compare_df.to_csv(f'data/compare_df/{date}')

In [None]:
# [pprint(params) for params in compare_df.sort_values(by='best_test_score', ascending=False)['best_params']]

In [None]:
best_model = compare_df.sort_values(by='best_test_score', ascending=False).iloc[0, :].to_dict()
best_model

## Make a new model with the best params from the search

In [None]:
best_pipe = Pipeline([
    (best_model['prep_code'], preprocessors[best_model['prep_code']]['processor']),
    (best_model['est_code'], estimators[best_model['est_code']]['estimator'])
])
best_pipe.set_params(**best_model['best_params'])
# fit on entire dataset
best_pipe.fit(X, y)
best_pipe.score(X, y)

In [None]:
cross_score = cross_val_score(best_pipe, X, y)
print(cross_score, cross_score.mean())


### Model Improvement

In [None]:
# baseline
y.value_counts(normalize=True)

In [None]:
# how much improvement over baseline
best_pipe_score - y.value_counts(normalize=True)[0]

In [None]:
# how much difference from the best worst model to the best best model
best_pipe_score - min(compare_df['Best Test Score'])

In [None]:
# how much improvement from retraining on entire dataset
best_pipe_score - best_model['Best Test Score']