# Data Science Workflow
## Find the Best Model

This notebook shows how to use some of the functions located in `reddit_functions` to compare the performance of different models on the data.

A second workflow is included to take the parameters of the best model and create a new model and fit it on the entire dataset and see the improvement.

In [1]:
# for jupyter to find my modules and packages
# import sys
# sys.path
# sys.path.append("../")

In [32]:
from pprint import pprint
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
from helpers import databases

In [4]:
from helpers import dataloader

In [5]:
from helpers import grid_models

In [6]:
from helpers.reddit_functions import Reddit

In [7]:
# subreddit_list = ['css', 'html', 'javascript', 'php', 'perl', 'java', 'datascience', 'machinelearning', 'etl', 'python', 'dataengineering']

In [8]:
subreddit_list = ['datascience','machinelearning','dataengineering','python','aws','sql']

In [9]:
df = dataloader.data_selector(subreddit_list, 'sqlite')

Connection to SQLite DB successful


In [10]:
# get rid of list items with no data retrieved
subreddit_list = [sub for sub in subreddit_list if sub in df.subreddit.unique()]
subreddit_list

['datascience', 'machinelearning', 'dataengineering', 'python', 'aws', 'sql']

In [11]:
df = dataloader.subreddit_encoder(df)

Subreddits and codes added: {'aws': 0, 'sql': 1, 'datascience': 2, 'machinelearning': 3, 'python': 4, 'dataengineering': 5}


In [12]:
df.sample(10)

Unnamed: 0,title,subreddit,date,sub_code
8893,A quick speech synthesis project—is Tacotron 2...,machinelearning,2020-04-02,3
7629,Finding duplicates across multiple columns. Is...,sql,2020-04-02,1
2835,Newbie Question - Which is easier to learn Pyt...,datascience,2020-03-29,2
2518,An outsider's opinion of data quality. (My bac...,datascience,2020-03-29,2
8224,Looking for experienced team buddies for Faceb...,datascience,2020-04-02,2
5449,Apache Spark for dotnet developers,dataengineering,2020-03-29,5
10192,Python package to detect emotion using tone of...,python,2020-04-02,4
8240,"[UK] Bachelors degree in engineering, what els...",datascience,2020-04-02,2
4017,Boids - organic motion from 3 simple rules,python,2020-03-29,4
9312,[R] KaoKore: A Pre-modern Japanese Art Facial ...,machinelearning,2020-04-02,3


In [13]:
X = df['title']
y = df['sub_code']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

In [15]:
useless_words = set(['using', 'help', 'new', 'data', 'science', 'machine', 'learning', 'use', 'need'])

custom_stop_words = ENGLISH_STOP_WORDS.union(subreddit_list, useless_words)

In [16]:
redfuncs = Reddit()

In [17]:
preprocessors = grid_models.preprocessors
estimators = grid_models.estimators

In [18]:
# pprint(preprocessors)

In [19]:
# pprint(estimators)

### Compare Subset of Models

In [20]:
# esty = {'logreg': estimators['logreg']}

# compare_df = redfun.compare_models(X_train, X_test, y_train, y_test, estimators=esty, cv=3, verbose=0)

### Compare All Models

In [21]:
compare_df = redfuncs.compare_models(X_train, X_test, y_train, y_test, cv=3, verbose=1)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done 216 out of 216 | elapsed:   13.2s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    1.4s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:  2.6min finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:   19.1s finished


Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    9.0s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   19.4s finished


Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    1.9s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    2.0s finished


Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:    2.9s finished


Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.2s finished


Fitting 3 folds for each of 162 candidates, totalling 486 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   56.9s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed: 13.2min
[Parallel(n_jobs=-1)]: Done 486 out of 486 | elapsed: 15.0min finished


Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:  2.2min finished


In [22]:
compare_df.sort_values(by='Best Test Score', ascending=False)

Unnamed: 0,Preprocessor,Estimator,Best Params,Best Train Score,Best Test Score,Variance
9,tfidf,svc,"{'svc__C': 3, 'svc__degree': 1, 'svc__gamma': ...",0.993373,0.894003,10.003301
8,count_vec,svc,"{'count_vec__max_df': 0.3, 'count_vec__max_fea...",0.980467,0.862622,12.019304
0,count_vec,logreg,"{'count_vec__max_df': 0.3, 'count_vec__max_fea...",0.97105,0.848675,12.602323
1,tfidf,logreg,"{'logreg__C': 3, 'logreg__penalty': 'l2', 'tfi...",0.955121,0.841702,11.874925
3,tfidf,randomforest,"{'randomforest__max_depth': 200, 'randomforest...",0.954191,0.838215,12.154437
2,count_vec,randomforest,"{'count_vec__max_df': 0.3, 'count_vec__max_fea...",0.935473,0.822176,12.111192
7,tfidf,multinomialnb,"{'multinomialnb__alpha': 1, 'multinomialnb__fi...",0.894663,0.795328,11.103132
6,count_vec,multinomialnb,"{'count_vec__max_df': 0.3, 'count_vec__max_fea...",0.853738,0.776499,9.047113
4,count_vec,knearest,"{'count_vec__max_df': 0.3, 'count_vec__max_fea...",0.806999,0.625523,22.487777
5,tfidf,knearest,"{'knearest__metric': 'manhattan', 'knearest__n...",0.787699,0.313808,60.161497


In [60]:
compare_df.to_csv('compare_df_04022020')

In [31]:
[pprint(params) for params in compare_df.sort_values(by='Best Test Score', ascending=False)['Best Params']]

{'svc__C': 3,
 'svc__degree': 1,
 'svc__gamma': 'scale',
 'svc__kernel': 'rbf',
 'svc__probability': True,
 'tfidf__max_features': 5000,
 'tfidf__ngram_range': (1, 1),
 'tfidf__stop_words': 'english',
 'tfidf__strip_accents': None}
{'count_vec__max_df': 0.3,
 'count_vec__max_features': 5000,
 'count_vec__min_df': 4,
 'count_vec__ngram_range': (1, 2),
 'count_vec__stop_words': 'english',
 'svc__C': 3,
 'svc__degree': 1,
 'svc__gamma': 'scale',
 'svc__kernel': 'rbf',
 'svc__probability': True}
{'count_vec__max_df': 0.3,
 'count_vec__max_features': 5000,
 'count_vec__min_df': 4,
 'count_vec__ngram_range': (1, 2),
 'count_vec__stop_words': 'english',
 'logreg__C': 3,
 'logreg__penalty': 'l2'}
{'logreg__C': 3,
 'logreg__penalty': 'l2',
 'tfidf__max_features': 5000,
 'tfidf__ngram_range': (1, 1),
 'tfidf__stop_words': 'english',
 'tfidf__strip_accents': None}
{'randomforest__max_depth': 200,
 'randomforest__min_samples_leaf': 1,
 'randomforest__min_samples_split': 0.001,
 'randomforest__n_es

[None, None, None, None, None, None, None, None, None, None]

In [23]:
best_model = compare_df.sort_values(by='Best Test Score', ascending=False).iloc[0, :].to_dict()
best_model

{'Preprocessor': 'tfidf',
 'Estimator': 'svc',
 'Best Params': {'svc__C': 3,
  'svc__degree': 1,
  'svc__gamma': 'scale',
  'svc__kernel': 'rbf',
  'svc__probability': True,
  'tfidf__max_features': 5000,
  'tfidf__ngram_range': (1, 1),
  'tfidf__stop_words': 'english',
  'tfidf__strip_accents': None},
 'Best Train Score': 0.9933728636205092,
 'Best Test Score': 0.8940027894002789,
 'Variance': 10.003300659740184}

## Make a new model with the best params from the search

In [34]:
best_pipe = Pipeline([
    (best_model['Preprocessor'], preprocessors[best_model['Preprocessor']]['processor']),
    (best_model['Estimator'], estimators[best_model['Estimator']]['estimator'])
])
best_pipe.set_params(**best_model['Best Params'])
# fit on entire dataset
# best_pipe.fit(X, y)
crossval = cross_val_score(best_pipe, X=X, y=y, cv=5)
crossval.mean()

0.9566688123628602

In [35]:
best_pipe_score = best_pipe.score(X, y)
best_pipe_score

0.9911936524544425

### Model Improvement

In [26]:
# baseline
y.value_counts(normalize=True)

0    0.173773
3    0.173250
5    0.170285
1    0.167669
4    0.161479
2    0.153544
Name: sub_code, dtype: float64

In [27]:
# how much improvement over baseline
best_pipe_score - y.value_counts(normalize=True)[0]

0.8174208736594298

In [28]:
# how much difference from the best worst model to the best best model
best_pipe_score - min(compare_df['Best Test Score'])

0.6773861210736893

In [29]:
# how much improvement from retraining on entire dataset
best_pipe_score - best_model['Best Test Score']

0.09719086305416358