# Data Science Workflow
## Find the Best Model

This notebook shows how to use some of the functions located in `reddit_functions` to compare the performance of different models on the data.

A second workflow is included to take the parameters of the best model and create a new model and fit it on the entire dataset and see the improvement.

In [1]:
from pprint import pprint
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
from util import databases
from util import dataloader
from util import grid_models
from util.reddit_functions import Reddit, Labeler

In [3]:
!pwd

/Users/chris/github/reddit_nlp/util


In [4]:
subreddit_list = ['css', 'html', 'machinelearning', 'python']

In [5]:
df = dataloader.data_selector(subreddit_list, 'sqlite')

Connection to SQLite DB successful


In [6]:
df.sample(10)

Unnamed: 0,title,subreddit,date
11494,Why does the span element have a vertical offs...,css,2020-04-14
22526,Python Scraping Question,python,2020-04-21
32497,[D] IJCAI 20 notifications,machinelearning,2020-04-24
26053,I made a Twitter market watch bot,python,2020-04-22
33995,Sniffing for File Types On Local Machine?,python,2020-04-24
430,How can I recreate this?,css,2020-03-29
12919,Adding content,html,2020-04-14
12369,Developed a pure CSS count up timer using HTML...,html,2020-04-14
34778,Why side nav works only with sections?,css,2020-04-25
19783,Review my small Book chapter on CSS,css,2020-04-21


In [7]:
X = df['title']
y = df['subreddit']

In [8]:
labeler = Labeler()
labeler.fit(y)
y = labeler.transform(y)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

In [10]:
model = Reddit()

In [11]:
preprocessors = grid_models.preprocessors
estimators = grid_models.estimators

### Compare All Models

In [12]:
compare_df = model.compare_models(X_train, X_test, y_train, y_test, cv=3, verbose=1)

  0%|          | 0/3 [00:00<?, ?it/s]

Fitting model with CountVectorizer and Logistic Regression
Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   11.3s finished


Fitting model with TfidVectorizer and Logistic Regression
Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed:   17.4s finished
 33%|███▎      | 1/3 [00:34<01:08, 34.45s/it][Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting model with CountVectorizer and Random Forest
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:   42.6s finished


Fitting model with TfidVectorizer and Random Forest
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   41.5s
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:  1.5min finished
 67%|██████▋   | 2/3 [03:19<01:13, 73.55s/it]

Fitting model with CountVectorizer and Passive Agressive Classifier
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.6s finished


Fitting model with TfidVectorizer and Passive Agressive Classifier
Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:    0.7s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    1.2s finished
100%|██████████| 3/3 [03:23<00:00, 67.67s/it]


In [13]:
compare_df.sort_values(by='best_test_score', ascending=False)

Unnamed: 0,date,preprocessor,estimator,best_params,best_train_score,best_test_score,variance,prep_code,est_code,subreddits,fit_and_score_time
4,2020-04-25 22:38:35.548883,CountVectorizer,Passive Agressive Classifier,"{'count_vec__max_features': 5000, 'count_vec__...",0.983799,0.972184,1.180568,count_vec,passive,na,1.578906
5,2020-04-25 22:38:37.741098,TfidVectorizer,Passive Agressive Classifier,"{'passive__C': 1.0, 'passive__average': False,...",0.984009,0.972079,1.212319,tfidf,passive,na,2.188445
0,2020-04-25 22:35:28.729328,CountVectorizer,Logistic Regression,"{'count_vec__max_features': 5000, 'count_vec__...",0.980929,0.97019,1.094819,count_vec,logreg,na,13.99459
1,2020-04-25 22:35:49.185874,TfidVectorizer,Logistic Regression,"{'logreg__C': 3, 'logreg__max_iter': 1000, 'lo...",0.971936,0.959694,1.259645,tfidf,logreg,na,20.452396
2,2020-04-25 22:36:50.849978,CountVectorizer,Random Forest,"{'count_vec__max_features': 5000, 'count_vec__...",0.928826,0.915188,1.4683,count_vec,randomforest,na,61.659652
3,2020-04-25 22:38:33.965580,TfidVectorizer,Random Forest,"{'randomforest__max_depth': 200, 'randomforest...",0.931276,0.914349,1.817626,tfidf,randomforest,na,103.111789


In [14]:
!pwd

/Users/chris/github/reddit_nlp/util


In [15]:
date = str(datetime.datetime.now().strftime('%Y-%m-%d_%H:%M:%S'))
compare_df.to_csv(f'../data/compare_df/{date}')

In [None]:
# [pprint(params) for params in compare_df.sort_values(by='best_test_score', ascending=False)['best_params']]

In [None]:
best_model = compare_df.sort_values(by='best_test_score', ascending=False).iloc[0, :].to_dict()
best_model

## Make a new model with the best params from the search

In [None]:
best_pipe = Pipeline([
    (best_model['prep_code'], preprocessors[best_model['prep_code']]['processor']),
    (best_model['est_code'], estimators[best_model['est_code']]['estimator'])
])
best_pipe.set_params(**best_model['best_params'])
# fit on entire dataset
best_pipe.fit(X, y)
best_pipe_score = best_pipe.score(X, y)
best_pipe_score

In [None]:
cross_score = cross_val_score(best_pipe, X, y)
print(cross_score, cross_score.mean())


### Model Improvement

In [None]:
# baseline
y.value_counts(normalize=True)

In [None]:
# how much improvement over baseline
best_pipe_score - y.value_counts(normalize=True)[0]

In [None]:
# how much difference from the best worst model to the best best model
best_pipe_score - min(compare_df['best_test_score'])

In [None]:
# how much improvement from retraining on entire dataset
best_pipe_score - best_model['best_test_score']