<img src="../images/vegan-logo-resized.png" style="float: right; margin: 10px;">

# Model Tuning

Author: Gifford Tompkins

---

Project 03 | Notebook 5 of 6

## OBJECTIVE 
In this notebook, we will create a function that will loop through all of our potential models and pull the best hyper-parameters of each one.

## Vectorizers 

- `CountVectorizer`
- `TfidVectorizer`
- `SVD`

We will want to place these vectorizers into a GridSearchCV, so we will create the grid parameters for the specific vectorizers that we pass.
```python
CountVectorizer(input='content',
    encoding='utf-8',
    decode_error='strict',
    strip_accents=None,
    lowercase=True,
    preprocessor=None,
    tokenizer=None,
    stop_words=None,
    token_pattern='(?u)\\b\\w\\w+\\b',
    ngram_range=(1, 1),
    analyzer='word',
    max_df=1.0,
    min_df=1,
    max_features=None,
    vocabulary=None,
    binary=False,
    dtype=<class 'numpy.int64'>
)

TfidfVectorizer(
    input='content',
    encoding='utf-8',
    decode_error='strict',
    strip_accents=None,
    lowercase=True,
    preprocessor=None,
    tokenizer=None,
    analyzer='word',
    stop_words=None,
    token_pattern='(?u)\\b\\w\\w+\\b',
    ngram_range=(1, 1),
    max_df=1.0,
    min_df=1,
    max_features=None,
    vocabulary=None,
    binary=False,
    dtype=<class 'numpy.float64'>,
    norm='l2',
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=False,
)```

### SDA: Singular Value Decomposition
```python
TruncatedSVD(
    n_components=2,
    algorithm='randomized',
    n_iter=5,
    random_state=None,
    tol=0.0,
)
```

## Classifiers
The list of Estimators we have is also extensive.

- Naive Bayes Theorem
    - `BinomialNB`
    - `MultinomialNB`
    - `GaussianNB`
- Decision Trees
    - `DecisionTreeClassifier`
    - `RandomForestClassifier`
    - `Extraa

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.linear_model import LogisticRegression, Lasso, RidgeClassifier
import numpy as  np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import  GridSearchCV

import time
from project_python_code import parse_performance

from sklearn.preprocessing import FunctionTransformer

## Parameters 

In [74]:
vectorizer_params = {'vec__max_df': [0.80, 0.90, 1.0],
    'vec__max_features': [1000, 3000]
    'vec__min_df': [3, 5],
    'vec__ngram_range': [(1, 1),(1,2)],
    'vec__stop_words': [None,'english'],
}

param_reference = {}

# Vectorizer Parameter Dictionaries
param_reference[type(CountVectorizer())] = {'name':'vec','parameters':count_vectorizer_params}
param_reference[type(TfidfVectorizer())] = {'name':'vec','parameters':tfid_vectorizer_params}


# SVD Parameter Dictionaries
param_reference[type(TruncatedSVD())] = {'name':'svd',
                                         'parameters':{
                                             'svd__n_components':[1000, 2000, 3000]}}


# Decision Trees, Baggers, Random and Extra Forests
param_reference[type(DecisionTreeClassifier())] = {'name':'tree',
                                                   'parameters':{
                                                       'tree__max_depth': [3, 7],
                                                       'tree__min_samples_split': [5, 10],
                                                       'tree__min_samples_leaf': [2, 5, 7]}
                                                  }

param_reference[type(BaggingClassifier())] = {'name':'bagg',
                                              'parameters':{
                                                  'bagg__base_estimator': [None],
                                                  'bagg__n_estimators': [10, 15, 20]}}

param_reference[type(RandomForestClassifier())] = {'name':'tree',
                                                   'parameters':{
                                                       'tree__max_depth': [None, 3, 7, 10],
                                                       'tree__min_samples_split': [5, 10, 20],
                                                       'tree__min_samples_leaf': [2, 4, 7],
                                                       'tree__n_estimators': [6, 8, 10]}}

param_reference[type(ExtraTreesClassifier())] = param_reference[type(RandomForestClassifier())]


# Adaboosters
param_reference[type(AdaBoostClassifier())] = {'name':'boost',
                                               'parameters':{
                                                   'boost__n_estimators': [45, 50, 55],
                                                   'boost__base_estimator__max_depth': [1, 2, 3]}}

# param_reference[type(GradientBoostingClassifier())]


# Logistic and Linear Model Regressions
param_reference[type(LogisticRegression())] = {'name':'logreg',
                                               'parameters':{
                                                   'logreg__C':np.logspace(-2,2,4)}}

param_reference[type(Lasso())] = {'name':'lasso',
                                  'parameters':{
                                      'lasso__alpha':np.logspace(-2,2,4)}}

param_reference[type(RidgeClassifier())] = {'name':'ridge',
                                            'parameters':{
                                                'ridge__alpha':np.logspace(-2,2,4)}}


# Naive Bayes Models
param_reference[type(MultinomialNB())] = {'name':'nb',
                                          'parameters':{
                                              'nb__alpha':np.logspace(-2,2,4)}}

param_reference[type(GaussianNB())] = {'name':'nb',
                                       'parameters':{}}

In [75]:
count_vectorizer_params = {'vec__max_df': [0.8],
                             'vec__max_features': [3000],
                             'vec__min_df': [5],
                             'vec__ngram_range': [(1, 2)],
                             'vec__stop_words': ['english']}

In [76]:
tfid_vectorizer_params = {'vec__max_df': [0.8],
                          'vec__max_features': [3000],
                          'vec__min_df': [3],
                          'vec__ngram_range': [(1, 2)],
                          'vec__stop_words': ['english']}

In [108]:
transformers = [CountVectorizer(max_df = 0.8,
                             max_features = 3000,
                             min_df = 5,
                             ngram_range = (1, 2),
                             stop_words = 'english')
                ,TfidfVectorizer(max_df = 0.8,
                             max_features = 3000,
                             min_df = 3,
                             ngram_range = (1, 2),
                             stop_words = 'english')
#                 ,TruncatedSVD()
               ]

classifiers = [DecisionTreeClassifier(random_state=42)
               ,BaggingClassifier(random_state=42)
               ,RandomForestClassifier(random_state=42)
               ,ExtraTreesClassifier(random_state=42)
               ,AdaBoostClassifier(base_estimator=DecisionTreeClassifier(),random_state=42)
               ,LogisticRegression(random_state=42)
#                ,Lasso()
               ,RidgeClassifier(random_state=42)
               ,MultinomialNB()
               ,GaussianNB()]

In [109]:
pd.DataFrame({'steps': None,
              'best_cross_val': None,
              'best_params': None,
              'train_score': None,
              'test_score': None,
              'sensitivity': None,
              'specificity': None,
              'confusion_matrix': None,
              'runtime': None},
             index=[0]
            ).to_csv('../data/model_scores.csv', index=False)

In [110]:
def condenser(sparse_matrix):
    return sparse_matrix.toarray()

condenser = FunctionTransformer(condenser, accept_sparse=True, validate=False)

In [111]:
def build_pipeline(transformer, classifier, param_reference=param_reference):
    pipeline = Pipeline([
        ('vec',transformer),
        ('condenser', condenser),
        (param_reference[type(classifier)]['name'],classifier)
    ])
    print(f"Pipeline steps: {pipeline.steps}")
    return pipeline

In [112]:
def build_gridsearch(transformer, classifier, param_dict=param_reference):
    pipe_params = param_dict[type(classifier)]['parameters']
    
#     print(pipe_params)
    
    pipeline = build_pipeline(transformer, classifier, param_dict)
    
    gridsearch = GridSearchCV(estimator=pipeline,
                              param_grid=pipe_params,
                              scoring='accuracy',
                              verbose=4,
                              n_jobs=4,
                              cv=3)
    return gridsearch

In [113]:
def run_evaluate_grid(grid, X_train, X_test, y_train, y_test):
    model = {}
    t_0 = time.time()
    print(f"Fitting model: {t_0}")
    grid.fit(X_train, y_train)
    print(f"Model fit: {time.time()-t_0}")
    
    model['steps'] = [type(step) for _,step in grid.estimator.steps]
    model['best_cross_val'] = grid.best_score_
    model['best_params'] = grid.best_params_
    model['train_score'] = grid.score(X_train, y_train)
    model['test_score'] = grid.score(X_test, y_test)
    model['sensitivity'],model['specificity'],model['confusion_matrix'] = parse_performance(grid, X_test, y_test)
    model['runtime'] = time.time() - t_0
    
#     with open('../data/model_scores.csv', 'a') as f:
#         model.to_csv(f, header=False, index=False)
#     print(f"Model with Testing Score: {model['test_score']} appended to model scores.")
    
    return model

## Run Model

In [114]:
df = pd.read_csv('../data/model_data.csv')

In [115]:
X = df['text']
y = df['vegan']

In [116]:
from sklearn.model_selection import train_test_split

In [117]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [118]:
y_train.shape

(10931,)

In [120]:
models = []

In [121]:
for transformer in transformers:
    for classifier in classifiers:
        
        grid = build_gridsearch(transformer, classifier, param_reference)
        print(f"\nCreated grid: {grid.param_grid}")
        
        t_0 = time.time()
        print("Begin fitting and evaluation.")
        models.append(run_evaluate_grid(grid, X_train, X_test, y_train, y_test))
        print(f"Model fit after: {time.time() - t_0} seconds.")
        
        
        
pd.DataFrame(models)

pd.read_csv('../data/model_scores.csv')

Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:   18.8s
[Parallel(n_jobs=4)]: Done  36 out of  36 | elapsed:   37.4s finished


Model fit: 42.90650486946106
Model fit after: 43.94465708732605 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('bagg', BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=None, oob_score=False, random_state=42,
         verbose=0, warm_start=False))]

Created grid: {'bagg__base_estimato

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   9 | elapsed:  2.0min remaining:  1.6min
[Parallel(n_jobs=4)]: Done   9 out of   9 | elapsed:  3.3min finished


Model fit: 316.94268894195557
Model fit after: 330.54615592956543 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('tree', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
         

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:   21.3s
[Parallel(n_jobs=4)]: Done  90 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 213 tasks      | elapsed:  2.3min
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:  3.3min finished


Model fit: 203.439129114151
Model fit after: 204.6124243736267 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('tree', ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_w

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:   43.0s
[Parallel(n_jobs=4)]: Done  90 tasks      | elapsed:  2.7min
[Parallel(n_jobs=4)]: Done 213 tasks      | elapsed:  3.7min
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:  4.7min finished


Model fit: 295.1760609149933
Model fit after: 296.3481879234314 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('boost', AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:  4.0min
[Parallel(n_jobs=4)]: Done  27 out of  27 | elapsed:  7.9min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  27 out of  27 | elapsed:  7.9min finished


Model fit: 554.1015543937683
Model fit after: 555.8675911426544 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('logreg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]

Created

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   9 out of  12 | elapsed:    6.9s remaining:    2.3s
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:    7.4s finished


Model fit: 8.986255884170532
Model fit after: 10.185841083526611 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('ridge', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=42, solver='auto',
        tol=0.001))]

Created grid: {'ridge__alpha': array([1.00000000e-02, 2.15443469e-01, 4.64158883e+00, 1.00000000e+0

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   9 out of  12 | elapsed:   15.4s remaining:    5.1s
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:   15.5s finished


Model fit: 18.00074601173401
Model fit after: 19.333905935287476 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

Created grid: {'nb__alpha': array([1.00000000e-02, 2.15443469e-01, 4.64158883e+00, 1.00000000e+02])}
Begin fitting and evaluation.
Fitting model: 1571366162.790238
Fitting 3 folds for each of 4 candidates, totalli

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   9 out of  12 | elapsed:    7.4s remaining:    2.5s
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:    7.4s finished


Model fit: 8.737454175949097
Model fit after: 10.095630884170532 seconds.
Pipeline steps: [('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('nb', GaussianNB(priors=None, var_smoothing=1e-09))]

Created grid: {}
Begin fitting and evaluation.
Fitting model: 1571366172.906958
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   3 out of   3 | elapsed:    3.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done   3 out of   3 | elapsed:    3.1s finished


Model fit: 4.835395812988281
Model fit after: 6.8293821811676025 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
         

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:   17.7s
[Parallel(n_jobs=4)]: Done  36 out of  36 | elapsed:   36.7s finished


Model fit: 42.13371801376343
Model fit after: 43.18635892868042 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('bagg', BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=None, oob_score=False, random_state=42,
    

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   9 | elapsed:  2.5min remaining:  2.0min
[Parallel(n_jobs=4)]: Done   9 out of   9 | elapsed:  4.1min finished


Model fit: 418.8007869720459
Model fit after: 432.46011304855347 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('tree', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_spl

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:   21.6s
[Parallel(n_jobs=4)]: Done  90 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 213 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:  3.4min finished


Model fit: 206.09593987464905
Model fit after: 207.2550449371338 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('tree', ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:   44.2s
[Parallel(n_jobs=4)]: Done  90 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done 213 tasks      | elapsed:  3.8min
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:  4.8min finished


Model fit: 300.5494351387024
Model fit after: 301.746150970459 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('boost', AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
         

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done  27 out of  27 | elapsed:  7.7min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  27 out of  27 | elapsed:  7.7min finished


Model fit: 525.0438668727875
Model fit after: 526.7477450370789 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('logreg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solv

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   9 out of  12 | elapsed:    5.4s remaining:    1.8s
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:    5.6s finished


Model fit: 6.781074047088623
Model fit after: 7.8101513385772705 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('ridge', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=42, solver='auto',
        tol=0.001))]

Created grid: {'ridge__alph

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   9 out of  12 | elapsed:   14.0s remaining:    4.7s
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:   14.1s finished


Model fit: 16.4378719329834
Model fit after: 17.614206075668335 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

Created grid: {'nb__alpha': array([1.00000000e-02, 2.15443469e-01, 4.64158883e+00, 1.00000000e+02])}
Begin fitting and evaluation.
Fitting mod

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   9 out of  12 | elapsed:    6.6s remaining:    2.2s
[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:    6.6s finished


Model fit: 7.900869131088257
Model fit after: 9.061836004257202 seconds.
Pipeline steps: [('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=3000, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('condenser', FunctionTransformer(accept_sparse=True, check_inverse=True,
          func=<function condenser at 0x1a2deea8c8>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False)), ('nb', GaussianNB(priors=None, var_smoothing=1e-09))]

Created grid: {}
Begin fitting and evaluation.
Fitting model: 1571367725.802698
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   3 out of   3 | elapsed:    3.1s remaining:    0.0s
[Parallel(n_jobs=4)]: Done   3 out of   3 | elapsed:    3.1s finished


Model fit: 4.797494173049927
Model fit after: 6.832221031188965 seconds.


Unnamed: 0,steps,best_cross_val,best_params,train_score,test_score,sensitivity,specificity,confusion_matrix,runtime
0,,,,,,,,,


In [123]:
model_df = pd.DataFrame(models)

In [81]:
grid = GridSearchCV(estimator=Pipeline([
    ('vec', TfidfVectorizer()),
    ('condenser', condenser),
    ('tree', DecisionTreeClassifier())
]),param_grid={
    **param_reference[type(CountVectorizer())]['parameters'],
    **param_reference[type(DecisionTreeClassifier())]['parameters']
},
                    verbose=3,
                    n_jobs=3,
                    cv=3
)

In [82]:
grid.param_grid

{'vec__stop_words': [None, 'english'],
 'vec__ngram_range': [(1, 1), (1, 2)],
 'vec__max_df': [0.8, 0.9, 1.0],
 'vec__min_df': [3, 5],
 'vec__max_features': [1000, 3000],
 'tree__random_state': [42]}

In [83]:
t_0 = time.time()

In [84]:
grid.fit(X_train, y_train)

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:  1.1min
[Parallel(n_jobs=3)]: Done 122 tasks      | elapsed:  9.1min
[Parallel(n_jobs=3)]: Done 144 out of 144 | elapsed: 12.3min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))]),
       fit_params=None, iid='warn', n_jobs=3,
       param_grid={'vec__stop_words': [None, 'english'], 'vec__ngram_range': [(1, 1), (1, 2)], 'vec__max_df': [0.8, 0.9, 1.0], 'vec__min_df': [3, 5], 'vec__max_features': [1000, 3000], 'tree__random_state': [42]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [85]:
print((time.time() - t_0)/60)

13.282081695397695


In [86]:
grid.best_params_

{'tree__random_state': 42,
 'vec__max_df': 0.8,
 'vec__max_features': 3000,
 'vec__min_df': 3,
 'vec__ngram_range': (1, 2),
 'vec__stop_words': 'english'}

In [88]:
param_reference[type(CountVectorizer())] = count_vectorizer_params
param_reference[type(TfidfVectorizer())] = tfid_vectorizer_params

In [None]:
# cvec = CountVectorizer()
# X_train_cv = cvec.fit_transform(X_train)

# X_train_cv

# X_train_cv.toarray()

# X_train

# logreg = LogisticRegression()

# transformers

# classifiers

# t_0 = time.time()
# tfid = TfidfVectorizer()
# X_train_vec = tfid.fit_transform(X_train)
# X_test_vec = tfid.transform(X_test)

# gs = GridSearchCV(DecisionTreeClassifier(),
#                   {'max_depth': [None, 3, 7, 10],
#                    'min_samples_split': [5, 10, 20],
#                    'min_samples_leaf': [2, 4, 7],
#                    'random_state': [42]},
#                   cv=5
#                  )

# gs.fit(X_train_vec,y_train)
# print(time.time()-t_0)

# param_reference[type(DecisionTreeClassifier())]

In [125]:
model_df.to_csv('../data/model_scores.csv', index=False)

In [126]:
model_df

Unnamed: 0,steps,best_cross_val,best_params,train_score,test_score,sensitivity,specificity,confusion_matrix,runtime
0,[<class 'sklearn.feature_extraction.text.Count...,0.723447,"{'tree__max_depth': 7, 'tree__min_samples_leaf...",0.732961,0.720088,0.940437,0.497795,"(903, 911, 109, 1721)",43.944631
1,[<class 'sklearn.feature_extraction.text.Count...,0.770927,"{'bagg__base_estimator': None, 'bagg__n_estima...",0.974476,0.760154,0.740984,0.779493,"(1414, 400, 474, 1356)",330.546137
2,[<class 'sklearn.feature_extraction.text.Count...,0.786662,"{'tree__max_depth': None, 'tree__min_samples_l...",0.864056,0.777168,0.778689,0.775634,"(1407, 407, 405, 1425)",204.61241
3,[<class 'sklearn.feature_extraction.text.Count...,0.782454,"{'tree__max_depth': None, 'tree__min_samples_l...",0.844113,0.761526,0.748087,0.775083,"(1406, 408, 461, 1369)",296.348172
4,[<class 'sklearn.feature_extraction.text.Count...,0.770286,"{'boost__base_estimator__max_depth': 2, 'boost...",0.80505,0.76674,0.833333,0.699559,"(1269, 545, 305, 1525)",555.867576
5,[<class 'sklearn.feature_extraction.text.Count...,0.788949,{'logreg__C': 0.01},0.80752,0.787322,0.853005,0.721058,"(1308, 506, 269, 1561)",10.185824
6,[<class 'sklearn.feature_extraction.text.Count...,0.777239,{'ridge__alpha': 100.0},0.820968,0.77854,0.863388,0.692944,"(1257, 557, 250, 1580)",19.333888
7,[<class 'sklearn.feature_extraction.text.Count...,0.752447,{'nb__alpha': 4.6415888336127775},0.773488,0.749726,0.692896,0.807056,"(1464, 350, 562, 1268)",10.095613
8,[<class 'sklearn.feature_extraction.text.Count...,0.688592,{},0.733693,0.705543,0.831694,0.57828,"(1049, 765, 308, 1522)",6.829363
9,[<class 'sklearn.feature_extraction.text.Tfidf...,0.72052,"{'tree__max_depth': 7, 'tree__min_samples_leaf...",0.737718,0.722558,0.942077,0.501103,"(909, 905, 106, 1724)",43.186298


In [128]:
model_df.sort_values(by='test_score',ascending=False)

Unnamed: 0,steps,best_cross_val,best_params,train_score,test_score,sensitivity,specificity,confusion_matrix,runtime
14,[<class 'sklearn.feature_extraction.text.Tfidf...,0.795261,{'logreg__C': 0.21544346900318834},0.821974,0.78787,0.813661,0.761852,"(1382, 432, 341, 1489)",7.810136
5,[<class 'sklearn.feature_extraction.text.Count...,0.788949,{'logreg__C': 0.01},0.80752,0.787322,0.853005,0.721058,"(1308, 506, 269, 1561)",10.185824
15,[<class 'sklearn.feature_extraction.text.Tfidf...,0.789955,{'ridge__alpha': 100.0},0.804592,0.78101,0.796175,0.765711,"(1389, 425, 373, 1457)",17.614181
6,[<class 'sklearn.feature_extraction.text.Count...,0.777239,{'ridge__alpha': 100.0},0.820968,0.77854,0.863388,0.692944,"(1257, 557, 250, 1580)",19.333888
2,[<class 'sklearn.feature_extraction.text.Count...,0.786662,"{'tree__max_depth': None, 'tree__min_samples_l...",0.864056,0.777168,0.778689,0.775634,"(1407, 407, 405, 1425)",204.61241
10,[<class 'sklearn.feature_extraction.text.Tfidf...,0.781539,"{'bagg__base_estimator': None, 'bagg__n_estima...",0.975025,0.773875,0.772678,0.775083,"(1406, 408, 416, 1414)",432.460099
11,[<class 'sklearn.feature_extraction.text.Tfidf...,0.782271,"{'tree__max_depth': None, 'tree__min_samples_l...",0.8507,0.772503,0.781421,0.763506,"(1385, 429, 400, 1430)",207.255031
13,[<class 'sklearn.feature_extraction.text.Tfidf...,0.77239,"{'boost__base_estimator__max_depth': 2, 'boost...",0.807428,0.767838,0.820765,0.714443,"(1296, 518, 328, 1502)",526.74773
4,[<class 'sklearn.feature_extraction.text.Count...,0.770286,"{'boost__base_estimator__max_depth': 2, 'boost...",0.80505,0.76674,0.833333,0.699559,"(1269, 545, 305, 1525)",555.867576
12,[<class 'sklearn.feature_extraction.text.Tfidf...,0.783368,"{'tree__max_depth': None, 'tree__min_samples_l...",0.835422,0.765917,0.757377,0.774531,"(1405, 409, 444, 1386)",301.746136


# Summary
We will use the scores gathered here to select our model in the next notebook.