# Advanced Pipelines with Grid Search (classification)
**OPIM 5512: Data Science Using Python - University of Connecticut**

---------------------------------
This is where the real magic happens! With so many models, we won't make boxplots of the output (we will fit HUNDREDS or THOUSANDS of models). We rely on a grid search and simply retrieve the model with the best average error metric. Although we are focusing on classification in this notebook, you can apply the same logic and wisdom to regression problems.

## Load Modules

In [None]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# preprocessing
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# model evaluation
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


# classification spot check models!
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# more advanced ensemble models
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier

# Read Data

In [None]:
# let's use gdown to get the data instead of mounting the drive
# https://drive.google.com/file/d/1UwCOmgdOwvpMd58lVlwqUL3w1IRaYJa-/view?usp=sharing
!gdown 1UwCOmgdOwvpMd58lVlwqUL3w1IRaYJa-

Downloading...
From: https://drive.google.com/uc?id=1UwCOmgdOwvpMd58lVlwqUL3w1IRaYJa-
To: /content/breastcancer.csv
100% 125k/125k [00:00<00:00, 48.3MB/s]


In [None]:
df = pd.read_csv('breastcancer.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


The target variable will be `diagnosis`. Let's drop that last unnamed column while we are here. And since `id` doesn't have predictive power, let's drop that too.

In [None]:
df.drop('Unnamed: 32', axis=1, inplace=True)
df.drop('id', axis=1, inplace=True)
df.columns # voila - it's gone!

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [None]:
df.info() # check for any missing values - all looks good!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

If you look at the unique values in the `diagnosis`, we see that these are... **M** for malignant and **B** for benign.



In [None]:
Counter(df['diagnosis'])

Counter({'M': 212, 'B': 357})

Our data is imbalanced, and we will ignore this for now - we can use SMOTE later on with an imblearn Pipeline (different than an sklearn pipeline - be careful!) So

So that we don't have to deal with problems in a logistic regression, let's use `LabelEncoder()` from `sklearn`.

In [None]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df['diagnosis'] = LE.fit_transform(df['diagnosis'])
Counter(df['diagnosis'])

Counter({1: 212, 0: 357})

As you can see, B is 0 and M is 1. You could use SMOTE now before all of your pipelines (if you wanted to use it for everything). But for now, we simply ignore the class balance.

# Prepare Data for Modeling (Split, CV, error metrics)
At this point you are ready to:
* Split into X and y
* Make a train and test partition
* Leverage 10-fold cross-validation
* Add a seed for reproducability
* Make a list of all of the models you are interested in evaluating

In [None]:
# Split-out validation df
X = df.drop('diagnosis', axis=1) #covariates - just drop the target!
y = df['diagnosis'] #target variable
validation_size = 0.20
seed = 123 # so you will split the same way and evaluate the SAME dataset

# split!
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=validation_size,
                                                    random_state=seed)

## Build Pipeline
* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
* [GBM](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [Extra Trees ](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)

All of the hyperparameters that you see below came from the documentation. Of course you could include PCA or polynomial features as pre-processing here...

In [None]:
# Construct some pipelines
pipe_lr = Pipeline([('scl', StandardScaler()),
			('clf', LogisticRegression(random_state=42))])

pipe_knn = Pipeline([('scl', StandardScaler()),
			('clf', KNeighborsClassifier())])

pipe_dt = Pipeline([('scl', StandardScaler()),
			('clf', DecisionTreeClassifier(random_state=42))])

pipe_ada = Pipeline([('scl', StandardScaler()),
			('clf', AdaBoostClassifier(random_state=42))])

pipe_gb = Pipeline([('scl', StandardScaler()),
			('clf', GradientBoostingClassifier(random_state=42))])

pipe_rf = Pipeline([('scl', StandardScaler()),
			('clf', RandomForestClassifier(random_state=42))])

pipe_et = Pipeline([('scl', StandardScaler()),
			('clf', ExtraTreesClassifier(random_state=42))])

# and remember, for now these are just boring vanilla defaults... the grid is coming!

## Define your Parameters for Grid Search

Note the clf - this is an artifact of the pipeline code we just wrote.

In [None]:
# Set grid search params
grid_params_lr = [{'clf__penalty': ['l1', 'l2'],
                  'clf__C': [1, 10],
                  'clf__solver': ['liblinear'],
                   'clf__max_iter': [1000000]}]

grid_params_knn = [{'clf__n_neighbors': [1, 3, 5, 10, 50]}]

grid_params_dt = [{'clf__criterion': ['gini', 'entropy'],
                  'clf__min_samples_leaf': [5, 10, 20, 25],
                  'clf__max_depth': [3, 5, 10, 15, 20],
                  'clf__min_samples_split': [5, 10, 20, 25]}]

grid_params_ada = [{'clf__n_estimators': [3, 5, 10, 15, 20],
		                'clf__learning_rate': [0.001, 0.01]}]

grid_params_gb = [{'clf__n_estimators': [3, 5, 10, 15, 20],
                'clf__learning_rate': [0.001, 0.01],
                'clf__loss': ['deviance', 'exponential']}]

grid_params_rf = [{'clf__criterion': ['gini', 'entropy'],
                  'clf__min_samples_leaf': [5, 10, 20, 25],
                  'clf__max_depth': [3, 5, 10, 15, 20],
                  'clf__min_samples_split': [5, 10, 20, 25],
                  'clf__n_estimators': [30, 50, 100, 200, 500]}]

grid_params_et = [{'clf__criterion': ['gini', 'entropy'],
                  'clf__min_samples_leaf': [5, 10, 20, 25],
                  'clf__max_depth': [3, 5, 10, 15, 20],
                  'clf__min_samples_split': [5, 10, 20, 25],
                  'clf__n_estimators': [30, 50, 100, 200, 500]}]



## Define your Grid Search

In [None]:
# Construct grid searches

gs_lr = GridSearchCV(estimator=pipe_lr,
    param_grid=grid_params_lr,
    scoring='accuracy',
    cv=10)

gs_knn = GridSearchCV(estimator=pipe_knn,
    param_grid=grid_params_knn,
    scoring='accuracy',
    cv=10)

gs_dt = GridSearchCV(estimator=pipe_dt,
    param_grid=grid_params_dt,
    scoring='accuracy',
    cv=10)

gs_ada = GridSearchCV(estimator=pipe_ada,
    param_grid=grid_params_ada,
    scoring='accuracy',
    cv=10)

gs_gb = GridSearchCV(estimator=pipe_gb,
    param_grid=grid_params_gb,
    scoring='accuracy',
    cv=10)

gs_rf = GridSearchCV(estimator=pipe_rf,
    param_grid=grid_params_rf,
    scoring='accuracy',
    cv=10)

gs_et = GridSearchCV(estimator=pipe_et,
    param_grid=grid_params_et,
    scoring='accuracy',
    cv=10)

# List of pipelines for ease of iteration
grids = [gs_lr, gs_knn, gs_dt, gs_ada, gs_gb, gs_rf, gs_et]

# Dictionary of pipelines and classifier types for ease of reference
grid_dict = {0: 'Logistic Regression',
             1: 'KNN',
             2: 'DTC',
             3: 'ADA',
             4: 'GBC',
             5: 'RFC',
             6: 'ET'}


## Run it! Find the best model
Go get some coffee - this will take a minute!

In [None]:

# Fit the grid search objects
print('Performing model optimizations...')
best_acc = 0.0
best_clf = 0
best_gs = ''
for idx, gs in enumerate(grids):
	print('\nEstimator: %s' % grid_dict[idx])
	# Fit grid search
	gs.fit(X_train, y_train)
	# Best params
	print('Best params: %s' % gs.best_params_)
	# Best training data accuracy
	print('Best training accuracy: %.3f' % gs.best_score_)
	# Predict on test data with best params
	y_pred = gs.predict(X_test)
	# Test data accuracy of model with best params
	print('Test set accuracy score for best params: %.3f ' % accuracy_score(y_test, y_pred))
	# Track best (highest test accuracy) model
	if accuracy_score(y_test, y_pred) > best_acc:
		best_acc = accuracy_score(y_test, y_pred)
		best_gs = gs
		best_clf = idx
print('\nClassifier with best test set accuracy: %s' % grid_dict[best_clf])

Performing model optimizations...

Estimator: Logistic Regression
Best params: {'clf__C': 1, 'clf__max_iter': 1000000, 'clf__penalty': 'l2', 'clf__solver': 'liblinear'}
Best training accuracy: 0.978
Test set accuracy score for best params: 0.991 

Estimator: KNN
Best params: {'clf__n_neighbors': 5}
Best training accuracy: 0.956
Test set accuracy score for best params: 0.982 

Estimator: DTC
Best params: {'clf__criterion': 'entropy', 'clf__max_depth': 5, 'clf__min_samples_leaf': 5, 'clf__min_samples_split': 5}
Best training accuracy: 0.947
Test set accuracy score for best params: 0.965 

Estimator: ADA
Best params: {'clf__learning_rate': 0.01, 'clf__n_estimators': 10}
Best training accuracy: 0.901
Test set accuracy score for best params: 0.930 

Estimator: GBC
Best params: {'clf__learning_rate': 0.001, 'clf__loss': 'deviance', 'clf__n_estimators': 3}
Best training accuracy: 0.624
Test set accuracy score for best params: 0.640 

Estimator: RFC
Best params: {'clf__criterion': 'entropy', '

**On Your Own:** try to add a few more models or go back and try to get the GBC to fit better - probably can do better than this!