### Outline
1. Projects -- Proposals due 4/12 !
2. SVM
3. Boosting
4. Ensembles; SuperLearners
5. Review: Bagging vs Boosting vs Ensembles
6. Into to GRIDSEARCH
7. Lab

SVM:
* Supervised learning
* Used for Regression or Classification

<center>
<img src="../images/svm_example.jpg" alt="drawing" style="width: 1200px;"/>
</center>

"Kernel: A kernel is a similarity function for pattern analysis. It must be one of rbf/linear/poly/sigmoid/precomputed; default = “rbf” (radial basis function). Choosing the appropriate kernel will result in a better model fit"

"Mastering Machine Learning with Python in Six Steps..." - Manohar Swamynathan

'''
Description of fnlwgt (final weight)
|
| The weights on the CPS files are controlled to independent estimates of the
| civilian noninstitutional population of the US.  These are prepared monthly
| for us by Population Division here at the Census Bureau.  We use 3 sets of
| controls.
|  These are:
|          1.  A single cell estimate of the population 16+ for each state.
|          2.  Controls for Hispanic Origin by age and sex.
|          3.  Controls by Race, age and sex.
|
| We use all three sets of controls in our weighting program and "rake" through
| them 6 times so that by the end we come back to all the controls we used.
|
| The term estimate refers to population totals derived from CPS by creating
| "weighted tallies" of any specified socio-economic characteristics of the
| population.
|
| People with similar demographic characteristics should have
| similar weights.  There is one important caveat to remember
| about this statement.  That is that since the CPS sample is
| actually a collection of 51 state samples, each with its own
| probability of selection, the statement only applies within
| state.
'''

In [None]:
import pandas
from sklearn.svm import SVC
import sklearn

from sklearn.model_selection import train_test_split

col_names = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]

df = pandas.read_csv('adult.data')
df.columns=col_names
df['Income'] = df['Income'].apply(lambda x: 0 if x == ' <=50K' else 1)

for column in df.columns:
    if df[column].dtype == type(object):
        le = sklearn.preprocessing.LabelEncoder()
        df[column] = le.fit_transform(df[column])
df.tail()

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
32555,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,0
32556,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,1
32557,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,0
32558,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,0
32559,52,5,287927,11,9,2,4,5,4,0,15024,0,40,39,1


In [3]:
X = df.drop(columns='Income')
y = df['Income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=555, stratify=y)

In [4]:
%%time

# SVM TAKES A LONG TIME !!

# Run in another notebook.

SVM = SVC(random_state=0, probability=True)

# Remember what is happening in cross validation ????
scores = sklearn.model_selection.cross_val_score(SVM, X_train, y_train, cv=5, scoring="accuracy")
print("Train CV Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std()))
SVM.fit(X_train, y_train)
print("Test Accuracy: %0.3f " % (sklearn.metrics.accuracy_score(SVM.predict(X_test), y_test)))

Train CV Accuracy: 0.792 (+/- 0.001)
Test Accuracy: 0.794 
Wall time: 16min 9s


### Boosting

"The core concept of boosting is that rather than an independent individual hypothesis, combining hypotheses in a sequential order increases the accuracy."

<center>
<img src="../images/boosting_ada.jpg" alt="drawing" style="width: 900px;"/>
</center>

<center>
<img src="../images/boosting_w_weights.jpg" alt="drawing" style="width: 900px;"/>
</center>

### Common types of Boosting you will see:
1. AdaBoost  
2. Gradient Boost
3. XGBoost (Extreme Gradient Boost)

### Compare Bagging and Boosting
* Sequential ensemble of models:  Boosting

* Parallel ensemble of models:  Bagging

In [5]:
%%time
from sklearn.ensemble import GradientBoostingClassifier
GBC = GradientBoostingClassifier(n_estimators = 100)
scores = sklearn.model_selection.cross_val_score(GBC, X_train, y_train, cv=5, scoring="accuracy")
print("Train CV Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), scores.std(), 'GBC'))
GBC.fit(X_train, y_train)
print("Test Accuracy: %0.3f " % (sklearn.metrics.accuracy_score(GBC.predict(X_test), y_test)))


Train CV Accuracy: 0.864 (+/- 0.003) [GBC]
Test Accuracy: 0.865 
Wall time: 30.1 s


In [6]:
scores

array([0.86466011, 0.86261261, 0.86527437, 0.86036036, 0.86793612])

#### Hey that's pretty good!

#### Let's compare to other estimators.

In [7]:
%%time
from sklearn.ensemble import AdaBoostClassifier

# We can specify a base model with AdaBoost.  Defaults to DecisionTreeClassifier

ADBC = AdaBoostClassifier(n_estimators = 100)
scores = sklearn.model_selection.cross_val_score(ADBC, X_train, y_train, cv=5, scoring="accuracy")
print("Train CV Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), scores.std(), 'ADBC'))
ADBC.fit(X_train, y_train)
print("Test Accuracy: %0.3f " % (sklearn.metrics.accuracy_score(ADBC.predict(X_test), y_test)))


Train CV Accuracy: 0.864 (+/- 0.002) [ADBC]
Test Accuracy: 0.864 
Wall time: 18.9 s


In [8]:
%%time
# What if we apply boosting to a KNN model instead of the DT ?
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier()
dt_ab = AdaBoostClassifier(n_estimators = 100, base_estimator=dt)
# knn_ab = AdaBoostClassifier(n_estimators = 100, base_estimator=knn)
scores = sklearn.model_selection.cross_val_score(dt_ab, X_train, y_train, cv=5, scoring="accuracy")
print("Train CV Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), scores.std(), 'DT'))
# knn.fit(X_train, y_train)
scores_knn = sklearn.model_selection.cross_val_score(knn, X_train, y_train, cv=5, scoring="accuracy")
# print("Test Accuracy: %0.3f " % (sklearn.metrics.accuracy_score(knn.predict(X_test), y_test)))
print("Train CV Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores_knn.mean(), scores_knn.std(), 'KNN'))

Train CV Accuracy: 0.814 (+/- 0.006) [DT]
Train CV Accuracy: 0.772 (+/- 0.002) [KNN]
Wall time: 1min 13s


In [9]:
import statsmodels.api as sm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

import xgboost as xgb
from xgboost.sklearn import XGBClassifier

In [10]:
%%time
LR = LogisticRegression(solver='lbfgs', max_iter=10000, random_state=555)
RF = RandomForestClassifier(n_estimators = 100, random_state=555)
SVM = SVC(random_state=0, probability=True)
KNC = KNeighborsClassifier()
DTC = DecisionTreeClassifier()
ABC = AdaBoostClassifier(n_estimators = 100)
BC = BaggingClassifier(n_estimators = 100)
GBC = GradientBoostingClassifier(n_estimators = 100)
# clf_XGB = XGBClassifier(n_estimators = 100, objective= 'binary:logistic', seed=555, use_label_encoder=False)
clf_XGB = XGBClassifier(n_estimators = 100, seed=555, use_label_encoder=False, eval_metric='logloss')
clfs = []
print('5-fold cross validation:\n')
for clf, label in zip([LR, RF, KNC, DTC, ABC, BC, GBC, clf_XGB],
                      ['Logistic Regression',
                       'Random Forest',
                       #'Support Vector Machine',
                       'KNeighbors',
                       'Decision Tree',
                       'Ada Boost',
                       'Bagging',
                       'Gradient Boosting',
                       'XGBoost']):
    scores = sklearn.model_selection.cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy")
    print("Train CV Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), scores.std(), label))
    md = clf.fit(X_train, y_train)
    clfs.append(md)
    print("Test Accuracy: %0.4f " % (sklearn.metrics.accuracy_score(clf.predict(X_test), y_test)))

5-fold cross validation:

Train CV Accuracy: 0.795 (+/- 0.008) [Logistic Regression]
Test Accuracy: 0.8034 
Train CV Accuracy: 0.856 (+/- 0.004) [Random Forest]
Test Accuracy: 0.8575 
Train CV Accuracy: 0.772 (+/- 0.002) [KNeighbors]
Test Accuracy: 0.7792 
Train CV Accuracy: 0.809 (+/- 0.004) [Decision Tree]
Test Accuracy: 0.8152 
Train CV Accuracy: 0.864 (+/- 0.002) [Ada Boost]
Test Accuracy: 0.8636 
Train CV Accuracy: 0.849 (+/- 0.003) [Bagging]
Test Accuracy: 0.8585 
Train CV Accuracy: 0.864 (+/- 0.002) [Gradient Boosting]
Test Accuracy: 0.8650 
Train CV Accuracy: 0.866 (+/- 0.002) [XGBoost]
Test Accuracy: 0.8709 
Wall time: 2min 53s


In [11]:
# clf_XGB.predict(X_test)

In [12]:
y_test

9090     1
16487    0
20301    0
22508    0
1260     1
        ..
21348    0
10582    1
22634    0
23939    0
28337    0
Name: Income, Length: 8140, dtype: int64

### What if we want to create our own custom ensemble ??

### This is python.  There is a library for that!

### Discussion: Super Learners.

<center>
<img src="../images/ensemble.jpg" alt="drawing" style="width: 900px;"/>
</center>

#### Also known as 

In [13]:
from mlens.ensemble import SuperLearner
from mlens.model_selection import Evaluator
from mlens.metrics import make_scorer

from sklearn.metrics import accuracy_score

#We'll use threading.  Discuss: threads vs processes.

[MLENS] backend: threading


In [14]:
# --- Build ---
# Passing a scoring function will create cv scores during fitting
# the scorer should be a simple function accepting to vectors and returning a scalar
ensemble = SuperLearner(scorer=accuracy_score, random_state=555, verbose=2)

In [15]:
# Build the first layer
ensemble.add([KNC, LR])

SuperLearner(array_check=None, backend=None, folds=2,
       layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1,
   name='layer-1', propagate_features=None, raise_on_exception=True,
   random_state=4782, shuffle=False,
   stack=[Group(backend='threading', dtype=<class 'numpy.float32'>,
   indexer=FoldIndex(X=None, folds=2, raise_on_ex...0AF4160>)],
   n_jobs=-1, name='group-0', raise_on_exception=True, transformers=[])],
   verbose=1)],
       model_selection=False, n_jobs=None, raise_on_exception=True,
       random_state=555, sample_size=20,
       scorer=<function accuracy_score at 0x00000210D0AF4160>,
       shuffle=False, verbose=2)

In [16]:
# Attach the final meta estimator
# ensemble.add_meta(LogisticRegression())
ensemble.add_meta(GradientBoostingClassifier())

SuperLearner(array_check=None, backend=None, folds=2,
       layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1,
   name='layer-1', propagate_features=None, raise_on_exception=True,
   random_state=4782, shuffle=False,
   stack=[Group(backend='threading', dtype=<class 'numpy.float32'>,
   indexer=FoldIndex(X=None, folds=2, raise_on_ex...0AF4160>)],
   n_jobs=-1, name='group-1', raise_on_exception=True, transformers=[])],
   verbose=1)],
       model_selection=False, n_jobs=None, raise_on_exception=True,
       random_state=555, sample_size=20,
       scorer=<function accuracy_score at 0x00000210D0AF4160>,
       shuffle=False, verbose=2)

In [17]:
# Fit ensemble
ensemble.fit(X_train, y_train)


Fitting 2 layers
Processing layer-1             done | 00:00:04
Processing layer-2             done | 00:00:01
Fit complete                        | 00:00:06


SuperLearner(array_check=None, backend=None, folds=2,
       layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1,
   name='layer-1', propagate_features=None, raise_on_exception=True,
   random_state=4782, shuffle=False,
   stack=[Group(backend='threading', dtype=<class 'numpy.float32'>,
   indexer=FoldIndex(X=None, folds=2, raise_on_ex...0AF4160>)],
   n_jobs=-1, name='group-1', raise_on_exception=True, transformers=[])],
   verbose=1)],
       model_selection=False, n_jobs=None, raise_on_exception=True,
       random_state=555, sample_size=20,
       scorer=<function accuracy_score at 0x00000210D0AF4160>,
       shuffle=False, verbose=2)

In [18]:
#pred_vals = ensemble.predict(X_test)
print ("Accuracy - Train : ", sklearn.metrics.accuracy_score(ensemble.predict(X_train), y_train))
print ("Accuracy - Test : ", sklearn.metrics.accuracy_score(ensemble.predict(X_test), y_test))


Predicting 2 layers
Processing layer-1             done | 00:00:02
Processing layer-2             done | 00:00:00
Predict complete                    | 00:00:02
Accuracy - Train :  0.7996723996723997

Predicting 2 layers
Processing layer-1             done | 00:00:00
Processing layer-2             done | 00:00:00
Predict complete                    | 00:00:01
Accuracy - Test :  0.8034398034398035


In [19]:
# pred_vals

In [20]:
print("Fit data:\n%r" % ensemble.data)

Fit data:
                                 score-m  score-s  ft-m  ft-s  pt-m  pt-s
layer-1  kneighborsclassifier       0.77     0.00  0.52  0.05  4.30  0.05
layer-1  logisticregression         0.80     0.00  2.88  0.29  0.01  0.00



In [21]:
#Let's try some GridSearch

xg_params = {
    'eta': [0.01, 0.025, 0.05, 0.075, 0.25, 0.3, 0.5],
    'max_depth': [3, 5, 6, 7, 9],
    'tree_method': ['auto']
}

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold
seed = 555
#kfold = StratifiedKFold(n_splits=5)

In [23]:
# clf_XGB = XGBClassifier(n_estimators = 100, objective= 'binary:logistic', seed=555)
clf_XGB = XGBClassifier(n_estimators = 100, seed=555, use_label_encoder=False, eval_metric='logloss')
grid2 = GridSearchCV(clf_XGB, xg_params, scoring="roc_auc", cv=5, verbose=10, n_jobs=-1)
grid2.fit(X_train, y_train)

Fitting 5 folds for each of 35 candidates, totalling 175 fits


GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     enable_categorical=False,
                                     eval_metric='logloss', gamma=None,
                                     gpu_id=None, importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, mono...
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None, predictor=None,
                                     random_state=None, reg_alpha=None,
                             

In [24]:
print ('Best Parameters: ', grid2.best_params_)
results = sklearn.model_selection.cross_val_score(grid2.best_estimator_, X_train,y_train, cv=5)
print ("Accuracy - Train CV: ", results.mean())
print ("Accuracy - Train : ", sklearn.metrics.accuracy_score(grid2.best_estimator_.predict(X_train), y_train))
print ("Accuracy - Test : ", sklearn.metrics.accuracy_score(grid2.best_estimator_.predict(X_test), y_test))

Best Parameters:  {'eta': 0.25, 'max_depth': 3, 'tree_method': 'auto'}
Accuracy - Train CV:  0.8699017199017198
Accuracy - Train :  0.8770270270270271
Accuracy - Test :  0.8716216216216216
