# Chapter 15 - Improve Performance with Ensaembles

- Boosting (AdaBoost, stochastic gradient boosting) builds models on top of each previous model in order to fix previous model's error
- Bagging (decision trees, random forest, extra trees) builds models from different subsamples of train
- Majority Voting to combine predictions from multiple algorithms using simple statistics like mean to combine predictions


Bootstrap Aggregation (or Bagging) takes mutilple sample from train data with replacement and train a model for each sample
- Bagged Decision Trees
- Random Forest
- Extra Trees

### Bagging
- performs best with algorithms that have high variance 

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html



In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from pandas import read_csv

filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)


In [3]:
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [7]:
seed= 7
kfold = KFold(n_splits = 10, random_state=seed, shuffle=True)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X,Y,cv=kfold)
print(results.mean())

0.7578263841421736


This shows a robust estimate of the model accuracy

### Random Forest
- and extension of Bagging decision trees, samples of train data are taken with replacement but the trees are constructed in a way that reduces the correlation between individual classifiers.
- instead of choosing the best split points only a raondom subset of features are considered.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [8]:
from sklearn.ensemble import RandomForestClassifier

num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7695317840054682


### Extra Trees
- random trees are constructed from samples of the train data

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [10]:
from sklearn.ensemble import ExtraTreesClassifier

num_trees = 100
max_features = 7
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X,Y,cv=kfold)
print(results.mean())

0.7604237867395762


## Boosting Algorithms
create a sequence of models that attempt to correct the mistakes of the previous model.
- AdaBoost
- Stochastic Gradient Boosting

### AdaBoost
- first successful boosting ensemble algorithm.
- weighting instances in data by how difficult they are to classify
- this allows the algorithm to pay more or less attention to them in the construction of subsequent models

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [13]:
from sklearn.ensemble import AdaBoostClassifier

num_trees = 30
seed=7
kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

#The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. 
#Use the SAMME algorithm to circumvent this warning.



0.7552802460697198




### Stochastic Gradient Boosting (Gradient Boosting Machines)

- One of the most sophisticated ensemble techniques
- Perhaps the best for improving performance via ensembles

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

In [15]:
from sklearn.ensemble import GradientBoostingClassifier

seed = 7
num_trees = 100
kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed) 
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7578947368421053


### Voting Ensemble
- One of the simplest ways of combing the predictions from multiple ML algorithms
- A more advanced method is called stacking (stacked aggregation) is not provided in scikit learn


http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

In [17]:
# Voting Ensemble for Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
# create the sub models
estimators = []
model1 = LogisticRegression(solver='liblinear')
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC(gamma='auto')
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())


0.7449248120300751


# Chapter 16 - Improve Performance with Algorithm Tuning

- ML models are parameterized
- Their behavior can be tuned for a given problem

## ML Algorithm Parameters 

Somtimes called Hyperparameter optimization 
- algorithm parameters are referred to as hyperparametes
- coefficients found by ML algorithms are referred to as parameters.

2 simple methods in scikit learn:
- Grid Search Parapeter Tuning
- Random Search Parameter Tuning

### Grid Search Parameter Tuning

- GridSearchCV class will build a model for each combination of algorithm parameters specified in a grid


http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [27]:
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV
import numpy

# different values for alpha we ant to test
alphas = numpy.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)


In [29]:
model = RidgeClassifier()
grid = GridSearchCV(estimator=model, param_grid=param_grid , cv=3)
grid.fit(X,Y)

In [30]:
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.7708333333333334
1.0


### Random Search Parameter Tuning
- Sample of random values for for a fixed number of iterations

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [32]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

para_grid = {'alpha': uniform()}
model = RidgeClassifier()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=para_grid, n_iter=100, cv=3, random_state=7)
rsearch.fit(X,Y)
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

0.7708333333333334
0.07630828937395717


-----
# Chapter 17 - Save and Load ML Models

### Finalize your job with pickle
- Pickle is the standard way of serializing objects in Python.
- You can load it to deserialize it.

 https://docs.python.org/2/library/pickle.html

In [24]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from pickle import dump
from pickle import load
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# Fit the model on 33%
model = LogisticRegression(solver='liblinear') 
model.fit(X_train, Y_train)
# save the model to disk
filename = 'finalized_model_pickle.sav'
dump(model, open(filename, 'wb'))




# some time later...
# load the model from disk
loaded_model = load(open(filename, 'rb')) 
result = loaded_model.score(X_test, Y_test) 
print(result)

0.7559055118110236


### Finalize you Model with Joblib
- part of the SciPy ecosystem 

 https://pypi.python.org/pypi/joblib
 
 https://pythonhosted.org/joblib/generated/joblib.dump.html

In [25]:
# Save Model Using joblib
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from joblib import dump
from joblib import load
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # Fit the model on 33%
model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)

# save the model to disk
filename = 'finalized_model_joblib.sav' 
dump(model, filename)
# some time later...
# load the model from disk
loaded_model = load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)


0.7559055118110236


----------
# Chapter 18 - Predictive Modeling Project Template

Practice with dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu)

Work end-to-end with each project using Recipes



6 Step ML Project Template
1) Define Problem
- Ideally your dataset should be small enough to run a visualization on in within a minute. Break it into small enough chunks.
2) Summarize Date
- Take your time and write question prompts, assumptions and hypothesis to explore later
3) Prepare Data
- cleaning by removing duplicates, marking missing values, imputing missing values
- start simple and revisit this step often until you begin to show accurate results
4) Evaluate Algorithms
- This is about finding a subset of ML algorithms that are good ate explointing the structure of your data
- You will spend most of your time on Step 3 and Step 4 (here) to narrow down 3 - 5 well performing algorithms
5) Improve Results 
- Step 5 may blur into Step 4 in a concrete problem becuase there may be some tuning involved in Step 4 
6) Present Results 
- Finalizing a model that you can present to staekholders or deploy


In [35]:
# Python Project Template

# 1. Prepare Problem
## Ideally your dataset should be small enough to run a visualization on in within a minute. Break it into small enough chunks.
# a) Load Libraries
# b) Load Dataset

# 2. Summarize Data
## Take your time and write question prompts, assumptions and hypothesis to explore later
# a) Descriptive Statistics
# b) Data Visualization

# 3. Prepare Date
## cleaning by removing duplicates, marking missing values, imputing missing values
## start simple and revisit this step often until you begin to show accurate results
# a) Data Cleaning
# b) Feature Selection
# c) Data Transforms (splits/folds)

# 4. Evaluate Algorithms
## This is about finding a subset of ML algorithms that are good ate explointing the structure of your data
## You will spend most of your time on Step 3 and Step 4 (here) to narrow down 3 - 5 well performing algorithms
# a) Split-out Validation Dataset
# b) Test options and Evaluate Metrics
# c) Spot Check Algorithms
# d) Compare Algorithms

# 5. Improve Accuracy
## Step 5 may blur into Step 4 in a concrete problem becuase there may be some tuning involved in Step 4 
# a) Algorithm Tuning
# b) Ensembles

# 6. Finalize Model
# You can present to stakeholders or deploy
# a) Predictions on vvalidation dataset
# b) Create standalone Model on entire training dataset
# c) Save Model for later use