###  Python Basics Tutorial

#### Improve Performance with Ensembles

####  Machine Learning Mastery with Python
####  Jason Brownlee

### 3 common ensemble models
    1. bagging: multiple same models, different sample subsets
    2. boosting: multiple same models, each learning to fix prediction errors of prior model
    3. voting: multiple different models and simple statistics used to combine predictions    

## Bagging Algorithms
- bagged decision trees
- random forest
- extra trees

#### Bagged Decision Trees

In [5]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [6]:
path = 'D:\OneDrive - QJA\My Files\DataScience\DataSets'
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 
         'mass', 'pedi', 'age', 'class']

df = read_csv(path + '\\' + filename, names = names)


In [8]:
array = df.values
X = array[:, 0:8]
Y = array[:, 8]

seed = 7
n_splits = 10
kfold = KFold(n_splits = n_splits, random_state = seed)

cart = DecisionTreeClassifier()
num_trees = 100

model = BaggingClassifier(base_estimator = cart,
                          n_estimators = num_trees,
                          random_state = seed)

results = cross_val_score(model, X, Y, cv = kfold)
print('Error Estimator: %.4f' % results.mean())

Error Estimator: 0.7707


#### Random Forest
- sample taken with replacement
- trees constructed to reduce correlation between classifiers
    - rather than chose best split point in construction of each 
      tree, only random subset of features considered for each split

In [12]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

array = df.values
X = array[:, 0:8]
Y = array[:, 8]

num_trees = 100
max_features = 3 # randomly selected features used for each tree
n_splits = 10
seed = 7

kfold = KFold(n_splits = n_splits, random_state = seed)

model = RandomForestClassifier(n_estimators = num_trees,
                                max_features = max_features)

results = cross_val_score(model, X, Y, cv = kfold)
print('Error Estimator: %.4f' % results.mean())

Error Estimator: 0.7591


#### Extra Trees
- modification of bagging where random trees are constructed from samples of training data

In [14]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score

from sklearn.ensemble import ExtraTreesClassifier

array = df.values
X = array[:, 0:8]
Y = array[:, 8]

num_trees = 100
max_features = 7
n_splits = 10
seed = 7

kfold = KFold(n_splits = n_splits, random_state = seed)
model = ExtraTreesClassifier(n_estimators = num_trees,
                             max_features = max_features)

results = cross_val_score(model, X, Y, cv = kfold)
print('Error Estimator: %.4f' % results.mean())

Error Estimator: 0.7656


## Boosting Algorithms
- AdaBoost
- Stochastic Gradient Boosting

Once created boosting models weight predictions based on demonstrated accuracy

#### AdaBoost

In [18]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score

from sklearn.ensemble import AdaBoostClassifier

array = df.values
X = array[:, 0:8]
Y = array[:, 8]

num_trees = 30
seed = 7
n_splits = 10

kfold = KFold(n_splits = n_splits, random_state = seed)
model = AdaBoostClassifier(n_estimators = num_trees,
                           random_state = seed)

results = cross_val_score(model, X, Y, cv = kfold)
print('Error Estimate: %.4f' % results.mean())

Error Estimate: 0.7605


#### Stochastic Gradient Boosting (Gradient Boosting Machines)

In [20]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score

from sklearn.ensemble import GradientBoostingClassifier

array = df.values
X = array[:, 0:8]
Y = array[:, 8]

num_trees = 30
seed = 7
n_splits = 10

kfold = KFold(n_splits = n_splits, random_state = seed)
model = GradientBoostingClassifier(n_estimators = num_trees,
                           random_state = seed)

results = cross_val_score(model, X, Y, cv = kfold)
print('Error Estimate: %.4f' % results.mean())

Error Estimate: 0.7720


## Voting Ensembles

- voting classifier wraps models and averages predictions of sub-models
- predictions of sub-models can be weigthed

In [23]:
# in this example, log reg, CARTs and SVM combined and wrapped in voting classifer

#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.ensemble import VotingClassifier

array = df.values
X = array[:, 0:8]
Y = array[:, 8]

n_splits = 10
seed = 7

kfold = KFold(n_splits = n_splits, random_state = seed)

# create sub-models and empty list to store them
estimators =[]
mod1 = LogisticRegression(solver = 'liblinear')
estimators.append(('logistic', mod1))
mod2 = DecisionTreeClassifier()
estimators.append(('cart', mod2))
mod3 = SVC(gamma = 'auto')
estimators.append(('svm', mod3))

## create ensemble model
ensemble = VotingClassifier(estimators)

results = cross_val_score(ensemble, X, Y, cv = kfold)
print('Error Estimate: %.4f' % results.mean())

Error Estimate: 0.7356
