# Chapter 6 - Other Popular Machine Learning Methods
## Segment 6 - Ensemble methods with random forest

This is a classification problem, where in we will be estimating the species label for iris flowers.

## Background 
#### Ensemble Method:
- models that combines several base models to produce one optimal predictive model 
- the ensembled model should be more powerful than the components in prediction 
- can be same algo used more than once
    - eg. random forest: decision tree ensembled 
- or can be multiple alog aggregated

#### Methods
- vote
    - multiple models are run and choose the model with more vote 
- average
    - multiple models are run and return the avg result 
- weighted avg
    - assign weights to diff model and do the avg mthd
- bagging 
    - takes results from multiple model and combine the results to give the final output
    - random forest use this 
    - decision tree application: use diff tree to run diff subsets of input and aggregate the result 
        - aggregation mthd should be defined
- boosting

#### Random Forest Steps 
1. create a rand subset from the original data
2. randomly select a set of features at each node in the decision tree
3. decide the best split
4. for each subset of data, create such a model 
5. take the avergage of all individual models' prediction

In [1]:
import numpy as np
import pandas as pd

import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
from sklearn import metrics

In [2]:
from sklearn.ensemble import RandomForestClassifier

In [3]:
iris = datasets.load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target)

y.columns = ['labels']

print(df.head())
y[0:5]

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2


Unnamed: 0,labels
0,0
1,0
2,0
3,0
4,0


The data set contains information on the:
- sepal length (cm)
- sepal width (cm)  
- petal length (cm)  
- petal width (cm)
- species type

In [4]:
df.isnull().any()==True

sepal length (cm)    False
sepal width (cm)     False
petal length (cm)    False
petal width (cm)     False
dtype: bool

In [5]:
print(y.labels.value_counts())

2    50
1    50
0    50
Name: labels, dtype: int64


# Preparing the data for training the model

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=.2, random_state=17)

# Build a Random Forest model

In [7]:
classifier = RandomForestClassifier(n_estimators=200, random_state=0)
#n_estimator = num of trees we want to generate 

y_train_array = np.ravel(y_train)
#reformat the target data 

classifier.fit(X_train, y_train_array)

y_pred = classifier.predict(X_test)

# Evaluating the model on the test data

In [8]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.92      1.00      0.96        11
           2       1.00      0.92      0.96        12

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30



In [9]:
y_test_array = np.ravel(y_test)
print(y_test_array)

[0 1 2 1 2 2 1 2 1 2 2 0 1 0 2 0 0 2 2 2 2 0 2 1 1 1 1 1 0 1]


In [10]:
print(y_pred)

[0 1 2 1 2 2 1 2 1 2 2 0 1 0 2 0 0 2 2 2 1 0 2 1 1 1 1 1 0 1]
