# Homework 4 - Ensemble Methods and Decision Trees 
## CSCI 5622 - Spring 2019
***
**Name**: $<$Akash Iyengar
$>$ 
***

This assignment is due on Canvas by **11.59 PM on Wednesday, March 20**. Submit only this Jupyter notebook to Canvas.  Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your classmates and instructors, but **you must write all code and solutions on your own**, and list any people or sources consulted.


## Dataset
***
Please do not change this class. We will use the MNIST dataset for this assignment. You have previously trained KNN, Logistic Regression on this dataset. You will now be using Ensemble methods and Decision Trees. This is a good opportunity to compare the results of different Machine Learning Algorithms on the dataset.

This is a binary classification task. We have only included 3's and 8's for this task.

In [17]:
import numpy as np
from sklearn.base import clone

In [18]:
class ThreesAndEights:
    """
    Class to store MNIST data
    """

    def __init__(self, location):

        import pickle, gzip

        # Load the dataset
        f = gzip.open(location, 'rb')

        # Split the data set
        train_set, valid_set, test_set = pickle.load(f)
    
        X_train, y_train = train_set
        X_valid, y_valid = valid_set

        # Extract only 3's and 8's for training set 
        self.X_train = X_train[np.logical_or( y_train==3, y_train == 8), :]
        self.y_train = y_train[np.logical_or( y_train==3, y_train == 8)]
        self.y_train = np.array([1 if y == 8 else -1 for y in self.y_train])
        
        # Shuffle the training data 
        shuff = np.arange(self.X_train.shape[0])
        np.random.shuffle(shuff)
        self.X_train = self.X_train[shuff,:]
        self.y_train = self.y_train[shuff]

        # Extract only 3's and 8's for validation set 
        self.X_valid = X_valid[np.logical_or( y_valid==3, y_valid == 8), :]
        self.y_valid = y_valid[np.logical_or( y_valid==3, y_valid == 8)]
        self.y_valid = np.array([1 if y == 8 else -1 for y in self.y_valid])
        
        f.close()

In [19]:
data = ThreesAndEights("data/mnist.pklz")

Feel free to explore this data and get comfortable with it before proceeding further.

## Bagging
Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

Given a standard training set $D$ of size n, bagging generates $N$ new training sets $D_i$, roughly each of size n * ratio, by sampling from $D$ uniformly and with replacement. By sampling with replacement, some observations may be repeated in each $D_i$ The $N$ models are fitted using the above $N$ bootstraped samples and combined by averaging the output (for regression) or voting (for classification). 

-Source [Wiki](https://en.wikipedia.org/wiki/Bootstrap_aggregating)

## Implementing Bagging [25 points]
***

We've given you a skeleton of the class `BaggingClassifier` below which will train a classifier based on the decision trees as implemented by sklearn. Your tasks are as follows, please approach step by step to understand the code flow:
* Implement `bootstrap` method which takes in two parameters (`X_train, y_train`) and returns a bootstrapped training set ($D_i$)
* Implement `fit` method which takes in two parameters (`X_train, y_train`) and trains `N` number of base models on different bootstrap samples. You should call `bootstrap` method to get bootstrapped training data for each of your base model
* Implement `voting` method which takes the predictions from learner trained on bootstrapped data points `y_hats` and returns final prediction as per majority rule. In case of ties, return either of the class randomly.
* Implement `predict` method which takes in multiple data points and returns final prediction for each one of those. Please use the `voting` method to reach consensus on final prediction.

In [20]:
from sklearn.tree import DecisionTreeClassifier

class BaggingClassifier:
    def __init__(self, ratio = 0.20, N = 20, base=DecisionTreeClassifier(max_depth=4)):
        """
        Create a new BaggingClassifier
        
        Args:
            base (BaseEstimator, optional): Sklearn implementation of decision tree
            ratio: ratio of number of data points in subsampled data to the actual training data
            N: number of base estimator in the ensemble
        
        Attributes:
            base (estimator): Sklearn implementation of decision tree
            N: Number of decision trees
            learners: List of models trained on bootstrapped data sample
        """
        
        assert ratio <= 1.0, "Cannot have ratio greater than one"
        self.base = base
        self.ratio = ratio
        self.N = N
        self.learners = []
        
    def fit(self, X_train, y_train):
        """
        Train Bagging Ensemble Classifier on data
        
        Args:
            X_train (ndarray): [n_samples x n_features] ndarray of training data   
            y_train (ndarray): [n_samples] ndarray of data 
        """
        #TODO: Implement functionality to fit models on the bootstrapped samples
        # cloning sklearn models:
        # from sklearn.base import clone
        # h = clone(self.base)
        from sklearn.base import clone
        for i in range (0,self.N):
            h=clone(self.base)
            bt_X,bt_y= self.boostrap(X_train,y_train)
            self.learners.append(h.fit(bt_X, bt_y))
  
        
        
    def boostrap(self, X_train, y_train):
        
        """
        Args:
            n (int): total size of the training data
            X_train (ndarray): [n_samples x n_features] ndarray of training data   
            y_train (ndarray): [n_samples] ndarray of data 
        """
        n=len(X_train)
        r_indi = np.random.choice(np.arange(n),size = int(n*self.ratio), replace = True)
        bstrapped_X = X_train[r_indi]
        bstrapped_y = y_train[r_indi]
       
        return bstrapped_X, bstrapped_y
    
    def predict(self, X):
        yhat = []
        for di in X:
            y_pdict = []
            for learn in self.learners:
                y_pdict.append(learn.predict([di]))
            yhat.append(self.voting(y_pdict))
            
        return yhat
        """
        BaggingClassifier prediction for data points in X
        
        Args:
            X (ndarray): [n_samples x n_features] ndarray of data 
            
        Returns:
            yhat (ndarray): [n_samples] ndarray of predicted labels {-1,1}
        """
        
        #TODO: Using the individual classifiers trained predict the final prediction using voting mechanism
    
    
    def voting(self, y_hats):
        """
        Args:
            y_hats (ndarray): [N] ndarray of data
        Returns:
            y_final : int, final prediction of the 
        """
        #TODO: Implement majority voting scheme and incase of ties return random label
        tot = 0
        for xi in y_hats:
            tot += sum(xi)
            
        if tot < 0:
            y_f = -1
        elif tot > 0:
            y_f = 1
        else:
            y_f = int(np.random.choice([-1,1], size = 1))
            
        return y_f
     

## BaggingClassifier for Handwritten Digit Recognition [10 points]
***

After you've successfully completed `BaggingClassifier` find the optimal values of `ratio`, `N` and `depth` using k-fold cross validation. You are allowed to use sklearn library to split your training data in folds. Use the data from `ThreesAndEights` class initialized variable `data`. Hyperparameter tuning as you may have noticed is very important in Machine Learning.  

Justify why those values are optimal.

Report accuracy on the validation data using the optimal parameter values.

__Note__: This might take a little longer time than usual to run (i.e. several minutes). This is true for the ensemble methods you will implement below as well.

In [139]:
N_r = np.arange(5,25,5)
r_r = np.arange(.1, 1, 0.4)
d_r = np.arange(3,24,7)

In [140]:
from sklearn.model_selection import KFold
Num_k = 5

print("N", N_r)
print("ratio", r_r)
print("depth", d_r)

kfold = KFold(Num_k, True, 555)
D = []

itera = 0

for n in N_r:
    for r in r_r:
        for d in d_r:
            itera += 1
            param = (n, r, d)
            error = 0
            c_f_a = [] 
            for train, test in kfold.split(data.X_train):
                trained_tree = BaggingClassifier(ratio = r, N = n, base = DecisionTreeClassifier(max_depth = d))
                trained_tree.fit(data.X_train[train], data.y_train[train])
                
                y_pred = np.array(trained_tree.predict(data.X_train[test]))
                y_actual = data.y_train[test]     
                
                total_errors = y_pred != y_actual
                error = np.sum(total_errors)
                percent_accuracy = 1 - error/len(y_actual)
                c_f_a.append(percent_accuracy)

            
            average_accuracy = np.average(current_fold_accuracy)
            D.append((average_accuracy, param))
            print(itera)

N [ 5 10 15 20]
ratio [0.1 0.5 0.9]
depth [ 3 10 17]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


<center>__Expected accuracy is about 97%__</center>

In [170]:
s_b_e = sorted(D, key = lambda tup: tup[0], reverse = True)
    
for e in s_b_e:
    
    print("Acuuracy for : %.2f, the value of N: %i, with  ratio: %.1f, and depth: %i" %(e[0],e[1][0],e[1][1],e[1][2]))

Acuuracy for : 0.98, the value of N: 15, with  ratio: 0.5, and depth: 17
Acuuracy for : 0.97, the value of N: 15, with  ratio: 0.9, and depth: 17
Acuuracy for : 0.97, the value of N: 20, with  ratio: 0.9, and depth: 17
Acuuracy for : 0.97, the value of N: 20, with  ratio: 0.5, and depth: 10
Acuuracy for : 0.97, the value of N: 20, with  ratio: 0.9, and depth: 10
Acuuracy for : 0.97, the value of N: 15, with  ratio: 0.9, and depth: 10
Acuuracy for : 0.97, the value of N: 10, with  ratio: 0.9, and depth: 17
Acuuracy for : 0.97, the value of N: 10, with  ratio: 0.5, and depth: 17
Acuuracy for : 0.97, the value of N: 20, with  ratio: 0.5, and depth: 17
Acuuracy for : 0.97, the value of N: 5, with  ratio: 0.9, and depth: 17
Acuuracy for : 0.97, the value of N: 15, with  ratio: 0.5, and depth: 10
Acuuracy for : 0.97, the value of N: 10, with  ratio: 0.9, and depth: 10
Acuuracy for : 0.97, the value of N: 5, with  ratio: 0.9, and depth: 10
Acuuracy for : 0.97, the value of N: 10, with  ratio:

In [171]:
### Testing w/ chosen values from above

f_N =15
f_r = 0.5
f_d = 17


BaggedTr = BaggingClassifier(ratio= f_r,N= f_N,base=DecisionTreeClassifier(max_depth = f_d))

BaggedTr.fit(data.X_train, data.y_train)
Bagged_predic = BaggedTr.predict(data.X_valid)
act = data.y_valid

Bagged_t_errors = Bagged_predic != act
B_error = np.sum(Bagged_t_errors)
B_percent = 100*(1 - B_error/len(act))

In [172]:
print("Using %i trees, a ratio of %.1f, and a maximum depth of %i, I got an accuarcy of %.1f on my test set." %(f_N, f_r, f_d, B_percent))

Using 15 trees, a ratio of 0.5, and a maximum depth of 17, I got an accuarcy of 98.0 on my test set.


> I am  r aunning a nested for loop, examining many permutations of the three variables. To imporve the running time i could have used three different loops and kept values constant and varied each element one at a time. From my answer it can be understood that depth has the maximum influence on the result. A deeper tree results to more overfitting. Due to overfitting the tree gives better acuracy for training but doesnt perform well with testing data. I decided to use the depth as 17 as it was a value in between and I decided to use ratio as 50% to get equal split so that i can have equal training and testing data.I feel lower percent would give much better results. Number of trees have minimal effects on the results. I used 15 trees since it gave me the best results.

# Random Decision Tree [35 points]

In this assignment you are going to implement a random decision tree using random vector method as discussed in the lecture.

Best split: One that achieves maximum reduction in gini index across multiple candidate splits. (decided by `candidate_splits` attribute of the class `RandomDecisionTree`)

Use `TreeNode` class as node abstraction to build the tree

You are allowed to add new attributes in the `TreeNode` and `RandomDecisionTree` class - if that helps.

Your tasks are as follows:
* Implement `gini_index` method which takes in class labels as parameter and returns the gini impurity as measure of uncertainty

* Implement `majority` method which picks the most frequent class label. In case of tie return any random class label

* Implement `find_best_split` method which finds the random vector/hyperplane which causes most reduction in the gini index. 

* Implement `build_tree` method which uses `find_best_split` method to get the best random split vector for current set of training points. This vector partitions the training points into two sets, and you should call `build_tree` method on two partitioned sets and build left subtree and right subtree. Use `TreeNode` as abstraction for a node.

> The method calls itself recursively to the generate left and right subtree till the point either `max_depth` is reached or no good random split is found.  When either of two cases is encountered, you should make that node as leaf and identify the label for that leaf to be the most frequent class (use `majority` method). Go through lecture slides for better understanding

* Implement `predict` method which takes in multiple data points and returns final prediction for each one of those using the tree built. (`root` attribute of the class)

In [11]:
class TreeNode:
    def __init__(self):
        self.left = None
        self.right = None
        self.isLeaf = False
        self.label = None
        self.split_vector = None

    def getLabel(self):
        if not self.isLeaf:
            raise Exception("Should not to do getLabel on a non-leaf node")
        return self.label
    
class RandomDecisionTree:
            
    def __init__(self, candidate_splits = 100, depth = 10):
        """
        Args:
            candidate_splits (int) : number of random decision splits to test
            depth (int) : maximum depth of the random decision tree
        """
        self.candidate_splits = candidate_splits
        self.depth = depth
        self.root = None
    
    def fit(self, X_train, y_train):
        """
        Args:
            X_train (ndarray): [n_samples x n_features] ndarray of training data   
            y_train (ndarray): [n_samples] ndarray of data
            
        """
        self.root = self.build_tree(X_train[:], y_train[:], 0)
        return self
        
    def build_tree(self, X_train, y_train, height):
        """
        Args:
            X_train (ndarray): [n_samples x n_features] ndarray of training data   
            y_train (ndarray): [n_samples] ndarray of data
            
        """
        node = TreeNode()
       
        c_s = self.find_best_split(X_train, y_train)
        if height == self.depth or len(y_train) < 2:             
            node.isLeaf = True
            node.label = self.majority(y_train)
            return node
        
        height += 1
        l_s, r_s = [], []
        index = 0
        
        for pt in X_train:
            side = np.dot(pt, c_s)
            if side < 0:
                l_s.append(index)
            else:
                r_s.append(index)
            index += 1            
        l_p, r_p = np.array(l_s), np.array(r_s)
        
        if len(l_p) == 0 or len(r_p) == 0:
            node.isLeaf = True
            node.label = self.majority(y_train)
            return node            
        
        l_b = self.build_tree(X_train[l_p], y_train[l_p], height)
        r_b = self.build_tree(X_train[r_p], y_train[r_p], height)
       
        node.left = l_b
        node.right = r_b
        node.split_vector = c_s
        return node
    
    def find_best_split(self, X_train, y_train):
        """
        Args:
            X_train (ndarray): [n_samples x n_features] ndarray of training data   
            y_train (ndarray): [n_samples] ndarray of data
            
        """
        u_i = self.gini_index(y_train)
        T_P = len(y_train)
        
        dims = len(X_train[0])
        n_c = self.candidate_splits
        max_g = float(-np.inf)
        split_vector = None
        for x in range(n_c):
            candidate_split = np.random.uniform(-1,1,dims)
            left, right = [], []
            index = 0
            for pt in X_train:
                side = np.dot(pt, candidate_split)
                if side < 0:
                    left.append(index)
                else:
                    right.append(index)
                index += 1

            left, right = np.array(left), np.array(right)           
            p_Left, p_Right = len(left)/T_P, len(right)/T_P
          
            if p_Left > 0:
                u_Left = self.gini_index(y_train[left])
            else:
                u_Left = 0
                
            if p_Right > 0:
                u_Right = self.gini_index(y_train[right])
            else:
                u_Right = 0
                
            Gain_of_split = u_i - (p_Left*u_Left) - (p_Right*u_Right)
            if Gain_of_split > max_g:
                split_vector = candidate_split
                max_g = Gain_of_split
        # your logic here
        return split_vector
        
            
        
    def gini_index(self, y):
        """
        Args:
            y (ndarray): [n_samples] ndarray of data
        """
        l_c = {}

        for label in np.unique(y):
            l_c.update( {label : list(y).count(label)})

        T_P = len(y)

        try:
            P = l_c[-1]
        except:
            P = l_c[1]

        u = 2 * (P/T_P) * (1 - (P/T_P))
        return(u)

    
    def majority(self, y):
        """
        Return the major class in ndarray y
        """
        l_c = {}
        for label in np.unique(y):
            l_c.update( {label : list(y).count(label)})
            
        m_c = 0
        for label in l_c:
            if l_c[label] > m_c:
                m_l = label
                m_c = l_c[label]
        return m_l
                    
    
    def predict(self, X):
        """
        BaggingClassifier prediction for new data points in X
        
        Args:
            X (ndarray): [n_samples x n_features] ndarray of data 
            
        Returns:
            yhat (ndarray): [n_samples] ndarray of predicted labels {-1,1}
        """
        yhat = []
        
        for point in X:
            node = self.root            
            while node.isLeaf == False:
                direction = np.dot(point, node.split_vector)
                if direction < 0:
                    node = node.left

                else:
                    node = node.right
   
            y_pred = node.getLabel()
            yhat.append(y_pred)
                
        return yhat

## RandomDecisionTree for Handwritten Digit Recognition

After you've successfully completed `RandomDecisionTree`, and train using the default values in the constructor and report accuracy on the test set. Use the data from `ThreesAndEights` class initialized variable `data` 

In [196]:
r_t = RandomDecisionTree()

r_t.fit(data.X_train, data.y_train)
predict = r_t.predict(data.X_valid)
act = data.y_valid

total_e = predict != act
err = np.sum(total_e)
perce = 100*(1 - err/len(data.y_valid))

In [199]:
print("The accuracy of my random decision tree is %.1f percent. " %(perce))

The accuracy of my random decision tree is 90.1 percent. 


<center>__Expected accuracy is about 90%__</center>

# Random Forest [20 points]
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

Random forest trains random decision trees on bootstrapped training points. Thus, you can implementation of methods (`bootstrap`, `predict`) from `BaggingClassifier` class directly. Only difference being, you have to use the `RandomDecisionTree` as base which you implemented previously instead of sklearn's implementation of `DecisionTreeClassifier`). Implement the `fit` method in the class below accordingly.

In [21]:
class RandomForest(BaggingClassifier):
    def __init__(self, ratio = 0.20, N = 20, max_depth = 10, candidate_splits = 100):
        self.ratio = ratio
        self.N = N
        self.learners = []
        self.candidate_splits = candidate_splits
        self.max_depth = max_depth
        
    def fit(self, X_train, y_train):
        """
        Train Bagging Ensemble Classifier on data
        
        Args:
            X_train (ndarray): [n_samples x n_features] ndarray of training data   
            y_train (ndarray): [n_samples] ndarray of data 
        """
        for i in range(0,self.N): 
            r_t = RandomDecisionTree(candidate_splits = self.candidate_splits, depth = self.max_depth)
            bt_X, bt_y = self.boostrap(X_train, y_train)
            self.learners.append(r_t.fit(bt_X, bt_y))

      

## RandomForest for Handwritten Digit Recognition [10 points]
***

After you've successfully completed `RandomForest` find the optimal values of `ratio`, `N`, `candidate_splits` and `depth` using k-fold cross validation on. Feel free to use sklearn library to split your training data. Use the data from `ThreesAndEights` class intialized variable `data`. 

Justify why those values are optimal.

Report best accuracy on the testing data using those optimal parameter values.

In [22]:
N_c = 25
r_c = .2
d_c = 4
c_c = 100


In [204]:
# by varying N

from sklearn.model_selection import KFold
Num_k = 5

N_r = np.arange(5, 25, 5)
r = r_c
d = d_c
c = c_c
print("N", N_r)
print("ratio", r)
print("depth", d)
print("candidates", c)

kfold = KFold(Num_k, True, 555)
N_s_f = []


for n in N_r:
    e = 0
    c_f_a = []
    params = (n,r,d,c)
    for train, test in kfold.split(data.X_train):
        forest = RandomForest(ratio = r, N = n, max_depth = d, candidate_splits = c )
        forest.fit(data.X_train[train], data.y_train[train])

        y_pred = np.array(forest.predict(data.X_train[test]))
        y_actual = data.y_train[test]     

        t_e = y_pred != y_actual
        e = np.sum(t_e)
        per_acc = 1 - e/len(y_actual)
        c_f_a.append(per_acc)


    N_a_a_f = np.average(c_f_a)
    N_s_f.append((N_a_a_f, n))

N [ 5 10 15 20]
ratio 0.2
depth 4
candidates 100


In [207]:
s_b_e1 = sorted(N_s_f, key = lambda tup: tup[0], reverse = True)
    
for e in s_b_e1:
  
    print("Accuracy: %.2f, N: %1f" %(e[0],e[1]))


Accuracy: 0.93, N: 20.000000
Accuracy: 0.93, N: 15.000000
Accuracy: 0.93, N: 10.000000
Accuracy: 0.91, N: 5.000000


In [208]:
#Ratio varied

from sklearn.model_selection import KFold
Num_k = 5


r_r = np.arange(.1, 1, .4)
n = N_c
d = d_c
c = c_c
print("ratio",r_r)
print("N", n)
print("depth", d)
print("candidates", c)

kfold = KFold(Num_k, True, 555)
r_s_f = []

for r in r_r:
    e = 0
    c_f_a = [] 
    for train, test in kfold.split(data.X_train):
        forest = RandomForest(ratio = r, N = n, max_depth = d, candidate_splits = c )
        forest.fit(data.X_train[train], data.y_train[train])

        y_pred = np.array(forest.predict(data.X_train[test]))
        y_actual = data.y_train[test] 
        
        t_e = y_pred != y_actual
        e = np.sum(t_e)
        per_acc = 1 - e/len(y_actual)
        c_f_a.append(per_acc)


    r_a_a_f = np.average(c_f_a)
    r_s_f.append((r_a_a_f,r))

ratio [0.1 0.5 0.9]
N 25
depth 4
candidates 100


In [209]:
s_b_e2 = sorted(r_s_f, key = lambda tup: tup[0], reverse = True)
    
for e in s_b_e2:
   
    print("Accuracy: %.2f,r: %.1f" %(e[0],e[1]))

Accuracy: 0.94,r: 0.9
Accuracy: 0.93,r: 0.5
Accuracy: 0.93,r: 0.1


In [23]:
#depth varied

from sklearn.model_selection import KFold
Num_k = 5

d_r = np.arange(3, 24, 7)
n = N_c
r = r_c
c = c_c

print("depth", d_r)
print("ratio",r)
print("N", n)
print("candidates", c)


kfold = KFold(Num_k, True, 555)
d_s_f = []

for d in d_r:
    e = 0
    c_f_a = [] 
    for train, test in kfold.split(data.X_train):
        forest = RandomForest(ratio = r, N = n, max_depth = d, candidate_splits = c )
        forest.fit(data.X_train[train], data.y_train[train])

        y_pred = np.array(forest.predict(data.X_train[test]))
        y_actual = data.y_train[test]      

        t_e = y_pred != y_actual
        e = np.sum(t_e)
        per_acc = 1 - e/len(y_actual)
        c_f_a.append(per_acc)


    d_a_a_f = np.average(c_f_a)
    d_s_f.append((d_a_a_f, d))

depth [ 3 10 17]
ratio 0.2
N 25
candidates 100


In [24]:
s_b_e3 = sorted(d_s_f, key = lambda tup: tup[0], reverse = True)
    
for e in s_b_e3:
    print("Accuracy: %.2f, d: %i" %(e[0],e[1]))

Accuracy: 0.96, d: 17
Accuracy: 0.96, d: 10
Accuracy: 0.92, d: 3


In [212]:
#number of candidates varied

from sklearn.model_selection import KFold
Num_k = 5

c_r = np.arange(10, 100, 10)
n = N_c
r = r_c
d = d_c
print("candidates", c_r)
print("depth", d)
print("ratio", r)
print("N", n)


kfold = KFold(Num_k, True, 555)
c_s_f = []

for c in c_r:
    e = 0
    c_f_a = [] 
    for train, test in kfold.split(data.X_train):
        forest = RandomForest(ratio = r, N = n, max_depth = d, candidate_splits = c )
        forest.fit(data.X_train[train], data.y_train[train])

        y_pred = np.array(forest.predict(data.X_train[test]))
        y_actual = data.y_train[test]      

        t_e = y_pred != y_actual
        e = np.sum(t_e)
        per_acc = 1 - e/len(y_actual)
        c_f_a.append(per_acc)


    c_a_a_f = np.average(c_f_a)
    c_s_f.append((c_a_a_f, c))

candidates [10 20 30 40 50 60 70 80 90]
depth 4
ratio 0.2
N 25


In [213]:
s_b_e4 = sorted(c_s_f, key = lambda tup: tup[0], reverse = True)
    
for e in s_b_e4:
    print("Accuracy: %.2f,c: %i" %(e[0],e[1]))

Accuracy: 0.93,c: 90
Accuracy: 0.93,c: 70
Accuracy: 0.93,c: 80
Accuracy: 0.93,c: 60
Accuracy: 0.92,c: 40
Accuracy: 0.92,c: 50
Accuracy: 0.92,c: 30
Accuracy: 0.92,c: 20
Accuracy: 0.91,c: 10


In [25]:
f_N_f = 15
f_r_f = 0.9
f_d_f = 17
f_c_f = 80


FF = RandomForest(ratio = f_r_f, N = f_N_f, max_depth = f_d_f, candidate_splits = f_c_f)


FF.fit(data.X_train, data.y_train)
Fp = FF.predict(data.X_valid)
actual = data.y_valid

Fte = Fp != actual
Fe = np.sum(Fte)
Fpercent = 100*(1 - Fe/len(actual))
print(" %iNumber of  trees,  for ratio of %.1f, with depth of %i, and number of candidates %i, the accuracy of %.1f on test set." %(f_N_f, f_r_f, f_d_f,f_c_f, Fpercent))

 15Number of  trees,  for ratio of 0.9, with depth of 17, and number of candidates 80, the accuracy of 96.3 on test set.


<center>__Expected accuracy is about 97%__</center>

In [26]:
f_N_f = 15
f_r_f = 0.9
f_d_f = 17
f_c_f = 60


FF = RandomForest(ratio = f_r_f, N = f_N_f, max_depth = f_d_f, candidate_splits = f_c_f)


FF.fit(data.X_train, data.y_train)
Fp = FF.predict(data.X_valid)
actual = data.y_valid

Fte = Fp != actual
Fe = np.sum(Fte)
Fpercent = 100*(1 - Fe/len(actual))
print(" %iNumber of  trees,  for ratio of %.1f, with depth of %i, and number of candidates %i, the accuracy of %.1f on test set." %(f_N_f, f_r_f, f_d_f,f_c_f, Fpercent))

 15Number of  trees,  for ratio of 0.9, with depth of 17, and number of candidates 60, the accuracy of 97.1 on test set.


In [27]:
f_N_f = 20
f_r_f = 0.9
f_d_f = 17
f_c_f = 70


FF = RandomForest(ratio = f_r_f, N = f_N_f, max_depth = f_d_f, candidate_splits = f_c_f)


FF.fit(data.X_train, data.y_train)
Fp = FF.predict(data.X_valid)
actual = data.y_valid

Fte = Fp != actual
Fe = np.sum(Fte)
Fpercent = 100*(1 - Fe/len(actual))
print(" %iNumber of  trees,  for ratio of %.1f, with depth of %i, and number of candidates %i, the accuracy of %.1f on test set." %(f_N_f, f_r_f, f_d_f,f_c_f, Fpercent))

 20Number of  trees,  for ratio of 0.9, with depth of 17, and number of candidates 70, the accuracy of 96.8 on test set.


> I ran four different loops instead of using nested for loops similiar to baggingclassifier.I chose my number of trees to be N = 20, which is close to the 25 I used previously. I used a ratio of 90% , since this performed well and seems to give good variability. I used a depth again of 17 again, for the same reasons as explained above. I tested for different candidate values initiallly i used 80 and then changed to 60. Lower candidate value gave better results rather than higher value. I later changed the Number of trees and used a value between both the candidate value and got better results than higher candidte value with lower number of trees but lower accuracy than lower candidate value and lower number of trees.