# Problem 6

## Problem Description
In this problem you will train decision tree and random forest models using sklearn on a real world dataset. The dataset is the *Cylinder Bands Data Set* from the UCI Machine Learning Repository: [https://archive.ics.uci.edu/ml/datasets/Cylinder+Bands](https://archive.ics.uci.edu/ml/datasets/Cylinder+Bands). The dataset is generated from rotogravure printers, with 39 unique features, and a binary classification label for each sample. The class is either 0, for 'band' or 1 for 'no band', where banding is an undesirable process delay that arises during the rotogravure printing process. By training ML models on this dataset, you could help identify or predict cases where these process delays are avoidable, thereby improving the efficiency of the printing. For the sake of this exercise, we only consider features 21-39 in the above link, and have removed any samples with missing values in that range. No further processing of the data is required on your behalf. The data has been partitioned into a training and testing set using an 80/20 split. Your models will be trained on just the train set, and accuracy results will be reported on both the training and testing sets.

Fill out the notebook as instructed, making the requested plots and printing necessary values. 

*You are welcome to use any of the code provided in the lecture activities.*

#### Summary of deliverables:

- Accuracy function
- Report accuracy of the DT model on the training and testing set
- Report accuracy of the Random Forest model on the training and testing set

#### Imports and Utility Functions:

In [1]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## Load the data

Use the `np.load()` function to load "w5-hw1-train.npy" (training data) and "w5-hw1-test.npy" (testing data). The first 19 columns of each are the features. The last column is the label

In [12]:
# YOUR CODE GOES HERE
train_data = np.load("data/w5-hw1-train.npy")
test_data = np.load("data/w5-hw1-test.npy")

X_train = train_data[:, :19]
y_train = train_data[:, -1]

X_test = test_data[:, :19]
y_test = test_data[:, -1]

## Write an accuracy function

Write a function `accuracy(pred,label)` that takes in the models prediction, and returns the percentage of predictions that match the corresponding labels.

In [9]:
# YOUR CODE GOES HERE
def accuracy(pred, label):
    acc = np.sum((pred == label)) * 100 / len(pred)
    return acc

## Train a decision tree model

Train a decision tree using `DecisionTreeClassifier()` with a `max_depth` of 10 and using a `random_state` of 0 to ensure repeatable results. Print the accuracy of the model on both the training and testing sets.

In [19]:
# YOUR CODE GOES HERE
model = DecisionTreeClassifier(max_depth = 10, random_state = 0)
model.fit(X_train, y_train)

print("Model accuracy of DecisionTreeClassifier on training data: ", accuracy(model.predict(X_train), y_train))
print("Model accuracy of DecisionTreeClassifier on test data: ", accuracy(model.predict(X_test), y_test))

Model accuracy on training data:  93.12714776632302
Model accuracy on test data:  65.75342465753425


## Train a random forest model

Train a random forest model using `RandomForestClassifier()` with a `max_depth` of 10, a `n_estimators` of 100, and using a random state of `0` to ensure repeatable results. Print the accuracy of the model on both the training and testing sets. 

In [20]:
# YOUR CODE GOES HERE
model_ = RandomForestClassifier(max_depth = 10, n_estimators = 100, random_state = 0)
model_.fit(X_train, y_train)

print("Model accuracy of RandomForestClassifier on training data: ", accuracy(model_.predict(X_train), y_train))
print("Model accuracy of RandomForestClassifier on test data: ", accuracy(model_.predict(X_test), y_test))

Model accuracy of RandomForestClassifier on training data:  100.0
Model accuracy of RandomForestClassifier on test data:  82.1917808219178


## Discuss the performance of the models

Compare the training and testing accuracy of the two models, and explain why the random forest model is advantageous compared to a standard decision tree model

# 

A single decision tree might not be very accurate, especially if the depth is limited, there is noise in the data etc. Whereas in a random forest model, predictions of multiple decision trees are averaged out, resulting in irregularities being smoothened out and making the model less sensitive to noise and outliers. A decision might also be overfit to the data, which is avoided in random forest models, as the averaging and randomness generalizes the data better.