# Train a Random Forest Classifier on the ISOLET Dataset

You are working for a technology company and they are planning to launch a new voice assistant product. You have been tasked with building a classification model that will recognize the letters spelled out by a user based on the signal frequencies captured. Each sound can be captured and represented as a signal composed of multiple frequencies.

> Note: This is the ISOLET [dataset](https://archive.ics.uci.edu/ml/datasets/ISOLET), taken from the UCI Machine Learning Repository. The link to CSV version of this [dataset](https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/phpB0xrNj.csv).

### Download and load the dataset using .read_csv() from pandas.
### Extract the response variable using .pop() from pandas.
### Split the dataset into training and test sets using train_test_split() from sklearn.model_selection.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('..\Dataset\phpB0xrNj.csv')
y = df.pop('class')
X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=1, test_size = 0.3)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(5457, 617) (2340, 617) (5457,) (2340,)


### Create a function that will instantiate and fit a RandomForestClassifier using .fit() from sklearn.ensemble.

In [3]:
def build_model(X, y, n_estimators=100, max_depth=None, min_samples_leaf=1, max_features='auto'):
    model = RandomForestClassifier(
        n_estimators=n_estimators, 
        max_depth=max_depth, 
        min_samples_leaf=min_samples_leaf, 
        max_features=max_features,
        random_state=1
    )
    model.fit(X, y)
    return model

### Create a function that will predict the outcome for the training and testing sets using .predict().

In [4]:
def get_predictions(model, X_train, X_test):
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    return pred_train, pred_test

### Create a function that will print the accuracy score for the training and testing sets using accuracy_score() from sklearn.metrics.

In [5]:
def print_accuracy_scores(y_train, y_test, pred_train, pred_test):
    print(f'Train accuracy: {accuracy_score(y_train, pred_train):.3f}')
    print(f'Test accuracy: {accuracy_score(y_test, pred_test):.3f}')

In [6]:
model0 = build_model(X_train, y_train)
pred_train0, pred_test0 = get_predictions(model0, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train0, pred_test0)

Train accuracy: 1.000
Test accuracy: 0.944


### Train and get the accuracy score for n_estimators = 20 and 50.

In [7]:
model1 = build_model(X_train, y_train, n_estimators=20)
pred_train1, pred_test1 = get_predictions(model1, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train1, pred_test1)

Train accuracy: 1.000
Test accuracy: 0.922


In [8]:
model2 = build_model(X_train, y_train, n_estimators=50)
pred_train2, pred_test2 = get_predictions(model2, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train2, pred_test2)

Train accuracy: 1.000
Test accuracy: 0.939


### Train and get the accuracy score for max_depth = 5 and 10.

In [9]:
model3 = build_model(X_train, y_train, max_depth=5)
pred_train3, pred_test3 = get_predictions(model3, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train3, pred_test3)

Train accuracy: 0.885
Test accuracy: 0.851


In [10]:
model4 = build_model(X_train, y_train, max_depth=10)
pred_train4, pred_test4 = get_predictions(model4, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train4, pred_test4)

Train accuracy: 0.986
Test accuracy: 0.935


### Train and get the accuracy score for min_samples_leaf = 10 and 50.

In [11]:
model5 = build_model(X_train, y_train, min_samples_leaf=10)
pred_train5, pred_test5 = get_predictions(model5, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train5, pred_test5)

Train accuracy: 0.975
Test accuracy: 0.938


In [12]:
model6 = build_model(X_train, y_train, min_samples_leaf=50)
pred_train6, pred_test6 = get_predictions(model6, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train6, pred_test6)

Train accuracy: 0.921
Test accuracy: 0.899


### Train and get the accuracy score for max_features = 0.5 and 0.3.

In [13]:
model7 = build_model(X_train, y_train, max_features=0.5)
pred_train7, pred_test7 = get_predictions(model7, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train7, pred_test7)

Train accuracy: 1.000
Test accuracy: 0.928


In [14]:
model8 = build_model(X_train, y_train, max_features=0.3)
pred_train8, pred_test8 = get_predictions(model8, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train8, pred_test8)

Train accuracy: 1.000
Test accuracy: 0.939


### Select the best hyperparameter value.

In [17]:
model9 = build_model(X_train, y_train, n_estimators=50, max_depth=10, min_samples_leaf=50, max_features=0.3)
pred_train9, pred_test9 = get_predictions(model9, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train9, pred_test9)

Train accuracy: 0.899
Test accuracy: 0.870


In [18]:
model10 = build_model(X_train, y_train, n_estimators=50, max_depth=10, min_samples_leaf=10, max_features=0.3)
pred_train10, pred_test10 = get_predictions(model10, X_train, X_test)
print_accuracy_scores(y_train, y_test, pred_train10, pred_test10)

Train accuracy: 0.953
Test accuracy: 0.911
