# Training and Testing the Random Forest (RF) Classifier

In this notebook I found a baseline accuracy for my model, trained my RF classifer, and then tested it. 

## Importing Modules

In [61]:
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Importing the train features and labels

In [62]:
train_features = np.loadtxt('../train_features.txt', dtype=int)

In [63]:
train_labels = np.loadtxt('../train_labels.txt', dtype=int)

In [64]:
test_features = np.loadtxt('../test_features.txt', dtype=int)

In [65]:
test_labels = np.loadtxt('../test_labels.txt', dtype=int)

In [66]:
feature_list = np.loadtxt('../feature_list.txt', dtype=object)

## Establishing a baseline accuracy with a random prediction model

In [67]:
baseline_preds = []
for i in np.arange(len(test_labels)):
    pred = np.random.choice(train_labels)
    baseline_preds.append(pred)

I picked random labels from my train_labels set in order to establish the baseline accuracy for the algorithm. Since I'm chosing the labels from the train_labels set, the labels that appear more often will also appear more often in the baseline prediction set. This will account for the class imbalance. 

In [68]:
from sklearn.metrics import f1_score
print('Baseline F1 score: ', round(f1_score(test_labels, baseline_preds, average = 'weighted'), 3)*100, '%')

Baseline F1 score:  61.7 %


In [69]:
accurate = 0
for a, b in zip(baseline_preds, test_labels):
    if a == b:
        accurate += 1
print('Baseline accuracy: ', round(accurate/len(test_labels), 3)* 100, '%')

Baseline accuracy:  61.9 %


I calculated both the baseline F1 score and the baseline accuracy. The F1 score will be a better indicator of how well the algorithm is performing because of the class imbalance.

## Training the RF Classifier

In [70]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=-1, random_state=123, verbose = True, n_estimators=1000)
rf = clf.fit(train_features, train_labels)

KeyboardInterrupt: 

## Testing the RF Classifier

In [None]:
rf_preds = clf.predict(test_features)

In [None]:
from sklearn.metrics import f1_score
print('Random Forest Classifier F1 score: ', round(f1_score(test_labels, rf_preds, average = 'weighted'), 3)*100, '%')

In [None]:
accurate = 0
for a, b in zip(rf_preds, test_labels):
    if a == b:
        accurate += 1
print('Random Forest Classifier Accuracy: ', round(accurate/len(test_labels), 3)* 100, '%')

There was an improvment from 61.5% F1 and 61.6 % accuracy in random predictions to 94.6% F1 and 94.8% accuracy with the RF classifier. That's an improvement of nearly 33 points in both F1 score and accuracy.

In [None]:
pd.crosstab(test_labels, rf_preds, rownames=['True Recommendations'], colnames=['Predicted Recommendations'])

In [None]:
from sklearn.metrics import confusion_matrix
import itertools

In [None]:
sns.set_style("whitegrid")
classes = ['Buy', 'Buy or Wait', 'Wait']
cmap=plt.cm.Blues
cnf_matrix = confusion_matrix(test_labels, rf_preds)
cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
tick_marks = np.arange(len(classes))
_ = plt.figure(figsize = (10,10))
_ = plt.imshow(cnf_matrix, cmap = cmap)
_ = plt.title("Confusion Matrix with Normalization")
_ = plt.xticks(tick_marks, classes, rotation = 45)
_ = plt.yticks(tick_marks, classes, rotation = 45)
_ = plt.ylabel('True label')
_ = plt.xlabel('Predicted label')
_ = plt.grid('off')
_ = plt.tight_layout()
fmt = '.2f'
thresh = cnf_matrix.max() / 2.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
    plt.text(j, i, format(cnf_matrix[i, j], fmt), horizontalalignment="center", color="white" if cnf_matrix[i, j] > thresh else "black")

Above is a confusion matrix of my results.  

In [None]:
sns.set()
f_i = list(zip(feature_list, clf.feature_importances_))
feature_importances = pd.DataFrame(f_i, columns =['features', 'importances']).sort_values('importances', ascending = False)
top15 = feature_importances.iloc[:15,:]

sns.set(font_scale = 1.5)
_ = plt.figure(figsize = (15,10))
_ = plt.bar(top15['features'], top15['importances'])
_ = plt.title("Top 15 Most Important Features in RF Classifier")
_ = plt.ylabel("Relative Feature Importance")
_ = plt.xticks(rotation=55)

In [None]:
test_labels_2=test_labels.copy()
np.place(test_labels_2, test_labels_2 == 0, [1])
rf_preds_2=rf_preds.copy()
np.place(rf_preds_2, rf_preds_2 == 0, [1])

In [None]:
print('Random Forest Classifier Accuracy: ', round(f1_score(test_labels_2, rf_preds_2), 3) * 100, '%')

In [None]:
accurate = 0
for a, b in zip(rf_preds, test_labels):
    if a == 0:
        a = a + 1
    if b == 0:
        b = b + 1
    if a == b:
        accurate += 1
print('Random Forest Classifier Accuracy: ', round(accurate/len(test_labels), 3)* 100, '%')

In the cell above I changed all of the category of "Buy or Wait" to entirely buy. This would result in the same amount of money being saved for the customer. My accuracy improves by 1.5 points. My F1 score goes down by 3.2 points.

## Part II: Using the datasets that have reduced dimensionality from PCA

In [None]:
train_features_pca = np.loadtxt('../train_features_pca.txt', dtype=int)
test_features_pca = np.loadtxt('../test_features_pca.txt', dtype=int)

## Training and Testing the RF Classifier

In [None]:
n_components = []
accuracies = []
cnf_matricies = []
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=-1, random_state=123, n_estimators=1000)
for n in [52, 34, 29, 25]:
    rf = clf.fit(train_features_pca[:, :n], train_labels)
    rf_preds = clf.predict(test_features_pca[:, :n])
    accurate = 0
    for a, b in zip(rf_preds, test_labels):
        if a==b:
            accurate += 1
    n_components.append(n)
    accuracies.append(accurate/len(test_labels))
    print('RF Classifier Accuracy for ', n, ' components:', round((accurate/len(test_labels)), 3) * 100, '%')
    print('RF Classifier F1 score for ', n, ' components:', round(f1_score(test_labels, rf_preds, average = 'weighted'), 3) * 100, '%')
    cnf_matrix = confusion_matrix(test_labels, rf_preds)
    cnf_matricies.append(cnf_matrix)

The accuracies and F1 scores of the models trained on the PCA data are much worse than of the model trained on all of the training data. This might be because the colinearity of the features actually helps the model. 

## Conclusion

The best F1 score and accuracy (94.6 % and 94.8% respectively) came from the RF classifier trained on all 52 features. This was a 33 point improvement of both F1 score and accuracy over the baselines. The date features were broken into week number and day of week, and coded numerically. The categorical variables like highest class available and carrier were encoded using one hot encoding. The accuracy and F1 score went down when using data that was first reduced by PCA. 