# 4.8 Assignment 3: Decision Trees and Random Forest

## Table of Contents:
[Part 1: Decision Trees](#Part-1:-Decision-Trees)

[Part 2: Random Forest](#Part-2:-Random-Forest) 

Both sections work with the “Blues Guitarists Hand Posture and Thumbing Style by Region and Birth Period” data, which has 93 entries of various blues guitarists born between 1874-1940.

#### Features included

Regions: 1 means East, 2 means Delta, 3 means Texas

Years: 0 for those born before 1906, 1 for the rest

Hand postures: 1= Extended, 2= Stacked, 3=Lutiform

Thumb styles: Between 1 and 3, 1=Alternating, 2=Utility, 3=Dead

### Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

from sklearn.ensemble import RandomForestClassifier

# Part 1: Decision Trees

In this section, I'll use a decision tree classifier to see how accurately predictions of a guitarist's birth year are when considering their hand postures ('handPost') and thumb styles ('thumbSty'). Afterward, I will redo the model but also include the 'region' variable while training the model and measure how it affected the predictions. 

[Data Processing](#Data-Processing)

[Model Training](#Model-Training)

# Data Processing

In [2]:
guitarists_data = pd.read_csv('Assignment 4-blues_hand.csv')
guitarists_data.head()

Unnamed: 0,name,state,brthYr,post1906,region,handPost,thumbSty
0,Henry Thomas,TX,1874,0,3,1,3
1,Frank Stokes,TN,1887,0,2,1,3
2,Sam Collins,MS,1887,0,2,1,2
3,Peg Leg Howell,GA,1888,0,1,2,2
4,Huddie Ledbetter,TX,1888,0,3,2,3


In [3]:
#ensure correct upload
guitarists_data.shape

(93, 7)

In [4]:
#examine value distributions
guitarists_data.describe()

Unnamed: 0,brthYr,post1906,region,handPost,thumbSty
count,93.0,93.0,93.0,93.0,93.0
mean,1908.903226,0.548387,1.741935,1.580645,2.043011
std,13.44802,0.500351,0.657783,0.712048,0.832936
min,1874.0,0.0,1.0,1.0,1.0
25%,1898.0,0.0,1.0,1.0,1.0
50%,1908.0,1.0,2.0,1.0,2.0
75%,1917.0,1.0,2.0,2.0,3.0
max,1940.0,1.0,3.0,3.0,3.0


In [5]:
#ensure no null values exist
guitarists_data.isnull().any()

name        False
state       False
brthYr      False
post1906    False
region      False
handPost    False
thumbSty    False
dtype: bool

In [6]:
guitarists_data.dtypes

name        object
state       object
brthYr       int64
post1906     int64
region       int64
handPost     int64
thumbSty     int64
dtype: object

In [7]:
#one-hot encode 'region', 'handPost', and 'thumbSty'

key_encoder = OneHotEncoder()

processed_columns = key_encoder.fit_transform(guitarists_data[['region', 'handPost', 'thumbSty']])

features = key_encoder.get_feature_names_out(['region', 'handPost', 'thumbSty'])

processed_df = pd.DataFrame(processed_columns.toarray(), columns = features)

guitarists_df = pd.concat([guitarists_data, processed_df], axis = 1)

In [8]:
guitarists_df = guitarists_df.drop(columns = ['region', 'handPost', 'thumbSty'])
guitarists_df

Unnamed: 0,name,state,brthYr,post1906,region_1,region_2,region_3,handPost_1,handPost_2,handPost_3,thumbSty_1,thumbSty_2,thumbSty_3
0,Henry Thomas,TX,1874,0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1,Frank Stokes,TN,1887,0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,Sam Collins,MS,1887,0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,Peg Leg Howell,GA,1888,0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,Huddie Ledbetter,TX,1888,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,Jimmie Lee Harris,AL,1935,1,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
89,Snooks Eaglin,LA,1936,1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
90,Larry Johnson,GA,1938,1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
91,Tom Winslow,NC,1938,1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


# Model Training

In [9]:
#feature selection for 1st tree: 'handPost' & 'thumbSty'
X1 = guitarists_df[['handPost_1', 'handPost_2', 'handPost_3', 
                   'thumbSty_1', 'thumbSty_2', 'thumbSty_3']]

y1 = guitarists_df['brthYr']


#split data
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = 0.3, random_state = 15)

tree_classifier = DecisionTreeClassifier()

#fit model
tree_classifier.fit(X1_train, y1_train)

tree_preds = tree_classifier.predict(X1_test)

handpost_thumb_accuracy = accuracy_score(y1_test, tree_preds)
handpost_thumb_accuracy

0.03571428571428571

The model performed pretty poorly, so I decided to do kfold validation to get an average accuracy score of the trees. 

In [32]:
#kfold validation
k_fold = KFold(n_splits = 9, shuffle = True, random_state = 7)

#initialize list to store accuracy scores
accuracies_tree1 = []


for train_index, test_index in k_fold.split(X1):
    X1_trainf, X1_testf = X1.iloc[train_index], X1.iloc[test_index]
    y1_trainf, y1_testf = y1.iloc[train_index], y1.iloc[test_index]
    
    kftree_classifier = DecisionTreeClassifier()
    
    #train tree
    kftree_classifier.fit(X1_trainf, y1_trainf)
    
    #test tree
    kftree_preds = kftree_classifier.predict(X1_testf)
    
    kfold_accuracy = accuracy_score(y1_testf, kftree_preds)
    
    accuracies_tree1.append(kfold_accuracy)

avg_accuracy = round(sum(accuracies_tree1)/len(accuracies_tree1),3)
avg_accuracy

0.011

In [11]:
min(accuracies_tree1)

0.0

In [12]:
max(accuracies_tree1)

0.1

The first tree performed pretty poorly with an accuracy score of 0.03, which improved when performing kfold cross validation to a score of 0.11. These accuracies are still pretty low, but there was significant improvement when implementing kfold cross validation. It was interesting to see that the score of the tree that performed the worst with the handPost and thumbSty variables actually scored 0, while the best performing tree had an accuracy score of 0.1. I will not add the 'region' variable and compare these results.

In [13]:
#feature selection for 2nd tree: 'handPost', 'thumbSty' & 'region'
X2 = guitarists_df[['region_1', 'region_2', 'region_3', 'handPost_1', 'handPost_2', 'handPost_3', 
                   'thumbSty_1', 'thumbSty_2', 'thumbSty_3']]

y2 = guitarists_df['brthYr']


#split data
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.3, random_state = 45)

tree_classifier2 = DecisionTreeClassifier()

#fit model
tree_classifier2.fit(X2_train, y2_train)

tree_preds2 = tree_classifier2.predict(X2_test)

accuracy3_variables = accuracy_score(y2_test, tree_preds2)
accuracy3_variables

0.07142857142857142

There was significant improvement on this tree with the region, hand posture, and thumb style variables, though the accuracy score is still pretty low. I will also perform kfld cv and evaluate the averages.

In [31]:
#kfold validation
k_fold2 = KFold(n_splits = 9, shuffle = True, random_state = 3)

#initialize list to store accuracy scores
accuracies_tree2 = []


for train_index, test_index in k_fold2.split(X2):
    X2_trainf, X2_testf = X2.iloc[train_index], X2.iloc[test_index]
    y2_trainf, y2_testf = y2.iloc[train_index], y2.iloc[test_index]
    
    kftree_classifier2 = DecisionTreeClassifier()
    
    #train tree
    kftree_classifier2.fit(X2_trainf, y2_trainf)
    
    #test tree
    kftree_preds2 = kftree_classifier2.predict(X2_testf)
    
    kfold_accuracy2 = accuracy_score(y2_testf, kftree_preds2)
    
    accuracies_tree2.append(kfold_accuracy2)

avg_accuracy2 = round(sum(accuracies_tree1)/len(accuracies_tree1),3)
avg_accuracy2

0.011

In [15]:
min(accuracies_tree2)

0.0

In [16]:
max(accuracies_tree2)

0.2

In [17]:
print(accuracies_tree1, accuracies_tree2)

[0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0] [0.0, 0.09090909090909091, 0.0, 0.1, 0.1, 0.0, 0.2, 0.0, 0.0]


Interestingly enough, the averages for both the trees were the same (0.01). Their worst performing tree scored 0 in both instances, and their best performing trees scored 0.1 and 0.2 respectively. Neither tree is performing greatly so it seems the region variable is not adding a lot of information gain nor reducing entropy significantly.

# Part 2: Random Forest

In [18]:
#random forest for first decision tree model with 'handPost' & 'thumbSty' variables

rand_forest = RandomForestClassifier(n_estimators = 500)

rand_forest.fit(X1_train, y1_train)

rand_forest1_preds = rand_forest.predict(X1_test)

rand_forest1_accuracy = accuracy_score(y1_test, rand_forest1_preds)

rand_forest1_accuracy

0.03571428571428571

This first random forest with the handpost and thumbSty variables performed similarly as the singular decision tree from earlier: both scored 0.35. 
Since I performed kfold CV on the decison tree, I'll do the same for this random forest despite it already being an ensemble method, so as to gain a robust estimate of the model's performance and hopefully reduce variability in its performance.

In [30]:
#kfold validation
k_fold = KFold(n_splits = 7, shuffle = True, random_state = 19)

#initialize list to store accuracy scores
accuracies_forest1 = []


for train_index, test_index in k_fold.split(X1):
    X1_trainrf, X1_testrf = X1.iloc[train_index], X1.iloc[test_index]
    y1_trainrf, y1_testrf = y1.iloc[train_index], y1.iloc[test_index]
    
    rf_classifier = DecisionTreeClassifier()
    
    #train tree
    rf_classifier.fit(X1_trainrf, y1_trainrf)
    
    #test tree
    rf_preds = rf_classifier.predict(X1_testrf)
    
    rf_accuracy = accuracy_score(y1_testrf, rf_preds)
    
    accuracies_forest1.append(rf_accuracy)

f_accuracy = round(sum(accuracies_forest1)/len(accuracies_forest1),3)
f_accuracy

0.01

In [21]:
min(accuracies_forest1)

0.0

In [22]:
max(accuracies_forest1)

0.07142857142857142

The kfold CV of the random forest built for the handPost and thumbSty variables performed worse than the random forest above, with an average accuracy of 0.010. It is worth noting that the best performing iteration in the kfold cv process had an accuracy score of 0.07, which is much higher than the average and the singular random forest. 

In [23]:
#random forest for second decision tree model with 'handPost', 'thumbSty' & 'region' variables

rand_forest2 = RandomForestClassifier(n_estimators = 350)

rand_forest2.fit(X2_train, y2_train)

rand_forest2_preds = rand_forest2.predict(X2_test)

rand_forest2_accuracy = accuracy_score(y2_test, rand_forest2_preds)

rand_forest2_accuracy

0.03571428571428571

The random forest including the region variable performed better than the forest without it, with an accuracy score of 0.07 indicating some potential information gain.

In [27]:
#kfold validation
k_fold = KFold(n_splits = 7, shuffle = True, random_state = 20)

#initialize list to store accuracy scores
accuracies_forest2 = []


for train_index, test_index in k_fold.split(X2):
    X2_trainrf, X2_testrf = X2.iloc[train_index], X2.iloc[test_index]
    y2_trainrf, y2_testrf = y2.iloc[train_index], y2.iloc[test_index]
    
    rf2_classifier = DecisionTreeClassifier()
    
    #train tree
    rf2_classifier.fit(X2_trainrf, y2_trainrf)
    
    #test tree
    rf2_preds = rf2_classifier.predict(X2_testrf)
    
    rf2_accuracy = accuracy_score(y2_testrf, rf2_preds)
    
    accuracies_forest2.append(rf2_accuracy)

f2_accuracy = round(sum(accuracies_forest2)/len(accuracies_forest2),3)
f2_accuracy

0.044

In [25]:
min(accuracies_forest2)

0.0

In [26]:
max(accuracies_forest2)

0.15384615384615385

The kfold CV results from the second random forest show significant improvement; for the first forest's kfold CV, the average accuracy score was 0.010, and for this second forest, the accuracy score is 0.044. This tracks with the slight information gain from the 'region' variable in the random forest, which indicates that perhaps a singular decision tree can fail in truly capturing its impact, whereas a random forest can highlight the improvement of including this variable a bit better. 