In this exercise you will work with the Blues Guitarists Hand Posture and Thumbing Style by Region and Birth Period data, which you can download from here. This dataset has 93 entries of various blues guitarists born between 1874 and 1940. Apart from the name of the guitarists, that dataset contains the following four features:

Regions: 1 means East, 2 means Delta, 3 means Texas.  
Years: 0 for those born before 1906, 1 for the rest  
Hand postures: 1= Extended, 2= Stacked, 3=Lutiform  
Thumb styles: Between 1 and 3, 1=Alternating, 2=Utility, 3=Dead  

Using decision tree on this dataset, how accurately you can tell their birth year from their hand postures and thumb styles. How does it affect the evaluation when you include the region while training the model? Now do the same using random forest (on both the above cases) and report the difference.

Make sure to use appropriate training-testing parameters for your evaluation. You should also run the algorithms multiple times, measure various accuracies, and report the average (and perhaps the range).

### import packages

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Visualize the decision tree
import graphviz

### read data

In [4]:
df = pd.read_csv("blues_hand.csv")
df.head()

Unnamed: 0,name,state,brthYr,post1906,region,handPost,thumbSty
0,Henry Thomas,TX,1874,0,3,1,3
1,Frank Stokes,TN,1887,0,2,1,3
2,Sam Collins,MS,1887,0,2,1,2
3,Peg Leg Howell,GA,1888,0,1,2,2
4,Huddie Ledbetter,TX,1888,0,3,2,3


### deploy models
- decision tree

In [16]:
# Creating X and Y
X = df[['handPost','thumbSty','region']]
y = df['post1906']

# Making training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

# Creating the DTC and fitting the model
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

#Predicting on test data
predictions = dtree.predict(X_test)

#Printing the classification report and accuracy score
print(accuracy_score(y_test,predictions))
print(confusion_matrix(y_test,predictions))

#Features to vizualize dtree
features = list(df[['handPost','thumbSty','region']])

# DOT data
dot_data = tree.export_graphviz(dtree, out_file=None,
                                feature_names=('handPost','thumbSty','region'),
                                class_names=('0','1'),
                                filled=True)

dot_data
# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph.render('blues_hand_dt', view=True)

0.4642857142857143
[[ 8  4]
 [11  5]]


'blues_hand_dt.png'

- random forest

In [20]:
rfc = RandomForestClassifier(n_estimators=100)
# y_train is a column vector, but 1d array is expected. Therefore, we need to
# change the shape to (n_samples,)
rfc.fit(X_train, y_train.values.ravel())

predictions = rfc.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))
print(confusion_matrix(y_test,predictions))

Accuracy: 0.4642857142857143
[[4 8]
 [7 9]]
