# Decision Tree Classifier - random_state

In the previous notebook we got an accuracy score of just over 40%.

Lets just do that again.

In [15]:
# Imports
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Training Data
training_raw = pd.read_table("../data/training_data.dat")
df_training = pd.DataFrame(training_raw)
# test Data
test_raw = pd.read_table("../data/test_data.dat")
df_test = pd.DataFrame(test_raw)
# target names
target_categories = ['Unclassified','Art','Aviation','Boating','Camping /Walking /Climbing','Collecting']
# Extract target results from panda
target = df_training["CategoryID"].values
# Create classifier class
model_dtc = DecisionTreeClassifier()
# features
feature_names_integers = ['Barcode','UnitRRP']
# Extra features from panda (without description)
training_data_integers = df_training[feature_names_integers].values
training_data_integers[:3]
# train model
model_dtc.fit(training_data_integers, target)
# Extract test data and test the model
test_data_integers = df_test[feature_names_integers].values
test_target = df_test["CategoryID"].values
expected = test_target
predicted_dtc = model_dtc.predict(test_data_integers)
print(metrics.classification_report(expected, predicted_dtc, target_names=target_categories))
print(metrics.confusion_matrix(expected, predicted_dtc))
print(metrics.accuracy_score(expected, predicted_dtc, normalize=True, sample_weight=None))

                            precision    recall  f1-score   support

              Unclassified       0.33      0.05      0.08        43
                       Art       0.39      0.80      0.52        20
                  Aviation       0.50      0.56      0.53        54
                   Boating       0.47      0.57      0.52        28
Camping /Walking /Climbing       0.42      0.53      0.47        15
                Collecting       0.61      0.61      0.61        31

               avg / total       0.46      0.48      0.43       191

[[ 2  0 24  9  5  3]
 [ 1 16  1  2  0  0]
 [ 1 16 30  1  4  2]
 [ 0  4  3 16  1  4]
 [ 2  0  2  0  8  3]
 [ 0  5  0  6  1 19]]
0.476439790576


and again.

In [16]:
model_dtc = DecisionTreeClassifier()
model_dtc.fit(training_data_integers, target)
predicted_dtc = model_dtc.predict(test_data_integers)
print(metrics.accuracy_score(expected, predicted_dtc, normalize=True, sample_weight=None))

0.460732984293


one more time :)

In [17]:
model_dtc = DecisionTreeClassifier()
model_dtc.fit(training_data_integers, target)
predicted_dtc = model_dtc.predict(test_data_integers)
print(metrics.accuracy_score(expected, predicted_dtc, normalize=True, sample_weight=None))

0.434554973822


We see that the results are not the same. This is because the Decision Tree Classifier chooses a feature at random in order to try to avoid overfitting. As we are about to start trying to improve the results by trying different strategies of preparing and loading data having varying will be unhelpful.

To aviod this we can manually set the random_state.

In [18]:
model_dtc = DecisionTreeClassifier(random_state=511)
model_dtc.fit(training_data_integers, target)
predicted_dtc = model_dtc.predict(test_data_integers)
print(metrics.accuracy_score(expected, predicted_dtc, normalize=True, sample_weight=None))

0.471204188482


In [19]:
model_dtc = DecisionTreeClassifier(random_state=511)
model_dtc.fit(training_data_integers, target)
predicted_dtc = model_dtc.predict(test_data_integers)
print(metrics.accuracy_score(expected, predicted_dtc, normalize=True, sample_weight=None))

0.471204188482
