# Decision Trees for Classification and Regression
https://stackabuse.com/decision-trees-in-python-with-scikit-learn/

Decision trees are widely used for supervised machine learning tasks for regression and classification. For each feature/attribute in the dataset, the decision tree forms a node; most important attribute is placed at the root node

For evaluation, start at root node and work way down; continue until leaf node is reached, which is prediction of decision tree.

Advantages of decision trees:
    1. Can predict both continuous and discrete values
    2. Require relatively less effort for training
    3. Used to classify non-linearly separable data
    4. Very fast and efficient compared to KNN and other techniques

# Classification Example

In [42]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [17]:
dataset = pd.read_csv("/home/colinphillips17/Downloads/bill_authentication.csv")

In [18]:
dataset.shape

(1372, 5)

In [19]:
dataset.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [20]:
dataset.columns

Index([u'Variance', u'Skewness', u'Curtosis', u'Entropy', u'Class'], dtype='object')

In [21]:
# drop method returns dataset without that specific column
x = dataset.drop('Class',axis=1)

# make y the class column
y = dataset['Class']

# Preparing Data

In [22]:
# model_selection library of sklearn contains train_test_split
# splits data randomly into training and testing sets

from sklearn.model_selection import train_test_split
x_train,x_test,y_train, y_test = train_test_split(x,y,test_size=0.20)

# Training the Decision Tree

In [23]:
# tree library from sklearn contains built-in classes for decision trees
# Use DecisionTreeClassifier class
# fit method trains the algorithm on training data

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [26]:
# predict method used to make predictions on test data
y_pred = classifier.predict(x_test)

# Evaluate the Algorithm
Determine how accurate the algorithm is. Common metrics for classification is :
    - Confusion Matrix
    - Precision
    - Recall
    - F1 Score
Sklearn has metrics library containing the classification_report and confusion_matrix methods.

In [27]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[153   1]
 [  1 120]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       154
           1       0.99      0.99      0.99       121

   micro avg       0.99      0.99      0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275



# Regression Example

In [53]:
# Use the DecisionTreeRegressor Class
dataset2 = pd.read_csv("/home/colinphillips17/Downloads/petrol_consumption.csv")

In [54]:
dataset2.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [55]:
#describe method 
dataset2.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [56]:
#process the data
x1 = dataset2.drop('Petrol_Consumption',axis = 1)
y1 = dataset2['Petrol_Consumption']

In [57]:
#split the data for training and testing
from sklearn.model_selection import train_test_split
x1_train, x1_test, y1_train, y1_test = train_test_split(x1,y1,test_size=0.2,random_state=0)

In [61]:
#train the model
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(x1_train,y1_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [62]:
#predict based on model
y1_pred = regressor.predict(x1_test)

In [63]:
#see predictions of model and actual values
df = pd.DataFrame({"Actual":y1_test, "Predicted": y1_pred})
df

Unnamed: 0,Actual,Predicted
29,534,541.0
4,410,414.0
26,577,574.0
30,571,554.0
32,577,631.0
37,704,644.0
34,487,628.0
40,587,540.0
7,467,414.0
10,580,498.0


In [65]:
#Evaluate the algorithm
from sklearn import metrics
print("Mean Absolute Error:", metrics.mean_absolute_error(y1_test,y1_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y1_test,y1_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y1_test,y1_pred)))

('Mean Absolute Error:', 46.8)
('Mean Squared Error:', 3850.2)
('Root Mean Squared Error:', 62.04997985495241)
