Assignment 6 - Classification and Regression Trees

The problems in this assignment are based on the exercise 9.3 of Chapter 9 in Data Mining for Business Analytics.

Scenario: Predicting Prices of Used Cars using Regression Trees.

Data: The file ToyotaCorolla.csv contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.(The example in Section 9.7 is a subset of this dataset).

In [1]:
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
from dmba import plotDecisionTree, regressionSummary, classificationSummary
import math

no display found. Using non-interactive Agg backend


In [2]:
# Load the data
cars_df = pd.read_csv("dmba/ToyotaCorolla.csv")

# Verify data is loaded correctly
print("Shape", cars_df.shape)  # determine data frame dimensions
cars_df.head(15)  # view the first 15 observations

Shape (1436, 39)


Unnamed: 0,Id,Model,Price,Age_08_04,Mfg_Month,Mfg_Year,KM,Fuel_Type,HP,Met_Color,...,Powered_Windows,Power_Steering,Radio,Mistlamps,Sport_Model,Backseat_Divider,Metallic_Rim,Radio_cassette,Parking_Assistant,Tow_Bar
0,1,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13500,23,10,2002,46986,Diesel,90,1,...,1,1,0,0,0,1,0,0,0,0
1,2,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13750,23,10,2002,72937,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
2,3,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13950,24,9,2002,41711,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
3,4,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,14950,26,7,2002,48000,Diesel,90,0,...,0,1,0,0,0,1,0,0,0,0
4,5,TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors,13750,30,3,2002,38500,Diesel,90,0,...,1,1,0,1,0,1,0,0,0,0
5,6,TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors,12950,32,1,2002,61000,Diesel,90,0,...,1,1,0,1,0,1,0,0,0,0
6,7,TOYOTA Corolla 2.0 D4D 90 3DR TERRA 2/3-Doors,16900,27,6,2002,94612,Diesel,90,1,...,1,1,0,0,1,1,0,0,0,0
7,8,TOYOTA Corolla 2.0 D4D 90 3DR TERRA 2/3-Doors,18600,30,3,2002,75889,Diesel,90,1,...,1,1,0,0,0,1,0,0,0,0
8,9,TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors,21500,27,6,2002,19700,Petrol,192,0,...,1,1,1,0,0,0,1,1,0,0
9,10,TOYOTA Corolla 1.9 D HATCHB TERRA 2/3-Doors,12950,23,10,2002,71138,Diesel,69,0,...,0,1,0,0,0,1,0,0,0,0


Data Preparation: You will need to convert the categorical Fuel_Type column into dummy variables. Split the data into training (60%), and validation (40%) datasets (use random_state=1).

In [3]:
cars_df.columns

Index(['Id', 'Model', 'Price', 'Age_08_04', 'Mfg_Month', 'Mfg_Year', 'KM',
       'Fuel_Type', 'HP', 'Met_Color', 'Color', 'Automatic', 'CC', 'Doors',
       'Cylinders', 'Gears', 'Quarterly_Tax', 'Weight', 'Mfr_Guarantee',
       'BOVAG_Guarantee', 'Guarantee_Period', 'ABS', 'Airbag_1', 'Airbag_2',
       'Airco', 'Automatic_airco', 'Boardcomputer', 'CD_Player',
       'Central_Lock', 'Powered_Windows', 'Power_Steering', 'Radio',
       'Mistlamps', 'Sport_Model', 'Backseat_Divider', 'Metallic_Rim',
       'Radio_cassette', 'Parking_Assistant', 'Tow_Bar'],
      dtype='object')

In [4]:
len(cars_df.columns)

39

In [5]:
required = ['Price', 'Age_08_04', 'KM', 'Fuel_Type', 'HP', 'Automatic', 'Doors', 'Quarterly_Tax',
              'Mfr_Guarantee', 'Guarantee_Period', 'Airco', 'Automatic_airco', 'CD_Player',
              'Powered_Windows', 'Sport_Model', 'Tow_Bar']
cars_df = cars_df[required]

# Convert the categorical Fuel_Type to dummy variables
cars_df = pd.get_dummies(cars_df, drop_first=True)
cars_df.tail(15)  # view the first 15 observations

Unnamed: 0,Price,Age_08_04,KM,HP,Automatic,Doors,Quarterly_Tax,Mfr_Guarantee,Guarantee_Period,Airco,Automatic_airco,CD_Player,Powered_Windows,Sport_Model,Tow_Bar,Fuel_Type_Diesel,Fuel_Type_Petrol
1421,8500,78,36000,86,1,3,69,0,3,0,0,0,0,0,0,0,1
1422,7600,78,36000,110,0,3,69,1,3,0,0,0,1,0,0,0,1
1423,7950,80,35821,86,1,3,19,0,3,0,0,0,0,0,0,0,1
1424,7750,73,34717,86,0,3,69,0,6,0,0,0,0,0,0,0,1
1425,7950,80,34000,86,0,4,69,1,12,0,0,0,0,0,0,0,1
1426,9950,78,30964,110,1,3,85,1,12,1,0,0,0,0,1,0,1
1427,8950,71,29000,86,1,3,69,1,3,0,0,0,0,0,0,0,1
1428,8450,72,26000,86,0,3,69,0,3,0,0,0,0,1,0,0,1
1429,8950,78,24000,86,1,5,85,1,12,0,0,0,0,0,1,0,1
1430,8450,80,23000,86,0,3,69,0,3,0,0,0,0,1,0,0,1


In [6]:
print("New Shape", cars_df.shape)  # determine new data frame dimensions

New Shape (1436, 17)


In [7]:
# Split the data into training (60%), and validation (40%) datasets (use random_state=1).
X = cars_df.drop(columns='Price')
y = cars_df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

In [8]:
print("The training set dimensions are:", X_train.shape, ".\n")
print("The test set dimensions are are:", X_test.shape, ".")

The training set dimensions are: (861, 16) .

The test set dimensions are are: (575, 16) .


##############################################################################################################

##############################################################################################################


Question 1 (10 points) Run a regression tree (RT) with the output variable Price and input variables Age_08_04, KM, Fuel_Type (= Petrol and Diesel), HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_Airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.

    Which appear to be the three or four most important car specifications for predicting the car’s price? Use the feature_importances_ property of the tree.
    Compare the prediction errors of the training, and validation, by examining their RMSE.
    Redo the tree, this time adjusting the parameters to yield a shallower tree. Compare the RMSE to the deeper tree.
    Determine an optimal set of parameters (max_depth, min_impurity_decrease, min_samples_split) using cross-validated grid search.
    What is the performance of the optimized model on the training and validation sets?
    Predict the price of a used Toyota Corolla with the specifications listed in the table below.

In [9]:
reg_Tree = DecisionTreeRegressor(random_state=1)
reg_Tree.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

In [10]:
print("Nodes: {}".format(reg_Tree.tree_.node_count))

Nodes: 1485


############################################################################################################

Which appear to be the three or four most important car specifications for predicting the car’s price? Use the feature_importances_ property of the tree.


In [11]:
specifications = pd.DataFrame({"features": X_train.columns, "importance": reg_Tree.feature_importances_})
# Sorting out the top four car specifications based on importance
specifications.sort_values(by="importance", ascending=False).head(4)

Unnamed: 0,features,importance
0,Age_08_04,0.844867
2,HP,0.053789
1,KM,0.049601
9,Automatic_airco,0.013358


#######################################################################################################

Compare the prediction errors of the training, and validation, by examining their RMSE.

In [12]:
# Prediction error of the training set
regressionSummary(y_train, reg_Tree.predict(X_train))
# Prediction error of the validation/test set
regressionSummary(y_test, reg_Tree.predict(X_test))


Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 0.0000
            Mean Absolute Error (MAE) : 0.0000
          Mean Percentage Error (MPE) : 0.0000
Mean Absolute Percentage Error (MAPE) : 0.0000

Regression statistics

                      Mean Error (ME) : 76.6557
       Root Mean Squared Error (RMSE) : 1492.3365
            Mean Absolute Error (MAE) : 1152.4852
          Mean Percentage Error (MPE) : -0.3363
Mean Absolute Percentage Error (MAPE) : 11.3783


Training set RMSE of 0.0000 versus Validation set RMSE of 1492.3365 shows overfitting on the Training set.

Redo the tree, this time adjusting the parameters to yield a shallower tree. Compare the RMSE to the deeper tree.

In [13]:
reg_Tree2 = DecisionTreeRegressor(max_depth=5)
reg_Tree2.fit(X_train, y_train)
print("Nodes: {}".format(reg_Tree2.tree_.node_count))

Nodes: 59


In [14]:
print("Deeper Tree Regression Comparison Summary of Training Set vs. Test Set:")
# Prediction error of the training set
regressionSummary(y_train, reg_Tree.predict(X_train))
# Prediction error of the validation/test set
regressionSummary(y_test, reg_Tree.predict(X_test))

print("\nShallower Tree Regression Comparison Summary of Training Set vs. Test Set:")
# Prediction error of the training set
regressionSummary(y_train, reg_Tree2.predict(X_train))
# Prediction error of the validation/test set
regressionSummary(y_test, reg_Tree2.predict(X_test))

Deeper Tree Regression Comparison Summary of Training Set vs. Test Set:

Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 0.0000
            Mean Absolute Error (MAE) : 0.0000
          Mean Percentage Error (MPE) : 0.0000
Mean Absolute Percentage Error (MAPE) : 0.0000

Regression statistics

                      Mean Error (ME) : 76.6557
       Root Mean Squared Error (RMSE) : 1492.3365
            Mean Absolute Error (MAE) : 1152.4852
          Mean Percentage Error (MPE) : -0.3363
Mean Absolute Percentage Error (MAPE) : 11.3783

Shallower Tree Regression Comparison Summary of Training Set vs. Test Set:

Regression statistics

                      Mean Error (ME) : -0.0000
       Root Mean Squared Error (RMSE) : 1028.0279
            Mean Absolute Error (MAE) : 773.2770
          Mean Percentage Error (MPE) : -1.0039
Mean Absolute Percentage Error (MAPE) : 7.6715

Regression statistics

                      Mean Error (M

Where on the Deeper Tree the Training set RMSE of 0.0000 versus Validation set RMSE of 1492.3365 showed overfitting on the Training set; the Shallower Tree's Training set RMSE is 1028.0279 versus Validation set RMSE of 1160.9679 indicating better predictive capability for the Shallower Tree.

################################################################################################################

Determine an optimal set of parameters (max_depth, min_impurity_decrease, min_samples_split) using cross-validated grid search.

In [15]:
### Using the Chapter 09 ipynb file as a guide
# user grid search to find optimized tree
param_grid = {
    'max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25], 
    'min_impurity_decrease': [0, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01], 
    'min_samples_split': [10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50], 
}
gridSearch = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('Initial parameters: ', gridSearch.best_params_)

param_grid = {
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 
    'min_impurity_decrease': [0, 0.001, 0.002, 0.003, 0.005, 0.006, 0.007, 0.008, 0.009, 0.010], 
    'min_samples_split': [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], 
}
gridSearch = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('\nImproved parameters: ', gridSearch.best_params_)

regTree = gridSearch.best_estimator_
print("\nOptimal Parameter Tree has {} Nodes.".format(regTree.tree_.node_count))

Initial parameters:  {'max_depth': 7, 'min_impurity_decrease': 0, 'min_samples_split': 19}

Improved parameters:  {'max_depth': 7, 'min_impurity_decrease': 0.001, 'min_samples_split': 19}

Optimal Parameter Tree has 89 Nodes.


##########################################################################################################

What is the performance of the optimized model on the training and validation sets?

In [16]:
print("Optimized Model Tree Regression Comparison Summary of Training Set vs. Test Set:")
# Prediction error of the training set
regressionSummary(y_train, regTree.predict(X_train))
# Prediction error of the validation/test set
regressionSummary(y_test, regTree.predict(X_test))

Optimized Model Tree Regression Comparison Summary of Training Set vs. Test Set:

Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 1038.9184
            Mean Absolute Error (MAE) : 742.5818
          Mean Percentage Error (MPE) : -0.9049
Mean Absolute Percentage Error (MAPE) : 7.1758

Regression statistics

                      Mean Error (ME) : 28.3958
       Root Mean Squared Error (RMSE) : 1244.4135
            Mean Absolute Error (MAE) : 944.2862
          Mean Percentage Error (MPE) : -0.9772
Mean Absolute Percentage Error (MAPE) : 9.3784


In [17]:
print("Using the optimal parameters of max_depth=7, min_impurity_decrease=0.002, and min_samples_split=19")
optreg_Tree = DecisionTreeRegressor(max_depth=7, min_samples_split=19, min_impurity_decrease=0.002,random_state=1)
optreg_Tree.fit(X_train, y_train)
# Prediction error of the training set
regressionSummary(y_train, optreg_Tree.predict(X_train))
# Prediction error of the validation/test set
regressionSummary(y_test, optreg_Tree.predict(X_test))

Using the optimal parameters of max_depth=7, min_impurity_decrease=0.002, and min_samples_split=19

Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 1038.9184
            Mean Absolute Error (MAE) : 742.5818
          Mean Percentage Error (MPE) : -0.9049
Mean Absolute Percentage Error (MAPE) : 7.1758

Regression statistics

                      Mean Error (ME) : 28.3958
       Root Mean Squared Error (RMSE) : 1244.4135
            Mean Absolute Error (MAE) : 944.2862
          Mean Percentage Error (MPE) : -0.9772
Mean Absolute Percentage Error (MAPE) : 9.3784


With the optimal parameters the training set to validation set RMSE comparison is 1038.9184 to 1244.4135.

#################################################################################################################

Predict the price of a used Toyota Corolla with the specifications listed in the table below.

Variable	Value

Age_08_04	77

KM	117,000

Fuel_Type	Petrol

HP	110

Automatic	No

Doors	5

Quarterly_Tax	100

Mfr_Guarantee	No

Guarantee_Period	3

Airco	Yes

Automatic_airco	No

CD_Player	No

Powered_Windows	No

Sport_Model	No

Tow_Bar	Yes

In [18]:
X_train.columns

Index(['Age_08_04', 'KM', 'HP', 'Automatic', 'Doors', 'Quarterly_Tax',
       'Mfr_Guarantee', 'Guarantee_Period', 'Airco', 'Automatic_airco',
       'CD_Player', 'Powered_Windows', 'Sport_Model', 'Tow_Bar',
       'Fuel_Type_Diesel', 'Fuel_Type_Petrol'],
      dtype='object')

In [19]:
used_Corrola = pd.DataFrame([{"Age_08_04": 77, "KM": 117000, "Petrol": 1, "Diesel": 0, "HP": 110,
                              "Automatic": 0, "Doors": 5, "Quarterly_Tax": 100, "Mfr_Guarantee": 0,
                              "Guarantee_Period": 3, "Airco": 1, "Automatic_airco": 0, "CD_Player": 0,
                              "Powered_Windows": 0, "Sport_Model": 0, "Tow_Bar": 1 
                             }])

In [20]:
regTree.predict(used_Corrola)

array([8461.53846154])

Using the optimal parameters the predicted price of a used Toyota Corolla with the listed specifications is $8461.53.

##############################################################################################################

##############################################################################################################

Question 2 (10 points) Let us see the effect of turning the price variable into a categorical variable. First, create a new variable that categorizes price into 20 bins. Now repartition the data keeping Binned Price instead of Price. 

In [21]:
cars_df.head()

Unnamed: 0,Price,Age_08_04,KM,HP,Automatic,Doors,Quarterly_Tax,Mfr_Guarantee,Guarantee_Period,Airco,Automatic_airco,CD_Player,Powered_Windows,Sport_Model,Tow_Bar,Fuel_Type_Diesel,Fuel_Type_Petrol
0,13500,23,46986,90,0,3,210,0,3,0,0,0,1,0,0,1,0
1,13750,23,72937,90,0,3,210,0,3,1,0,1,0,0,0,1,0
2,13950,24,41711,90,0,3,210,1,3,0,0,0,0,0,0,1,0
3,14950,26,48000,90,0,3,210,1,3,0,0,0,0,0,0,1,0
4,13750,30,38500,90,0,3,210,1,3,1,0,0,1,0,0,1,0


In [22]:
cars_df["binned_price"] = pd.cut(cars_df.Price, 20, labels=False)
cars_df.head()

Unnamed: 0,Price,Age_08_04,KM,HP,Automatic,Doors,Quarterly_Tax,Mfr_Guarantee,Guarantee_Period,Airco,Automatic_airco,CD_Player,Powered_Windows,Sport_Model,Tow_Bar,Fuel_Type_Diesel,Fuel_Type_Petrol,binned_price
0,13500,23,46986,90,0,3,210,0,3,0,0,0,1,0,0,1,0,6
1,13750,23,72937,90,0,3,210,0,3,1,0,1,0,0,0,1,0,6
2,13950,24,41711,90,0,3,210,1,3,0,0,0,0,0,0,1,0,6
3,14950,26,48000,90,0,3,210,1,3,0,0,0,0,0,0,1,0,7
4,13750,30,38500,90,0,3,210,1,3,1,0,0,1,0,0,1,0,6


In [23]:
X = cars_df.drop(columns=["Price", "binned_price"])
y = cars_df["binned_price"]

In [24]:
print(X.head())
print(y.head())

   Age_08_04     KM  HP  Automatic  Doors  Quarterly_Tax  Mfr_Guarantee  \
0         23  46986  90          0      3            210              0   
1         23  72937  90          0      3            210              0   
2         24  41711  90          0      3            210              1   
3         26  48000  90          0      3            210              1   
4         30  38500  90          0      3            210              1   

   Guarantee_Period  Airco  Automatic_airco  CD_Player  Powered_Windows  \
0                 3      0                0          0                1   
1                 3      1                0          1                0   
2                 3      0                0          0                0   
3                 3      0                0          0                0   
4                 3      1                0          0                1   

   Sport_Model  Tow_Bar  Fuel_Type_Diesel  Fuel_Type_Petrol  
0            0        0             

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=1)

Run a classification tree (CT) with the same set of input variables as in the RT, and with Binned Price as the output variable. Set the model parameters so as to build a deep tree. The function pandas.cut can be used to bin your data.

    Use cross-validated grid search to determine an optimal set of parameters.
    Compare the tree generated by the CT with the one generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc.) Why?
    Predict the price, using the RT and the CT, of a used Toyota Corolla with the specifications listed in table below.

Compare the predictions in terms of the predictors that were used, the magnitude of the difference between the two predictions, and the advantages and disadvantages of the two methods.

In [26]:
class_DeepTree = DecisionTreeClassifier(random_state=1)
class_DeepTree.fit(X_train, y_train)

names_of_class = [str(s) for s in class_DeepTree.classes_]

print("Classes: {}".format(', '.join(names_of_class)))
print("Nodes: {}".format(class_DeepTree.tree_.node_count))

Classes: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19
Nodes: 779


The Regression Tree had 1485 nodes compared to the 779 nodes of the Decision Tree.

In [27]:
print("Remembering that the top four specifications of the Regression Tree were:")
specifications = pd.DataFrame({"features": X_train.columns, "importance": reg_Tree.feature_importances_})
# Sorting out the top four car specifications based on importance
specifications.sort_values(by="importance", ascending=False).head(4)

Remembering that the top four specifications of the Regression Tree were:


Unnamed: 0,features,importance
0,Age_08_04,0.844867
2,HP,0.053789
1,KM,0.049601
9,Automatic_airco,0.013358


In [28]:
print("\nCompared to the top four specifications of the Decision Tree Clssifier:")

specifications2 = pd.DataFrame({"features": X_train.columns, "importance": class_DeepTree.feature_importances_})
# Sorting out the top four car specifications based on importance
specifications2.sort_values(by="importance", ascending=False).head(4)


Compared to the top four specifications of the Decision Tree Clssifier:


Unnamed: 0,features,importance
0,Age_08_04,0.321882
1,KM,0.294123
5,Quarterly_Tax,0.064955
2,HP,0.04597


Quarterly_Tax has replaced Automatic_airco in the top four.

In [29]:
classificationSummary(y_train, class_DeepTree.predict(X_train))
classificationSummary(y_test, class_DeepTree.predict(X_test))

Confusion Matrix (Accuracy 1.0000)

       Prediction
Actual   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
     0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     1   0  62   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     2   0   0 179   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     3   0   0   0 251   0   0   0   0   0   0   0   0   0   0   0   0   0
     4   0   0   0   0 109   0   0   0   0   0   0   0   0   0   0   0   0
     5   0   0   0   0   0  83   0   0   0   0   0   0   0   0   0   0   0
     6   0   0   0   0   0   0  58   0   0   0   0   0   0   0   0   0   0
     7   0   0   0   0   0   0   0  14   0   0   0   0   0   0   0   0   0
     8   0   0   0   0   0   0   0   0  32   0   0   0   0   0   0   0   0
     9   0   0   0   0   0   0   0   0   0  13   0   0   0   0   0   0   0
    10   0   0   0   0   0   0   0   0   0   0  23   0   0   0   0   0   0
    11   0   0   0   0   0   0   0   0   0   0

The 100% accuracy on the training set compared to 41.39% accurace on the test/validation set shows overfitting.

############################################################################################################

In [30]:
### Using the Chapter 09 ipynb file as a guide
# user grid search to find optimized tree
param_grid = {
    'max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25], 
    'min_impurity_decrease': [0, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01], 
    'min_samples_split': [10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50], 
}
gridSearch = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('Initial parameters: ', gridSearch.best_params_)

param_grid = {
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 
    'min_impurity_decrease': [0, 0.001, 0.002, 0.003, 0.005, 0.006, 0.007, 0.008, 0.009, 0.010], 
    'min_samples_split': [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], 
}
gridSearch = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('\nImproved parameters: ', gridSearch.best_params_)

classTree = gridSearch.best_estimator_
print("\nOptimal Parameter Tree has {} Nodes.".format(classTree.tree_.node_count))



Initial parameters:  {'max_depth': 5, 'min_impurity_decrease': 0.002, 'min_samples_split': 15}

Improved parameters:  {'max_depth': 5, 'min_impurity_decrease': 0, 'min_samples_split': 14}

Optimal Parameter Tree has 55 Nodes.




In [31]:
print("Nodes: {}".format(classTree.tree_.node_count))

Nodes: 55


In [32]:
classificationSummary(y_train, classTree.predict(X_train))
classificationSummary(y_test, classTree.predict(X_test))

Confusion Matrix (Accuracy 0.5714)

       Prediction
Actual   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
     0   0   2   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0
     1   0  22  38   2   0   0   0   0   0   0   0   0   0   0   0   0   0
     2   0   9 119  51   0   0   0   0   0   0   0   0   0   0   0   0   0
     3   0   1  42 197   4   7   0   0   0   0   0   0   0   0   0   0   0
     4   0   0   3  63  21  21   1   0   0   0   0   0   0   0   0   0   0
     5   0   0   0  14   4  52  13   0   0   0   0   0   0   0   0   0   0
     6   0   0   0   4   0  17  32   0   2   0   0   3   0   0   0   0   0
     7   0   0   0   0   0   2   9   0   2   0   1   0   0   0   0   0   0
     8   0   0   0   0   0   0   0   0  25   0   4   3   0   0   0   0   0
     9   0   0   0   0   0   0   0   0   4   4   3   2   0   0   0   0   0
    10   0   0   0   1   0   0   0   0   4   0  15   3   0   0   0   0   0
    11   0   0   0   0   0   0   0   0   1   0

Using the Improved parameters determined by gridsearch of max_depth=5, min_impurity_decrease=0.001, and min_samples_split=15 we get a 57.14% accuracy on the training set and a 47.65% accuracy on the test/validation set.

############################################################################################################

In [33]:
used_Corrola = pd.DataFrame([{"Age_08_04": 77, "KM": 117000, "Petrol": 1, "Diesel": 0, "HP": 110,
                              "Automatic": 0, "Doors": 5, "Quarterly_Tax": 100, "Mfr_Guarantee": 0,
                              "Guarantee_Period": 3, "Airco": 1, "Automatic_airco": 0, "CD_Player": 0,
                              "Powered_Windows": 0, "Sport_Model": 0, "Tow_Bar": 1 
                             }])

In [34]:
print("The predicted bin for the price of used Corrola is %s." % classTree.predict(used_Corrola))
pd.cut(cars_df.Price, 20).cat.categories[classTree.predict(used_Corrola)]

The predicted bin for the price of used Corrola is [2].


IntervalIndex([(7165.0, 8572.5]],
              closed='right',
              dtype='interval[float64]')

The predicted price range in dollars for the used Corrola is 7165.00 to 8572.50.

############################################################################################################

Comparing the Results:
With binning in the classification tree the accuracy is lower than the regression tree and the predicted price is given as a range.