In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
from sklearn.tree import DecisionTreeRegressor


## Decision Tree Model


<p> Non-parametric (distribution-free) supervised learning method used for classification and regression.</p>
<p>Decision Tree is a <span style="color:red;"> white box </span>type of ML algorithm. It <span style="color:red;">shares internal decision-making logic</span>, which is not available in the black box type of algorithms such as Neural Network. Its training time is faster compared to the neural network algorithm.The time complexity of decision trees is a function of the number of records and number of attributes in the given data.</p>
<p>Decision trees can handle high dimensional data with good accuracy.</p>
<p>Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that the scikit-learn module does not support missing values.</p>

<img src="img/dt_process.png" width="400px">

#### Attibute Selection Measures(ASM)
<p><span style="color:#007bb5;">Information Gain</span>: computes the difference between entropy(impurity/randomness) before split and average entropy after split of the dataset based on given attribute values. The attribute with<span style="font-weight:bold;"> highest information gain</span>,  is chosen as the splitting attribute at node. </p>

<p><span style="color:#007bb5;">Gain Ration</span>: Gain ratio handles the issue of bias (ex:surrogate pk useless partitioning ) by normalizing the information gain. The attribute with the <span style="font-weight:bold;"> highest gain ratio </span> is chosen as the splitting attribute </p>

<p><span style="color:#007bb5;">Gini Index</span>: CART (Classification and Regression Tree) uses the Gini method to create split points. The Gini Index considers a binary split for each attribute. You can compute a weighted sum of the impurity of each partition. In case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with smaller gini index chosen as the splitting point.
The attribute with <span style="font-weight:bold;">  minimum Gini index </span> is chosen as the splitting attribute.
</p>
<p>
If a binary split on attribute A partitions data D into D1 and D2, the Gini index of D is:
<img src="img/gini_formula.PNG" width="200px">
<img src="img/gini_weight.png" width="400px">

</p>

<p style="font-weight:bold;">Famous Usecases:</p>
<p>Classification : Iris Dataset </p>
<img src="img/iris.png" width="250px">

<p>Regression : Boston Housing</p>
<img src="img/boston_decison.png" width="250px">
<h5>TUNING NOTES</h5>
 <p>Many nodes can easily lead to overfitting - a utility function to test different depths could help with it
    
 </p>

### graphviz and pydotplus are libs to visualize the decision trees

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.head(2)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500


### Feature selection


In [4]:
#print(df.columns)
#print(df.describe())
feature_names = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF","FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]
df.loc[0:5, feature_names]

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9
5,14115,1993,796,566,1,1,5


In [5]:
X = df.loc[:, feature_names]

In [6]:
Y = df.loc[:, 'SalePrice']


### Train - Validation - Test


In [7]:
from sklearn.model_selection import train_test_split

###### Method API: train_size: if  None, test will be set to 0.25 -- shuffle : boolean, optional (default=True) -- (default=None)  data is split in a stratified fashion

In [8]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, random_state=1)

### First Model


In [9]:
first_model = DecisionTreeRegressor(random_state=1)

# Fit the model
first_model.fit(X_train, Y_train) 

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

In [10]:
train_predictions = first_model.predict(X_train)
print("In sample first predictions", train_predictions[0:5])

In sample first predictions [307000. 223500. 145000. 155000. 140000.]


In [11]:
val_predictions = first_model.predict(X_val)
print("Out sample first predictions", val_predictions[0:5])

Out sample first predictions [186500. 184000. 130000.  92000. 164500.]


###  First Model Metrics


In [12]:
from sklearn.metrics import mean_absolute_error

<img src="img/mae_formula.png" width="200px">

In [13]:
print("MAE in sample:", mean_absolute_error(Y_train, train_predictions))
print("MAE validation:", mean_absolute_error(Y_val, val_predictions))

MAE in sample: 61.85692541856926
MAE validation: 29652.931506849316


##### this shows how our model is overfiting the training set


###  Second Model 


In [14]:
#TODO

###  Tuning


In [15]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [25]:
def mae_tuning_iterator(stop_signal=None):
    mae_result = []
    for i in range(2, stop_signal):
        mae_calculation = get_mae(i, X_train, X_val, Y_train, Y_val)
        mae_result.append(mae_calculation)
        print("Leaf Nodes:", i, "MAE:",mae_calculation)
    return mae_result

def mae_tuning_iterator(stop_signal=None):
    mae_result = []
    for i in range(2, stop_signal):
        mae_calculation = get_mae(i, X_train, X_val, Y_train, Y_val)
        mae_result.append(mae_calculation)
        print("Leaf Nodes:", i, "MAE:",mae_calculation)
    return mae_result

In [32]:
result_from_tuning = mae_tuning_iterator(1000)

Leaf Nodes: 2 MAE: 44268.47361143952
Leaf Nodes: 3 MAE: 39912.20512711714
Leaf Nodes: 4 MAE: 37786.689768600314
Leaf Nodes: 5 MAE: 35044.51299744237
Leaf Nodes: 6 MAE: 35713.94342638403
Leaf Nodes: 7 MAE: 34769.10089767185
Leaf Nodes: 8 MAE: 33155.05799844675
Leaf Nodes: 9 MAE: 31863.851616036944
Leaf Nodes: 10 MAE: 31585.432831537662
Leaf Nodes: 11 MAE: 30389.783612505194
Leaf Nodes: 12 MAE: 30235.08084999803
Leaf Nodes: 13 MAE: 29124.908937039498
Leaf Nodes: 14 MAE: 27956.436884725757
Leaf Nodes: 15 MAE: 28125.478430318668
Leaf Nodes: 16 MAE: 27418.101648804903
Leaf Nodes: 17 MAE: 27807.663665995344
Leaf Nodes: 18 MAE: 27807.663665995344
Leaf Nodes: 19 MAE: 28648.267042530915
Leaf Nodes: 20 MAE: 28707.31479747764
Leaf Nodes: 21 MAE: 28750.331097785598
Leaf Nodes: 22 MAE: 28513.702809268416
Leaf Nodes: 23 MAE: 28653.86284944501
Leaf Nodes: 24 MAE: 28856.62997273268
Leaf Nodes: 25 MAE: 29016.41319191076
Leaf Nodes: 26 MAE: 28959.76970980736
Leaf Nodes: 27 MAE: 28616.229360696358
Leaf N

Leaf Nodes: 226 MAE: 27777.023775578884
Leaf Nodes: 227 MAE: 27752.115099779792
Leaf Nodes: 228 MAE: 27784.567154574313
Leaf Nodes: 229 MAE: 27746.784810586487
Leaf Nodes: 230 MAE: 27744.424720448318
Leaf Nodes: 231 MAE: 27751.020329177485
Leaf Nodes: 232 MAE: 27751.020329177485
Leaf Nodes: 233 MAE: 27720.81784716365
Leaf Nodes: 234 MAE: 27701.490325370374
Leaf Nodes: 235 MAE: 27684.281421260785
Leaf Nodes: 236 MAE: 27774.67192634347
Leaf Nodes: 237 MAE: 27742.12889859496
Leaf Nodes: 238 MAE: 27742.12889859497
Leaf Nodes: 239 MAE: 27742.12889859497
Leaf Nodes: 240 MAE: 27725.35635812174
Leaf Nodes: 241 MAE: 27758.918001957354
Leaf Nodes: 242 MAE: 27815.098481409408
Leaf Nodes: 243 MAE: 27828.11218003955
Leaf Nodes: 244 MAE: 27884.139577299822
Leaf Nodes: 245 MAE: 27898.751449445943
Leaf Nodes: 246 MAE: 27925.80624396649
Leaf Nodes: 247 MAE: 27948.61674624959
Leaf Nodes: 248 MAE: 27948.61674624959
Leaf Nodes: 249 MAE: 27948.61674624959
Leaf Nodes: 250 MAE: 27893.822225701646
Leaf Nodes:

Leaf Nodes: 440 MAE: 29062.195629146474
Leaf Nodes: 441 MAE: 29062.195629146474
Leaf Nodes: 442 MAE: 29032.195629146474
Leaf Nodes: 443 MAE: 29032.195629146474
Leaf Nodes: 444 MAE: 29032.195629146474
Leaf Nodes: 445 MAE: 29049.30887115561
Leaf Nodes: 446 MAE: 29049.30887115561
Leaf Nodes: 447 MAE: 29037.8979122515
Leaf Nodes: 448 MAE: 29037.8979122515
Leaf Nodes: 449 MAE: 29054.79288942045
Leaf Nodes: 450 MAE: 29054.792889420452
Leaf Nodes: 451 MAE: 29088.064122297164
Leaf Nodes: 452 MAE: 29112.72165654374
Leaf Nodes: 453 MAE: 29065.87234147525
Leaf Nodes: 454 MAE: 29080.666862023194
Leaf Nodes: 455 MAE: 29052.03672503689
Leaf Nodes: 456 MAE: 29066.766405402184
Leaf Nodes: 457 MAE: 29085.076907685292
Leaf Nodes: 458 MAE: 29070.50156521954
Leaf Nodes: 459 MAE: 29070.50156521954
Leaf Nodes: 460 MAE: 29090.36457891817
Leaf Nodes: 461 MAE: 29090.36457891817
Leaf Nodes: 462 MAE: 29090.36457891817
Leaf Nodes: 463 MAE: 29089.223026406755
Leaf Nodes: 464 MAE: 29089.223026406755
Leaf Nodes: 465

Leaf Nodes: 666 MAE: 29964.761545988255
Leaf Nodes: 667 MAE: 29964.761545988255
Leaf Nodes: 668 MAE: 29986.679354207434
Leaf Nodes: 669 MAE: 29995.583463796473
Leaf Nodes: 670 MAE: 29995.583463796473
Leaf Nodes: 671 MAE: 29983.254696673186
Leaf Nodes: 672 MAE: 29983.254696673186
Leaf Nodes: 673 MAE: 29970.9259295499
Leaf Nodes: 674 MAE: 29983.254696673186
Leaf Nodes: 675 MAE: 30007.91223091976
Leaf Nodes: 676 MAE: 30017.419080234828
Leaf Nodes: 677 MAE: 30010.341454664056
Leaf Nodes: 678 MAE: 30010.341454664056
Leaf Nodes: 679 MAE: 30010.341454664056
Leaf Nodes: 680 MAE: 30024.268395303327
Leaf Nodes: 681 MAE: 30012.231409001957
Leaf Nodes: 682 MAE: 30007.36839530333
Leaf Nodes: 683 MAE: 30014.21771037182
Leaf Nodes: 684 MAE: 30034.7656555773
Leaf Nodes: 685 MAE: 30027.916340508807
Leaf Nodes: 686 MAE: 30027.916340508807
Leaf Nodes: 687 MAE: 30024.856979778215
Leaf Nodes: 688 MAE: 30001.56930854534
Leaf Nodes: 689 MAE: 30013.21314416178
Leaf Nodes: 690 MAE: 30013.21314416178
Leaf Nodes

Leaf Nodes: 882 MAE: 30135.1803652968
Leaf Nodes: 883 MAE: 30135.1803652968
Leaf Nodes: 884 MAE: 30135.1803652968
Leaf Nodes: 885 MAE: 30135.1803652968
Leaf Nodes: 886 MAE: 30135.1803652968
Leaf Nodes: 887 MAE: 30135.1803652968
Leaf Nodes: 888 MAE: 30133.8105022831
Leaf Nodes: 889 MAE: 30137.9200913242
Leaf Nodes: 890 MAE: 30137.9200913242
Leaf Nodes: 891 MAE: 30129.700913242006
Leaf Nodes: 892 MAE: 30133.8105022831
Leaf Nodes: 893 MAE: 30133.8105022831
Leaf Nodes: 894 MAE: 30133.8105022831
Leaf Nodes: 895 MAE: 30133.8105022831
Leaf Nodes: 896 MAE: 30138.468036529677
Leaf Nodes: 897 MAE: 30143.125570776254
Leaf Nodes: 898 MAE: 30139.152968036527
Leaf Nodes: 899 MAE: 30139.152968036527
Leaf Nodes: 900 MAE: 30146.002283105023
Leaf Nodes: 901 MAE: 30146.002283105023
Leaf Nodes: 902 MAE: 30146.002283105023
Leaf Nodes: 903 MAE: 30146.002283105023
Leaf Nodes: 904 MAE: 30141.61872146119
Leaf Nodes: 905 MAE: 30141.61872146119
Leaf Nodes: 906 MAE: 30137.851598173518
Leaf Nodes: 907 MAE: 30143.1

In [33]:
min(result_from_tuning)

26704.033546536175

In [34]:
result_from_tuning.index(26704.033546536175)

69

###### so, for this metric (MAE), 69 is the optimal tree depth
