In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
from sklearn.tree import DecisionTreeRegressor


## Decision Tree Model


<p> Non-parametric (distribution-free) supervised learning method used for classification and regression.</p>
<p>Decision Tree is a <span style="color:red;"> white box </span>type of ML algorithm. It <span style="color:red;">shares internal decision-making logic</span>, which is not available in the black box type of algorithms such as Neural Network. Its training time is faster compared to the neural network algorithm.The time complexity of decision trees is a function of the number of records and number of attributes in the given data.</p>
<p>Decision trees can handle high dimensional data with good accuracy.</p>
<p>Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that the scikit-learn module does not support missing values.</p>

<img src="img/dt_process.png" width="400px">

#### Attibute Selection Measures(ASM)
<p><span style="color:#007bb5;">Information Gain</span>: computes the difference between entropy(impurity/randomness) before split and average entropy after split of the dataset based on given attribute values. The attribute with<span style="font-weight:bold;"> highest information gain</span>,  is chosen as the splitting attribute at node. </p>

<p><span style="color:#007bb5;">Gain Ration</span>: Gain ratio handles the issue of bias (ex:surrogate pk useless partitioning ) by normalizing the information gain. The attribute with the <span style="font-weight:bold;"> highest gain ratio </span> is chosen as the splitting attribute </p>

<p><span style="color:#007bb5;">Gini Index</span>: CART (Classification and Regression Tree) uses the Gini method to create split points. The Gini Index considers a binary split for each attribute. You can compute a weighted sum of the impurity of each partition. In case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with smaller gini index chosen as the splitting point.
The attribute with <span style="font-weight:bold;">  minimum Gini index </span> is chosen as the splitting attribute.
</p>
<p>
If a binary split on attribute A partitions data D into D1 and D2, the Gini index of D is:
<img src="img/gini_formula.png" width="200px">
<img src="img/gini_weight.png" width="400px">

</p>
<h5>TUNING NOTES</h5>
<p>Famous Usecases:</p>
<p>Classification : Iris Dataset </p>
<p>Regression : Boston Housing</p>
 <p>*many nodes can easily lead to overfitting.
 </p>

### graphviz and pydotplus are libs to visualize the decision trees

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.head(2)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500


### Feature selection


In [4]:
#print(df.columns)
#print(df.describe())
feature_names = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF","FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]
df.loc[0:5, feature_names]

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9
5,14115,1993,796,566,1,1,5


In [5]:
X = df.loc[:, feature_names]

In [6]:
Y = df.loc[:, 'SalePrice']


### Train - Validation - Test


In [7]:
from sklearn.model_selection import train_test_split

###### Method API: train_size: if  None, test will be set to 0.25 -- shuffle : boolean, optional (default=True) -- (default=None)  data is split in a stratified fashion

In [8]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, random_state=1)

### First Model


In [9]:
first_model = DecisionTreeRegressor(random_state=1)

# Fit the model
first_model.fit(X_train, Y_train) 

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

In [10]:
train_predictions = first_model.predict(X_train)
print("In sample first predictions", train_predictions[0:5])

In sample first predictions [307000. 223500. 145000. 155000. 140000.]


In [11]:
val_predictions = first_model.predict(X_val)
print("Out sample first predictions", val_predictions[0:5])

Out sample first predictions [186500. 184000. 130000.  92000. 164500.]


###  First Model Metrics


In [12]:
from sklearn.metrics import mean_absolute_error

<img src="img/mae_formula.png" width="200px">

In [13]:
print("MAE in sample:", mean_absolute_error(Y_train, train_predictions))
print("MAE validation:", mean_absolute_error(Y_val, val_predictions))

MAE in sample: 61.85692541856926
MAE validation: 29652.931506849316


##### this shows how our model is overfiting the training set


###  Second Model 


In [14]:
###  First Model Metrics###  First Model Metrics
