### Decision Tree Regression

#### Predict House Price Based on Boston Housing Dataset

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston) and has been used extensively throughout the literature to benchmark algorithms. 

There are 14 attributes in each case of the dataset. They are:

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per 10,000 dollars.
- PTRATIO - pupil-teacher ratio by town
- BLACK - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in 1000 dollars

### Importing Libraries

In [None]:
# Import useful libararies used for data management and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
# load Boston Dataset
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
data = pd.read_csv('Boston.csv', index_col=0)

# display the first 10 records
data.head(10)

In [None]:
data.info()

**Now let's fit a simple linear model (OLS - for "ordinary least squares" method) with MEDV as the target variable and the others as the predictors:**

In [None]:
# use the first 13 attributes as independent varibles 
features = list(data.columns[0:13])

features

In [None]:
# use the names of attributes to split them into independent variables X and target variable y

X = data[features]
y = data['medv']

In [None]:
# Import Decision tree Regression Model from sklearn
from sklearn.tree import DecisionTreeRegressor

# Define model to be Decision Tree regression
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
dtr = DecisionTreeRegressor()


#### Use Cross validation to evaluate the model

In [None]:
# import cross validation 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

In [None]:
score_cv = cross_val_score(dtr, X, y,scoring = 'neg_mean_squared_error', cv=10)

In [None]:
score_cv

In [None]:
-score_cv.mean()

In [None]:
pred_y = cross_val_predict(dtr, X, y, cv=10)

In [None]:
pred_y

In [None]:
df=pd.DataFrame({'Actual':y, 'Predicted':pred_y})
df

#### Fit the model

In [None]:
# train model use all the training data
dtr.fit(X, y)

In [None]:
# show the intercept of the trained model (Theta_0)
dtr.get_depth()

In [None]:
dtr.get_n_leaves()

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV

In [None]:
# make an array of depths to choose from, say 1 to 20
depths = np.arange(2, 19)
depths

In [None]:
# define the number to try out for max leaf nodes 
num_leafs = [5, 10, 20, 50, 250]
num_leafs

In [None]:
try_grid = [{'max_depth':depths,
              'max_leaf_nodes':num_leafs}]

In [None]:
# define your Model using GridSearchCV
DTM = GridSearchCV(DecisionTreeRegressor(), param_grid=try_grid, cv=10)

In [None]:
DTM.fit(X,y)

In [None]:
# find the best parameters
DTM.best_params_

In [None]:
best_DTR_model = DecisionTreeRegressor(max_leaf_nodes = 250, max_depth = 6)

In [None]:
score_cv = cross_val_score(best_DTR_model, X, y,scoring = 'neg_mean_squared_error', cv=10)

In [None]:
score_cv

In [None]:
-score_cv.mean()