# Boston Housing Dataset - Decision Tree (Regression)
Source
***

# Data
***

## Features

1. **CRIM**:
   - **Description**: Per capita crime rate by town.
   - **Type**: Continuous numerical value.

2. **ZN**:
   - **Description**: Proportion of residential land zoned for lots over 25,000 sq. ft.
   - **Type**: Continuous numerical value.
   
3. **INDUS**:
   - **Description**: Proportion of non-retail business acres per town.
   - **Type**: Continuous numerical value.
   
4. **CHAS**:
   - **Description**: Charles River dummy variable (1 if tract bounds river, 0 otherwise).
   - **Type**: Categorical (binary).

5. **NOX**:
   - **Description**: Nitric oxide concentration (parts per 10 million).
   - **Type**: Continuous numerical value.

6. **RM**:
   - **Description**: Average number of rooms per dwelling.
   - **Type**: Continuous numerical value.

7. **AGE**:
   - **Description**: Proportion of owner-occupied units built prior to 1940.
   - **Type**: Continuous numerical value.
   
8. **DIS**:
   - **Description**: Weighted distances to five Boston employment centers.
   - **Type**: Continuous numerical value.

9. **RAD**:
   - **Description**: Index of accessibility to radial highways.
   - **Type**: Discrete numerical value (integer).

10. **TAX**:
    - **Description**: Full-value property tax rate per $10,000.
    - **Type**: Continuous numerical value.

11. **PTRATIO**:
    - **Description**: Pupil-teacher ratio by town.
    - **Type**: Continuous numerical value.

12. **LSTAT**:
    - **Description**: Percentage of lower status of the population.
    - **Type**: Continuous numerical value.

## Target (Label)

- **MEDV**:
   - **Description**: Median value of owner-occupied homes in $1000s.
   - **Type**: Continuous numerical value.

# Evaluation Criteria
***

`MSE`, `RMSE`, `MAE`, and `MAPE` were used to evaluate the model.

# Importing necessary libraries

In [1]:
# EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Evaluation
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import make_scorer, mean_squared_error, root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

# Loading the data

In [2]:
df = pd.read_csv('./data/boston_house_prices.csv')

**Note:**
* There is no missing data.
* All data is numeric.

# Splitting the data

In [3]:
X, y = df.drop('MEDV', axis=1), df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Model

In [4]:
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train);
print(f"Test Score = {reg.score(X_test, y_test)}, Training Score = {reg.score(X_train, y_train)}")

Test Score = 0.8310149080832583, Training Score = 1.0


**Observations:**
The model is overfitting the training data.

# Hyperparameter Tuning

## RandomizedSearchCV

In [5]:
params_dist = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'max_leaf_nodes': [None, 10, 20, 30],
    'min_impurity_decrease': [0.0, 0.01, 0.1],
}

In [6]:
reg_rs = RandomizedSearchCV(
    DecisionTreeRegressor(random_state=42), 
	params_dist, 
	n_iter=4000,  
	n_jobs=-1, 
	refit=True, 
	cv=None,
	random_state=42
)

In [7]:
%%time
reg_rs.fit(X_train, y_train);

CPU times: user 4.55 s, sys: 337 ms, total: 4.88 s
Wall time: 13.6 s


In [8]:
reg_rs.best_params_

{'splitter': 'best',
 'min_samples_split': 10,
 'min_samples_leaf': 2,
 'min_impurity_decrease': 0.01,
 'max_leaf_nodes': None,
 'max_features': 'sqrt',
 'max_depth': 30,
 'criterion': 'absolute_error'}

## GridSearchCV

In [9]:
param_grid = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 30],
    'min_samples_split': [10],
    'min_samples_leaf': [2],
    'max_features': ['sqrt', 'log2'],
    'max_leaf_nodes': [None, 30],
    'min_impurity_decrease': [0.0, 0.01, 0.1],
}

In [10]:
reg_gs = GridSearchCV(
    DecisionTreeRegressor(random_state=42), 
    param_grid,  
    n_jobs=-1, 
    refit=True, 
    cv=None
)

In [11]:
%%time
reg_gs.fit(X_train, y_train);

CPU times: user 496 ms, sys: 53.3 ms, total: 550 ms
Wall time: 773 ms


In [12]:
reg_gs.best_params_

{'criterion': 'absolute_error',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.01,
 'min_samples_leaf': 2,
 'min_samples_split': 10,
 'splitter': 'best'}

## Best Estimator

In [13]:
reg = DecisionTreeRegressor(
	criterion='absolute_error',
	max_depth=None,
	max_features='sqrt',
	max_leaf_nodes=None,
	min_impurity_decrease=0.01,
	min_samples_leaf=2,
	min_samples_split=10,
	splitter='best',
	random_state=42
)
reg.fit(X_train, y_train);
print(f"Test Score = {reg.score(X_test, y_test)}, Training Score = {reg.score(X_train, y_train)}")

Test Score = 0.6145943292938496, Training Score = 0.8814590776975002


**Observations:**
* The model is now able to generalize better.

# Evaluation
***

In [14]:
%%time
np.random.seed(42)
reg_mse = cross_val_score(
    reg,
    X,
    y,
    scoring=make_scorer(mean_squared_error),
    cv=3,
    n_jobs=-1
)

# RMSE score
reg_rmse = cross_val_score(
    reg,
    X,
    y,
    scoring=make_scorer(root_mean_squared_error),
    cv=3,
    n_jobs=-1
)

# MAE score
reg_mae = cross_val_score(
    reg,
    X,
    y,
    scoring=make_scorer(mean_absolute_error),
    cv=3,
    n_jobs=-1
)

# MAPE score
reg_mape = cross_val_score(
    reg,
    X,
    y,
    scoring=make_scorer(mean_absolute_percentage_error),
    cv=3,
    n_jobs=-1
)

CPU times: user 30.9 ms, sys: 517 μs, total: 31.4 ms
Wall time: 105 ms


In [18]:
scores = {
    "MSE": reg_mse,
    "RMSE": reg_rmse,
    "MAE": reg_mae,
    "MAPE": reg_mape
}
scores = pd.DataFrame(scores)
scores

Unnamed: 0,MSE,RMSE,MAE,MAPE
0,23.918314,4.890635,3.759763,0.197041
1,111.056243,10.538323,7.080473,0.250893
2,41.965699,6.478094,4.581845,0.318049


# Summary
***

The Decision Tree Classifier demonstrates satisfactory levels of performance when applied to the Boston House Prices Dataset.

**End**