# My 1st Model - Decision Tree

In [1]:
from sklearn.tree import DecisionTreeRegressor
import pandas as pd

In [2]:
# Create a data frame From the csv file.
prize_set = pd.read_csv('melb_data.csv')

In [3]:
# Print all the columns in the data field
prize_set.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [4]:
# select the parameter 
y = prize_set.Price

# Select Parameters
feature = ['Rooms', 'Bedroom2', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = prize_set[feature]
print(X[:10])

   Rooms  Bedroom2  Bathroom  Landsize  Lattitude  Longtitude
0      2       2.0       1.0     202.0   -37.7996    144.9984
1      2       2.0       1.0     156.0   -37.8079    144.9934
2      3       3.0       2.0     134.0   -37.8093    144.9944
3      3       3.0       2.0      94.0   -37.7969    144.9969
4      4       3.0       1.0     120.0   -37.8072    144.9941
5      2       2.0       1.0     181.0   -37.8041    144.9953
6      3       4.0       2.0     245.0   -37.8024    144.9993
7      2       2.0       1.0     256.0   -37.8060    144.9954
8      1       1.0       1.0       0.0   -37.8008    144.9973
9      2       3.0       1.0     220.0   -37.8010    144.9989


### Create a Model

In [5]:
model1 = DecisionTreeRegressor(random_state=1)
model1.fit(X, y)
# Model Created

### Give the Prediction

In [6]:
print(model1.predict(X.head()))
print(prize_set.head().Price)


[1480000. 1035000. 1465000.  850000. 1600000.]
0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64


- As you can see that the predicted and the actually value is the same

### Model Validation

In [7]:
from sklearn.metrics import mean_absolute_error 

- With the **MAE metric**, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. 

In [8]:
price_prediction = model1.predict(X)
mean_absolute_error(y,price_prediction)

979.8441826215021

- See that is we upload the same data that tha model was trained in the mean error is very less
- Therefore, lets split the data-form and train on one half

#### Split  your Data

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)

- **train_x** --> The part of DataFrame --- use to train
- **test_x**  --> The part of DataFrame -- used to test
- **train_y** --> part of Serices -- that contain the parameter -- for train
- **test_y** --> part of Serices -- that contain the parameter -- for test

In [11]:
model2 = DecisionTreeRegressor(random_state=1)
model2.fit(train_X, train_y)

- Now lets see the MAE of the new model2 

In [12]:
predected_of_split = model2.predict(test_X)
print(predected_of_split[:10])

[2070000.  700000. 3625000.  835000. 2130000.  440000.  615000. 1285000.
  890000. 1782800.]


In [13]:
error_margin = mean_absolute_error(test_y, predected_of_split)
print(error_margin)

240548.23888070692


- Now you can see the Mean-Error in the predection which is was found high when trained and tested with different DataFrame

#### Overfitting & Underfitting

- When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses). 
This is a phenomenocalled **overfitting**  
- At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called   **underfitting**

- Therefore, to control this we predefine the the max_nodes that the tree should have   ng

In [14]:
# Just a fuction that returens the MAE when the max node is specified
def get_mae(max_nodes):
    sample_model = DecisionTreeRegressor(max_leaf_nodes= max_nodes, random_state=1)
    sample_model.fit(train_X, train_y)
    sample_predict = sample_model.predict(test_X)
    mae = mean_absolute_error(test_y, sample_predict)
    return mae

In [15]:
# Checking the best number of nodes that should be used
for max_num in [150, 200, 250, 300, 350]:
    print(max_num, "--->", get_mae(max_num))

150 ---> 235936.4698181066
200 ---> 232479.01740074818
250 ---> 229307.761730195
300 ---> 226191.48512112297
350 ---> 224707.54429560617


- From the above we can see that when max_leaf_nodes = 200 we get the min error
- therefore, we could use 200 with no validation/ split requied and train the whole data

In [16]:
# The final Model with all optimization is 
final_model = DecisionTreeRegressor(max_leaf_nodes= 200, random_state = 1)
final_model.fit(X, y)