<a href="https://colab.research.google.com/github/aryamanan/Machiene_Learning/blob/main/MyFirstMLModel_pt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Validation

- You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

- Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

- You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

- There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.

In [18]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import HistGradientBoostingRegressor

In [19]:
data_path="https://raw.githubusercontent.com/aryamanan/mydata/main/Home%20Data/melb_data%202.csv"
melbourne_data=pd.read_csv(data_path)
melbourne_data.dropna()

melbourne_data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


# Selecting Data for Modeling

- Your dataset had too many variables to wrap your head around, or even to print out nicely. 
- How can you pare down this overwhelming amount of data to something you can understand?

- We'll start by picking a few variables using our intuition. Later we'll see statistical techniques to automatically prioritize variables.

In [20]:
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [21]:
# To choose variables/columns, we'll need to see a list of all columns in the dataset. 
# That is done with the columns property of the DataFrame (the bottom line of code below).

melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

There are many ways to select a subset of your data.
- Dot notation, which we use to select the "prediction target"
- Selecting with a column list, which we use to select the "features"

In [22]:
y=melbourne_data.Price
# You can pull out a variable with dot-notation. 



# We select multiple features by providing a list of column names inside brackets. 

# we will model using these features only for now
feature_list={'Rooms','Distance', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt'}

X=melbourne_data[feature_list]

  X=melbourne_data[feature_list]


In [23]:
X.describe()

Unnamed: 0,Bathroom,YearBuilt,BuildingArea,Distance,Landsize,Rooms,Car
count,13580.0,8205.0,7130.0,13580.0,13580.0,13580.0,13518.0
mean,1.534242,1964.684217,151.96765,10.137776,558.416127,2.937997,1.610075
std,0.691712,37.273762,541.014538,5.868725,3990.669241,0.955748,0.962634
min,0.0,1196.0,0.0,0.0,0.0,1.0,0.0
25%,1.0,1940.0,93.0,6.1,177.0,2.0,1.0
50%,1.0,1970.0,126.0,9.2,440.0,3.0,2.0
75%,2.0,1999.0,174.0,13.0,651.0,3.0,2.0
max,8.0,2018.0,44515.0,48.1,433014.0,10.0,10.0


In [24]:
X.head()

Unnamed: 0,Bathroom,YearBuilt,BuildingArea,Distance,Landsize,Rooms,Car
0,1.0,,,2.5,202.0,2,1.0
1,1.0,1900.0,79.0,2.5,156.0,2,0.0
2,2.0,1900.0,150.0,2.5,134.0,3,0.0
3,2.0,,,2.5,94.0,3,1.0
4,1.0,2014.0,142.0,2.5,120.0,4,2.0


# The Problem with "In-Sample" Scores

- The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

- Imagine that, in the large real estate market, door color is unrelated to home price.

- However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

- Since this pattern was derived from the training data, the model will appear accurate in the training data.

- But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

- Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. 
- This data is called validation data.

In [29]:
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X,Y,random_state=0)

# define Model, and set random_state=1 to make sure the values acquired retain their value
X.dropna()
Y.dropna()
melbourne_model= HistGradientBoostingRegressor(random_state=1)
# HistGradientBoostingRegressor is used instead of DecisionTreeRegressor
# because it can work even with NaN values in the dataframe

# Fit Model
melbourne_model.fit(train_X, train_Y)

### As you see, we can create validation data using `test_train_split`, where you can use a default but modifiable 70 percentage of data being used to train the model and remaining 30% to test and compare the model's predicted data with.

In [30]:
val_predictions=melbourne_model.predict(test_X) 
# get predicted prices on validation data
val_predictions

array([1690491.9673707 ,  819004.61344598,  853216.07833705, ...,
        978924.675788  ,  860393.78198684, 1170565.64704454])

In [31]:
print(mean_absolute_error(test_y, val_predictions))

234574.5243193647


In [32]:
print(train_X.shape)
print(test_X.shape)
print(train_y.shape)
print(test_y.shape)

(10185, 7)
(3395, 7)
(10185,)
(3395,)


In [None]:
# we can change the default ratio of test/train by using the parameter `test_size`
 
train_X, test_X, train_y, test_y = train_test_split(X,Y,random_state=0, test_size=0.1) 
# test data % is now 10%, ie training data is 90% now