<span style="color:green; font-size:38px">
    <div align=center><b>Cross-Validation Implementation</b></div>
</span>

In this notebook I am going to implement cross-validation in a regression model, the steps are the same for both cases regression and classification.

<span style="color:green; font-size:28px">
    <b>Melbourne House Price Prediction</b>
</span>

**Columns**
* Rooms: Number of rooms
* Price: Price in dollars
* Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.
* Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.
* SellerG: Real Estate Agent
* Date: Date sold
* Distance: Distance from CBD
* Regionname: General Region (West, North West, North, North east …etc)
* Propertycount: Number of properties that exist in the suburb.
* Bedroom2 : Scraped # of Bedrooms (from different source)
* Bathroom: Number of Bathrooms
* Car: Number of carspots
* Landsize: Land Size
* BuildingArea: Building Size
* CouncilArea: Governing council for the area

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline 

<span style="color:cyan; font-size:28px">
    <b>Get Data</b>
</span>

In [70]:
data = pd.read_csv( '../datasets/melb_data.csv' )
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


<span style="color:green; font-size:28px">
    <b>Data Processing</b>
</span>

In [71]:
# info of data
print( data.info() )
print(f"No of categorical columns: {len( data.select_dtypes('object').columns )}")
print(f"No of numerical columns: {len( data.select_dtypes(['int64', 'float64']).columns )}")
print(f"Categorical columns: {data.select_dtypes('object').columns}")
print(f"Numerical columns: {data.select_dtypes(['int64', 'float64']).columns}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

Ok, we have 8 categorical features and 13 numerical features

<span style="color:green; font-size:18px">
    <b>Verify if exists missing values</b>
</span>

In [72]:
missin_values_df = pd.DataFrame( {'Column': list(data.isna().sum().index),
                                 'Missing Values': list(data.isna().sum().values),
                                  '%': list(np.round((data.isna().sum().values)/data.shape[0] * 100, 2)),
                                 'Dtype':list(data.dtypes.values) } )
missin_values_df

Unnamed: 0,Column,Missing Values,%,Dtype
0,Suburb,0,0.0,object
1,Address,0,0.0,object
2,Rooms,0,0.0,int64
3,Type,0,0.0,object
4,Price,0,0.0,float64
5,Method,0,0.0,object
6,SellerG,0,0.0,object
7,Date,0,0.0,object
8,Distance,0,0.0,float64
9,Postcode,0,0.0,float64


Here we can see that we have 3 numerical features with missing values and 1 categorical feature with missing values, the buildingArea column represents almost 50% of whole data, so if we drop this column we can delete important data, the solution im my opinion is replace missing values by their median since the data is skewed in the numerical features case and in categorical data with its mode 

In [73]:
data['Car'] = data['Car'].fillna( data['Car'].median() )
data['BuildingArea'] = data['BuildingArea'].fillna( data['BuildingArea'].median() )
data['YearBuilt'] = data['YearBuilt'].fillna( data['YearBuilt'].median() )

data['CouncilArea'] = data['CouncilArea'].fillna( data['CouncilArea'].mode()[0] )

In [74]:
data.isna().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

<span style="color:green; font-size:18px">
    <b>Verify for outliers and replace by their mean</b>
</span>

In [75]:
def get_outlliers(column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_extreme = Q1 - 1.5 * IQR
    upper_extreme = Q3 + 1.5 * IQR
    mean = data[column].mean()
    outliers = data[ (data[column] > upper_extreme) | (data[column] < lower_extreme) ][column].values
    data[column].replace( outliers, mean, inplace=True )

numerical_features = list( data.select_dtypes(['int64', 'float64']).columns )
for column in numerical_features:
    get_outlliers(column)

<span style="color:green; font-size:18px">
    <b>Feature Selection</b>
</span>

In [76]:
categorical_features = list(data.select_dtypes('object').columns)
categorical_features

['Suburb',
 'Address',
 'Type',
 'Method',
 'SellerG',
 'Date',
 'CouncilArea',
 'Regionname']

We have 8 categorical features but not all are important im my opinion for this model, Suburb, Type, Method, CouncilArea and Regionname are important for me.

In [77]:
numerical_features

['Rooms',
 'Price',
 'Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'Lattitude',
 'Longtitude',
 'Propertycount']

In [78]:
# correlation between features and label
numerical_data = data.select_dtypes(['int64', 'float64'])
numerical_data.corr()['Price'].sort_values()

Lattitude       -0.246160
Distance        -0.099266
YearBuilt       -0.088408
Propertycount   -0.007489
BuildingArea    -0.006969
Car              0.213670
Longtitude       0.228160
Postcode         0.249406
Landsize         0.289639
Bathroom         0.365221
Bedroom2         0.440741
Rooms            0.454538
Price            1.000000
Name: Price, dtype: float64

For this exaple we use only the features with positive correlation

In [141]:
X = data[['Rooms', 'Bedroom2', 'Bathroom', 'Landsize', 'Car']]
y = data['Price']

<span style="color:green; font-size:22px">
    <b>Model Building</b>
</span>

<span style="color:green; font-size:18px">
    <b>Train | Test Split</b>
</span>

In [142]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<span style="color:green; font-size:18px">
    <b>Scaling DAta</b>
</span>

In [143]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

<span style="color:green; font-size:18px">
    <b>Train Model</b>
</span>

In [144]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

# train preds
y_train_hat = lr.predict(X_train)

# test_preds 
y_hat = lr.predict(X_test)

In [145]:
from sklearn.metrics import mean_squared_error
print(f"Train error = {np.sqrt(mean_squared_error(y_train, y_train_hat))}")
print(f"Test error = {np.sqrt(mean_squared_error(y_test, y_hat))}")
print( lr.score(X_test, y_test) )

Train error = 387321.6336427131
Test error = 385411.40775858366
0.2416632148455451


<span style="color:green; font-size:18px">
    <b>Model Validation</b>
</span>

<span style="color:green; font-size:18px">
    <b>Cross-Validation with for loop</b>
</span>

In [123]:
X1 = X.values
y1 = y.values

In [137]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10)
rmse_list = []
scores = []
for train_idx, test_idx in kfold.split(X1, y1):
    X_train, X_test = X1[train_idx], X1[test_idx]
    y_train, y_test = y1[train_idx], y1[test_idx]
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    rmse = np.sqrt( mean_squared_error(y_test, y_hat) )
    rmse_list.append( rmse )
    scores.append( model.score(X_test, y_test) )
print(np.mean(rmse_list))  
print(f"std: {np.std(rmse_list)}")  
print(f"score: {np.mean(scores)}")  

388496.73956152395
std: 25027.61509555682
score: 0.20880675826214512


We can see that our model validation has an mean score of 0.20 so we can say that our model with Linear Regression will not have a good performance, since whatever the train set and test set will not improve the score about 0.20, for this reason the standard deviation is high. We can try with more advanced algorithms like trees or random forest or algorithms based in boosting, but this is for other occasion.

<span style="color:green; font-size:18px">
    <b>Cross-Validation with cross_val_score</b>
</span>

In [161]:
from sklearn.model_selection import cross_val_score
score = cross_val_score( LinearRegression(), X1, y1, cv=10 )
print(score.mean())

0.20880675826214512
