## Welcome to P7 Hands On!

In [55]:
import pandas as pd
import numpy as np

### Part 1: Regularized Regression

Pada kesempatan kali ini, kita akan mengunduh data dari: http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime. Kita diminta untuk memprediksi jumlah `Violent Crime` yang terjadi pada setiap lokasi (satu lokasi direpresentasikan oleh satu baris). Terdapat 120 kolom 'feature', dan hanya 1 kolom target.

Berarti, datanya memiliki ukuran yang besar, banyak kolom, dan mungkin saja banyak kolom yang 'redundant' atau malah tidak penting. Jika kita menggunakan semua kolom tersebut, maka kemungkinan OverFitting akan besar terjadi. Oleh sebab itu, mari kita lihat apakah regularization membantu atau tidak. 

Selain itu, data ini juga memiliki banyak missing values. Dari hampir 2000 baris, hanya ada 300 baris yang memiliki nilai penuh (tanpa missing values). Missing values yang dihadapi di sini juga sulit untuk diimputasikan dikarenakan banyaknya kolom yang ada. 

Ini adalah salah satu contoh data yang memiliki kualitas 'buruk' (row sedikit, kolom sangat banyak dan kita tidak memiliki domain knowledge yang cukup, serta banyak missing values). 

In [56]:
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error as mse

In [57]:
import pandas as pd
url = 'https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/communities.data'
crime = pd.read_csv(url, header=None, na_values=['?'])

df = crime.iloc[:,5:]
df['location'] = crime[[3]]

df = df.dropna().reset_index(drop = True)

In [58]:
df

Unnamed: 0,5,6,7,8,9,10,11,12,13,14,...,119,120,121,122,123,124,125,126,127,location
0,0.19,0.33,0.02,0.90,0.12,0.17,0.34,0.47,0.29,0.32,...,0.26,0.20,0.06,0.04,0.90,0.5,0.32,0.14,0.20,Lakewoodcity
1,0.15,0.31,0.40,0.63,0.14,0.06,0.58,0.72,0.65,0.47,...,0.39,0.84,0.06,0.06,0.91,0.5,0.88,0.26,0.49,Albanycity
2,0.25,0.54,0.05,0.71,0.48,0.30,0.42,0.48,0.28,0.32,...,0.46,0.05,0.09,0.05,0.88,0.5,0.76,0.13,0.34,Modestocity
3,1.00,0.42,0.47,0.59,0.12,0.05,0.41,0.53,0.34,0.33,...,0.07,0.15,1.00,0.35,0.73,0.0,0.31,0.21,0.69,Jacksonvillecity
4,0.11,0.43,0.04,0.89,0.09,0.06,0.45,0.48,0.31,0.46,...,0.12,0.07,0.04,0.01,0.81,1.0,0.56,0.09,0.63,SiouxCitycity
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
314,1.00,0.29,0.21,0.29,1.00,0.26,0.24,0.47,0.28,0.46,...,1.00,1.00,0.53,0.62,0.64,0.5,0.64,0.35,0.75,SanFranciscocity
315,0.07,0.38,0.17,0.84,0.11,0.04,0.35,0.41,0.30,0.64,...,0.13,0.17,0.02,0.01,0.72,0.0,0.62,0.15,0.07,Hamdentown
316,0.16,0.37,0.25,0.69,0.04,0.25,0.35,0.50,0.31,0.54,...,0.32,0.18,0.08,0.06,0.78,0.0,0.91,0.28,0.23,Waterburytown
317,0.08,0.51,0.06,0.87,0.22,0.10,0.58,0.74,0.63,0.41,...,0.38,0.33,0.02,0.02,0.79,0.0,0.22,0.18,0.19,Walthamcity


In [59]:
X = df.drop(127, axis=1)
X = X.drop('location', axis = 1)
y = df[127]

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size = 0.2)

In [61]:
def k_fold_eval(model):
    kf = KFold(n_splits = 5)
    RMSE_length = 5
    RMSE_list = []
    
    for i, (train, val) in enumerate(kf.split(X_train)):
        train_features = X_train.iloc[train]
        train_target = y_train.iloc[train]
        
        val_features = X_train.iloc[val]
        val_target = y_train.iloc[val]
        
        ml_model = model.fit(train_features, train_target)
        prediction = ml_model.predict(val_features)
        
        rmse_score = np.sqrt(mse(val_target, prediction))
        RMSE_list.append(rmse_score)
        
    print('RMSE Scores:')
    print(RMSE_list)
    print('')
    print(f'Average RMSE Score: {np.mean(RMSE_list)}')
    print('')
    
    ml_model_final = model.fit(X_train, y_train)
    test_prediction = ml_model_final.predict(X_test)
    rmse_final = np.sqrt(mse(y_test, test_prediction))
    
    print(f'RMSE Evaluate on Test Set: {rmse_final}')
    return ml_model_final

In [62]:
linear_reg = k_fold_eval(LinearRegression())

RMSE Scores:
[0.20371217588591764, 0.2596038679337269, 0.24344012193554357, 0.25753517046740365, 0.28869148341738304]

Average RMSE Score: 0.25059656392799495

RMSE Evaluate on Test Set: 0.23422645103031733


Okay, dengan Linear Regression, kita mendapatkan rata-rata RMSE 0.22 (pada cross validation), serta RMSE 0.23 pada Test Set. 

In [63]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

In [64]:
ridge_reg = k_fold_eval(Ridge())

RMSE Scores:
[0.1538000424041132, 0.21349040597832789, 0.18486171017651526, 0.16674280558306057, 0.19832156274921264]

Average RMSE Score: 0.1834433053782459

RMSE Evaluate on Test Set: 0.16683820768815263


In [65]:
lasso_reg = k_fold_eval(Lasso())

RMSE Scores:
[0.26494020130075124, 0.3072293744166097, 0.2959851861025427, 0.2707031627558652, 0.2784957810065473]

Average RMSE Score: 0.28347074111646325

RMSE Evaluate on Test Set: 0.24938730808883183


In [66]:
elastic_net = k_fold_eval(ElasticNet())

RMSE Scores:
[0.26494020130075124, 0.3072293744166097, 0.2959851861025427, 0.2707031627558652, 0.2784957810065473]

Average RMSE Score: 0.28347074111646325

RMSE Evaluate on Test Set: 0.24938730808883183


Ternyata, pada data kita kali ini, teknik "regularization" yang paling "ampuh" adalah Ridge Regression. 

Apa bedanya Ridge Regression dan Linear Regression? Jika kita ingat pemaparan di teori, Ridge Regression berusaha membuat agar nilai koefisien regresi sekecil mungkin. Yuk bandingin koefisien regresi di Ridge dan di Linear Regression!

In [67]:
np.abs(ridge_reg.coef_).sum()

6.339058904227935

In [68]:
np.abs(linear_reg.coef_).sum()

148.75334563710373

### Part 2: Random Forest Tuning

Untuk Random Forest, kita akan menggunakan dataset Apartment yang sudah dibersihkan dari 'kejanggalan'.

In [69]:
apt = pd.read_csv('cleaned_apt_data.csv')

In [70]:
apt.head()

Unnamed: 0,No_Rooms,Bathroom,Longitude,Latitude,Furnished,Area,Total_Facilities,AnnualPrice
0,1,1,106.819159,-6.226598,1,43.0,23,96000000
1,2,1,106.756061,-6.192081,0,35.0,19,30000000
2,2,1,106.757651,-6.186415,1,53.0,22,70000000
3,2,2,106.7846,-6.272637,1,85.0,24,576000000
4,2,1,106.796056,-6.153652,0,48.0,15,32000000


In [71]:
X = apt.drop('AnnualPrice', axis = 'columns')
y = apt['AnnualPrice']

In [72]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

base_model = RandomForestRegressor()

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size = 0.2)

In [74]:
base_model.fit(X_train, y_train)
base_pred = base_model.predict(X_test)
base_rmse = np.sqrt(mse(y_test, base_pred))
print('Base Model has RMSE:', base_rmse)
print('Base Model has R2-Score:', r2_score(y_test, base_pred))

Base Model has RMSE: 28704746.28643344
Base Model has R2-Score: 0.9017898035289309


### 1. Randomized Search

In [75]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}


In [77]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 100, cv = 5, verbose=3, random_state=42, n_jobs=-1,
                              return_train_score=True)

# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


0it [00:00, ?it/s]

<tqdm.std.tqdm at 0x1cf83ea1130>

In [78]:
rf_random.best_params_

{'n_estimators': 1000,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 40}

In [79]:
new_pred = rf_random.best_estimator_.predict(X_test)
new_rmse = np.sqrt(mse(y_test, new_pred))
print('New Model has RMSE:', new_rmse)
print('New Model has R2-Score:', r2_score(y_test, new_pred))

New Model has RMSE: 28077739.234876532
New Model has R2-Score: 0.906033418555545


In [81]:
print('Improvement of:', ((base_rmse - new_rmse)/base_rmse)*100,'%')

Improvement of: 2.1843323236521566 %


### 2. Grid Search

In [89]:
from sklearn.model_selection import GridSearchCV

grid_search_params = {
    'max_depth': [2,5,10],
    'min_samples_leaf': [5,50,100],
    'n_estimators': [10,50,200]
}

# Instantiate the grid search model
rf_grid = GridSearchCV(estimator=rf, param_grid=grid_search_params,
                            cv = 5, n_jobs=-1, verbose=3, return_train_score=True)

rf_grid.fit(X_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'max_depth': [2, 5, 10],
                         'min_samples_leaf': [5, 50, 100],
                         'n_estimators': [10, 50, 200]},
             return_train_score=True, verbose=3)

In [90]:
rf_grid.best_params_

{'max_depth': 10, 'min_samples_leaf': 5, 'n_estimators': 200}

In [91]:
new_pred_grid = rf_grid.best_estimator_.predict(X_test)

new_rmse_grid = np.sqrt(mse(y_test, new_pred_grid))

print('New Model has RMSE:', new_rmse_grid)
print('New Model has R2-Score:', r2_score(y_test, new_pred_grid))

New Model has RMSE: 31265074.756521944
New Model has R2-Score: 0.8834886867200987


In [92]:
print('Improvement of:', ((base_rmse - new_rmse_grid)/base_rmse)*100,'%')

Improvement of: -8.919530047539832 %
