# Capstone: Airbnb Price Listing Prediction
## Part 4 Model Tuning

_Authors: Evonne Tham_

In the previous notebook, the XGBoost produced a high r2 score of _[]_ and 0.908 for the train and validation sets respectively, and an RMSE of _[]_. Despite this, the model needs to be tuned by narrowing the features from _[]_ features to a more manageable number so that the model is more generalisable and for inferences about the data to be easily made. 

This will be done by utilising the features' importance, a build in function in XGBoost, after they have been modelled. This model will be used as the production model in the next notebook.

## Contents of this notebook
1. [Import Necessary Libraries and Load Data](#1.-Import-Necessary-Libraries-and-Load-Data)
2. [Re-training the Best Model (XGBoost)](#2.-Re-training-the-Best-Model-(XGBoost))
3. [Feature Selection](#3.-Feature-Selection)
4. [Re-training the XGBoost with Selected Features](#4.-Re-training-the-XGBoost-with-Selected-Features) 


## 1. Import Necessary Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# modelling
import time
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LinearRegression, ElasticNetCV
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score, accuracy_score
# from xgboost import plot_importance


#Hide warnings
import warnings
warnings.filterwarnings('ignore')

  import pandas.util.testing as tm


In [None]:
# Load in Data 
train = pd.read_csv('../datasets/train.csv')

# #Set id as index 
# df.set_index('id', inplace=True)

print(f"Total Number of Listing: {train.shape[0]} | Total Number of Features: {train.shape[1]}")
train.head(4).T

---
## 2. Re-training the Best Model (XGBoost)

In [None]:
# Create X and y variables
features = [col for col in train._get_numeric_data().columns if col != 'price']
X = train[features]
y = train['price']

# Train/Validation Split
X_train, X_val, y_train, y_val = train_test_split(X, 
                                                  y, 
                                                  test_size=0.25,
                                                  random_state = 42) 

ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

In [None]:
# Instantiate Best Model
xgb = XGBClassifier(gamma = 0.3, 
                    learning_rate = 0.1, 
                    max_depth = 5, 
                    n_estimators = , 
                    reg_alpha = , 
                    reg_lambda = ,
                    subsample = 0.3)

In [None]:
# Fit Model
xgb.fit(X_train, y_train)

--- 
## 3. Feature Selection

### 3.1. Feature Importances

In [None]:
# # option 1 
# # ft_weights_xgb_reg = pd.DataFrame(xgb_reg.feature_importances_, columns=['weight'], index=X_train.columns)
# # ft_weights_xgb_reg.sort_values('weight', ascending=False, inplace=True)
# # ft_weights_xgb_reg.head(10)

# # option 2
# #Visualizing top features in our production model. 
# features = xgb.feature_importances_
# key_features = pd.Series(features, index=X.columns)
# sorted_features = key_features.sort_values(ascending=False).head(30)
# sorted_features.head(10)

In [None]:
#Visualizing top features in our production model. 
key_features = pd.DataFrame([xgb.feature_importances_], columns = X.columns).T
key_features.sort_values('weight', ascending = False, inplace = True)
key_features.head(10)

In [None]:
# Plotting feature importances
plt.figure(figsize=(15,25))
plt.barh(key_features.index, key_features.weight, align='center') 
plt.title("Feature importances in the XGBoost model", fontsize=20)
plt.xlabel("Feature importance")
plt.margins(y=0.01)
plt.show()

### 3.2. Dropping Features

## 4. Re-training the XGBoost with Selected Features

## 5. Leanring Curve

In [None]:
# from sklearn.model_selection import learning_curve
# import matplotlib.pyplot as plt
# plt.style.use('ggplot')
# %matplotlib inline

# def plot_learning_curve(estimator, clf, X, y, ylim=None, cv=None, train_sizes=None):
#     plt.figure()
#     plt.title(f'Learning Curves ({clf})')
#     plt.ylim(*ylim)
#     plt.xlabel("Training examples")
#     plt.ylabel("Score")
#     train_sizes, train_scores, test_scores = learning_curve(
#         estimator, X, y, cv=cv, train_sizes=train_sizes)
#     train_scores_mean = np.mean(train_scores, axis=1)
#     train_scores_std = np.std(train_scores, axis=1)
#     test_scores_mean = np.mean(test_scores, axis=1)
#     test_scores_std = np.std(test_scores, axis=1)

#     plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
#                      train_scores_mean + train_scores_std, alpha=0.1,
#                      color="r")
#     plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
#                      test_scores_mean + test_scores_std, alpha=0.1, color="g")
#     plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
#              label="Training score")
#     plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
#              label="Cross-validation score")

#     plt.legend(loc="best")
#     plt.grid(True)
#     return

In [None]:
# train_sizes = np.linspace(.1, 1.0, 5)
# ylim = (0.9, 1.01)
# cv = 5

# plot_learning_curve(pipe_lr, "Linear Regression", X_val, y_val, 
#                     ylim=ylim, cv=cv, train_sizes=train_sizes)
# plot_learning_curve(pipe_enet, "ElasticNetCV", X_val, y_val, 
#                     ylim=ylim, cv=cv, train_sizes=train_sizes)
# plot_learning_curve(pipe_svr, "SVR", X_val, y_val, 
#                     ylim=ylim, cv=cv, train_sizes=train_sizes)
# plot_learning_curve(pipe_xgb, "XGBoost", X_val, y_val,
#                     ylim=ylim, cv=cv, train_sizes=train_sizes)

# plt.show()

----> Proceed to the next notebook for [Production Model](./05_Production_Model.ipynb)