# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np

import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [2]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import GridSearchCV


## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
df = pd.read_csv(airbnbDataSet_filename)


## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
print("data features: \n")
print(list(df.columns),len(list(df.columns))) 

data features: 

['name', 'description', 'neighborhood_overview', 'host_name', 'host_location', 'host_about', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_listings_count', 'host_total_listings_count', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'calculated_host_listings_count', 'calculated_host_listings_count_

In [5]:
#drop irrelavent features
columnsIrrelavent = ["name","host_name","neighborhood_overview", "host_total_listings_count",
"calculated_host_listings_count","calculated_host_listings_count","description","has_availability",
"host_has_profile_pic","host_about","amenities","host_location"]
featuresToWorkWith = df.drop(columns=columnsIrrelavent, axis=1)
len(featuresToWorkWith.columns)
featuresToWorkWith.head()


Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_identity_verified,neighbourhood_group_cleansed,room_type,accommodates,bathrooms,bedrooms,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,0.8,0.17,True,8.0,True,Manhattan,Entire home/apt,1,1.0,,...,4.76,4.79,4.86,4.41,False,3,0,0,0.33,9
1,0.09,0.69,True,1.0,True,Brooklyn,Entire home/apt,3,1.0,1.0,...,4.78,4.8,4.71,4.64,False,1,0,0,4.86,6
2,1.0,0.25,True,1.0,True,Brooklyn,Entire home/apt,4,1.5,2.0,...,5.0,5.0,4.5,5.0,False,1,0,0,0.02,3
3,1.0,1.0,True,1.0,True,Manhattan,Private room,2,1.0,1.0,...,4.66,4.42,4.87,4.36,False,0,1,0,3.68,4
4,,,True,1.0,True,Manhattan,Private room,1,1.0,1.0,...,4.97,4.95,4.94,4.92,False,0,1,0,0.87,7


In [6]:
featuresToWorkWith.describe()


Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
count,16179.0,16909.0,28022.0,28022.0,28022.0,25104.0,26668.0,28022.0,28022.0,28022.0,...,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0
mean,0.906901,0.791953,14.554778,2.874491,1.142174,1.329708,1.629556,154.228749,18.689387,78695.41,...,4.613352,4.8143,4.808041,4.750393,4.64767,5.562986,3.902077,0.048283,1.758325,5.16951
std,0.227282,0.276732,120.721287,1.860251,0.421132,0.700726,1.097104,140.816605,25.569151,12829730.0,...,0.573891,0.438603,0.464585,0.415717,0.518023,26.121426,17.972386,0.442459,4.446143,2.028497
min,0.0,0.0,0.0,1.0,0.0,1.0,1.0,29.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,1.0
25%,0.94,0.68,1.0,2.0,1.0,1.0,1.0,70.0,2.0,40.0,...,4.5,4.81,4.81,4.67,4.55,0.0,0.0,0.0,0.13,4.0
50%,1.0,0.91,1.0,2.0,1.0,1.0,1.0,115.0,30.0,1124.0,...,4.8,4.96,4.97,4.88,4.78,1.0,0.0,0.0,0.51,5.0
75%,1.0,1.0,3.0,4.0,1.0,1.0,2.0,180.0,30.0,1125.0,...,5.0,5.0,5.0,5.0,5.0,1.0,1.0,0.0,1.83,7.0
max,1.0,1.0,3387.0,16.0,8.0,12.0,21.0,1000.0,1250.0,2147484000.0,...,5.0,5.0,5.0,5.0,5.0,308.0,359.0,8.0,141.0,13.0


In [7]:
#finding out the null values to later fill it or drop it
print(featuresToWorkWith.isnull().sum())



host_response_rate                              11843
host_acceptance_rate                            11113
host_is_superhost                                   0
host_listings_count                                 0
host_identity_verified                              0
neighbourhood_group_cleansed                        0
room_type                                           0
accommodates                                        0
bathrooms                                           0
bedrooms                                         2918
beds                                             1354
price                                               0
minimum_nights                                      0
maximum_nights                                      0
minimum_minimum_nights                              0
maximum_minimum_nights                              0
minimum_maximum_nights                              0
maximum_maximum_nights                              0
minimum_nights_avg_ntm      

In [8]:
featuresToWorkWith['beds'].fillna(0, inplace=True)
featuresToWorkWith['bedrooms'].fillna(0, inplace=True)



#chose these 2 instead of dropping because these numerical values are important to my prediction label
featuresToWorkWith['host_response_rate'].fillna(featuresToWorkWith['host_response_rate'].median(),inplace = True)
featuresToWorkWith['host_acceptance_rate'].fillna(featuresToWorkWith['host_acceptance_rate'].median(),inplace = True)


In [9]:
#this is to make sure it worked .any().any() gives me a simpler and shorter answer
featuresToWorkWith.isnull().any().any()

False

In [10]:
#going to one-hot-encode these features
stringFeatures = featuresToWorkWith.select_dtypes(include=['object','boolean'])
stringFeaturesList = list(stringFeatures)
stringFeaturesList

['host_is_superhost',
 'host_identity_verified',
 'neighbourhood_group_cleansed',
 'room_type',
 'instant_bookable']

In [11]:
#One-hot-encode:
featuresToWorkWith.drop(columns = stringFeaturesList ,axis=1, inplace=True) #drop first

stringFeatures_encoded = pd.get_dummies(stringFeatures, columns=stringFeaturesList, drop_first=False)

featuresToWorkWith = featuresToWorkWith.join(stringFeatures_encoded)
featuresToWorkWith

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,...,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room,instant_bookable_False,instant_bookable_True
0,0.80,0.17,8.0,1,1.0,0.0,1.0,150.0,30,1125,...,0,1,0,0,1,0,0,0,1,0
1,0.09,0.69,1.0,3,1.0,1.0,3.0,75.0,1,730,...,1,0,0,0,1,0,0,0,1,0
2,1.00,0.25,1.0,4,1.5,2.0,2.0,275.0,5,1125,...,1,0,0,0,1,0,0,0,1,0
3,1.00,1.00,1.0,2,1.0,1.0,1.0,68.0,2,14,...,0,1,0,0,0,0,1,0,1,0
4,1.00,0.91,1.0,1,1.0,1.0,1.0,75.0,2,14,...,0,1,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,1.00,1.00,8.0,2,1.0,1.0,1.0,89.0,1,365,...,0,0,1,0,0,0,1,0,0,1
28018,0.91,0.89,0.0,6,1.0,2.0,2.0,1000.0,1,1,...,1,0,0,0,1,0,0,0,1,0
28019,0.99,0.99,6.0,2,2.0,1.0,1.0,64.0,1,10,...,1,0,0,0,0,0,1,0,0,1
28020,0.90,1.00,3.0,3,1.0,1.0,2.0,84.0,7,365,...,1,0,0,0,1,0,0,0,1,0


# As of right now we have cleaned and preprocessed the data
# ------------------------------------------------------------------------

## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [13]:
y = featuresToWorkWith['price']
X = featuresToWorkWith.drop(columns = 'price', axis = 1)


In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1234)

# Random Forest:

In [28]:
randomForest_model = RandomForestRegressor()

randomForest_model.fit(X_train,y_train)

y_randomForest_pred = randomForest_model.predict(X_test)

randomForest_rmse = mean_squared_error(y_test, y_randomForest_pred, squared=False)

randomForest_r2 = r2_score(y_test, y_randomForest_pred)



print("Random Forest Model:")
print('Root Mean Squared Error: ', randomForest_rmse)
print('R2: ', randomForest_r2)  

Random Forest Model:
Root Mean Squared Error:  89.7069630595804
R2:  0.6187717907199021


# Gradient Boosting:

In [29]:
gradientBoostdt_model = GradientBoostingRegressor()

gradientBoostdt_model.fit(X_train,y_train)

y_gradientBoostdt_pred = gradientBoostdt_model.predict(X_test)

gradientBoostdt_rmse = mean_squared_error(y_test, y_gradientBoostdt_pred, squared=False)

gradientBoostdt_r2 = r2_score(y_test, y_gradientBoostdt_pred)


print("Gradient Boosted Model:")
print('Root Mean Squared Error: ', gradientBoostdt_rmse)
print('R2: ', gradientBoostdt_r2)             


Gradient Boosted Model:
Root Mean Squared Error:  94.12155522893467
R2:  0.580327115551816


 We can see that the <b>Random Forest</b> performs better than <b>Gradient Boosted DT</b> 

# Now to tune the models:
- I use stacking and choose Random forest + Gradient Boost to be my models


# Stacking:

In [17]:
estimators = [("RF", RandomForestRegressor()),("GBDT", GradientBoostingRegressor())]
stacking_model = StackingRegressor(estimators = estimators, cv = 5, passthrough = False)

In [20]:
print('Executing Stacking...')


cvScore = cross_val_score(stacking_model, X_train,y_train, cv = 3, scoring = 'neg_root_mean_squared_error', n_jobs=-1 )
stacking_rmse = (-1* cvScore).mean()

print('RMSE average : ', stacking_rmse)


Executing Stacking...
RMSE average :  89.2258317580845


In [31]:
#results so far:
print("Random Forest Model:")
print('Root Mean Squared Error: ', randomForest_rmse)
print('R2: ', randomForest_r2,'\n')  

print("Gradient Boosted Model:")
print('Root Mean Squared Error: ', gradientBoostdt_rmse)
print('R2: ', gradientBoostdt_r2,'\n')  


print("Stacking Model:")
print('RMSE average : ' , stacking_rmse)


Random Forest Model:
Root Mean Squared Error:  89.7069630595804
R2:  0.6187717907199021 

Gradient Boosted Model:
Root Mean Squared Error:  94.12155522893467
R2:  0.580327115551816 

Stacking Model:
RMSE average :  89.2258317580845


# --------------------------------------------------------------------
# Analysis of results:


1. Can see that the stacking model is only slightly better than RF 
 - Stacking RMSE= 89.<b>22</b> 
 - Random Forest RMSE= 89.<b>70</b>
<br>
<br>

2. Random Forest is better than GBDT in terms of <b>R^2</b>  metrics, the closer to 1 the more accurate 
 - RF: <b>0.618</b>
 - GBDT: <b>0.580</b>
 

# --------------------------------------------------------------------

 # Conclusion:
 <br>

- Although my final output wasn't as expected, I have tried methods such as utlizing stacking to find the best average RMSE of using both random forest and gradient boosting models. 
<br>
- I have also used the GridSearchCV and used the best_estimator_ to find the best model of the random forest but decided not to include it in the final submission of this project because the RMSE was as high as what I got for GBDT. 
<br>

- some causes for why my RMSE was so high might be because my data preprocesing stage wasn't done enough or needed even more thorough analysis, and may require experience that is beyond my current skillset
<br>

- A good takeaway output from my model is the fact that the R^2 of 0.61 signifies that the model captures 61% of the variance in the data, suggesting a moderate level of goodness of fit with the Random Forest. 