# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename)

df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

<Double click this Markdown cell to make it editable, and record your answers here.>

1. I have chosen to go with the airbnb listing Dataset
2. In this problem, I will be trying to predict the price of an listed airbnb. So the 'price' will be my label. 
3. This will be a supervised learning problem. This will be regression problem.
4. My initial features will include accommodates, room_type, bathrooms, bedrooms, beds, review_scores_rating, review_scores_cleanliness, review_scores_checkin, review_scores_location, review_scores_value.
5. With a model that predicts the price of the airbnb listings, the hosts can make more profit by listing their houses on a competitive price rather than underselling or overselling their houses. Additionally, rich customers will likely pay higher prices for listings that really deserve its listed price. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

1. Data Prep Techniques: Handling missing values, replacing outliers, One hot encoding for the room_type, removing irrelevant features
2. I will be using Linear Regression adn Random Forest ensemble model
3. I will get the RSME and R^2 scores of the model and try to get a low RSME and high R^2 scores.

In [3]:
df.corr()['price']

host_response_rate                              0.006480
host_acceptance_rate                            0.037550
host_is_superhost                                    NaN
host_listings_count                             0.080995
host_total_listings_count                       0.080995
host_has_profile_pic                                 NaN
host_identity_verified                               NaN
accommodates                                    0.519057
bathrooms                                       0.331297
bedrooms                                        0.475506
beds                                            0.409236
price                                           1.000000
minimum_nights                                 -0.079945
maximum_nights                                 -0.001024
minimum_minimum_nights                         -0.071261
maximum_minimum_nights                         -0.007691
minimum_maximum_nights                          0.064011
maximum_maximum_nights         

In [4]:
#Check for int and float values
int_float_columns = df.select_dtypes(include=['int', 'float'])
print(int_float_columns.columns)

#Check for null values
np.sum(df.isnull())
has_na = ['bedrooms', 'beds']

features_draft1 = ['accommodates', 'room_type', 'bathrooms', 'bedrooms', 'beds', 'review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_location', 'review_scores_value']

#Check for outlier 
#sns.pairplot(df[features_draft1])
#plt.show()

Index(['host_response_rate', 'host_acceptance_rate', 'host_listings_count',
       'host_total_listings_count', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'price', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month',
       'n_host_verifications'],
      dtype

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

<Double click this Markdown cell to make it editable, and record your answers here.>

1. After analyzing the data, I decided to remove some of the features, as I saw that there were negative or even little to no correlation and that it wasn't a nesessary feature in order to determine my label.
   My new feature list contains following features: 'accommodates', 'room_type', 'bathrooms', 'bedrooms', 'beds'.

2. Data Prep Techniques: Handling missing values, replacing outliers, One hot encoding for the room_type, removing irrelevant features.
3. I will be using Linear Regression and Random Forest ensemble model initially and then if I need to, I will try to add additional models to compare depending on my results.
4. I will be splitting the data to training and testing sets. I will also tune the hyperparameters maybe if i need to, I will also adjust my features to improve the overall performance of the model. 


## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [5]:
import scipy.stats as stats

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [6]:
# YOUR CODE HERE
features = ['accommodates', 'room_type', 'bathrooms', 'bedrooms', 'beds']

In [7]:
df[features].describe()

Unnamed: 0,accommodates,bathrooms,bedrooms,beds
count,28022.0,28022.0,25104.0,26668.0
mean,2.874491,1.142174,1.329708,1.629556
std,1.860251,0.421132,0.700726,1.097104
min,1.0,0.0,1.0,1.0
25%,2.0,1.0,1.0,1.0
50%,2.0,1.0,1.0,1.0
75%,4.0,1.0,1.0,2.0
max,16.0,8.0,12.0,21.0


#Cleaning the data

In [8]:
#Handle Outliers 

#!!!! I got better results without winsorizing

# df['bathrooms'] = stats.mstats.winsorize(df['bathrooms'], limits=[0.01, 0.01])
# df['bedrooms'] = stats.mstats.winsorize(df['bedrooms'], limits=[0.01, 0.01])
# df['beds'] = stats.mstats.winsorize(df['beds'], limits=[0.01, 0.01])

# df[features].describe()

In [9]:
#Missing Data
print(np.sum(df[features].isnull(), axis=0))
has_na = ['bedrooms', 'beds']

accommodates       0
room_type          0
bathrooms          0
bedrooms        2918
beds            1354
dtype: int64


In [10]:
#Keep record of the missingness
for column in has_na: 
    df[column + '_na'] = df[column].isnull()

In [11]:
#Replacing missing values with mean values of column
for column in has_na: 
    df[column].fillna(value=df[column].mean(), inplace=True)

print(np.sum(df[features].isnull(), axis=0))

accommodates    0
room_type       0
bathrooms       0
bedrooms        0
beds            0
dtype: int64


In [12]:
#One hot encoding for the room_type
df['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Hotel room', 'Shared room'],
      dtype=object)

In [13]:
df_room_type = pd.get_dummies(df['room_type'], prefix='room_type')
df_room_type

Unnamed: 0,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,1,0,0,0
1,1,0,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0
...,...,...,...,...
28017,0,0,1,0
28018,1,0,0,0
28019,0,0,1,0
28020,1,0,0,0


In [14]:
df = df.join(df_room_type)
df.drop(columns='room_type', inplace=True)
df

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications,bedrooms_na,beds_na,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.80,0.17,True,8.0,...,0,0,0.33,9,True,False,1,0,0,0
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,0,0,4.86,6,False,False,1,0,0,0
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.00,0.25,True,1.0,...,0,0,0.02,3,False,False,1,0,0,0
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.00,1.00,True,1.0,...,1,0,3.68,4,False,False,0,0,1,0
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,1,0,0.87,7,False,False,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,Astoria Luxury suite 2A,THIS LOVELY HOME IS THE SPACIOUS SUITE WITH PR...,,Vicky,"Queens, New York, United States",,1.00,1.00,True,8.0,...,8,0,1.00,2,False,False,0,0,1,0
28018,Newly renovated suite in the heart of Williams...,Just fully renovated from head to toe. On the ...,,Samuel,"New York, New York, United States","Hello, my name is Sam. I am a real estate prof...",0.91,0.89,True,0.0,...,0,0,2.00,5,False,False,1,0,0,0
28019,Perfect Room to Stay in Brooklyn! Near Metro!,"Amazing and comfortable space in Brooklyn, sam...",,Carlos,US,,0.99,0.99,True,6.0,...,7,0,1.00,2,False,False,0,0,1,0
28020,New Beautiful Modern One Bedroom in Brooklyn,This stylish place to stay is perfect for a gr...,,Lexia,"New York, New York, United States","I am a graphic designer, swell chaser and duri...",0.90,1.00,True,3.0,...,0,0,1.00,7,False,False,1,0,0,0


In [15]:
#Re adjusting the features
features.remove('room_type')
features

['accommodates', 'bathrooms', 'bedrooms', 'beds']

In [16]:
features.extend(df_room_type.columns)
features

['accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'room_type_Entire home/apt',
 'room_type_Hotel room',
 'room_type_Private room',
 'room_type_Shared room']

In [17]:
df[features]

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,1,1.0,1.329708,1.0,1,0,0,0
1,3,1.0,1.000000,3.0,1,0,0,0
2,4,1.5,2.000000,2.0,1,0,0,0
3,2,1.0,1.000000,1.0,0,0,1,0
4,1,1.0,1.000000,1.0,0,0,1,0
...,...,...,...,...,...,...,...,...
28017,2,1.0,1.000000,1.0,0,0,1,0
28018,6,1.0,2.000000,2.0,1,0,0,0
28019,2,2.0,1.000000,1.0,0,0,1,0
28020,3,1.0,1.000000,2.0,1,0,0,0


In [18]:
#Check for correlation between featurse and the label
corrs = df[features].corrwith(df['price'])
print(corrs)

accommodates                 0.519057
bathrooms                    0.331297
bedrooms                     0.457171
beds                         0.400325
room_type_Entire home/apt    0.346902
room_type_Hotel room         0.127915
room_type_Private room      -0.355462
room_type_Shared room       -0.047938
dtype: float64


In [19]:
#Label and feature, then splitting
y = df['price']
X = df[features]



In [20]:
#Create a test set that is 30% of the data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=1234)

In [21]:
#Training the data 
lr_model = LinearRegression()     
rf_model = RandomForestRegressor(max_depth=30, n_estimators=300)

lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

In [22]:
#Predictions
y_lr_pred = lr_model.predict(X_test)
y_rf_pred = rf_model.predict(X_test)


# Create a DataFrame with actual and predicted prices
lr_results = pd.DataFrame({'Actual': y_test, 'Predicted': y_lr_pred})

# Create a DataFrame with actual and predicted prices
rf_results = pd.DataFrame({'Actual': y_test, 'Predicted': y_rf_pred})

# Display the first few rows of the DataFrame
print(lr_results.head(20))
print("-------------------------")
print(rf_results.head(20))

       Actual  Predicted
17758   385.0  333.18750
19492    40.0  146.90625
12485   120.0  159.06250
14553   113.0  141.34375
13532    80.0  188.81250
1059     82.0   93.00000
20303   168.0  182.81250
9808     45.0   90.37500
23496    58.0   66.65625
24551   170.0  276.65625
16979   400.0   86.81250
19417   250.0  368.65625
21948   100.0   90.37500
23952   250.0  137.84375
3854     75.0   68.06250
9010     60.0  188.81250
3667    180.0  125.84375
20860   150.0  276.65625
3977   1000.0  506.78125
7612    279.0  159.06250
-------------------------
       Actual   Predicted
17758   385.0  306.860146
19492    40.0   75.525448
12485   120.0  158.352553
14553   113.0  159.573531
13532    80.0  173.003036
1059     82.0  129.520502
20303   168.0  168.041498
9808     45.0   93.030291
23496    58.0   69.523025
24551   170.0  231.166423
16979   400.0   94.836974
19417   250.0  422.291721
21948   100.0   93.030291
23952   250.0  169.958768
3854     75.0   70.161544
9010     60.0  173.003036
3667   

In [23]:
#Compute RMSE
lr_rmse = mean_squared_error(y_test, y_lr_pred, squared=False)
rf_rmse = mean_squared_error(y_test, y_rf_pred, squared=False)

#Compute R2 Score
lr_r2 = r2_score(y_test, y_lr_pred)
rf_r2 = r2_score(y_test, y_rf_pred)

print('LR - Root Mean Squared Error: {0}'.format(lr_rmse))
print('LR - R2: {0}'.format(lr_r2))   

print("-------------------------")

print('RF - Root Mean Squared Error: {0}'.format(rf_rmse))
print('RF - R2: {0}'.format(rf_r2))   

LR - Root Mean Squared Error: 116.80923983485795
LR - R2: 0.3482343022583574
-------------------------
RF - Root Mean Squared Error: 113.13962566023892
RF - R2: 0.38854207306894273




RESULT: 

Although I have done all the data preping as well as tuning the hyperparameters. For some reason I am still getting high RMSE and low R2. I would greatly appreciate any suggestions or feedback on how I can further improve this model. Thank you!