# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename, header = 0)

df.head()
df.columns

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

In [3]:
#label:
df['price']

0         150.0
1          75.0
2         275.0
3          68.0
4          75.0
          ...  
28017      89.0
28018    1000.0
28019      64.0
28020      84.0
28021      70.0
Name: price, Length: 28022, dtype: float64

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. The data set I have chosen is the airbnb dataset.
2. I will be predicting the price of an airbnb based on its relevant features.
3. This problem is a regressive supervised learning problem.
4. My features are every other feature in my dataset other than price.
5. This is an important problem because this application could be implemented to the airbnb website as a special feature for airbnb owners to make informed decisions on how much they should be listing their airbnb.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
df.dtypes

name                                             object
description                                      object
neighborhood_overview                            object
host_name                                        object
host_location                                    object
host_about                                       object
host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        

In [5]:
df.describe()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
count,16179.0,16909.0,28022.0,28022.0,28022.0,28022.0,25104.0,26668.0,28022.0,28022.0,...,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0
mean,0.906901,0.791953,14.554778,14.554778,2.874491,1.142174,1.329708,1.629556,154.228749,18.689387,...,4.8143,4.808041,4.750393,4.64767,9.5819,5.562986,3.902077,0.048283,1.758325,5.16951
std,0.227282,0.276732,120.721287,120.721287,1.860251,0.421132,0.700726,1.097104,140.816605,25.569151,...,0.438603,0.464585,0.415717,0.518023,32.227523,26.121426,17.972386,0.442459,4.446143,2.028497
min,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,29.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01,1.0
25%,0.94,0.68,1.0,1.0,2.0,1.0,1.0,1.0,70.0,2.0,...,4.81,4.81,4.67,4.55,1.0,0.0,0.0,0.0,0.13,4.0
50%,1.0,0.91,1.0,1.0,2.0,1.0,1.0,1.0,115.0,30.0,...,4.96,4.97,4.88,4.78,1.0,1.0,0.0,0.0,0.51,5.0
75%,1.0,1.0,3.0,3.0,4.0,1.0,1.0,2.0,180.0,30.0,...,5.0,5.0,5.0,5.0,3.0,1.0,1.0,0.0,1.83,7.0
max,1.0,1.0,3387.0,3387.0,16.0,8.0,12.0,21.0,1000.0,1250.0,...,5.0,5.0,5.0,5.0,421.0,308.0,359.0,8.0,141.0,13.0


In [6]:
# Handle Outliers
import scipy.stats as stats
df['price'] = stats.mstats.winsorize(df['price'], limits = [0.01, 0.01])

In [7]:
#Handle Missing Values
nan_count = np.sum(df.isnull())
nan_count

name                                                5
description                                       570
neighborhood_overview                            9816
host_name                                           0
host_location                                      60
host_about                                      10945
host_response_rate                              11843
host_acceptance_rate                            11113
host_is_superhost                                   0
host_listings_count                                 0
host_total_listings_count                           0
host_has_profile_pic                                0
host_identity_verified                              0
neighbourhood_group_cleansed                        0
room_type                                           0
accommodates                                        0
bathrooms                                           0
bedrooms                                         2918
beds                        

In [8]:
nan_detected = nan_count != 0

In [9]:
is_int_or_float = (df.dtypes == int) | (df.dtypes == float)
        
is_int_or_float

name                                            False
description                                     False
neighborhood_overview                           False
host_name                                       False
host_location                                   False
host_about                                      False
host_response_rate                               True
host_acceptance_rate                             True
host_is_superhost                               False
host_listings_count                              True
host_total_listings_count                        True
host_has_profile_pic                            False
host_identity_verified                          False
neighbourhood_group_cleansed                    False
room_type                                       False
accommodates                                     True
bathrooms                                        True
bedrooms                                         True
beds                        

In [10]:
to_impute = nan_detected & is_int_or_float
to_impute

name                                            False
description                                     False
neighborhood_overview                           False
host_name                                       False
host_location                                   False
host_about                                      False
host_response_rate                               True
host_acceptance_rate                             True
host_is_superhost                               False
host_listings_count                             False
host_total_listings_count                       False
host_has_profile_pic                            False
host_identity_verified                          False
neighbourhood_group_cleansed                    False
room_type                                       False
accommodates                                    False
bathrooms                                       False
bedrooms                                         True
beds                        

In [11]:
to_impute_selected = ['host_response_rate', 'host_acceptance_rate', 'bedrooms', 'beds'] 

In [12]:
for col in to_impute_selected:
    new_name = col + '_na'
    df[new_name] = df[col].isnull()

df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications,host_response_rate_na,host_acceptance_rate_na,bedrooms_na,beds_na
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,3,3,0,0,0.33,9,False,False,True,False
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,1,1,0,0,4.86,6,False,False,False,False
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,1,1,0,0,0.02,3,False,False,False,False
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,1,0,1,0,3.68,4,False,False,False,False
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,1,0,1,0,0.87,7,True,True,False,False


In [13]:
for col in to_impute_selected:
    mean = df[col].mean()
    df[col].fillna(value = mean, inplace = True)

In [14]:
for colname in to_impute_selected:
    print("{} missing values count :{}".format(colname, np.sum(df[colname].isnull(), axis = 0)))

host_response_rate missing values count :0
host_acceptance_rate missing values count :0
bedrooms missing values count :0
beds missing values count :0


In [15]:
is_obj = df.select_dtypes(include = [object])
is_obj.columns

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'neighbourhood_group_cleansed',
       'room_type', 'amenities'],
      dtype='object')

In [16]:
to_transform_selected = ['neighbourhood_group_cleansed', 'room_type']

In [17]:
for colname in to_transform_selected:
    print("{} unqiue values :{}".format(colname, df[colname].unique()))

neighbourhood_group_cleansed unqiue values :['Manhattan' 'Brooklyn' 'Queens' 'Staten Island' 'Bronx']
room_type unqiue values :['Entire home/apt' 'Private room' 'Hotel room' 'Shared room']


In [18]:
df['neighbourhood_group_cleansed'].fillna(value = 'unavailable', inplace = True)
df['room_type'].fillna(value = 'unavailable', inplace = True)

In [19]:
df.drop(columns = ['name', 'description', 'neighborhood_overview', 'host_name', 'host_location', 'host_about', 'amenities'], axis = 1, inplace = True)

In [20]:
check_nan_count = np.sum(df.isnull())
check_nan_count

host_response_rate                              0
host_acceptance_rate                            0
host_is_superhost                               0
host_listings_count                             0
host_total_listings_count                       0
host_has_profile_pic                            0
host_identity_verified                          0
neighbourhood_group_cleansed                    0
room_type                                       0
accommodates                                    0
bathrooms                                       0
bedrooms                                        0
beds                                            0
price                                           0
minimum_nights                                  0
maximum_nights                                  0
minimum_minimum_nights                          0
maximum_minimum_nights                          0
minimum_maximum_nights                          0
maximum_maximum_nights                          0


In [21]:
#one-hot encode columns:
df_with_dummies = pd.get_dummies(df, columns = ['neighbourhood_group_cleansed', 'room_type'])
df_with_dummies

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,...,beds_na,neighbourhood_group_cleansed_Bronx,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,0.800000,0.170000,True,8.0,8.0,True,True,1,1.0,1.329708,...,False,0,0,1,0,0,1,0,0,0
1,0.090000,0.690000,True,1.0,1.0,True,True,3,1.0,1.000000,...,False,0,1,0,0,0,1,0,0,0
2,1.000000,0.250000,True,1.0,1.0,True,True,4,1.5,2.000000,...,False,0,1,0,0,0,1,0,0,0
3,1.000000,1.000000,True,1.0,1.0,True,True,2,1.0,1.000000,...,False,0,0,1,0,0,0,0,1,0
4,0.906901,0.791953,True,1.0,1.0,True,True,1,1.0,1.000000,...,False,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,1.000000,1.000000,True,8.0,8.0,True,True,2,1.0,1.000000,...,False,0,0,0,1,0,0,0,1,0
28018,0.910000,0.890000,True,0.0,0.0,True,True,6,1.0,2.000000,...,False,0,1,0,0,0,1,0,0,0
28019,0.990000,0.990000,True,6.0,6.0,True,True,2,2.0,1.000000,...,False,0,1,0,0,0,0,0,1,0
28020,0.900000,1.000000,True,3.0,3.0,True,True,3,1.0,1.000000,...,False,0,1,0,0,0,1,0,0,0


In [22]:
df_wo_encoded_col = df.drop(columns=['neighbourhood_group_cleansed', 'room_type'], inplace = True)
df = pd.concat([df_wo_encoded_col, df_with_dummies], axis=1)

In [23]:
df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,...,beds_na,neighbourhood_group_cleansed_Bronx,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,0.8,0.17,True,8.0,8.0,True,True,1,1.0,1.329708,...,False,0,0,1,0,0,1,0,0,0
1,0.09,0.69,True,1.0,1.0,True,True,3,1.0,1.0,...,False,0,1,0,0,0,1,0,0,0
2,1.0,0.25,True,1.0,1.0,True,True,4,1.5,2.0,...,False,0,1,0,0,0,1,0,0,0
3,1.0,1.0,True,1.0,1.0,True,True,2,1.0,1.0,...,False,0,0,1,0,0,0,0,1,0
4,0.906901,0.791953,True,1.0,1.0,True,True,1,1.0,1.0,...,False,0,0,1,0,0,0,0,1,0


In [24]:
bool_col = df.select_dtypes(include = [bool]).columns.tolist()

In [25]:
df[bool_col] = df[bool_col].astype(int)
df.dtypes

host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                 int64
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                              int64
host_identity_verified                            int64
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        float64
beds                                            float64
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                          float64
maximum_minimum_nights                          float64
minimum_maximum_nights                          float64
maximum_maximum_nights                          

In [26]:
#evaluate correlation
corr_matrix = round(df.corr(),5)
corr_matrix

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,...,beds_na,neighbourhood_group_cleansed_Bronx,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
host_response_rate,1.0,0.42738,,0.04103,0.04103,,,0.022,0.01969,0.01447,...,-0.03036,0.02224,0.01739,-0.04979,0.03067,0.01489,0.02917,-0.01695,-0.02177,-0.02203
host_acceptance_rate,0.42738,1.0,,0.04082,0.04082,,,0.05188,-0.00282,0.01719,...,-0.02007,0.02646,-0.03139,-0.01014,0.04209,0.01146,-0.00553,0.0377,0.00182,-0.00678
host_is_superhost,,,,,,,,,,,...,,,,,,,,,,
host_listings_count,0.04103,0.04082,,1.0,1.0,,,-0.0039,0.01326,-0.00432,...,0.00145,-0.00165,-0.07464,0.08258,-0.00814,-0.00975,0.02274,-5e-05,-0.02028,-0.01098
host_total_listings_count,0.04103,0.04082,,1.0,1.0,,,-0.0039,0.01326,-0.00432,...,0.00145,-0.00165,-0.07464,0.08258,-0.00814,-0.00975,0.02274,-5e-05,-0.02028,-0.01098
host_has_profile_pic,,,,,,,,,,,...,,,,,,,,,,
host_identity_verified,,,,,,,,,,,...,,,,,,,,,,
accommodates,0.022,0.05188,,-0.0039,-0.0039,,,1.0,0.36944,0.72124,...,-0.05234,-0.00831,0.027,-0.02531,-0.00212,0.01391,0.45266,-0.01507,-0.43853,-0.06092
bathrooms,0.01969,-0.00282,,0.01326,0.01326,,,0.36944,1.0,0.47263,...,-0.01265,-0.02246,0.07051,-0.04493,-0.02558,0.00349,0.03139,-0.01759,-0.0288,-0.00112
bedrooms,0.01447,0.01719,,-0.00432,-0.00432,,,0.72124,0.47263,1.0,...,-0.0424,-0.0129,0.0563,-0.05188,-0.0044,0.01698,0.35617,-0.02895,-0.34022,-0.05829


In [27]:
corrs = corr_matrix['price']
corrs

host_response_rate                              0.00649
host_acceptance_rate                            0.03299
host_is_superhost                                   NaN
host_listings_count                             0.08155
host_total_listings_count                       0.08155
host_has_profile_pic                                NaN
host_identity_verified                              NaN
accommodates                                    0.52443
bathrooms                                       0.32857
bedrooms                                        0.46218
beds                                            0.40576
price                                           1.00000
minimum_nights                                 -0.08205
maximum_nights                                 -0.00102
minimum_minimum_nights                         -0.07308
maximum_minimum_nights                         -0.00728
minimum_maximum_nights                          0.06550
maximum_maximum_nights                          

In [28]:
exclude = ['price']
corrs_sorted = corrs.sort_values(ascending = False)
corrs_sorted.drop(exclude, axis = 0, inplace = True)
corrs_sorted

accommodates                                    0.52443
bedrooms                                        0.46218
beds                                            0.40576
room_type_Entire home/apt                       0.35682
bathrooms                                       0.32857
neighbourhood_group_cleansed_Manhattan          0.24194
availability_60                                 0.15246
availability_90                                 0.14833
availability_30                                 0.14538
room_type_Hotel room                            0.12797
availability_365                                0.12420
maximum_maximum_nights                          0.11148
review_scores_location                          0.09886
maximum_nights_avg_ntm                          0.08395
review_scores_cleanliness                       0.08350
host_total_listings_count                       0.08155
host_listings_count                             0.08155
minimum_maximum_nights                          

In [29]:
relevant_features = corrs_sorted[abs(corrs_sorted) > 0.1].index.tolist()
relevant_features

['accommodates',
 'bedrooms',
 'beds',
 'room_type_Entire home/apt',
 'bathrooms',
 'neighbourhood_group_cleansed_Manhattan',
 'availability_60',
 'availability_90',
 'availability_30',
 'room_type_Hotel room',
 'availability_365',
 'maximum_maximum_nights',
 'host_acceptance_rate_na',
 'neighbourhood_group_cleansed_Brooklyn',
 'host_response_rate_na',
 'neighbourhood_group_cleansed_Queens',
 'room_type_Private room']

In [30]:
all_columns = ['price'] + relevant_features
df = df[all_columns]
df.head()

Unnamed: 0,price,accommodates,bedrooms,beds,room_type_Entire home/apt,bathrooms,neighbourhood_group_cleansed_Manhattan,availability_60,availability_90,availability_30,room_type_Hotel room,availability_365,maximum_maximum_nights,host_acceptance_rate_na,neighbourhood_group_cleansed_Brooklyn,host_response_rate_na,neighbourhood_group_cleansed_Queens,room_type_Private room
0,150.0,1,1.329708,1.0,1,1.0,1,33,63,3,0,338,1125.0,0,0,0,0,0
1,75.0,3,1.0,3.0,1,1.0,0,6,18,3,0,194,730.0,0,1,0,0,0
2,275.0,4,2.0,2.0,1,1.5,0,3,12,3,0,123,1125.0,0,1,0,0,0
3,68.0,2,1.0,1.0,0,1.0,1,16,34,1,0,192,14.0,0,0,0,0,1
4,75.0,1,1.0,1.0,0,1.0,1,0,0,0,0,0,14.0,1,0,1,0,1


In [31]:
#standardize data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaled_df = scaler.fit_transform(df)
df = pd.DataFrame(scaled_df, columns=all_columns)
df.head()

Unnamed: 0,price,accommodates,bedrooms,beds,room_type_Entire home/apt,bathrooms,neighbourhood_group_cleansed_Manhattan,availability_60,availability_90,availability_30,room_type_Hotel room,availability_365,maximum_maximum_nights,host_acceptance_rate_na,neighbourhood_group_cleansed_Brooklyn,host_response_rate_na,neighbourhood_group_cleansed_Queens,room_type_Private room
0,-0.024752,-1.007673,0.0,-0.588233,0.892088,-0.337606,1.186421,0.656954,0.873381,-0.356395,-0.070093,1.494669,-0.041001,-0.810694,-0.819486,-0.855569,-0.40824,-0.859203
1,-0.577126,0.06747,-0.497128,1.28049,0.892088,-0.337606,-0.842871,-0.626587,-0.458538,-0.356395,-0.070093,0.458908,-0.041006,-0.810694,1.220276,-0.855569,-0.40824,-0.859203
2,0.895871,0.605041,1.010653,0.346128,0.892088,0.849692,-0.842871,-0.769202,-0.636127,-0.356395,-0.070093,-0.051779,-0.041001,-0.810694,1.220276,-0.855569,-0.40824,-0.859203
3,-0.628681,-0.470102,-0.497128,-0.588233,-1.120965,-0.337606,1.186421,-0.151201,0.015034,-0.572258,-0.070093,0.444523,-0.041014,-0.810694,-0.819486,-0.855569,-0.40824,1.163869
4,-0.577126,-1.007673,-0.497128,-0.588233,-1.120965,-0.337606,1.186421,-0.911818,-0.991305,-0.68019,-0.070093,-0.936492,-0.041014,1.233512,-0.819486,1.168813,-0.40824,1.163869


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

1. The new feature list I chose to keep: ['accommodates',
 'bedrooms',
 'beds',
 'room_type_Entire home/apt',
 'bathrooms',
 'neighbourhood_group_cleansed_Manhattan',
 'availability_60',
 'availability_90',
 'availability_30',
 'room_type_Hotel room',
 'availability_365',
 'maximum_maximum_nights',
 'host_acceptance_rate_na',
 'neighbourhood_group_cleansed_Brooklyn',
 'host_response_rate_na',
 'neighbourhood_group_cleansed_Queens',
 'room_type_Private room']

2. The different data preparation techniques I chose perform are to handle missing values, outliers, one-hot encode categorical columns, binary encoding, data standardization, and feature engineering.
3. I do not know what model is best suited for this project yet, I plan to do model selection with the following models: linear regression, decision trees, and random forest.
4. To train, analyze, and improve my model, I plan to evalute different model evaluation metrics using a grid search, split my data into training/validation/testing sets, and evaluate its accuracy/loss.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [32]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import roc_curve, auc

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [33]:
y = df['price']
X = df.drop(columns = ['price'], axis = 1)

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 1234)

In [35]:
print(X_train.shape)
print(X_test.shape)

(25219, 17)
(2803, 17)


In [36]:
#perform grid search for gb:
#grid search to find best hyperparameter
hyperparams_est_gb = [50, 100, 150]

hyperparams_depth_gb = [2, 6, 10]

# Create parameter grid.
param_grid_gb={'n_estimators':hyperparams_est_gb, 'max_depth':hyperparams_depth_gb}


In [42]:
#grid search for gb:
print('Running Grid Search...')

gb_model = GradientBoostingRegressor(random_state = 1234)

grid_gb = GridSearchCV(gb_model, param_grid_gb, cv = 5, scoring = 'neg_mean_squared_error')

grid_search_gb = grid_gb.fit(X_train, y_train)

print('Done!')

Running Grid Search...
Done!


In [43]:
print('Optimal hyperparameters Gradient Boosting: {0}'.format(grid_search_gb.best_params_))

print('GB MSE score: {0}'.format(grid_search_gb.best_score_))

Optimal hyperparameters Gradient Boosting: {'max_depth': 6, 'n_estimators': 100}
GB MSE score: -0.4546065545670167


In [44]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred = linear_model.predict(X_test)
r2_mse = mean_squared_error(y_test, y_pred)
print(f"Linear Regression R2:", {r2_mse})

Linear Regression R2: {0.5694999989539666}


In [46]:
#GB Model performs better with lower MSE score:
model = GradientBoostingRegressor(n_estimators = 100, max_depth = 6)
model.fit(X_train, y_train)
model_pred = model.predict(X_test)

print('End')

End
