# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

df = pd.read_csv(airbnbDataSet_filename)
print(df.shape)
print(df.columns)
df.head()

(28022, 50)
Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_sc

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. I will be using the Airbnb dataset.
2. I would like to predict the overall rating of an Airbnb listing. The label column within the dataset will be a transformed version of the 'review_scores_rating' column or the overall review rating for a given listing. The transformed version will have integer values ranging from 0 to 5.
3. This is a supervised learning problem since I will have a target label to predict (rather than just trying to learn some pattern/grouping/clustering of the data). The project will also be a multiclass classification problem (classes being integer values 0 to 5 inclusive).
4. The features will include: 'host_response_rate', 'host_is_superhost', 'neighbourhood_group_cleansed', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', and 'review_scores_value'. I selected these features based off of intuitions about what review writers would most likely consider when evaluating their rating of a given listing. For example, the Airbnb's host's name is likely not a consideration when rating their Airbnb. Intuition is not fail proof however and there may be other factors that are influential.
5. Highly rated Airbnb's are more attractive to customers as they create more trust in the listing and their possible experience. Thus, highly rated listings will generate more customer interaction and revenue for the platform. Collecting feedback from customers is not guaranteed however. Since customers rarely have incentives to do so, and providing such would be expensive to the company, many listings on Airbnb have the potential to be highly rated but are yet to be. With a model that can predict Airbnb listings that will be highly rated, we can recommend listings to customers that they are more likely to enjoy. With more positive experiences, more customers are likely to return and rely on the platform as well. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
df.describe()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
count,16179.0,16909.0,28022.0,28022.0,28022.0,28022.0,25104.0,26668.0,28022.0,28022.0,...,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0
mean,0.906901,0.791953,14.554778,14.554778,2.874491,1.142174,1.329708,1.629556,154.228749,18.689387,...,4.8143,4.808041,4.750393,4.64767,9.5819,5.562986,3.902077,0.048283,1.758325,5.16951
std,0.227282,0.276732,120.721287,120.721287,1.860251,0.421132,0.700726,1.097104,140.816605,25.569151,...,0.438603,0.464585,0.415717,0.518023,32.227523,26.121426,17.972386,0.442459,4.446143,2.028497
min,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,29.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01,1.0
25%,0.94,0.68,1.0,1.0,2.0,1.0,1.0,1.0,70.0,2.0,...,4.81,4.81,4.67,4.55,1.0,0.0,0.0,0.0,0.13,4.0
50%,1.0,0.91,1.0,1.0,2.0,1.0,1.0,1.0,115.0,30.0,...,4.96,4.97,4.88,4.78,1.0,1.0,0.0,0.0,0.51,5.0
75%,1.0,1.0,3.0,3.0,4.0,1.0,1.0,2.0,180.0,30.0,...,5.0,5.0,5.0,5.0,3.0,1.0,1.0,0.0,1.83,7.0
max,1.0,1.0,3387.0,3387.0,16.0,8.0,12.0,21.0,1000.0,1250.0,...,5.0,5.0,5.0,5.0,421.0,308.0,359.0,8.0,141.0,13.0


In [4]:
label_to_transform = ['review_scores_rating']
features = ['host_response_rate', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']
print(features)

['host_response_rate', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value']


In [5]:
df.dtypes

name                                             object
description                                      object
neighborhood_overview                            object
host_name                                        object
host_location                                    object
host_about                                       object
host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        

In [6]:
df[features].dtypes

host_response_rate              float64
host_is_superhost                  bool
host_has_profile_pic               bool
host_identity_verified             bool
neighbourhood_group_cleansed     object
room_type                        object
accommodates                      int64
bathrooms                       float64
bedrooms                        float64
beds                            float64
amenities                        object
price                           float64
review_scores_cleanliness       float64
review_scores_checkin           float64
review_scores_communication     float64
review_scores_location          float64
review_scores_value             float64
dtype: object

In [7]:
to_encode = df[features].select_dtypes(include='object').columns
to_encode

Index(['neighbourhood_group_cleansed', 'room_type', 'amenities'], dtype='object')

In [8]:
nan = df[features].isnull().sum(axis=0)
nan

host_response_rate              11843
host_is_superhost                   0
host_has_profile_pic                0
host_identity_verified              0
neighbourhood_group_cleansed        0
room_type                           0
accommodates                        0
bathrooms                           0
bedrooms                         2918
beds                             1354
amenities                           0
price                               0
review_scores_cleanliness           0
review_scores_checkin               0
review_scores_communication         0
review_scores_location              0
review_scores_value                 0
dtype: int64

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I will also add the features 'host_has_profile_pic' and 'host_identity_verified' to the features list. Upon re-evaluation, these factors may contribute to the likeability or approachability of a host on Airbnb which could affect their rating/perception from customers.

There are a significant number of null values in 'host_response_rate', 'bedrooms', and 'beds' that I will need to address and either replace (for example with some median value or 0) or remove the specific examples with null values.

To add complexity to my problem, I would like to train and test multiple models to compare their performances: logistic regression model, decision tree, random forest, gradient boosted trees, and a neural network.

I will take an interative approach to evaluating my models and then improving them. I will try different combinations of hyperparameters in order to select and optimize my model. By nature of my goal of comparing various models, this will lead me to select and produce a model that generalizes well. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [9]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import clone

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [10]:
# features that have nan values (host_response_rate, bedrooms, beds)
nan_cols = nan[nan != 0].index

# replace nan host_response_rate with mean value 
mean_host_resp = df[nan_cols[0]].mean()
df[nan_cols[0]].fillna(mean_host_resp, inplace=True)

# replace nan bedrooms, beds with 0 
for nc in nan_cols[1:]:
    df[nc].fillna(0, inplace=True)

# check if any nan remaining in features
df[features].isnull().sum(axis=0)

host_response_rate              0
host_is_superhost               0
host_has_profile_pic            0
host_identity_verified          0
neighbourhood_group_cleansed    0
room_type                       0
accommodates                    0
bathrooms                       0
bedrooms                        0
beds                            0
amenities                       0
price                           0
review_scores_cleanliness       0
review_scores_checkin           0
review_scores_communication     0
review_scores_location          0
review_scores_value             0
dtype: int64

In [11]:
# winsorize price column to address outliers
from scipy.stats.mstats import winsorize

print(df['price'].min(), df['price'].max())
df['price'] = winsorize(df['price'], limits=[0.01, 0.01])
print(df['price'].min(), df['price'].max())

29.0 1000.0
30.0 899.0


In [12]:
# see categorical features that need to be transformed 
for enc in to_encode: 
    print(enc, '\n', df[enc].unique(), '\n')

neighbourhood_group_cleansed 
 ['Manhattan' 'Brooklyn' 'Queens' 'Staten Island' 'Bronx'] 

room_type 
 ['Entire home/apt' 'Private room' 'Hotel room' 'Shared room'] 

amenities 
 ['["Extra pillows and blankets", "Baking sheet", "Luggage dropoff allowed", "TV", "Hangers", "Ethernet connection", "Long term stays allowed", "Carbon monoxide alarm", "Wifi", "Heating", "Dishes and silverware", "Air conditioning", "Free street parking", "Essentials", "Hot water", "Bathtub", "Kitchen", "Fire extinguisher", "Cooking basics", "Dedicated workspace", "Hair dryer", "Stove", "Smoke alarm", "Keypad", "Iron", "Oven", "Paid parking off premises", "Refrigerator", "Bed linens", "Cleaning before checkout", "Coffee maker"]'
 '["Extra pillows and blankets", "Luggage dropoff allowed", "Free parking on premises", "Pack \\u2019n play/Travel crib", "Microwave", "Hangers", "Lockbox", "Long term stays allowed", "Carbon monoxide alarm", "High chair", "Wifi", "Heating", "Shampoo", "Dishes and silverware", "Air cond

In [13]:
# one hot encode neighborhood_group_cleansed and room_type (since low cardinality)
one_hot_enc = to_encode[:-1]

df_ohe = pd.get_dummies(df[features+label_to_transform], columns=one_hot_enc)
print(df_ohe.columns)
df_ohe.head()

Index(['host_response_rate', 'host_is_superhost', 'host_has_profile_pic',
       'host_identity_verified', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'amenities', 'price', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'review_scores_rating',
       'neighbourhood_group_cleansed_Bronx',
       'neighbourhood_group_cleansed_Brooklyn',
       'neighbourhood_group_cleansed_Manhattan',
       'neighbourhood_group_cleansed_Queens',
       'neighbourhood_group_cleansed_Staten Island',
       'room_type_Entire home/apt', 'room_type_Hotel room',
       'room_type_Private room', 'room_type_Shared room'],
      dtype='object')


Unnamed: 0,host_response_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,beds,amenities,price,...,review_scores_rating,neighbourhood_group_cleansed_Bronx,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,0.8,True,True,True,1,1.0,0.0,1.0,"[""Extra pillows and blankets"", ""Baking sheet"",...",150.0,...,4.7,0,0,1,0,0,1,0,0,0
1,0.09,True,True,True,3,1.0,1.0,3.0,"[""Extra pillows and blankets"", ""Luggage dropof...",75.0,...,4.45,0,1,0,0,0,1,0,0,0
2,1.0,True,True,True,4,1.5,2.0,2.0,"[""Kitchen"", ""BBQ grill"", ""Cable TV"", ""Carbon m...",275.0,...,5.0,0,1,0,0,0,1,0,0,0
3,1.0,True,True,True,2,1.0,1.0,1.0,"[""Room-darkening shades"", ""Lock on bedroom doo...",68.0,...,4.21,0,0,1,0,0,0,0,1,0
4,0.906901,True,True,True,1,1.0,1.0,1.0,"[""Breakfast"", ""Carbon monoxide alarm"", ""Fire e...",75.0,...,4.91,0,0,1,0,0,0,0,1,0


In [14]:
import re 

# consolidate and encode amenities
print(df['amenities'].nunique())

# create a dictionary mapping each amenity to its occurrence frequency across dataset
amen_freq = {}
for list_am_str in df['amenities']: 
    list_am = re.findall('"([^"^\]^\n^0-9^,]*)"', list_am_str)
    for am in list_am: 
        if am in amen_freq: 
            amen_freq[am] += 1
        else: 
            amen_freq[am] = 1

# sort the amenities frequency dictionary in non-increasing order 
amen_freq_sorted = {k:v for k, v in sorted(amen_freq.items(), key=lambda item: item[1], reverse=True)}
print(len(amen_freq_sorted))

# get the top 15 most frequently occuring amenities
amen_freq_top = list(amen_freq_sorted.keys())[:15]

# create amenities one hot encoding columns and set values            
for amen in amen_freq_top: 
    features.append('amenities_'+amen)
    df_ohe['amenities_'+amen] = df_ohe.apply(lambda row: 1 if amen in row['amenities'] else 0, axis=1)
    
# remove original column 
df_ohe.drop(columns='amenities', inplace=True)
print(df_ohe.columns)
df_ohe.head()

25020
1691
Index(['host_response_rate', 'host_is_superhost', 'host_has_profile_pic',
       'host_identity_verified', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'price', 'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'review_scores_rating',
       'neighbourhood_group_cleansed_Bronx',
       'neighbourhood_group_cleansed_Brooklyn',
       'neighbourhood_group_cleansed_Manhattan',
       'neighbourhood_group_cleansed_Queens',
       'neighbourhood_group_cleansed_Staten Island',
       'room_type_Entire home/apt', 'room_type_Hotel room',
       'room_type_Private room', 'room_type_Shared room', 'amenities_Wifi',
       'amenities_Essentials', 'amenities_Long term stays allowed',
       'amenities_Smoke alarm', 'amenities_Heating', 'amenities_Kitchen',
       'amenities_Air conditioning', 'amenities_Hangers',
       'amenities_Carbon monoxide alarm', 'amenities_Hair dryer',
  

Unnamed: 0,host_response_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,beds,price,review_scores_cleanliness,...,amenities_Kitchen,amenities_Air conditioning,amenities_Hangers,amenities_Carbon monoxide alarm,amenities_Hair dryer,amenities_Iron,amenities_Hot water,amenities_Shampoo,amenities_Dedicated workspace,amenities_Dishes and silverware
0,0.8,True,True,True,1,1.0,0.0,1.0,150.0,4.62,...,1,1,1,1,1,1,1,0,1,1
1,0.09,True,True,True,3,1.0,1.0,3.0,75.0,4.49,...,1,1,1,1,1,1,1,1,1,1
2,1.0,True,True,True,4,1.5,2.0,2.0,275.0,5.0,...,1,1,0,1,0,0,0,0,0,0
3,1.0,True,True,True,2,1.0,1.0,1.0,68.0,3.73,...,0,1,1,0,1,1,1,1,0,0
4,0.906901,True,True,True,1,1.0,1.0,1.0,75.0,4.82,...,0,1,0,1,1,0,1,1,1,0


In [15]:
# extract features and label
y = df_ohe[label_to_transform]
X = df_ohe.drop(columns=label_to_transform)

# transform label, round to nearest integer 
y = y.round().astype('int64')
print(y.head())

# create training and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

print('X_train: ', X_train.shape)
print('y_train: ', y_train.shape)
print('X_test: ', X_test.shape)
print('y_test: ', y_test.shape)

dataset = [X_train, X_test, y_train, y_test]

   review_scores_rating
0                     5
1                     4
2                     5
3                     4
4                     5
X_train:  (22417, 38)
y_train:  (22417,)
X_test:  (5605, 38)
y_test:  (5605,)


In [16]:
def best_params(model, params_grid, X_train, y_train): 
    '''
    run 5-fold cross validation grid search on the model and return the best hyperparamters (dict) 
    '''
    mod = clone(model)
    grid = GridSearchCV(mod, params_grid, cv=2)
    grid.fit(X_train, y_train)
    
    return grid.best_params_

def train_test(model, X_train, X_test, y_train, y_test):
    '''
    fit the model and return its accuracy on testing data
    '''
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    return acc

def visualize_accuracy(params, scores): 
    '''
    create lineplot to show relationship between hyperparam values and accuracy scores
    '''
    sns.lineplot(x=params, y=scores)

In [17]:
# Decision Tree Model
dt = DecisionTreeClassifier(max_depth=2, criterion='entropy')

depths = [2**i for i in range(6)]
leaves = [1, 5, 15, 20, 30]
splits = [2, 5, 10, 20, 35]
crit = ['gini', 'entropy', 'log_loss']

dt_params = {'max_depth': depths, 
             'min_samples_leaf': leaves, 
             'min_samples_split': splits, 
             'criterion': crit}

dt_best_p = best_params(dt, dt_params, X_train, y_train)
dt_best = DecisionTreeClassifier(**dt_best_p)
dt_acc = train_test(dt_best, *dataset)
print('dt accuracy: ', dt_acc)

dt accuracy:  0.8845673505798395


In [18]:
# K-Nearest Neighbors model
knn = KNeighborsClassifier()

kvals = [3*10**i for i in range(4)]
weights = ['uniform', 'distance']
leaves = [10*i for i in range(1,5)]
knn_params = {'n_neighbors': kvals, 
              'weights': weights, 
              'leaf_size': leaves}

knn_best_p = best_params(knn, knn_params, X_train, y_train)
knn_best = KNeighborsClassifier(**knn_best_p)
knn_acc = train_test(knn_best, *dataset)
print('knn accuracy: ', knn_acc)

knn accuracy:  0.8155218554861731


In [19]:
# Gradient Boosted Tree
gb = GradientBoostingClassifier()

gb_params = {'n_estimators': [50, 100, 200, 300], 
             'learning_rate': [0.05, 0.1, 0.15]}

gb_best_p = best_params(gb, gb_params, X_train, y_train)
gb_best = GradientBoostingClassifier(**gb_best_p)
gb_acc = train_test(gb_best, *dataset)
print('gb accuaracy: ', gb_acc)

gb accuaracy:  0.8947368421052632


In [20]:
# Random Forest 
rf = RandomForestClassifier()

rf_params = {'n_estimators': [50, 100, 200, 300]}

rf_best_p = best_params(rf, rf_params, X_train, y_train)
rf_best = RandomForestClassifier(**rf_best_p)
rf_acc = train_test(rf_best, *dataset)
print('rf accuracy: ', rf_acc)

rf accuracy:  0.8924174843889384


In [21]:
# hyperparameter values of gradient boosted tree
print(gb_best_p)

{'learning_rate': 0.05, 'n_estimators': 100}


Highest accuracy of 0.895 on testing set was observed with the Gradient Boosted Decision Tree with a learning rate of 0.05 and 100 estimators. 