# Assigment 8: Choose Your ML Problem and Data

In this unit's lab, you will implement a model to solve a machine learning problem of your choosing. First, you will have to make some decisions, such as which model to choose and which data preparation techniques may be necessary, and formulate a project plan accordingly. 

In this assignment, you will select a data set and choose a predictive problem that the data set supports. You will then inspect the data with your problem in mind and begin to formulate your  project plan. You will create this project plan in the written assignment that follows.


### Import Packages

Before you get started, import a few packages. You can import additional packages that you have used in this course that you may need for this task.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1: Choose Your Data Set and Load the Data

You will have the option to choose one of four data sets that you have worked with in this program:

* The "adult" data set that contains Census information from 1994: `adultData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load the Data Set

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "adultData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename, header = 0)

df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


In [3]:
df.columns

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

In [4]:
df.dtypes

name                                             object
description                                      object
neighborhood_overview                            object
host_name                                        object
host_location                                    object
host_about                                       object
host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        

In [5]:
df['calculated_host_listings_count'].unique()

array([  3,   1,   2,   4,   6,  30,  10,   5,  12,   7,   9,   8,  18,
        13,  20,  45,  34,  23,  11,  19,  51,  59, 142,  28, 177,  14,
       105,  16,  17,  44,  21,  83, 201, 180,  56,  24, 308, 162, 110,
       108,  15,  49,  46,  48,  27, 160, 421,  50,  25,  38,  22,  32,
        62,  66,  33,  31,  26,  36,  43])

In [6]:
type(df['host_is_superhost'])

pandas.core.series.Series

## Step 2: Choose Your Predictive Problem and Label 

Now that you have chosen your data set, you can: 

1. Choose what you would like to predict (i.e. the label) 
2. Identify your problem type: is it a classification or regression problem?

<b>Task:</b> In the markdown cell below, state what you are predicting (the label) and whether this is a classification or regression problem.

I am choosing to predict the host_is_superhost feature. 


<b>FEATURE ENGINEERING</b>

In [7]:
df_host = df[['host_location','host_response_rate', 'host_acceptance_rate','host_is_superhost',
             'host_listings_count','host_total_listings_count','host_identity_verified',
             'neighbourhood_group_cleansed','reviews_per_month','n_host_verifications',
             'calculated_host_listings_count','review_scores_rating','accommodates', 'bathrooms', 'bedrooms', 'beds']]

In [8]:
df_host.shape

(28022, 16)

In [9]:
df_host.dtypes

host_location                      object
host_response_rate                float64
host_acceptance_rate              float64
host_is_superhost                    bool
host_listings_count               float64
host_total_listings_count         float64
host_identity_verified               bool
neighbourhood_group_cleansed       object
reviews_per_month                 float64
n_host_verifications                int64
calculated_host_listings_count      int64
review_scores_rating              float64
accommodates                        int64
bathrooms                         float64
bedrooms                          float64
beds                              float64
dtype: object

In [10]:
df_host.select_dtypes(include = 'object')
#These 2 columns will need to be one hot encoded

Unnamed: 0,host_location,neighbourhood_group_cleansed
0,"New York, New York, United States",Manhattan
1,"New York, New York, United States",Brooklyn
2,"Brooklyn, New York, United States",Brooklyn
3,"New York, New York, United States",Manhattan
4,"New York, New York, United States",Manhattan
...,...,...
28017,"Queens, New York, United States",Queens
28018,"New York, New York, United States",Brooklyn
28019,US,Brooklyn
28020,"New York, New York, United States",Brooklyn


In [11]:
col_tobe_encoded = [len(df_host['host_location'].unique()), len(df_host['neighbourhood_group_cleansed'].unique())]

col_tobe_encoded

[1365, 5]

In [20]:
#There are way too many unique values for the host_location and because it's not an essential 
#feature, we will drop it from our features
df_clean = df_host.drop(['host_location'], axis = 1)

In [21]:
df_clean.shape

(28022, 15)

In [22]:
df_clean.dtypes

host_response_rate                float64
host_acceptance_rate              float64
host_is_superhost                    bool
host_listings_count               float64
host_total_listings_count         float64
host_identity_verified               bool
neighbourhood_group_cleansed       object
reviews_per_month                 float64
n_host_verifications                int64
calculated_host_listings_count      int64
review_scores_rating              float64
accommodates                        int64
bathrooms                         float64
bedrooms                          float64
beds                              float64
dtype: object

## Step 3: Inspect Your Data

In the code cell below, use some of the techniques you have learned in this course to take a look at your data. As you are investigating your data, consider the following to help you formulate your project plan:

1. What are my features?
5. Which model (or models) should I select that is appropriate for my machine learning problem and data?
6. Which data preparation techniques may be needed for my model (e.g. perform one-hot encoding)?
7. Which techniques should I use to evaluate my model's performance and improve my model?

Note: You will use this notebook to take a glimpse at your data to help you start making some considerations. In the written assignment you will outline your project plan, and in the lab assignment you will perform a deeper exploratory analysis of the data before implementing data preparation and feature engineering techniques.

<b>Task</b>: Use the techniques you have learned in this course to inspect your data.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.


In [14]:
#all analysis is above