# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [28]:
# YOUR CODE HERE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
# YOUR CODE HERE
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "adultData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename) # YOUR CODE HERE

df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


In [4]:
#shape prior any modification
df.shape

(28022, 50)

In [5]:
#Now we will take a quarter of the rows from the data set in the following exploration
percentage = 0.33
#rows to shape cop
num_rows = df.shape[0]
#randomizing the ability to cut 66 percent of rows 
indices = np.random.choice(df.index, size=int(percentage*num_rows), replace=False)
#creation of new dataframe subset
df_subset = df.loc[indices]

In [6]:
#visualization of prior data set
print(df.shape)
#visualization for 33 percent of row subsets
df_subset.shape

(28022, 50)


(9247, 50)

In [7]:
#checking for unique data
#I am using 5 as a minimum cut off for reliable hosts with more then just 1 listing in order to reduce one off experiences 

#in this case applying to if the host is a super host or not for subset group
condition = (df_subset['host_listings_count'] > 4) & (df_subset['host_is_superhost'] == True)
condition


5905      True
10334    False
157      False
17085    False
7523     False
         ...  
24000     True
19490    False
27607    False
14261     True
2012     False
Length: 9247, dtype: bool

In [8]:
#now this is my superhost subset
superhost_subset = df_subset[condition]
superhost_subset.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
5905,"Super Quiet, Your Oasis",This 331 State Street studio features a queen ...,"Popular for its boutiques, restaurants and row...",Benjamin,"Brooklyn, New York, United States",,1.0,0.56,True,20.0,...,5.0,5.0,4.68,False,19,19,0,0,0.42,4
13917,Room super close to Midtown! 2 mins->subway,"Spacious private room in Woodside, Queens area...",Woodside is a community with a harmonious blen...,Kaz,"New York, New York, United States",Dear Airbnb guests!\n\nI manage furnished mont...,1.0,0.6,True,173.0,...,4.75,5.0,4.75,False,162,33,129,0,0.12,8
3679,"New City View 1BR Loft, Greenpoint",Beautiful 1 Bedroom Loft in the 100+ year old ...,,Vida,"New York, New York, United States","Enthusiastic, sociable, and creative.",0.93,0.99,True,52.0,...,5.0,4.5,5.0,False,45,44,0,0,0.04,5
26368,Spacious Gem - 1 Bedroom in Hell's Kitchen,"Location, location, location!<br />Our Spaciou...",One of the best features of our apartment is t...,Ignacio,US,,1.0,1.0,True,12.0,...,5.0,5.0,5.0,True,13,13,0,0,0.46,3
5738,2 BR/2BA UWS Luxury Apartment w Private Backyard,2 Bedroom 2 Baths Luxury Apartment with washer...,"This this bright, sunny high-end two bedrooms,...",Izi,"New York, New York, United States",Please fill in,1.0,0.81,True,8.0,...,4.83,5.0,4.83,False,8,8,0,0,0.1,7


In [9]:
superhost_subset.shape

(1669, 50)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [10]:
# YOUR CODE HERE
#checking left variable types of the dataset 
df = superhost_subset
df.dtypes

name                                             object
description                                      object
neighborhood_overview                            object
host_name                                        object
host_location                                    object
host_about                                       object
host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        

In [11]:
#as seen in part 1 I accidently did a majority of filtering 
df.shape

#all that is left to do is to find and replace outliers and perform onehot encoding

(1669, 50)

In [12]:
#tried the notNull etc filtering not needed

#Getting rid of outliers for listings here that might skew for host_listings_count
#Instead of using numpy use percentile I chose to winsorize 
df.loc[:, 'price_win'] = stats.mstats.winsorize(df['price'], limits=[0.01, 0.01]) #modifying a slice had to do .loc
df.loc[:, 'host_listings_count_win'] = stats.mstats.winsorize(df['host_listings_count'], limits=[0.01, 0.01]) #modifying a slice had to do .loc

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications,price_win,host_listings_count_win
5905,"Super Quiet, Your Oasis",This 331 State Street studio features a queen ...,"Popular for its boutiques, restaurants and row...",Benjamin,"Brooklyn, New York, United States",,1.0,0.56,True,20.0,...,4.68,False,19,19,0,0,0.42,4,162.0,20.0
13917,Room super close to Midtown! 2 mins->subway,"Spacious private room in Woodside, Queens area...",Woodside is a community with a harmonious blen...,Kaz,"New York, New York, United States",Dear Airbnb guests!\n\nI manage furnished mont...,1.0,0.6,True,173.0,...,4.75,False,162,33,129,0,0.12,8,50.0,173.0
3679,"New City View 1BR Loft, Greenpoint",Beautiful 1 Bedroom Loft in the 100+ year old ...,,Vida,"New York, New York, United States","Enthusiastic, sociable, and creative.",0.93,0.99,True,52.0,...,5.0,False,45,44,0,0,0.04,5,149.0,52.0
26368,Spacious Gem - 1 Bedroom in Hell's Kitchen,"Location, location, location!<br />Our Spaciou...",One of the best features of our apartment is t...,Ignacio,US,,1.0,1.0,True,12.0,...,5.0,True,13,13,0,0,0.46,3,90.0,12.0
5738,2 BR/2BA UWS Luxury Apartment w Private Backyard,2 Bedroom 2 Baths Luxury Apartment with washer...,"This this bright, sunny high-end two bedrooms,...",Izi,"New York, New York, United States",Please fill in,1.0,0.81,True,8.0,...,4.83,False,8,8,0,0,0.1,7,295.0,8.0


In [13]:
#Your code here 
nan_count = np.sum(df.isnull(), axis = 0)

condition = nan_count != 0 # look for all columns with missing values

In [14]:
col_names = nan_count[condition].index # get the column names
print(col_names)

nan_cols = list(col_names) # convert column names to list
print(nan_cols)

nan_col_types = df[nan_cols].dtypes
nan_col_types

Index(['description', 'neighborhood_overview', 'host_location', 'host_about',
       'host_response_rate', 'host_acceptance_rate', 'bedrooms', 'beds'],
      dtype='object')
['description', 'neighborhood_overview', 'host_location', 'host_about', 'host_response_rate', 'host_acceptance_rate', 'bedrooms', 'beds']


description               object
neighborhood_overview     object
host_location             object
host_about                object
host_response_rate       float64
host_acceptance_rate     float64
bedrooms                 float64
beds                     float64
dtype: object

In [15]:
to_encode = list(df.select_dtypes(include=['object']).columns)
print(to_encode)

['name', 'description', 'neighborhood_overview', 'host_name', 'host_location', 'host_about', 'neighbourhood_group_cleansed', 'room_type', 'amenities']


In [16]:
#OHE
to_encode = ['name', 'description', 'neighborhood_overview', 'host_name', 'host_location', 'host_about', 'neighbourhood_group_cleansed', 'room_type', 'amenities']

# Apply one-hot encoding to the selected columns
df_encoded = pd.get_dummies(df, columns=to_encode, drop_first=True)  # drop_first=True to avoid multicollinearity


In [17]:
df.shape

(1669, 52)

## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [18]:
# YOUR CODE HERE
#KNN prep
print(df.shape)

df.sample(n=10, replace=False, random_state=1)

(1669, 52)


Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications,price_win,host_listings_count_win
3080,Cozy & Clean #4,"One large room with 1 queen bed, 1 twin bed, a...",We are located in the wonderful neighborhood o...,Fatou,"New York, New York, United States",I am a retired Inspector from the New York Cit...,,,True,7.0,...,4.8,True,6,0,6,0,0.53,5,150.0,7.0
23050,Luxury Master Suite + Movie Theater,•••WELCOME•••<br /><br />Welcome to our one of...,,Peter,"Miami, Florida, United States",,0.96,0.64,True,18.0,...,5.0,False,16,14,2,0,0.19,6,274.0,18.0
9448,Emerald Suite @ Northern Lights Mansion.,Guests will be comfortable w. two twin beds & ...,"We are located in Central Harlem, about a mile...",Doungrat (Diane),"New York, New York, United States","As a global traveller, I enjoy music, foods, m...",0.97,0.58,True,8.0,...,4.91,False,9,0,9,0,0.8,3,155.0,8.0
11277,Sunny church view room in Harlem brownstone,Bright and sunny upper floor bedroom in renova...,"The house is located in central Harlem, across...",Judie And Steven,"New York, New York, United States","We love travel and historic homes, and have sp...",1.0,0.94,True,6.0,...,4.93,False,4,1,3,0,0.9,6,50.0,6.0
22309,Easy access to Manhattan : Nice Location apart...,"The apartment is located in Woodside, Queens.<...",The surroundings of the apartment are a commut...,Shogo,"Queens, New York, United States",,1.0,0.8,True,131.0,...,5.0,False,110,4,106,0,0.3,4,29.0,131.0
18264,"Modern Twin Room | Free GYM, coworking",SharedEasy is the community-oriented Coliving ...,Centered in one of Brooklyn’s finest area’s yo...,SharedEasy Coliving,US,SharedEasy is the only community-oriented Coli...,1.0,0.78,True,7.0,...,5.0,False,8,0,4,4,0.07,2,37.0,7.0
22463,Beautiful King Bed Hotel Room,Relax in out Timeless European Designed Rooms ...,,Justin,"New York, New York, United States",,0.99,1.0,True,107.0,...,4.55,True,105,0,105,0,2.61,4,296.0,107.0
16180,"1st Floor, Room # 8 (12' x 15')","6 to 10 min walk to subway ( 2,3,4, A & L trai...",,Aminul,"Queens, New York, United States",,0.7,0.97,True,13.0,...,5.0,False,9,0,9,0,0.59,2,39.0,13.0
27780,Near Park! Great Home base for active traveler...,"Located in beautiful South Harlem, a convenien...",,Jeff,"New York, New York, United States",I am a 36 year old young professional. In my s...,,,True,5.0,...,5.0,False,4,1,3,0,0.91,8,100.0,5.0
21966,Upper West Side Studio near Central Park/River...,This apartment is professionally managed by Fu...,The Upper West Side is an inviting neighborhoo...,Ken,"Westport, Connecticut, United States","I work for Furnished Quarters, the largest pro...",0.96,1.0,True,204.0,...,4.67,True,105,105,0,0,0.27,7,150.0,204.0


In [19]:
#was going about it with host_listings_count
X = df_encoded.drop(columns=['host_listings_count'])
y = df_encoded['calculated_host_listings_count']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
n_neighbors = 5 

knn_classifier = KNeighborsClassifier(n_neighbors=n_neighbors)

In [22]:
# Handling Missing Values
mean_imputer = SimpleImputer(strategy='mean')
X_train_imputed = mean_imputer.fit_transform(X_train)
X_test_imputed = mean_imputer.transform(X_test)

In [23]:
X_train_imputed_winsorized = winsorize(X_train_imputed, limits=[0.01, 0.01])
X_test_imputed_winsorized = winsorize(X_test_imputed, limits=[0.01, 0.01])

knn_classifier.fit(X_train_imputed_winsorized, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [24]:
y_pred = knn_classifier.predict(X_test_imputed_winsorized)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

Accuracy: 0.05389221556886228


In [25]:
def train_test_knn(X_train, X_test, y_train, y_test, k):
    knn_classifier = KNeighborsClassifier(n_neighbors=k)
    knn_classifier.fit(X_train, y_train)
    
    y_pred = knn_classifier.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    return acc

In [26]:
k_values = [5, 50, 500]

k_tune = []

for k in k_values:
    score = train_test_knn(X_train_imputed_winsorized, X_test_imputed_winsorized, y_train, y_test, k)
    print('k=' + str(k) + ', accuracy score: ' + str(score))
    k_tune.append(float(score))

print("Accuracy scores:", k_tune)

k=5, accuracy score: 0.05389221556886228
k=50, accuracy score: 0.029940119760479042
k=500, accuracy score: 0.05389221556886228
Accuracy scores: [0.05389221556886228, 0.029940119760479042, 0.05389221556886228]


In [30]:
#attempts of model selection kFold
n_splits = 5  # Number of folds
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

In [33]:
n_splits = 5  # Number of folds
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
acc_scores = []

for train_row_index, test_row_index in kf.split(X_train_imputed_winsorized):
    # our new partition of X_train and X_val
    X_train_new = X_train_imputed_winsorized[train_row_index] 
    X_val = X_train_imputed_winsorized[test_row_index]
    
    # our new partition of y_train and y_val
    y_train_new = y_train.iloc[train_row_index]
    y_val = y_train.iloc[test_row_index]
    
    knn_classifier = KNeighborsClassifier(n_neighbors=k)
    knn_classifier.fit(X_train_new, y_train_new)
    
    predictions = knn_classifier.predict(X_val)
     
    iteration_accuracy = accuracy_score(predictions, y_val)
    acc_scores.append(iteration_accuracy)
     
for i in range(len(acc_scores)):
    print('Accuracy score for iteration {0}: {1}'.format(i+1, acc_scores[i]))
    
avg_scores = sum(acc_scores) / n_splits
print('\nAverage accuracy score: {}'.format(avg_scores))


Accuracy score for iteration 1: 0.0299625468164794
Accuracy score for iteration 2: 0.07116104868913857
Accuracy score for iteration 3: 0.10486891385767791
Accuracy score for iteration 4: 0.033707865168539325
Accuracy score for iteration 5: 0.04868913857677903

Average accuracy score: 0.05767790262172284
