# Dataquest Project: Introduction to K-Nearest Neighbors using an AirBnB dataset

From Wikipedia:

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

* In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

* In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.


From Dataquest:

We can't use K-nearest neighbors method for larger datasets because the model itself is represented using the entire training set. Each time we want to make a prediction on a new observation, we need to calculate the distance between each observation in our training set and our new observation, then rank by ascending distance. This is a computationally intensive technique!


## Data Description


Download the dataset for the Washington, D.C area:
http://insideairbnb.com/get-the-data.html

Each row in the dataset is a specific listing that's available for renting on AirBnB in the Washington, D.C. area:

* host_response_rate: the response rate of the host
* host_acceptance_rate: number of requests to the host that convert to rentals
* host_listings_count: number of other listings the host has
* latitude: latitude dimension of the geographic coordinates
* longitude: longitude part of the coordinates
* city: the city the living space resides
* zipcode: the zip code the living space resides
* state: the state the living space resides
* **accommodates: the number of guests the rental can accommodate**
* room_type: the type of living space (Private room, Shared room or Entire home/apt
* bedrooms: number of bedrooms included in the rental
* bathrooms: number of bathrooms included in the rental
* beds: number of beds included in the rental
* **price: nightly price for the rental**
* cleaning_fee: additional fee used for cleaning the living space after the guest leaves
* security_deposit: refundable security deposit, in case of damages
* minimum_nights: minimum number of nights a guest can stay for the rental
* maximum_nights: maximum number of nights a guest can stay for the rental
* number_of_reviews: number of reviews that previous guests have left

## Reading and storing the Data into a Pandas Dataframe

In [128]:
import pandas as pd
import numpy as np
dc_listings = pd.read_csv('data/listings.csv')

## Selecting some Specific Columns

In [129]:
dc_list = dc_listings.loc[:,['host_response_rate', 
                                 'host_listings_count', 
                                 'latitude', 
                                 'longitude', 
                                 'city', 
                                 'zipcode', 
                                 'state', 
                                 'accommodates',
                                 'room_type',
                                 'bedrooms',
                                 'bathrooms',
                                 'beds',
                                 'price',
                                 'cleaning_fee',
                                 'securrity_deposit',
                                 'minimum_nights',
                                 'maximum_nights',
                                 'number_of_reviews']]

dc_list.head(5)

Unnamed: 0,host_response_rate,host_listings_count,latitude,longitude,city,zipcode,state,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,securrity_deposit,minimum_nights,maximum_nights,number_of_reviews
0,92%,26,38.890046,-77.002808,Washington,20003,DC,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,,1,1125,0
1,90%,1,38.880413,-76.990485,Washington,20003,DC,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65
2,90%,2,38.955291,-76.986006,Hyattsville,20782,MD,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1
3,100%,1,38.872134,-77.019639,Washington,20024,DC,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0
4,92%,1,38.996382,-77.041541,Silver Spring,20910,MD,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,,7,1125,0


## Calculating the Euclidean Distance

Let us calculate the Euclidean distance between our living space, which can accommodate 3 people, and the living space in the dc_listings dataframe.

In [152]:
dc_listings = dc_list.copy()
dc_listings['distance'] = np.abs(dc_listings['accommodates'] - 3)
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


We can deduce that 461 living spaces can accommodate 3 people. Looking for the 5 "nearest neighbors", we could sort the dataframe by the distance column and then just select the first 5 living spaces but doing so, we would biase the result to the ordering of the dataset.


## Using the K-nearest Neighbors method for regression

Let us randomize the ordering of the dataset and then sort the dataframe by the distance column: all of the living spaces with the same number of bedrooms will be at the top of the dataframe but in random order across the first 461 rows.

In [153]:
# seed is used to reproduce the result of a random number generation.
np.random.seed(1)
index = np.random.permutation(len(dc_listings))
# Permuting lines in dc_listings
dc_listings = dc_listings.loc[index]
dc_listings.sort_values('distance', inplace=True)
# Display the first ten rows. The loc method returns rows from a dataframe delimited
# by the first and last index (both included). Here, it cannot be used since original indexes were permuted. 
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
prices_for_three = dc_listings.loc[0:10, 'price']
print(prices_for_three.mean())


122.65158676569885


## Calculate the suggested price for any number of guests

Let us write a function named predict_price that uses the k-nearest neighbors method to calculate the suggested price for any number of guests (accomodates column). We are going to split dc_listings in two sets: the training set (80% of the data) and the test set(20% of the data)

In [162]:
dc_listings = dc_list.copy()
#index = np.random.permutation(len(dc_listings))
# Permuting lines in dc_listings
#dc_listings = dc_listings.loc[index]

stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
## DataFrame.copy() performs a deep copy
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

In [163]:
print('One guest:', predict_price(1))
print('Two guests:', predict_price(2))
print('Four guests:', predict_price(4))

One guest: 89.0
Two guests: 104.0
Four guests: 145.8


In [145]:
test_df = test_df.copy()
test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

In [146]:
test_df[['accommodates', 'predicted_price']].head(10)

Unnamed: 0,accommodates,predicted_price
2792,2,178.007812
2793,3,121.719469
2794,4,164.314229
2795,3,121.719469
2796,6,170.296
2797,6,170.296
2798,4,164.314229
2799,2,178.007812
2800,4,164.314229
2801,3,121.719469
