# ML : K Nearest Neighbours

AirBnB strategy:

* find a few listings that are similar to ours,
* average the listed price for the ones most similar to ours,
* set our listing price to this calculated average price.

Similarity Metric 
* comparing numerical features ( observations) 
* predicit a continous value , Euclidean Distance



$ d = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2} $


</br>


<img src="img/euclidean_distance_five_features.png">

### Univariate KNN

$ d = \sqrt{(q_1 - p_1})^2 $

<br>
$ d = |q_1 - p_1| $

In [2]:
import pandas as pd

In [4]:
dc_listings = pd.read_csv('dc_airbnb.csv')

In [5]:
print(dc_listings)

     host_response_rate host_acceptance_rate  host_listings_count  \
0                   92%                  91%                   26   
1                   90%                 100%                    1   
2                   90%                 100%                    2   
3                  100%                  NaN                    1   
4                   92%                  67%                    1   
...                 ...                  ...                  ...   
3718               100%                  60%                    1   
3719               100%                  50%                    1   
3720               100%                 100%                    2   
3721                88%                 100%                    1   
3722                70%                 100%                    1   

      accommodates        room_type  bedrooms  bathrooms  beds    price  \
0                4  Entire home/apt       1.0        1.0   2.0  $160.00   
1                6  E

In [7]:
import numpy as np

In [13]:
our_living_space = 3
first_living_space_value = dc_listings['accommodates'][0]
first_distance = np.abs(our_living_space - first_living_space_value)
print(first_distance) # The closer to 0 the distance the more similar the living spaces are.

1


we can rank the existing living spaces by ascending distance values, the proxy for similarity.

In [22]:
dc_listings['distance'] = np.abs(dc_listings['accommodates'] - our_living_space)

dc_listings['distance'].value_counts()

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64

In [23]:
np.random.seed(1)
dc_listings[dc_listings['distance']==0]['accommodates']

26      3
34      3
36      3
40      3
44      3
       ..
3675    3
3697    3
3707    3
3714    3
3722    3
Name: accommodates, Length: 461, dtype: int64

In [28]:
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]


In [33]:
dc_listings = dc_listings.sort_values(by='distance')

In [59]:
dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,160.0,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,350.0,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,50.0,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,95.0,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,50.0,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [34]:
dc_listings['price'].head(10)

3098     $98.00
1499     $94.00
1053    $190.00
2113     $75.00
150     $150.00
530     $175.00
533     $135.00
3420    $150.00
1208    $139.00
2257    $129.00
Name: price, dtype: object

In [37]:
 dc_listings['price'] = dc_listings['price'].str.replace('$','').str.replace(',','').astype('float')

In [42]:
mean_price = dc_listings['price'][:5].mean()
mean_price

121.4

In [43]:
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = np.abs(temp_df['accommodates'] - new_listing)
    temp_df = temp_df.sort_values(by='distance')
    return temp_df['price'][:5].mean()

In [44]:
predict_price(1)

77.0

In [45]:
predict_price(2)

327.0

In [46]:
predict_price(3)

141.0

---

Evaluating Model Performace



In [50]:
dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [54]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

In [55]:

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Error metric

* quantifies how good the predictions on test set 
* error metric quantifies how inaccurate our predictions were from the actual values
* low error metric means that the gap between the predicted list price and actual list price values is low while a high error metric means the gap is high.


mean absolute error

$MAE$ = $\frac{1}{n} \sum_{k=1}^n |(actual_1 - predicted_1)| + ... + |(actual_n - predicted_n)|$



In [66]:
mae = np.sum(np.abs(test_df['predicted_price'] - test_df['price']))/len(test_df)
mae

56.29001074113856

mean squared error

$mse = \frac{1}{n} \sum_{k=1}^n (actual_1 - predicted_1)^2 + ... +(actual_n - predicted_n)^2$

In [67]:
mse = np.sum((test_df['price'] - test_df['predicted_price'])**2)/len(test_df)
mse

18646.525370569278

In [71]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['bathrooms'].apply(predict_price)

test_df['squared_error']  = (test_df['price'] - test_df['predicted_price'])**2


mse = test_df['squared_error'].mean()

print(mse)


18405.44408163265


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Root mean squared error 

* error metric whose units are the base unit


$ RMSE = \sqrt{MSE}$

In [72]:
np.sqrt(mse)

135.66666532952246

In [73]:
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])