# ML : K Nearest Neighbours

AirBnB strategy:

* find a few listings that are similar to ours,
* average the listed price for the ones most similar to ours,
* set our listing price to this calculated average price.

Similarity Metric 
* comparing numerical features ( observations) 
* predicit a continous value , Euclidean Distance



$ d = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2} $


</br>


<img src="img/euclidean_distance_five_features.png">

### Univariate KNN

$ d = \sqrt{(q_1 - p_1})^2 $

<br>
$ d = |q_1 - p_1| $

In [1]:
import pandas as pd

In [2]:
dc_listings = pd.read_csv('dc_airbnb.csv')

In [3]:
print(dc_listings)

     host_response_rate host_acceptance_rate  host_listings_count  \
0                   92%                  91%                   26   
1                   90%                 100%                    1   
2                   90%                 100%                    2   
3                  100%                  NaN                    1   
4                   92%                  67%                    1   
...                 ...                  ...                  ...   
3718               100%                  60%                    1   
3719               100%                  50%                    1   
3720               100%                 100%                    2   
3721                88%                 100%                    1   
3722                70%                 100%                    1   

      accommodates        room_type  bedrooms  bathrooms  beds    price  \
0                4  Entire home/apt       1.0        1.0   2.0  $160.00   
1                6  E

In [4]:
import numpy as np

In [5]:
our_living_space = 3
first_living_space_value = dc_listings['accommodates'][0]
first_distance = np.abs(our_living_space - first_living_space_value)
print(first_distance) # The closer to 0 the distance the more similar the living spaces are.

1


we can rank the existing living spaces by ascending distance values, the proxy for similarity.

In [6]:
dc_listings['distance'] = np.abs(dc_listings['accommodates'] - our_living_space)

dc_listings['distance'].value_counts()

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64

In [7]:
np.random.seed(1)
dc_listings[dc_listings['distance']==0]['accommodates']

26      3
34      3
36      3
40      3
44      3
       ..
3675    3
3697    3
3707    3
3714    3
3722    3
Name: accommodates, Length: 461, dtype: int64

In [8]:
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]


In [9]:
dc_listings = dc_listings.sort_values(by='distance')

In [10]:
dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,distance
577,98%,52%,49,3,Private room,1.0,1.0,2.0,$185.00,,,2,14,1,38.908356,-77.028146,Washington,20005,DC,0
2166,100%,89%,2,3,Entire home/apt,1.0,1.0,1.0,$180.00,,$100.00,1,14,10,38.905808,-77.000012,Washington,20002,DC,0
3631,98%,52%,49,3,Entire home/apt,1.0,1.0,2.0,$175.00,,,3,14,1,38.889065,-76.993576,Washington,20003,DC,0
71,100%,94%,1,3,Entire home/apt,1.0,1.0,1.0,$128.00,$40.00,,1,1125,9,38.87996,-77.006491,Washington,20003,DC,0
1011,,,1,3,Entire home/apt,0.0,1.0,1.0,$115.00,,,1,1125,0,38.907382,-77.035075,Washington,20005,DC,0


In [11]:
dc_listings['price'].head(10)

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object

In [12]:
 dc_listings['price'] = dc_listings['price'].str.replace('$','').str.replace(',','').astype('float')

In [13]:
mean_price = dc_listings['price'][:5].mean()
mean_price

156.6

In [14]:
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = np.abs(temp_df['accommodates'] - new_listing)
    temp_df = temp_df.sort_values(by='distance')
    return temp_df['price'][:5].mean()

In [15]:
predict_price(1)

78.8

In [16]:
predict_price(2)

126.0

In [17]:
predict_price(3)

194.8

---

Evaluating Model Performace



In [18]:
dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [19]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

In [20]:

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Error metric

* quantifies how good the predictions on test set 
* error metric quantifies how inaccurate our predictions were from the actual values
* low error metric means that the gap between the predicted list price and actual list price values is low while a high error metric means the gap is high.


mean absolute error

$MAE$ = $\frac{1}{n} \sum_{k=1}^n |(actual_1 - predicted_1)| + ... + |(actual_n - predicted_n)|$



In [21]:
mae = np.sum(np.abs(test_df['predicted_price'] - test_df['price']))/len(test_df)
mae

56.29001074113856

mean squared error

$mse = \frac{1}{n} \sum_{k=1}^n (actual_1 - predicted_1)^2 + ... +(actual_n - predicted_n)^2$

In [22]:
mse = np.sum((test_df['price'] - test_df['predicted_price'])**2)/len(test_df)
mse

18646.525370569278

In [23]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['bathrooms'].apply(predict_price)

test_df['squared_error']  = (test_df['price'] - test_df['predicted_price'])**2


mse = test_df['squared_error'].mean()

print(mse)


18405.44408163265


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Root mean squared error 

* error metric whose units are the base unit


$ RMSE = \sqrt{MSE}$

In [24]:
np.sqrt(mse)

135.66666532952246

In [25]:
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

In [31]:
mae_one = np.mean(errors_one)
mse_one = np.mean(np.square(errors_one))
rmse_one = np.sqrt(mse_one)


mae_two = np.mean(errors_two)
mse_two = np.mean(np.square(errors_two))
rmse_two = np.sqrt(mse_two)

---

# Multivariate K-Nearest Neighbors

2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation)

> increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors

> increase k, the number of nearby neighbors the model uses when computing the prediction

to watch out for columns that don't work well with the distance equation. This includes columns containing:

```
non-numerical values (e.g. city or state)
Euclidean distance equation expects numerical values
missing values
distance equation expects a value for each observation and attribute
non-ordinal values (e.g. latitude or longitude)
ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal
```

In [32]:
import pandas as pd
import numpy as np
np.random.seed(1)

dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

In [33]:
dc_listings.columns

Index(['host_response_rate', 'host_acceptance_rate', 'host_listings_count',
       'accommodates', 'room_type', 'bedrooms', 'bathrooms', 'beds', 'price',
       'cleaning_fee', 'security_deposit', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'latitude', 'longitude', 'city', 'zipcode',
       'state'],
      dtype='object')

In [37]:
dc_listings.drop(['room_type','city','state','latitude','longitude','zipcode','host_response_rate','host_acceptance_rate','host_listings_count'],axis=1,inplace=True)

In [38]:
dc_listings.drop(['cleaning_fee','security_deposit'],axis=1,inplace=True)

In [39]:
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,1,4,149
1593,2,1.0,1.5,1.0,85.0,1,30,49
3091,1,1.0,0.5,1.0,50.0,1,1125,1
420,2,1.0,1.0,1.0,209.0,4,730,2
808,12,5.0,2.0,5.0,215.0,2,1825,34


In [42]:
dc_listings.dropna(axis=0,subset=['bedrooms','bathrooms','beds'],inplace=True)

In [50]:
dc_listings.isnull().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64

# Normalization 

because of the way Euclidean distance is calculated, these listings would be considered very far apart because of the outsized effect the largeness of the values had on the overall Euclidean distance. To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.


> Normalizing the values in each column to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution

$ \large x = \frac{x - \mu}{\sigma}$ 

In [54]:
normalized_listings = (dc_listings - dc_listings.mean())/dc_listings.std()

normalized_listings['price'] = dc_listings['price']


In [55]:
normalized_listings.head(3)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505


## Euclidean distance for multivariate case


$ d = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... +(q_n - p_n)^2}$

<br>

<img src="img/distance_two_features.png">

In [56]:
from scipy.spatial import distance
first_listing = [-0.596544, -0.439151]
second_listing = [-0.596544, 0.412923]
dist = distance.euclidean(first_listing, second_listing)
dist

0.852074

In [75]:
# normalized_listings.iloc[0][['accommodates','bathrooms']]

In [74]:
first_fifth_distance = distance.euclidean(normalized_listings[['accommodates','bathrooms']].iloc[0],normalized_listings[['accommodates','bathrooms']].iloc[4])
print(first_fifth_distance)

5.272543124668404


---
# Scikit Learn


> Any model that helps us predict numerical values, like listing price in our case, is known as a regression model. 

> The other main class of machine learning models is called classification, where we're trying to predict a label from a fixed set of labels
