### K-Nearest Neighbors

- Two major aspsects of KNN are
    - how to find the similarity metrics
    - how to set the value of k
- Similarity metrics can be the Euclidean distance


#### Regression vs Classification
- Any model that helps us predict numerical values, like listing price in our case, is known as a regression model
- Any model we're trying to predict a label from a fixed set of labels (e.g. blood type or gender) is called Classification model.


#### Hyperparameters
Values that affect the behavior and performance of a model that are unrelated to the data that's used are referred to as hyperparameters

 A simple but common hyperparameter optimization technique is known as grid search, which involves:

- selecting a subset of the possible hyperparameter values,
- training a model using each of these hyperparameter values,
- evaluating each model's performance,
- selecting the hyperparameter value that resulted in the lowest error value.

#### Normalizing a column

Normalizing the values in each column to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:

- from each value, subtract the mean of the column
- divide each value by the standard deviation of the column

    $ x = \frac{x - \mu}{\sigma}$
    
Code

df = (df - df.mean())/df.std()

#### Types of errors

- Bias: if our assumption about the learning algorithm is wrong. It will have a high bias
- Variance: Error that occurs because of the variability of the model's predicted values. Mostly because many features are added to the model, making the model a highly complex multi-variate model.


The standard deviation of the RMSE values can be a proxy for a model's variance while the average RMSE is a proxy for a model's bias. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.

While k-nearest neighbors can make predictions, it isn't a mathematical model. A mathematical model is usually an equation that can exist without the original data, which isn't true with k-nearest neighbors. In the next two courses, we'll learn about a mathematical model called linear regression. We'll explore the bias-variance tradeoff in greater depth in these next 2 courses because of its importance when working with mathematical models in particular.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('dc_airbnb.csv')
df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null object
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(5), int64(5), object(9

In [3]:
df.shape

(3723, 19)

In [8]:
living_space = 3

In [16]:
df['distance'] = df['accommodates'].apply(lambda x: abs(x - 3))

In [17]:
df['distance'].value_counts()

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64

In [18]:
df[(df.distance == 0)]['accommodates']

26      3
34      3
36      3
40      3
44      3
       ..
3675    3
3697    3
3707    3
3714    3
3722    3
Name: accommodates, Length: 461, dtype: int64

In [19]:
np.random.seed(1)

In [20]:
random_index = np.random.permutation(df.index)

In [22]:
dc_listings = df.loc[random_index]

In [25]:
dc_listings = dc_listings.sort_values('distance')

In [28]:
dc_listings[["accommodates","price"]].head(10)

Unnamed: 0,accommodates,price
577,3,$185.00
2166,3,$180.00
3631,3,$175.00
71,3,$128.00
1011,3,$115.00
380,3,$219.00
943,3,$125.00
3107,3,$250.00
1499,3,$94.00
625,3,$150.00


In [32]:
dc_listings['price'] = dc_listings['price'].apply(lambda x: x.replace('$','').replace(',',''))

In [44]:
dc_listings['price'] = dc_listings['price'].astype(np.float)

In [47]:
dc_listings['price'][:5].mean()

156.6

In [49]:
from sklearn.model_selection import train_test_split

In [95]:
df['price'] = df['price'].apply(lambda x: x.replace('$','').replace(',',''))
df['price'] = df['price'].astype(np.float)

In [96]:
train,test = train_test_split(df)

In [97]:
print(f"{train.shape} and {test.shape}")

(2792, 21) and (931, 21)


In [None]:
def predict_price(listing, col_name="accommodates",df=train):
    temp = df.copy()
    temp['distance'] = temp[col_name].apply(lambda x: x-listing)
    temp['pricing'] = temp['pricing'].apply(lambda x: x.replace('$','').replace(',',''))
    temp['pricing'] = temp['pricing'].astype(np.float)

#### Using Transform and Apply to create a new column with top-k mean

In [177]:
def mean_price(df,k=5):
    df['mean_price'] = df.sort_values("price").head(k)["price"].mean()
    return df

In [173]:
def mean_price_app(series):
    return series.sort_values().head(5).mean()

In [178]:
train = train.groupby("accommodates").apply(mean_price)

In [174]:
train["mean_price_a"] = train.groupby("accommodates")["price"].transform(mean_price_app)

In [179]:
train.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,...,latitude,longitude,city,zipcode,state,first_distance,distance,mean,mean_price_a,mean_price
1961,86%,88%,8,8,Entire home/apt,3.0,1.0,5.0,214.0,$75.00,...,38.867161,-76.983278,Washington,20020,DC,5,5,146.6,146.6,146.6
1252,100%,100%,1,3,Entire home/apt,0.0,1.0,1.0,125.0,$10.00,...,38.912726,-77.040933,Washington,20009,DC,0,0,45.4,45.4,45.4
448,75%,100%,1,5,Entire home/apt,2.0,2.0,2.0,200.0,$50.00,...,38.905704,-77.025056,Washington,20001,DC,2,2,72.2,72.2,72.2
2468,93%,40%,1,2,Entire home/apt,1.0,1.0,1.0,100.0,$50.00,...,38.917778,-77.046039,Washington,20009,DC,1,1,29.2,29.2,29.2
550,92%,91%,26,4,Entire home/apt,1.0,1.0,3.0,140.0,$100.00,...,38.907582,-77.023823,Washington,20001,DC,1,1,46.0,46.0,46.0


In [189]:
def predict(val,train=train):
    return train[(train.accommodates == val)]["mean_price"].values[0]

In [190]:
predict(1)

23.8

In [191]:
test["predicted_price"] = test["accommodates"].apply(predict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [194]:
test["squared_error"] = (test.price - test.predicted_price)**2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [195]:
mse = test["squared_error"].mean()

In [196]:
mse

20316.65328440148

In [198]:
rmse = np.sqrt(mse)

In [199]:
rmse

142.53649807821674

In [200]:
df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,...,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,first_distance,distance
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,160.0,$115.00,...,1,1125,0,38.890046,-77.002808,Washington,20003,DC,1,1
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,350.0,$100.00,...,2,30,65,38.880413,-76.990485,Washington,20003,DC,3,3
2,90%,100%,2,1,Private room,1.0,2.0,1.0,50.0,,...,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD,2,2
3,100%,,1,2,Private room,1.0,1.0,1.0,95.0,,...,1,1125,0,38.872134,-77.019639,Washington,20024,DC,1,1
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,50.0,$15.00,...,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD,1,1


### Normalizing a dataframe ( only numerical columns)

In [201]:
selected = df[["host_listings_count","accommodates","bedrooms","bathrooms","beds","minimum_nights","maximum_nights","number_of_reviews"]]

In [202]:
selected = (selected - selected.mean())/selected.std()

In [203]:
selected.head()

Unnamed: 0,host_listings_count,accommodates,bedrooms,bathrooms,beds,minimum_nights,maximum_nights,number_of_reviews
0,0.193427,0.400054,-0.250231,-0.437816,0.301731,-0.345048,-0.016456,-0.516324
1,-0.193964,1.393984,2.131144,2.977843,1.147671,-0.069024,-0.016487,1.676245
2,-0.178468,-1.090839,-0.250231,1.270013,-0.544209,-0.069024,-0.016456,-0.482593
3,-0.193964,-0.593875,-0.250231,-0.437816,-0.544209,-0.345048,-0.016456,-0.516324
4,-0.193964,0.400054,-0.250231,-0.437816,-0.544209,1.311093,-0.016456,-0.516324


#### KNN using scikit learn library

In [206]:
from scipy.spatial import distance

In [207]:
from sklearn.neighbors import KNeighborsRegressor

In [209]:
knn = KNeighborsRegressor(p=2,algorithm="brute")

In [208]:
from sklearn.metrics import mean_squared_error