# Basic K-Nearest Neighbors Algorithm

In [1]:
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)
listings = pd.read_csv('listings.csv')
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,7087327,https://www.airbnb.com/rooms/7087327,20151002231825,2015-10-03,Historic DC Condo-Walk to Capitol!,Professional pictures coming soon! Welcome to ...,,Professional pictures coming soon! Welcome to ...,none,,...,,f,,"DISTRICT OF COLUMBIA, WASHINGTON",f,flexible,f,f,18,
1,975833,https://www.airbnb.com/rooms/975833,20151002231825,2015-10-03,Spacious Capitol Hill Townhouse,,Beautifully renovated Capitol Hill townhouse. ...,Beautifully renovated Capitol Hill townhouse. ...,none,,...,9.0,f,,"DISTRICT OF COLUMBIA, WASHINGTON",f,strict,f,f,1,2.11
2,8249488,https://www.airbnb.com/rooms/8249488,20151002231825,2015-10-03,Spacious/private room for single,This is an ideal room for a single traveler th...,,This is an ideal room for a single traveler th...,none,,...,,f,,,f,flexible,f,f,1,1.0
3,8409022,https://www.airbnb.com/rooms/8409022,20151002231825,2015-10-03,A wonderful bedroom with library,Prime location right on the Potomac River in W...,,Prime location right on the Potomac River in W...,none,,...,,f,,"DISTRICT OF COLUMBIA, WASHINGTON",f,flexible,f,f,1,
4,8411173,https://www.airbnb.com/rooms/8411173,20151002231825,2015-10-03,Downtown Silver Spring,"Hi travellers! I live in this peaceful spot, b...",This is a 750 sq ft 1 bedroom 1 bathroom. Whi...,"Hi travellers! I live in this peaceful spot, b...",none,Silver Spring is booming. You can walk to a n...,...,,f,,,f,flexible,f,f,1,


In [2]:
listings.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', '

In [3]:
#keep the Columns that will be useful
df = listings[['host_response_rate', 'host_acceptance_rate', 'host_listings_count',
              'accommodates', 'room_type', 'bedrooms', 'bathrooms', 'beds', 'price',
              'cleaning_fee', 'security_deposit', 'minimum_nights', 'maximum_nights',
              'number_of_reviews', 'latitude', 'longitude', 'city', 'zipcode', 'state']]

In [4]:
df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [5]:
df.shape

(3723, 19)

# Column Description

* **host_response_rate:** the response rate of the host
* **host_acceptance_rate:** number of requests to the host that convert to rentals
* **host_listings_count:** number of other listings the host has
* **latitude:** latitude dimension of the geographic coordinates
* **longitude:** longitude part of the coordinates
* **city:** the city the living space resides
* **zipcode:** the zip code the living space resides
* **state:** the state the living space resides
* **accommodates:** the number of guests the rental can accommodate
* **room_type:** the type of living space (**Private room, Shared room or Entire home/apt**)
* **bedrooms:** number of bedrooms included in the rental
* **bathrooms:** number of bathrooms included in the rental
* **beds:** number of beds included in the rental
* **price:** nightly price for the rental
* **cleaning_fee:** additional fee used for cleaning the living space after the guest leaves
* **security_deposit:** refundable security deposit, in case of damages
* **minimum_nights:** minimum number of nights a guest can stay for the rental
* **maximum_nights:** maximum number of nights a guest can stay for the rental
* **number_of_reviews:** number of reviews that previous guests have left

**1. Similar Metric**

The **Similarity Metric** works by comparing a fixed set of numerical features (attributes) between 2 observations. When trying to predict a ***continuous value***, the main similarity metric that's used is ***Euclidean Distance***.

\begin{equation*}
d   = {\sqrt{(q_1 + p_1)^2 + (q_2 + p_2)^2 + ... + (q_n + p_n)^2}}
\end{equation*}

where $q_1$  to $q_n$ represent the feature values for one observation and $p_1$ to $p_n$ represent the feature values for the other observation.


**Example 1:**
1. Assume that you own a living space which can accomodate 3 people. Calculate the Euclidean Distance between your living space and the first living space in the dataframe df.

In [6]:
import numpy as np

your_living_space = 3
first_living_space_value = df.iloc[0]['accommodates']  #Locate the first living space in the df
first_distance = np.abs(first_living_space_value - your_living_space) #the formula looks like this since the case is univariate.
print(first_distance)

1


The output **1** only mean that the first living space in the df can accomodate either 4 or 2 people. If equal to **0**, it means that the two listings are similar.

Now we want to know how many living spaces in your area that can also accomodate 3 people. 

In [7]:
your_living_space = 3
df['distance'] = df.accommodates.apply(lambda x: np.abs(x - your_living_space))
print(df[df["distance"] == 0]["accommodates"].value_counts())

3    461
Name: accommodates, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


There are **461** living spaces that is similar to your space which can accomodate 3 people. Selecting a **k=5 nearest neighbors** will be biased since it will automatically select those first 5 in our dataframe. In a situation like this, it is best to randomize the selection of 5 nearest living space.

In [8]:
np.random.seed(1234) ##It is advisable to create your own random seed
df = df.loc[np.random.permutation(len(df))]
df = df.sort_values('distance')
print(df.iloc[0:5]['price'])

274     $120.00
1124    $110.00
1744    $110.00
2729    $200.00
2500    $150.00
Name: price, dtype: object


In order for you to determine the price of your living based on the nearest neighbor or the price of living space that has similar feature with your property, we need to average those randomly selected 5 nearest neighborhood. However, as you notice, the dtype format of price is object and it still has a dollar sign. Clean the data and determine your living space price.

In [9]:
stripped_commas = df['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
df['price'] = stripped_dollars.astype('float')
mean_price = df.iloc[0:5]['price'].mean()
print(mean_price)

138.0


Based on the K-Nearest Neighbor, $k=5$, the price of your living space is **$138.0** per night.

**Example 2:**
2. Suppose you have other 3 living spaces that can accommodate 1, 2 and 4 people respectively. Define a function that would predict the price of those living spaces.

In [10]:
def predict_price(accom): #define a function for reproducibility
    temp_df = df.copy() #creating a temporary df
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - accom)) #create a lambda function
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors.mean()
    return(predicted_price)

accom_one = predict_price(1)
accom_two = predict_price(2)
accom_four = predict_price(4)
print("The Price of the living space that can accommodate 1 person is $" + str(accom_one))
print("The Price of the living space that can accommodate 2 people is $" + str(accom_two))
print("The Price of the living space that can accommodate 4 people is $" + str(accom_four))

The Price of the living space that can accommodate 1 person is $86.0
The Price of the living space that can accommodate 2 people is $92.8
The Price of the living space that can accommodate 4 people is $290.2


**Exercise:**
1. How much will be the price of your living space if:
 * $k=7$
 * $k=9$
2. You have other property which can accommodate 6 people and has 3 bedrooms. How much would be the price of your living space based on KNN where $k=3$.

# Evaluating Model Performance (Univariate Case)

**Model Evaluation** is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future. Evaluating model performance with the data used for training is not acceptable in data science because it can easily generate overoptimistic and overfitted models. There are two methods of evaluating models in data science, ***Hold-Out*** and ***Cross-Validation***. To avoid overfitting, both methods use a test set (not seen by the model) to evaluate model performance.

<font size="4">**Hold-Out**	</font>			
In this method, the mostly large dataset is randomly divided to three subsets:		
1. **Training set** is a subset of the dataset used to build predictive models.
2. **Validation set** is a subset of the dataset used to assess the performance of model built in the training phase. It provides a test platform for fine tuning model's parameters and selecting the best-performing model. Not all modeling algorithms need a validation set.
3. **Test set** or unseen examples is a subset of the dataset to assess the likely future performance of a model. If a model fit to the training set much better than it fits the test set, overfitting is probably the cause.

<font size="4">**Cross-Validation**	</font>	

When only a limited amount of data is available, to achieve an unbiased estimate of the model performance we use k-fold cross-validation. In k-fold cross-validation, we divide the data into k subsets of equal size. We build models k times, each time leaving out one of the subsets from training and use it as the test set. If k equals the sample size, this is called "leave-one-out".

In [11]:
#Create a df for Train and Test.
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.25)

In [12]:
def predict_price(accom):
    temp_df = df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - accom))
    temp_df = temp_df.sort_values('distance')
def predict_price(accom):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - accom))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)
test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


In [24]:
test_df.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,distance,predicted_price,error,squared_error
180,100%,45%,1,4,Entire home/apt,1.0,1.0,2.0,150.0,$45.00,$200.00,2,90,25,38.897977,-77.014872,Washington,20001,DC,1,110.8,39.2,1536.64
2766,100%,80%,1,4,Entire home/apt,1.0,1.0,2.0,70.0,$50.00,,2,1125,35,38.936094,-77.037573,Washington,20010,DC,1,110.8,40.8,1664.64
1647,100%,67%,1,2,Private room,1.0,1.0,1.0,60.0,$30.00,$200.00,1,730,1,38.937241,-77.026488,Washington,20011,DC,1,106.8,46.8,2190.24
2243,100%,100%,1,5,Entire home/apt,1.0,1.0,3.0,130.0,,$150.00,2,1125,51,38.900826,-76.993069,Washington,20002,DC,2,175.0,45.0,2025.0
704,,,1,1,Private room,1.0,1.0,1.0,1300.0,,,1,1125,0,38.886241,-76.930404,Washington,20019,DC,2,64.0,1236.0,1527696.0


**2. Error Metric**
An ***Error Metric*** is a type of Metric used to measure the error of a forecasting model. They can provide a way for forecasters to quantitatively compare the performance of competing models. Some common error metrics are:

* **Mean Absolute Error (MAE)**
\begin{equation*}
MSE   = {\frac{1}{n}\sum_{k=1}^{n} | (Actual_1-Predicted_1) + (Actual_2-Predicted_2) + ... + (Actual_n-Predicted_n)| }
\end{equation*}

* **Mean Squared Error (MSE)**
\begin{equation*}
MSE   = {\frac{1}{n}\sum_{k=1}^{n} (Actual_1-Predicted_1)^2 + (Actual_2-Predicted_2)^2 + ... + (Actual_n-Predicted_n)^2 }
\end{equation*}
* **Root Mean Square Error (RMSE)**
\begin{equation*}
RMSE   = {\sqrt{(MSE)} }
\end{equation*}


In [13]:
#Calculate the Mean Absolute Error
test_df['error'] = np.absolute(test_df['predicted_price'] - test_df['price'])
mae = test_df['error'].mean()
print(mae)

56.84189044038669


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


*Mean Absolute Error* computes the absolute value of each error before we average all the errors

In [14]:
#Calculate the Mea Squared Error
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
print(mse)

14636.604554242804


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


The *Mean Squared Error* makes the gap between the predicted and actual values more clear. While comparing MSE values helps us identify which model performs better on a relative basis, however it doesn't help us understand if the performance is good enough in general. This is because the units of the MSE metric are squared (in this case, dollars squared). An MSE value of 16104.17 dollars squared doesn't give us an intuitive sense of how far off the model's predictions are from the true price value in dollars. This is when Root Mean Squared Error enter the picture.

In [15]:
rmse = mse **1/2
rmse

7318.302277121402

Since the *RMSE* value uses the same units as the target column, we can understand how far off in real dollars we can expect the model to perform. However, we can improve this model by *increasing the number of nearby neighbors* or by *increasing the number of attributes* which will become a **Multivariate case**.

# Multivariate K-Nearest Neighbors Algorithm

In doing K-Nearest Neighbors, one should remember that when selecting more attributes to use in the model, we need to watch out for columns that don't work well with the distance equation. This includes columns containing:

* non-numerical values (e.g. city or state)
    * Euclidean distance equation expects numerical values
* missing values
    * distance equation expects a value for each observation and attribute
* non-ordinal values (e.g. latitude or longitude)
    * ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal 

I our dataset, there are columns that contain non-numerical values like:
* room_type: e.g. Private room
* city: e.g. Washington
* state: e.g. DG

While there are column as well that contains numeric but non-ordinal values like:
* latitude: e.g. 38.913458
* longitude: e.g. -77.031
* zipcode: e.g. 20009

In this tutorial, we would like to model the pricing of the living space, there are column (***host_response_rate***, ***host_acceptance_rate*** and ***host_listings_count***) that do not directly describe the living space itself. Thus we have to drop them as well.




In [16]:
drop_columns = ['room_type', 'city', 'state', 'latitude', 'longitude', 'zipcode', 'host_response_rate', 
                'host_acceptance_rate', 'host_listings_count']
df = df.drop(drop_columns, axis=1)
print(df.isnull().sum())
print(df.isnull().sum()/ df.shape[0] * 100)

accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
distance                0
dtype: int64
accommodates          0.000000
bedrooms              0.564061
bathrooms             0.725222
beds                  0.295461
price                 0.000000
cleaning_fee         37.281762
security_deposit     61.697556
minimum_nights        0.000000
maximum_nights        0.000000
number_of_reviews     0.000000
distance              0.000000
dtype: float64


Notice that there are alot of missing values on the dataframe. 
**cleaning_fee** has **37.28%** while **security_deposite** has **61.7%** missing values respectively.
This number of error is beyond what we can handle, so it is better to remove this column in our model because we might miss out something if we remain it.
However, those column which only have few missing values, we have to remove as well the entire row.

In [17]:
df = df.drop(['cleaning_fee', 'security_deposit'], axis=1) #remove column
df = df.dropna(axis=0) #remove the rows which has a missing value
print(df.isnull().sum())

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
distance             0
dtype: int64


In [18]:
df.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,distance
274,3,1.0,1.0,1.0,120.0,1,1125,3,0
1124,3,1.0,1.0,1.0,110.0,2,7,8,0
1744,3,1.0,1.0,2.0,110.0,3,1125,0,0
2729,3,1.0,1.0,1.0,200.0,1,1125,3,0
2500,3,2.0,2.0,2.0,150.0,14,30,0,0


In [19]:
df.describe()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,distance
count,3671.0,3671.0,3671.0,3671.0,3671.0,3671.0,3671.0,3671.0,3671.0
mean,3.195587,1.209752,1.257695,1.64778,148.843639,2.235358,588519.4,15.106783,1.421411
std,2.00419,0.840801,0.586803,1.184549,137.550045,3.618777,35443910.0,29.236563,1.426212
min,1.0,0.0,0.0,1.0,10.0,1.0,1.0,0.0,0.0
25%,2.0,1.0,1.0,1.0,85.0,1.0,120.0,1.0,1.0
50%,2.0,1.0,1.0,1.0,115.0,2.0,1125.0,4.0,1.0
75%,4.0,1.0,1.0,2.0,165.0,3.0,1125.0,16.0,2.0
max,16.0,10.0,8.0,16.0,2822.0,180.0,2147484000.0,362.0,13.0


As you've noticed on the table summary, maximum_nights and number_of_reviews columns span much larger as compare to other columns. Using this two columns as it is in our KNN model, it could end up having an outsized effect on the euclidean distance calculations. To prevent this, we can ***normalize*** all of the columns to have a mean of ***0*** and a standard deviation of ***1***. 
The mathematical formula for normalizing attributes is:

\begin{equation*}
x   = {\frac{x-\mu}{\sigma}}
\end{equation*}

where $x$ is a value in a specific column, $\mu$  is the mean of all the values in the column, and $\sigma$ is the standard deviation of all the values in the column

In [20]:
normalized_df = (df - df.mean())/(df.std())
normalized_df['price'] = df['price'] #Normalize all columns except price
print(normalized_df.head(5))

      accommodates  bedrooms  bathrooms      beds  price  minimum_nights  \
274      -0.097589 -0.249467  -0.439151 -0.546858  120.0       -0.341375   
1124     -0.097589 -0.249467  -0.439151 -0.546858  110.0       -0.065038   
1744     -0.097589 -0.249467  -0.439151  0.297345  110.0        0.211298   
2729     -0.097589 -0.249467  -0.439151 -0.546858  200.0       -0.341375   
2500     -0.097589  0.939875   1.264998  0.297345  150.0        3.251000   

      maximum_nights  number_of_reviews  distance  
274        -0.016573          -0.414097 -0.996634  
1124       -0.016604          -0.243079 -0.996634  
1744       -0.016573          -0.516709 -0.996634  
2729       -0.016573          -0.414097 -0.996634  
2500       -0.016603          -0.516709 -0.996634  


Alternately, we can run the StandardScaler() from sklearn to standard all the columns.

In [25]:
from sklearn import preprocessing
names = df.columns
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=names)
scaled_df['price'] = df['price']
scaled_df.head(5)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,distance
0,-0.097602,-0.249501,-0.439211,-0.546933,160.0,-0.341421,-0.016575,-0.414154,-0.996769
1,-0.097602,-0.249501,-0.439211,-0.546933,350.0,-0.065047,-0.016606,-0.243112,-0.996769
2,-0.097602,-0.249501,-0.439211,0.297386,50.0,0.211327,-0.016575,-0.516779,-0.996769
3,-0.097602,-0.249501,-0.439211,-0.546933,95.0,-0.341421,-0.016575,-0.414154,-0.996769
4,-0.097602,0.940003,1.26517,0.297386,50.0,3.251442,-0.016606,-0.516779,-0.996769


In [22]:
df.columns

Index(['accommodates', 'bedrooms', 'bathrooms', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'number_of_reviews', 'distance'],
      dtype='object')

For now, let's use the `normalized_df` and run a KNN with two predictors `accomodates` and `bathroom`.

In [26]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_df.iloc[0:2792]
test_df = normalized_df.iloc[2792:]
train_columns = ['accommodates', 'bathrooms']

# Instantiate ML model.
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

# Fit model to data.
knn.fit(train_df[train_columns], train_df['price'])

# Use model to make predictions.
predictions = knn.predict(test_df[train_columns])

In univariate case, I demonstrated how to calculate the MSE and RMSE using the formula. We can as well use the `sklearn.metrics` specifically the `mean_squared_error function` to get the MSE and RMSE.

In [27]:
from sklearn.metrics import mean_squared_error

train_columns = ['accommodates', 'bathrooms']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', metric='euclidean')
knn.fit(train_df[train_columns], train_df['price'])
predictions = knn.predict(test_df[train_columns])

two_features_mse = mean_squared_error(test_df['price'], predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_mse)
print(two_features_rmse)

41606.82002275313
203.97749881482792


Compare the MSE and RMSE of univariate and multivariate. You will notice that it dropped tremendously. 

From two predictors, let's expand it into 5 predictors and compare how will it behave.

In [29]:
normalized_df.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,distance
274,-0.097589,-0.249467,-0.439151,-0.546858,120.0,-0.341375,-0.016573,-0.414097,-0.996634
1124,-0.097589,-0.249467,-0.439151,-0.546858,110.0,-0.065038,-0.016604,-0.243079,-0.996634
1744,-0.097589,-0.249467,-0.439151,0.297345,110.0,0.211298,-0.016573,-0.516709,-0.996634
2729,-0.097589,-0.249467,-0.439151,-0.546858,200.0,-0.341375,-0.016573,-0.414097,-0.996634
2500,-0.097589,0.939875,1.264998,0.297345,150.0,3.251,-0.016603,-0.516709,-0.996634


In [30]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'beds', 'number_of_reviews']
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train_df[features], train_df['price'])
five_predictions = knn.predict(test_df[features])
five_mse = mean_squared_error(test_df['price'], five_predictions)
five_rmse = five_mse ** (1/2)
print(five_mse)
print(five_rmse)

41229.56882821388
203.05065581823143


As you have noticed, the MSE and RMSE has decreased. Now let's try to run the KNN using all the predictors.

In [32]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

features = train_df.columns.tolist()
features.remove('price')

knn.fit(train_df[features], train_df['price'])
all_features_predictions = knn.predict(test_df[features])
all_features_mse = mean_squared_error(test_df['price'], all_features_predictions)
all_features_rmse = all_features_mse ** (1/2)
print(all_features_mse)
print(all_features_rmse)

40021.288737201365
200.05321476347578


As expected, the MSE and RMSE has also decreased. But please bear in mind that using all the features do not always decrease the RMSE and MSE or does not automatically improve the prediction accuracy or your performance metrics in general. Sometimes it worsen the model's performance. This only suggests that features should be selected very carefully.Applying different technique in feature selection is a must to achieve better predictive performance.