We explored how to use a simple k-nearest neighbors machine learning model that used just one feature, or attribute, of the listing to predict the rent price. It's clear that using just a single feature to compare listings doesn't reflect the reality of the market. An apartment that can accommodate 4 guests in a popular part of Washington D.C. will rent for much higher than one that can accommodate 4 guests in a crime ridden area.

There are 2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation):

* increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
* increase k, the number of nearby neighbors the model uses when computing the prediction

When selecting more attributes to use in the model, we need to watch out for columns that don't work well with the distance equation. This includes columns containing:

* non-numerical values (e.g. city or state)
 * Euclidean distance equation expects numerical values
* missing values
 * distance equation expects a value for each observation and attribute
* non-ordinal values (e.g. latitude or longitude)
 * ranking by Euclidean distance doesn't make sense if all attributes aren't ordinal

In [5]:
import pandas as pd
import numpy as np

dc_listings = pd.read_csv('dc_airbnb.csv')

In [6]:
dc_listings["price"] = dc_listings["price"].str.replace("$","").str.replace(",","").astype(float)

In [12]:
np.random.seed(1)
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))] # np.random.permutation(dc_listings.index)

In [18]:
# Use the DataFrame.info() method to return the number of non-null values in each column.
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

The following columns contain non-numerical values:

* room_type: e.g. Private room
* city: e.g. Washington
* state: e.g. DC

while these columns contain numerical but non-ordinal values:

* latitude: e.g. 38.913458
* longitude: e.g. -77.031
* zipcode: e.g. 20009

Geographic values like these aren't ordinal, because a smaller numerical value doesn't directly correspond to a smaller value in a meaningful way. For example, the zip code 20009 isn't smaller or larger than the zip code 75023 and instead both are unique, identifier values. Latitude and longitude value pairs describe a point on a geographic coordinate system and different equations are used in those cases (e.g. haversine).

While we could convert the host_response_rate and host_acceptance_rate columns to be numerical (right now they're object data types and contain the % sign), these columns describe the host and not the living space itself. Since a host could have many living spaces and we don't have enough information to uniquely group living spaces to the hosts themselves, let's avoid using any columns that don't directly describe the living space or the listing itself:

* host_response_rate
* host_acceptance_rate
* host_listings_count

In [21]:
# Remove columns 

cols = ["host_response_rate","host_acceptance_rate","host_listings_count","room_type","city","state","latitude","longitude","zipcode"]
dc_listings.drop(cols, inplace = True, axis = 1)

In [26]:
dc_listings.isnull().sum()

accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
dtype: int64

Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):

* bedrooms
* bathrooms
* beds

Since the number of rows containing missing values for one of these 3 columns is low, we can select and remove those rows without losing much information. There are also 2 columns that have a large number of missing values:

* cleaning_fee - 37.3% of the rows
* security_deposit - 61.7% of the rows

and we can't handle these easily. We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, let's remove these 2 columns entirely from consideration.

In [27]:
dc_listings = dc_listings.drop(["cleaning_fee", "security_deposit"], axis = 1)

In [30]:
# remove all rows that contain a missing value

dc_listings.dropna(inplace = True)

In [31]:
dc_listings.isnull().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64

In [32]:
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,2,1.0,1.0,1.0,125.0,1,4,149
1593,2,1.0,1.5,1.0,85.0,1,30,49
3091,1,1.0,0.5,1.0,50.0,1,1125,1
420,2,1.0,1.0,1.0,209.0,4,730,2
808,12,5.0,2.0,5.0,215.0,2,1825,34


Notice that while the accommodates, bedrooms, bathrooms, beds, and minimum_nights columns hover between 0 and 12 (at least in the first few rows), the values in the maximum_nights and number_of_reviews columns span much larger ranges. For example, the maximum_nights column has values as low as 4 and high as 1825, in the first few rows itself. If we use these 2 columns as part of a k-nearest neighbors model, these attributes could end up having an outsized effect on the distance calculations because of the largeness of the values.

Because of the way Euclidean distance is calculated, these listings would be considered very far apart because of the outsized effect the largeness of the values had on the overall Euclidean distance. To prevent any single column from having too much of an impact on the distance, we can **normalize** all of the columns to have a mean of 0 and a standard deviation of 1.

Normalizing the values in each column to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:

* from each value, subtract the mean of the column
* divide each value by the standard deviation of the column
![image.png](attachment:image.png)

In [34]:
# Normalize all of the feature columns in dc_listings

normalized_listings = (dc_listings-dc_listings.mean())/dc_listings.std()
normalized_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,-0.173345,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,-0.464148,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,-0.718601,-0.341375,-0.016573,-0.482505
420,-0.596544,-0.249467,-0.439151,-0.546858,0.437342,0.487635,-0.016584,-0.448301
808,4.393004,4.507903,1.264998,2.829956,0.480962,-0.065038,-0.016553,0.646219


Above methods were written with mass column transformation in mind and when we call mean() or std(), the appropriate column means and column standard deviations are used for each value in the Dataframe

In [39]:
# Add the price column from dc_listings to normalized_listings.

normalized_listings["price"] = dc_listings["price"]
normalized_listings.head(3)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505


Let's now train a model that uses both accommodates and bathrooms attributes when determining how similar 2 living spaces are.

the Euclidean distance equation again to see what the distance calculation using 2 attributes would look like:
![image.png](attachment:image.png)

So far, we've been calculating Euclidean distance ourselves by writing the logic for the equation ourselves. We can instead use the distance.euclidean() function from scipy.spatial, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them. The euclidean() function expects:

* both of the vectors to be represented using a list-like object (Python list, NumPy array, or pandas Series)
* both of the vectors must be 1-dimensional and have the same number of elements

In [40]:
# Calculate the Euclidean distance using only the accommodates and bathrooms features 
# between the first row and fifth row in normalized_listings

from scipy.spatial import distance

first_listing = normalized_listings[["accommodates","bathrooms"]].iloc[0]
fifth_listing = normalized_listings[["accommodates","bathrooms"]].iloc[4]

first_fifth_distance = distance.euclidean(first_listing, fifth_listing)
first_fifth_distance

5.272543124668404

So far, we've been writing functions from scratch to train the k-nearest neighbor models. While this is helpful deliberate practice to understand how the mechanics work, we can be more productive and iterate quicker by using a library that handles most of the implementation.

We will use **scikit-learn library**, which is the most popular machine learning in Python. Scikit-learn contains functions for all of the major machine learning algorithms and a simple, unified workflow. Both of these properties allow data scientists to be incredibly productive when training and testing different models on a new dataset.

The scikit-learn workflow consists of 4 main steps:

1. instantiate the specific machine learning model we want to use
2. fit the model to the training data
3. use the model to make predictions
4. evaluate the accuracy of the predictions

Each model in scikit-learn is implemented as a separate class and the first step is to identify the class we want to create an instance of. In our case, we want to use the KNeighborsRegressor class.

Any model that helps us predict numerical values, like listing price in our case, is known as a **regression model**. The other main class of machine learning models is called **classification**, where we're trying to predict a label from a fixed set of labels (e.g. blood type or gender). The word **regressor** from the class name KNeighborsRegressor refers to the regression model class 

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

If we refer to the documentation, we'll notice that by default:

* n_neighbors: the number of neighbors, is set to **5**
* algorithm: for computing nearest neighbors, is set to **auto**
* p: set to 2, corresponding to Euclidean distance
    
Let's set the algorithm parameter to **brute** and leave the n_neighbors value as 5, which matches the implementation we wrote in the last project. If we leave the algorithm parameter set to the default value of auto, scikit-learn will try to use tree-based optimizations to improve performance (which are outside of the scope of this project):

knn = KNeighborsRegressor(algorithm='brute')

Now, we can fit the model to the data using the fit method. For all models, the fit method takes in 2 required parameters:

* matrix-like object, containing the feature columns we want to use from the training set.
* list-like object, containing correct target values.


Matrix-like object means that the method is flexible in the input and either a Dataframe or a NumPy 2D array of values is accepted.

list-like objects includes;

* NumPy array
* Python list
* pandas Series object (e.g. when selecting a column)

When the **fit() method** is called, scikit-learn stores the training data we specified within the KNearestNeighbors instance (knn). If we try passing in data containing missing values or non-numerical values into the fit method, scikit-learn will return an error. Scikit-learn contains many such features that help prevent us from making common mistake

We can use the predict method to make predictions on the test set. The predict method has only one required parameter:
* matrix-like object, containing the feature columns from the dataset we want to make predictions on

The number of feature columns we use during both training and testing need to match or scikit-learn will return an error:

In [43]:
# Create an instance of the KNeighborsRegressor class with the following parameters:

# n_neighbors: 5
# algorithm: brute

from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[:2792]
test_df = normalized_listings.iloc[2792:]

train_columns = ['accommodates', 'bathrooms']
target_column = ["price"]

In [48]:
knn = KNeighborsRegressor(n_neighbors = 5, algorithm="brute",)
knn.fit(train_df[train_columns], train_df[target_column])
predictions = knn.predict(test_df[train_columns])
predictions[:5]

array([[ 80.8],
       [251.2],
       [ 89.4],
       [ 80.8],
       [ 80.8]])

we calculated the MSE and RMSE values using the pandas arithmetic operators to compare each predicted value with the actual value from the price column of our test set. Alternatively, we can instead use the **sklearn.metrics.mean_squared_error function()**.

The mean_squared_error() function takes in 2 inputs:

1. list-like object, representing the true values
2. list-like object, representing the predicted values using the model

In [51]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(test_df[target_column], predictions)
rmse = np.sqrt(mse)
rmse

124.90201702396679

The model we trained using both features ended up performing better (lower error score) than either of the univariate models from the last project

Let's now train a model using the following 4 features:

* accommodates
* bedrooms
* bathrooms
* number_of_reviews

In [52]:
train_columns = ['accommodates', 'bathrooms',"bedrooms", "number_of_reviews"]
target_column = ["price"]

In [59]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors = 5, algorithm="brute",)
knn.fit(train_df[train_columns], train_df[target_column])
predictions = knn.predict(test_df[train_columns])
predictions[:5]

array([[102. ],
       [313. ],
       [ 82.2],
       [ 78. ],
       [ 78. ]])

In [61]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(test_df["price"], predictions)
rmse = np.sqrt(mse)
rmse

115.30620242876704

As we increased the features the model used, we observed lower MSE and RMSE values

In [65]:
# Use all of the columns

features = ['accommodates', 'bedrooms', 'bathrooms', 'beds',
       'minimum_nights', 'maximum_nights', 'number_of_reviews']
knn = KNeighborsRegressor(algorithm = "brute", n_neighbors = 5, metric  = "euclidean")
knn.fit(train_df[features], train_df["price"])
all_features_predictions = knn.predict(test_df[features])
all_features_mse = mean_squared_error(test_df["price"], all_features_predictions)
all_features_rmse = all_features_mse **(1/2)

print(all_features_mse,all_features_rmse)

15455.275631399316 124.31924883701363


Interestingly enough, the RMSE value actually increased to **124.32** when we used all of the features available to us. This means that selecting the right features is important and that using more features doesn't automatically improve prediction accuracy. We should re-phrase the lever we mentioned earlier from:

* increase the number of attributes the model uses to calculate similarity when ranking the closest neighbors
to:

* select the relevant attributes the model uses to calculate similarity when ranking the closest neighbors

The process of selecting features to use in a model is known as **feature selection**.