# Midterm Project Part 2
# Predicting Bike Sharing Demand

We will test the k-Nearest-Neighbor (kNN) method for predicting the bike sharing demand. The core idea of the kNN method is to make prediction on a given day based on k closest records from the dataset. The main challenge of using this method is to find a good way to determine "closeness".

From the analysis in Part 1, we know that the number of bike rentals correlates with year, season, weekday, working day, weather, temperature, humidity, and windspeed. However, any single feature is not powerful enough to provide a good prediction. We need to create a similarity measure that combines the closeness in all relevant features.

## I. Create a Test Set

We will randomly extract 3 records from the dataset as the test set to evaluate the performance of the kNN method.

1. Load the `day.csv` as a data frame

In [1]:
import pandas as pd

data = pd.read_csv("day.csv")
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


2. Use `numpy.random.choice()` to select 3 row indices. Print the 3 selected indices. Read the [documentation page](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) for the details of this function.

In [4]:
import numpy as np
selected_indices = np.random.choice(data.index, 3)
print(selected_indices)

[242 415 264]


3. Create a test data frame using the selected rows. Display the test data frame.

In [5]:
test = data.loc[selected_indices]
test

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
242,243,2011-08-31,3,0,8,0,3,1,1,0.656667,0.611121,0.597917,0.083333,688,4370,5058
415,416,2012-02-20,1,1,2,1,1,0,1,0.28,0.273391,0.507826,0.229083,502,2627,3129
264,265,2011-09-22,3,0,9,0,4,1,2,0.628333,0.554963,0.902083,0.128125,555,4240,4795


4. Update the `data` data frame by dropping the 3 test rows. Display the shape of the updated data frame.

In [11]:
data = data.drop(selected_indices)
data.shape

(728, 16)

## II. Create a Similarity Measure

A reasonable similarity measure between two data records should achieve the following:
- It combines the differences on each relevant feature: season, yr, mnth, holiday, weekday, workingday, weathersit, temp, hum, windspeed.
- Use absolute value of the differences to prevent cancelation between positive differences and negative differences.
- Use the reciprocal of feature standard deviations as the weights in the aggregation.

1. Calculate the standard deviation of each relavent feature. Store the reciprocals of the standard deviations in a `weights` list and print the list.

In [9]:
relevant_features = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
                     'weathersit', 'temp', 'hum', 'windspeed']
# weights = []
# for feature in relevant_features:
#     std = data[feature].std()
#     weights.append(1 / std)

weights = [1 / data[feature].std() for feature in relevant_features]
    
print(weights)

[0.9002463222520046, 1.9986334128913277, 0.28969445685064626, 5.982480570464832, 0.49880612799753477, 2.1494588063448, 1.8352181753331143, 5.462958526546023, 7.021037373217964, 12.903580333292048]


2. Create a `similarity(test_row, data_row, weights)` function that does the following:
- Calculate the absolute value of the difference between `test_row` and `data_row` on each relevant feature.
- Calculate the sum of all differences, each multiplied with its corresponding weight.
- Return the sum

In [10]:
def similarity(test_row, data_row, weights):
    diffs = [np.abs(test_row[feature] - data_row[feature]) \
             for feature in relevant_features]
    similarity = np.sum([weight * diff for weight, diff in \
                         zip(weights, diffs)])
    return similarity

3. Test the `similarity` function by calculating the similarity between the first row in the test data frame and the first row in `data`.

In [14]:
index1 = test.index[0]
similarity(test.loc[index1, :], data.loc[0, :], weights)

13.47143967333172

## III Test the kNN Method

Apply the kNN method with k=5 on each of the test rows.

1. In `data`, add a new column named `'sim1'` that shows the similarity between each data row with the first test row. Display the first 5 rows of `data`.

In [15]:
data['sim1'] = data.apply(similarity, axis=1, args=(test.loc[index1, :], 
                                                   weights))
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,sim1
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985,13.47144
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801,13.732129
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349,10.597247
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562,7.867539
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600,8.642329


2. Sort the rows according to their `sim1` value, and extract 5 rows with the smallest `sim1` values. Display these rows.

In [16]:
nn1 = data.sort_values('sim1').head()
nn1

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,sim1
228,229,2011-08-17,3,0,8,0,3,1,1,0.723333,0.666671,0.575417,0.143667,668,4026,4694,1.300688
241,242,2011-08-30,3,0,8,0,2,1,1,0.639167,0.594704,0.548333,0.125008,775,4429,5204,1.480292
193,194,2011-07-13,3,0,7,0,3,1,1,0.746667,0.689404,0.631667,0.146133,748,3594,4342,1.828662
243,244,2011-09-01,3,0,9,0,4,1,1,0.655,0.614921,0.639167,0.141796,783,4332,5115,1.841603
185,186,2011-07-05,3,0,7,0,2,1,1,0.746667,0.696338,0.590417,0.126258,1031,3634,4665,1.886707


3. Display the average number of renters from these 5 rows. The number of bike renters is the value in the 'cnt' column. This is kNN's prediction on the first test row.

In [17]:
nn1['cnt'].mean()

4804.0

4. Display the number of renters from the first test row.

In [18]:
test.loc[index1, 'cnt']

5058

5. Repeat Step 1-4 for the other two test rows. Display kNN's predictions and the actual number of renters.

In [20]:
index2 = test.index[1]
data['sim2'] = data.apply(similarity, axis=1, args=(test.loc[index2, :], 
                                                   weights))
nn2 = data.sort_values('sim2').head()
print("Prediction:", nn2['cnt'].mean(), "Actual:", test.loc[index2, 'cnt'])

Prediction: 2587.2 Actual: 3129


In [22]:
index3 = test.index[2]
data['sim3'] = data.apply(similarity, axis=1, args=(test.loc[index3, :], 
                                                   weights))
nn3 = data.sort_values('sim3').head()
print("Prediction:", nn3['cnt'].mean(), "Actual:", test.loc[index3, 'cnt'])

Prediction: 3835.4 Actual: 4795
