<a href="https://colab.research.google.com/github/avenenma/BikeShare/blob/main/Milestone1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Final Project, Milestone 1

__Title: Predicting number of Bike Share Users__

_Authors: Hannah Babe and Alessandra Vennema
<br>Due: March 6th, 2022_

__Abstract:__ Biking is an integral part of transportation systems in cities; they reduce congestion and CO2 emissions, increase accessibility, and provide both physical and mental health benefits. Bike Share programs in particular are hailed for their convenience and practicality. According to the Institute for Transportation and Development Policy, over 600 cities have implemented bike-share systems. _In this project_ we will build and train a neural network to predict the number of bikeshare users on a given day. This will allow us to analyze bikeshare usage and begin to understand what impacts user behavior. We are particularly intrested in how weather, day of the week and time of day impact the number of riders. The data set used is subset of rides from the NYC CitiBike Open Trip Data and can be found here https://ride.citibikenyc.com/system-data
The set consist of 34,000 rides occurring between Jan 2015 and June 2017. 

Additional Resources Used: https://medium.com/@danny_68946/predicting-bike-lane-usage-using-linear-regression-and-pytorch-ec30e6e7d2e9




In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Loading the Data

In [13]:
data_path = '/NYC-BikeShare-2015-2017-combined.csv'
rides = pd.read_csv(data_path)


In [14]:
rides.head()

Unnamed: 0.1,Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender,Trip_Duration_in_min
0,0,376,2015-10-01 00:16:26,2015-10-01 00:22:42,3212,Christ Hospital,40.734786,-74.050444,3207,Oakland Ave,40.737604,-74.052478,24470.0,Subscriber,1960.0,1.0,6.0
1,1,739,2015-10-01 00:27:12,2015-10-01 00:39:32,3207,Oakland Ave,40.737604,-74.052478,3212,Christ Hospital,40.734786,-74.050444,24481.0,Subscriber,1960.0,1.0,12.0
2,2,2714,2015-10-01 00:32:46,2015-10-01 01:18:01,3193,Lincoln Park,40.724605,-74.078406,3193,Lincoln Park,40.724605,-74.078406,24628.0,Subscriber,1983.0,1.0,45.0
3,3,275,2015-10-01 00:34:31,2015-10-01 00:39:06,3199,Newport Pkwy,40.728745,-74.032108,3187,Warren St,40.721124,-74.038051,24613.0,Subscriber,1975.0,1.0,5.0
4,4,561,2015-10-01 00:40:12,2015-10-01 00:49:33,3183,Exchange Place,40.716247,-74.033459,3192,Liberty Light Rail,40.711242,-74.055701,24668.0,Customer,1984.0,0.0,9.0


###Categorical Encoding 
We need to switch all variables to binary. Right now we're only changing user type, but later when we add more data we will need to change  season, weather, month etc. 

The following code was adopted from the following github repository: https://github.com/vmartinezalvarez/Predicting-Bike-Sharing-Patterns/blob/master/Neural_networks_from_scratch_Predicting_Bike_Sharing_Patterns.ipynb

In [16]:
dummy_fields = ['User Type']
for each in dummy_fields:
    dummies = pd.get_dummies(rides[each], prefix=each, drop_first=False)
    rides = pd.concat([rides, dummies], axis=1)

fields_to_drop = ['User Type']
data = rides.drop(fields_to_drop, axis=1)
data.head()

Unnamed: 0.1,Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,Birth Year,Gender,Trip_Duration_in_min,User Type_Customer,User Type_Subscriber
0,0,376,2015-10-01 00:16:26,2015-10-01 00:22:42,3212,Christ Hospital,40.734786,-74.050444,3207,Oakland Ave,40.737604,-74.052478,24470.0,1960.0,1.0,6.0,0,1
1,1,739,2015-10-01 00:27:12,2015-10-01 00:39:32,3207,Oakland Ave,40.737604,-74.052478,3212,Christ Hospital,40.734786,-74.050444,24481.0,1960.0,1.0,12.0,0,1
2,2,2714,2015-10-01 00:32:46,2015-10-01 01:18:01,3193,Lincoln Park,40.724605,-74.078406,3193,Lincoln Park,40.724605,-74.078406,24628.0,1983.0,1.0,45.0,0,1
3,3,275,2015-10-01 00:34:31,2015-10-01 00:39:06,3199,Newport Pkwy,40.728745,-74.032108,3187,Warren St,40.721124,-74.038051,24613.0,1975.0,1.0,5.0,0,1
4,4,561,2015-10-01 00:40:12,2015-10-01 00:49:33,3183,Exchange Place,40.716247,-74.033459,3192,Liberty Light Rail,40.711242,-74.055701,24668.0,1984.0,0.0,9.0,1,0


### Splitting the data into training, testing, and validation sets

For our purposes, we are going to use data from the last 3 weeks as our test set; we'll use this set to make predictions and compare them with the actual number of riders.

Code adapted from the github cited above

In [23]:
# Save data for approximately the last 21 days 
test_data = data[-21*24:]

# Now remove the test data from the data set 
data = data[:-21*24]

# Separate the data into features and targets
target_fields = ['Unnamed: 0', 'User Type_Customer', 'User Type_Subscriber']
features, targets = data.drop(target_fields, axis=1), data[target_fields]
test_features, test_targets = test_data.drop(target_fields, axis=1), test_data[target_fields]

No we can identify the training and validation set. We'll train on historical data, then try to predict on future data (the validation set).

In [19]:
# Hold out the last 60 days or so of the remaining data as a validation set
train_features, train_targets = features[:-60*24], targets[:-60*24]
val_features, val_targets = features[-60*24:], targets[-60*24:]
train_targets.shape

(692908, 3)