<a href="https://colab.research.google.com/github/giacomogreggio/HSL-citybikes-predictor/blob/master/HSL_citybikes_predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Citybike predictor

### Elevator pitch
Scheduling your day is important for everyone, but every day we have to face problems related to planning your itinerary. When you want to use a citybike to move from a place to another you may find yourself at an empty bike station. Could there be a way to predict the availability? A solution: an application that predicts exactly that based on the time and the weather.


### Data: sources, wrangling, management		
- The original purpose of the data is not compatible with our needs: the data is meant to describe bike trips/routing, not the bike availability
- 
            
### Data analysis: statistics, machine learning	
- We need a predicting model
- Predictions for time series: a lot of different variables
- Combining different data sources to base the prediction to current situation: weather, time of the day, current bike availability    


### Communication of results: summarization & visualization
- Finding clear and intuitive way to summarize and visualize data such that it is accessible to the user
- 
            
### Operationalization: creating added value, end-user point of view
- Mobile optimated web application

## Preprocessing the HSL-data

### Initializing everything

In [None]:
!pip install mpld3

Collecting mpld3
[?25l  Downloading https://files.pythonhosted.org/packages/66/31/89bd2afd21b920e3612996623e7b3aac14d741537aa77600ea5102a34be0/mpld3-0.5.1.tar.gz (1.0MB)
[K     |████████████████████████████████| 1.0MB 2.9MB/s 
Building wheels for collected packages: mpld3
  Building wheel for mpld3 (setup.py) ... [?25l[?25hdone
  Created wheel for mpld3: filename=mpld3-0.5.1-cp36-none-any.whl size=364064 sha256=5d46f97779b2355b495eed80a61d46971a829a10c06d1b11c484759f6c027404
  Stored in directory: /root/.cache/pip/wheels/38/68/06/d119af6c3f9a2d1e123c1f72d276576b457131b3a7bf94e402
Successfully built mpld3
Installing collected packages: mpld3
Successfully installed mpld3-0.5.1


In [None]:
# All imports

import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive
from datetime import datetime
from pandas.tseries.offsets import MonthEnd
import mpld3
from mpld3 import plugins
mpld3.enable_notebook()

In [None]:
drive.mount('/content/drive')


Mounted at /content/drive


### The function that processes the data for a month

In [None]:
def get_station_data():
  stations = pd.read_csv("/content/drive/My Drive/HSLDataset/Helsingin_ja_Espoon_kaupunkipyöräasemat.csv")
  stations = stations.drop(["FID", "Nimi", "Namn", "Adress", "Kaupunki", "Stad", "Operaattor"], axis = 1)
  return stations

In [None]:
def preprocess_month(month):
  path = "/content/drive/My Drive/HSLDataset/od-trips-2019/"
  extension = ".csv"
  filename = "2019-" + '{:02.0f}'.format(month)
  full_path = path + filename + extension
  
  data = pd.read_csv(full_path, sep = ",")

  # Make time a datetime object to ease handling. Also floor to starting hour
  data["Dep date"] = pd.to_datetime(data["Departure"], errors = "ignore").dt.floor(freq = "H")
  data["Return date"] = pd.to_datetime(data["Return"], errors = "ignore").dt.floor(freq = "H")

  # For our analysis we shouldn't need this information
  data = data.drop(columns=["Covered distance (m)", "Duration (sec.)", "Departure", "Return"])

  # Get the outgoing bikes per station at timeframe
  outgoing = data.groupby("Departure station id")["Dep date"].value_counts()
  outgoing = outgoing.sort_index()
  outgoing = outgoing.rename_axis(index = {"Dep date" : "Date", "Departure station id" : "ID"})
  outgoing = outgoing.rename("Outgoing")

  # Get the arriving bikes per station at timeframe
  arriving = data.groupby("Return station id")["Return date"].value_counts()
  arriving = arriving.sort_index()
  arriving = arriving.rename_axis(index = {"Return date" : "Date", "Return station id" : "ID"})
  arriving = arriving.rename("Arriving")

  outgoing_arriving_merge = pd.merge(outgoing, arriving, on = ["ID", "Date"], how = "outer")
  outgoing_arriving_merge = outgoing_arriving_merge.fillna(0)

  
  stations = set(outgoing_arriving_merge.index.get_level_values(0))

  # We need data for ALL timeframes
  first_day_of_month = "2019-" + '{:02.0f}'.format(month) + "-01 00:00:00"
  last_day_of_month = pd.Timestamp("2019-" + '{:02.0f}'.format(month) + "-01 23:00:00") + MonthEnd(0)
  all_dates = pd.date_range(first_day_of_month, last_day_of_month, freq = "H")
  idx = pd.MultiIndex.from_product([stations, all_dates], names = ["ID", "Date"])
  mega_frame_with_station_date_cartesian_product = pd.DataFrame(index = idx)
  processed = pd.merge(mega_frame_with_station_date_cartesian_product, outgoing_arriving_merge, on = ["ID", "Date"], how = "left")
  processed = processed.fillna(0)
  processed = processed.reset_index()

  # Merge with the station data from HSL
  station_data = get_station_data()
  processed_with_station_data = pd.merge(processed, station_data, on = "ID", how = "inner")

  processed_with_station_data.to_csv("./drive/My Drive/HSLDataset/processed/" + filename + "-processed.csv")

In [None]:
def get_processed_data_for_month(month):
  month = '{:02.0f}'.format(month)
  data = pd.read_csv("/content/drive/My Drive/HSLDataset/processed/2019-" + month + "-processed.csv")
  data["Date"] = pd.to_datetime(data["Date"])
  
  # Don't know what this is all about but I guess everything is fine-ish :DDD
  data = data.drop("Unnamed: 0", axis = 1)
  return data

In [None]:
# Process all the data and save them as csv-files
for month in range(4,11):
  data = preprocess_month(month)

## Looking at the data

In [None]:
def data_of_station_for_weekdays_in_month(dataframe, station, month, weekday):
  station_data = dataframe[dataframe["ID"] == station]
  station_data_for_month = station_data[(station_data["Date"].dt.month == month) & (station_data["Date"].dt.weekday == weekday)]
  return station_data_for_month

In [None]:
data =  get_processed_data_for_month(9)
data[data["ID"] == 19]

Unnamed: 0,ID,Date,Outgoing,Arriving,Name,Osoite,Kapasiteet,x,y
12240,19,2019-09-01 00:00:00,36.0,16.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
12241,19,2019-09-01 01:00:00,20.0,14.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
12242,19,2019-09-01 02:00:00,15.0,12.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
12243,19,2019-09-01 03:00:00,7.0,7.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
12244,19,2019-09-01 04:00:00,9.0,8.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
...,...,...,...,...,...,...,...,...,...
12955,19,2019-09-30 19:00:00,10.0,4.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
12956,19,2019-09-30 20:00:00,7.0,7.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
12957,19,2019-09-30 21:00:00,2.0,2.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824
12958,19,2019-09-30 22:00:00,2.0,4.0,Central Railway Station/East,Rautatientori 1,20,24.942527,60.170824


## Different data visualizations

In [None]:
wanted_data = data_of_station_for_weekdays_in_month(data, 19, 9, 6)
weekday_occurences = set(wanted_data["Date"].dt.date)
fig, axs = plt.subplots(1, len(weekday_occurences), figsize = (30,6))


for idx,weekday_occurence in enumerate(weekday_occurences):
  weekday_occurence_data = wanted_data[wanted_data["Date"].dt.date == weekday_occurence]
  axs[idx].grid(True, alpha = 0.3)
  axs[idx].plot(weekday_occurence_data["Date"].dt.hour, weekday_occurence_data["Outgoing"], 'r', weekday_occurence_data["Date"].dt.hour, weekday_occurence_data["Arriving"], 'b', alpha = 0.3)
  axs[idx].title.set_text(weekday_occurence)
mpld3.display()