# Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

%matplotlib inline

# Load bike dataset

In [2]:
df_chicago = pd.read_csv('../dataset/chicago_2018_clean.csv') 
df_chicago.head()

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type,duration_per_trip,trip_time_in_hours
0,2018-04-01 00:04:44,2018-04-01 00:13:03,22,171,May St & Taylor St,May St & Cullerton St,3819,Subscriber,0 days 00:08:19,0.138611
1,2018-04-01 00:06:42,2018-04-01 00:27:07,157,190,Lake Shore Dr & Wellington Ave,Southport Ave & Wrightwood Ave,5000,Subscriber,0 days 00:20:25,0.340278
2,2018-04-01 00:07:19,2018-04-01 00:23:19,106,106,State St & Pearson St,State St & Pearson St,5165,Customer,0 days 00:16:00,0.266667
3,2018-04-01 00:07:33,2018-04-01 00:14:47,241,171,Morgan St & Polk St,May St & Cullerton St,3851,Subscriber,0 days 00:07:14,0.120556
4,2018-04-01 00:10:23,2018-04-01 00:22:12,228,219,Damen Ave & Melrose Ave,Damen Ave & Cortland St,5065,Subscriber,0 days 00:11:49,0.196944


In [3]:
#Format the columns from datatype object to datatype datetime
df_chicago['start_time'] = pd.to_datetime(df_chicago['start_time'])
df_chicago['end_time'] = pd.to_datetime(df_chicago['end_time'])

In [4]:
df_chicago = df_chicago.sort_values(by = 'start_time')
df_chicago

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type,duration_per_trip,trip_time_in_hours
3212538,2018-01-01 00:12:00,2018-01-01 00:17:23,69,159,Damen Ave & Pierce Ave,Claremont Ave & Hirsch St,3304,Subscriber,0 days 00:05:23,0.089722
3212539,2018-01-01 00:41:35,2018-01-01 00:47:52,253,325,Winthrop Ave & Lawrence Ave,Clark St & Winnemac Ave (Temp),5367,Subscriber,0 days 00:06:17,0.104722
3212540,2018-01-01 00:44:46,2018-01-01 01:33:10,98,509,LaSalle St & Washington St,Troy St & North Ave,4599,Subscriber,0 days 00:48:24,0.806667
3212541,2018-01-01 00:53:10,2018-01-01 01:05:37,125,364,Rush St & Hubbard St,Larrabee St & Oak St,2302,Subscriber,0 days 00:12:27,0.207500
3212542,2018-01-01 00:53:37,2018-01-01 00:56:40,129,205,Blue Island Ave & 18th St,Paulina St & 18th St,3696,Subscriber,0 days 00:03:03,0.050833
...,...,...,...,...,...,...,...,...,...,...
3212533,2018-12-31 23:45:13,2018-12-31 23:50:05,49,164,Dearborn St & Monroe St,Franklin St & Lake St,246,Subscriber,0 days 00:04:52,0.081111
3212534,2018-12-31 23:45:17,2018-12-31 23:50:05,49,164,Dearborn St & Monroe St,Franklin St & Lake St,2931,Subscriber,0 days 00:04:48,0.080000
3212535,2018-12-31 23:48:48,2018-12-31 23:57:22,624,44,Dearborn St & Van Buren St (*),State St & Randolph St,4386,Subscriber,0 days 00:08:34,0.142778
3212536,2018-12-31 23:50:09,2018-12-31 23:57:16,41,52,Federal St & Polk St,Michigan Ave & Lake St,4927,Subscriber,0 days 00:07:07,0.118611


# KPI Definition

As with any other business, covering demand is a core challenge. Especially, for high demand stations it is important to make sure enough bikes are present to satisfy customers wanting to rent a bike during high demand hours, since otherwise potential profit can be missed. Based on the descriptive analysis we have done thus far, we know when and where high demand occurs and bikes are needed. The problem that arises is how we can cover this demand in the best way. Simply adding new bikes into the system is not possible, at least not on a short term basis. The bikes we use to cover demand at a station have to come from a different station, optimally one that has excess bikes that are not needed for that station during that time. This KPI, which is defined as the **count of available bikes** for a station, can be used to support the decision of where to source the bikes from. It gives an overview of where free bikes are located so that they can be shifted towards stations where there are not enough bikes to cover the demand. Furthemore it gives an overview of the utilization of bikes, as well as the coverage of bikes. 

A bike is **available** at a given time, when there is no bike ride in the dataset which uses that bike during that time. In other words, any bike that is not active at a given time is **available**. We assume every bike that is available is located at a specific station. Due to the lack of operational data in the dataset, we have to assume that the position of a bike at a given time is equal to the end station of the last ride taken with that bike. Of course it is possible that divvy staff manually shifts bikes around. Most likely these movements are not recorded in the dataset since they probably would have an own user type which is not a customer or subscriber user type. Therefore, we do not have data on these movements and cannot incorportate them in the calculation.

# KPI Calculation

## Availability

The set of active bikes for a time t, can be computed by masking the bike dataframe, taking every ride which started before t and ended after t. 

In [5]:
def get_active_bikes(time):
    return df_chicago.loc[
        (df_chicago['start_time'] <= time) & 
        (df_chicago['end_time'] >= time)
    ]

In order to compute the set of available bikes we calculate the set of active bikes and take its complement. 

In [6]:
all_bike_ids = df_chicago['bike_id'].unique()
def complement_bikes(bike_ids):
    return np.setdiff1d(all_bike_ids, bike_ids)

In [7]:
def get_non_active_bikes(time):
    return complement_bikes(get_active_bikes(time)['bike_id'].unique())

## Position

After calculating the set of available bikes for a given time, we determined at which station every bike is positioned at that time. For that we simply look up the last ride that was taken with that bike. If there was no ride recorded for that bike before the given time, then we assigned the bike to no station.

In [8]:
def bike_position_at(time):
    rides = df_chicago.loc[df_chicago['end_time'] < time]
    out = { 'bike_id': [], 'station_name': [], 'station_id': [] }
    for bike_id in get_non_active_bikes(time):
        rides_with_bike = rides[rides['bike_id'] == bike_id]
        last_station = None
        last_station_id = None
        if len(rides_with_bike) >= 1:
            last_station = rides_with_bike.iloc[-1]['end_station_name']
            last_station_id = rides_with_bike.iloc[-1]['end_station_id']
        out['bike_id'].append(bike_id)
        out['station_name'].append(last_station)
        out['station_id'].append(last_station_id)
    return out

### Import station dataset

In [9]:
df_stations = pd.read_csv('../dataset/chicago_stations.csv')
df_stations.head()

Unnamed: 0,ID,station_name,x,y,position
0,2,Buckingham Fountain,41.876423,-87.620339,"(41.876423, -87.620339)"
1,3,Shedd Aquarium,41.867226,-87.615355,"(41.86722595682, -87.6153553902)"
2,4,Burnham Harbor,41.857412,-87.613792,"(41.85741178707404, -87.61379152536392)"
3,5,State St & Harrison St,41.874053,-87.627716,"(41.874053, -87.627716)"
4,6,Dusable Harbor,41.886976,-87.612813,"(41.886976, -87.612813)"


The function below simply adds the position of each station to the dataframe using the station data.

In [10]:
df_stations_indexed_by_id = df_stations.set_index('ID')
def enrich_with_position(df):
    out = df
    out['position'] = df_stations_indexed_by_id.loc[df.index]['position']
    return out

# KPI Visualization

We visualized the distribution of bikes using the folium library. Folium allows to use a real world map and draw markers at specific positions on the map. The function below draws all the stations passed to it as a marker on such a map. Each marker contains the count of bikes located at that stations and is color coded from green to red. Green indicating a percentually low share of bikes at the station and red indicating a high percentual share.

In [11]:
import folium
from folium.features import DivIcon
from ast import literal_eval as make_tuple

def create_map(stations, zoom, color, offset):
    origin = stations.position.iloc[0]
    m = folium.Map(location=[make_tuple(origin)[0] + offset[0], make_tuple(origin)[1] + offset[1]], zoom_start=zoom)

    max_bikes_at_one_station = stations['bike_id'].max()
    max_radius = 600
    for station in stations.iterrows():
        coord = station[1]['position']
    
        pct = station[1]['bike_id'] / max_bikes_at_one_station
        pct_diff = 1.0 - pct
        green_color = min(255, pct_diff*2 * 255)
        red_color = min(255, pct*2 * 255)
        col = '#%02x%02x%02x' % (int(red_color), int(green_color), 50)
        if pct > 0.4:
            folium.Marker(
                radius=600 * pct,
                location=make_tuple(coord),
                icon=DivIcon(
                    icon_size=(250,36),
                   icon_anchor=(10,10),
                    html='<div style="font-size: 20pt">'+ str(station[1]['bike_id']) +'</div>',
                )
            ).add_to(m)
        folium.Circle(
            radius=600 * pct,
            location=make_tuple(coord),
            color=col,
            fill=True,
        ).add_to(m)

    return m

In order to draw the bike availability map for a time t we first calculated the bike positions at t, using the approach described in the previous section. Then, we grouped the data by the station id and count the number of bikes at each station. Finally, we enriched this data with the station positions and drew the map using the function above.

In [12]:
def draw_bike_availability_map(time, offset = (0, 0), zoom = 11):
    bike_position = bike_position_at(time)
    df_bike_positions = pd.DataFrame(bike_position.values(),bike_position.keys()).T
    df_bike_positions = df_bike_positions.groupby('station_id').count()
    df_bike_positions = enrich_with_position(df_bike_positions)
    df_bike_positions = df_bike_positions.dropna().sort_values(by = 'bike_id', ascending = False)
    return create_map(
        df_bike_positions,
        zoom,
        'crimson',
        offset
    )

# KPI instantiation

To detect and analyse potential patterns of the defined KPI we computed hourly values for it over a set of different types of days and daytimes and visualized them. We chose two winter days and two summer days for instantiating the KPI. One being a weekday and one being a weekend day respectively. 

## Summer Wednesday

In [13]:
summer_wednesday = pd.Timestamp('2018-06-13')

### Morning

In [14]:
draw_bike_availability_map(summer_wednesday.replace(hour = 7), zoom = 13)

In [15]:
draw_bike_availability_map(summer_wednesday.replace(hour = 9), zoom = 13)

In the early morning until approximately 8am most bikes are located at the Station "Canal St & Adams St". After 8am the bikes get distributed to different stations throughout the center of the city. This pattern most likely represents working people arriving at "Canal St & Adams St", through the Chicago Union Station and then commuting to their respective workplaces using the bikes. During 9am only about 30 bikes are left at "Canal St & Adams St", contrasting the 104 bikes that were positioned there at 7am, meaning about 70 bikes left the station in that time range. Fleet operator should ensure that in the morning between 6am and 7am enough bikes are positioned at "Canal St & Adams St" to cover the spike in demand. Since there are still some bikes left over at the station at 9am, about 70 bikes seem to be able to cover the demand.

### Afternoon

In [16]:
draw_bike_availability_map(summer_wednesday.replace(hour = 17), zoom = 13)

In [17]:
draw_bike_availability_map(summer_wednesday.replace(hour = 18), zoom = 13)

Towards 6pm a lot of bikes return to "Canal St & Adams St". This is most likely the time frame where a lot of people get off work and return home or use the Union Station to get to their next destination. Knowing this, fleet operators should avoid having a lot of bikes positioned at "Canal St & Adams St" before the time frame and should instead distribute the bikes to nearby stations to enable people to reach "Canal St & Adams St".

### Evening / Night

In [18]:
draw_bike_availability_map(summer_wednesday.replace(hour = 19), zoom = 13)

The destinations many users choose during the evening are stations closely located to the lake. Most popular are the "Grand & Streeter" and "Lake Shore" stations. 

In [19]:
draw_bike_availability_map(summer_wednesday.replace(hour = 23), zoom = 13)

When it comes to the station "Grand & Streeter" most people seem to leave the location by bike, thus leaving a small amount of excess bikes. However, a lot of bikes are left at the station "Lake Shore" towards the end of the day. This is a potential source of bikes fleet operators could consider to fuel the high demand at "Canal St & Adams St" in the morning. 

## Summer Saturday

In [20]:
summer_saturday = pd.Timestamp('2018-06-16')

### Morning

In [21]:
draw_bike_availability_map(summer_saturday.replace(hour = 8), zoom = 14)

Similar to the wednesday evening, a lot of bikes are left over at the station "Grand & Streeter" from the day before. Again, this is a potential source of bikes that can be used for distribution purposes. Since this is the weekend, there are not people commuting to work and less bikes are needed at the Union Station. However, 35 bikes might still be a low amount to supply potential customers who might want to commute to other destinations from the Union Station.

### Afternoon

In [22]:
draw_bike_availability_map(summer_saturday.replace(hour = 17), zoom = 14)

### Evening / Night

In [23]:
draw_bike_availability_map(summer_saturday.replace(hour = 18), zoom = 13, offset = (0.02, 0))

In [24]:
draw_bike_availability_map(summer_saturday.replace(hour = 19), zoom = 14)

Throughout the late afternoon into the evening one can observe a shift from the northern two lake stations onto the "Grand & Streeter" station. This displays another interesting feature of this KPI, which is how the distribution of bikes shifts throughout the day. In this particular instance, the only insight that might be derived from this, is that a lot of bikes could be left over at the "Grand & Streeter" station.

## Winter Wednesday

In [25]:
winter_wednesday = pd.Timestamp('2018-12-12')

### Morning

In [26]:
draw_bike_availability_map(winter_wednesday.replace(hour = 6), zoom = 14)

In [27]:
draw_bike_availability_map(winter_wednesday.replace(hour = 7), zoom = 14)

In [28]:
draw_bike_availability_map(winter_wednesday.replace(hour = 8), zoom = 14)

Similar patterns to the summer wednesday can be observed here, in the sense that a lot of bikes are positioned at "Canal St & Adams St" in the early morning and throughout the morning the bikes spread out to many stations in the center of the city. Again, this could be explained by the commuting of working people in the morning. An interesting observation are the consistent 78 bikes located at the station "Grand & Streeter". One explanation for this could be that these are actually "dead" bikes. Meaning, the last recorded station for these bikes is "Grand & Streeter", however they were moved by operational staff and are not actually located at that station anymore. As we analyzed in the geographical demand patterns, the "Grand & Streeter" station is highly unpopular during winter months. Therefore, it is unlikely that that many bikes ended up at that station through active bike rides in the winter months. Instead, these bikes could still be left over from rides that were taken towards the end of autumn. Nevertheless, similar to the summer day we observe a high demand spike at "Canal St & Adams St" during the morning hours, which is why fleet operators should ensure a sufficient stock of bikes for that station.

### Afternoon

In [29]:
draw_bike_availability_map(pd.Timestamp('2018-12-12 17:00:00'), zoom = 14)

During the afternoon many bikes return to "Canal St & Adams St".

### Night

In [30]:
draw_bike_availability_map(pd.Timestamp('2018-12-12 23:00:00'), zoom = 14)

In contrast to the summer day, the lake stations are not popular destinations. At this time, most bikes remain at "Canal St & Adams St". Therefore, no shifting is necessary here to supply the demand spike for the next morning. 

## Winter Saturday

### Morning

In [31]:
draw_bike_availability_map(pd.Timestamp('2018-12-15 08:00:00'), zoom = 14)

### Afternoon

In [32]:
draw_bike_availability_map(pd.Timestamp('2018-12-15 17:00:00'), zoom = 14)

### Night

In [None]:
draw_bike_availability_map(pd.Timestamp('2018-12-15 23:00:00'), zoom = 14)

During a weekend day in the winter a lot of different stations seem to be popular. While a lot of bikes are positioned at the "Grand & Streeter" station, most of them are most likely again "dead" bikes. The station "Millenium Park" attracts more users, and also leaves a lot of residual bikes. This station could be used as an alternative bike source to the "Grand & Streeter" station during winter months to supply the morning customers.
