# Model and visualize some Key Performance Indicators of a bike sharing service
### You operate a fleet of 1000 bikes over 5 different cities. Here data of the year 2020 are generated from scratch to emulated the service usage.
The following code performs a generation of a dataset and produces a csv output file with the granularity of one row per day and per city. The code is structured in different parts: firstly useful functions and variables for data generation are defined, these are used afterwards in the main block of the code where a loop over all days of 2020 contains a second loop over the 5 cities selected. For each pair day-city the colums of the csv file are filled with suitable indicators. 

We assumed that the number of bike trips per day depends on key parameters such as: weather conditions, season and temperature, which in turn are different for each city. Once we estimated the number of bike trips we can model the number of bike users per day. These are divided in two categories: registered users, that made a subscription to our service and pay a fixed monthly fee to obtain a lower price per minute, and occasional users. An increase of registered users during the year due to some perfoemance improvements and increase of popularity of our service is modeled. With this is mind and some assumption on the average bike ride time and frequency, we compute the minutes of bike usage per day and the relative income in euros. Finally, we give an estimation of the costumer satisfactions and the amount of new subscriptions per day. 

In [39]:
import csv
import pandas as pd
from datetime import date, timedelta
import random

We start from date generation: from 1 Jan 2020 to 31 Dec 2020 using datetime package.
These variables are used in the main loop to fill the first column of the dataset with 
year-day-month.

In [40]:
# variables for date generation from 1 Jan 2020 to 31 Dec 2020
sdate = date(2020, 1, 1)   # start date
edate = date(2020, 12, 31)   # end date

delta = edate - sdate       # timedelta 

Definiton of a list with the cities where our bikes are availiable.
We assume all the cities to have the same totl number of bikes. This value well be later reduced to take into account possible bike malfunctioning.

In [41]:
city_list = ['Paris', 'Milan', 'Madrid', 'Berlin', 'London']
bikes_per_city = 200

An important factor impacting the number of bike trips per day is the weather. 
Since each city has a peculiar local weather, we construct a list of weigths 
descibing the probability for a given city ho have a certain weather condition.
This is an estimation averaged over the whole year.
For example in Paris there is 20% probability to have a sunny day and 35% to have a cloudy one.

In [42]:
city_weather = ['clear', 'cloudy', 'light rain', 'heavy rain']

# probability of being: clear, cloudy, light rain or heavy rain
weigths_paris = [0.2, 0.35, 0.3, 0.15]
weigths_milan = [0.3, 0.2, 0.3, 0.2]
weigths_madrid = [0.4, 0.2, 0.2, 0.2]
weigths_berlin = [0.2, 0.3, 0.35, 0.15]
weigths_london = [0.15, 0.3, 0.4, 0.15]

The temperature is also an important factor to consider for estimating bike usage.
The temperature depends on the city and on the season. We construct lists which gives the average temperature for each city in a given season.

In [43]:
def average_temperature(date):
    # season identification depending on the date
    seasons = {'spring': pd.date_range(start='2020-03-20', end='2020-06-20', closed='left'),
               'summer': pd.date_range(start='2020-06-20', end='2020-09-22', closed='left'),
               'autumn': pd.date_range(start='2020-09-22', end='2020-12-20', closed='left')}
    if date in seasons['spring']:
        #average temperature in Paris, Milan, Madrid, Berlin, London in Spring
        t = [14, 16, 18, 12, 13, 'spring']
        return t
    if date in seasons['summer']:
        #average temperature in Paris, Milan, Madrid, Berlin, London in Spring
        t = [24, 28, 30, 23, 21, 'summer']
        return t
    if date in seasons['autumn']:
        #average temperature in Paris, Milan, Madrid, Berlin, London in Spring
        t = [13, 16, 19, 11, 12, 'autumn']
        return t
    else:
        #average temperature in Paris, Milan, Madrid, Berlin, London in Spring
        t = [5, 7, 10, 2, 2, 'winter']
        return t

The following function gives the temperature and weather estimation according to the season and the city: for each city it picks the weather using the probaility of having a given weather condition previously defined as a list of weigths. The temperature is estimated from the list containing the average temperature of a given city in a given season using the function 'average_temperature': a gaussian smoothing of 5 degrees standard deviation is applied to take into account eventual fluctuations. The season is also returned.

In [44]:
 def condition_estimation(city):
    if city == 'Paris':
        weather = random.choices(city_weather, weigths_paris)
        t = average_temperature(day) 
        # t is a list with average temperature in Paris, Milan, Madrid, Berlin, London and season 
        temperature = round(random.gauss(t[0], 5),2)
    if city == 'Milan':
        weather = random.choices(city_weather, weigths_milan)
        t = average_temperature(day)
        temperature = round(random.gauss(t[1], 5),2)
    if city == 'Madrid':
        weather = random.choices(city_weather, weigths_madrid)
        t = average_temperature(day)
        temperature = round(random.gauss(t[2], 5),2)
    if city == 'Berlin':
        weather = random.choices(city_weather, weigths_berlin)
        t = average_temperature(day)
        temperature = round(random.gauss(t[3], 5),2)
    if city == 'London':
        weather = random.choices(city_weather, weigths_london)
        t = average_temperature(day)
        temperature = round(random.gauss(t[4], 5),2)
                
    season = t[5]
    condition = [temperature, season, weather]
    return condition

Here we want to estimate the average number of trips per day, making some assumptions: suppose each bike is taken every 20 minutes (averaged over the year for all cities), therefore each bike provides in average about 3 rides per hour and 72 rides per day (~24hx60min/20min). Considering about 7 'night' hours per day, where the probability for a bike to be rented is reduced by a factor 3 (meaning 1 ride per hour), we have :
(24h - 7h)x 3 rides + 7h x 1 ride = about 58 rides per day. 

In [45]:
av_rides_1day_1bike = 58 

We can also suppose that the number of bike rides goes smoothly linearly with the temperature:
assuming that if with 0 degrees the possible number of bike rides is only 10% of the average (58x0.1 = 5.8)
and if with 25 degrees the possible number of bike rides is 120% of the average (58x1.2 = 69.6) 
the linear beheaviour is described by:
bike rides per day = 2.6 x temperature + 5.8. The following function implement the temperature dependence to compute the bike rides per bike and per day.
If the temperature is too high (ex. more than 30 degrees) it is likely that users avoid to move with bikes, so a decrease of number of bike rides of around 3% in introduced for too high temperatures. 

In [46]:
def temperature_dep(temp):
    #linear dependence extracted from assumptions of average number of bike rides with a given temperature
    rides = round(2.6*temperature+5.8,2)
    # gaussian sampling with 5 bike rides standard deviation to add randomness
    rides = random.gauss(rides, 10)
    # control that rides are always > 0: extremely unlikely to have 0 rides, it can be excluded for 
    # make the calculation safe
    if rides <= 0:
        rides=1
    if temp > 30.0:
        rides = rides - abs(random.gauss(rides*0.05,rides*0.02))
    return rides

Bike occupation, namely the number of bikes used in a given weather condition with respect to the average, depends on the weather: define a list giving the percentage of all alviable bikes used 
with a given weather condition.

In [47]:
# bike occupation if 1:clear, 2:cloudy, 3:light rain, 4: heavy rain
bike_occupation = [0.95, 0.75, 0.35, 0.15]

There are two types of users: occasional users and registered users. 
The second type subscribed an enrollement for bike sharing for which they have lower prices for bike rides.
We estimate the percentage of registered users assuming that an optimistic increased of enrolled users of 20% is observed during 2020: 50% of the total users are registered ones in January while the percentage becomes 70% in December of the same year. With this value we model the linear dependence.
We smooth the linear dependence with a gaussian sampling to take into account possible fluctuations of about 5%.

In [48]:
def user_type(users, date):    
    # linear growth of registered users describing a 20% increase during 2020
    percentage_registered_users = 0.018*date.month + 0.48
    registered_users = percentage_registered_users*users
    #gaussian smoothing with 5% std
    registered_users = abs(random.gauss(registered_users, registered_users*0.05))
    
    return registered_users

Estimate the total ride time for all bikes in a given city and given day: assume that each trip is is average 15 minutes long, gaussian distributed with a standard deviation of 10 minutes. For possible improvements: note that, the gaussian distribution is over simplicistic beacuse it does not include possible long rides (.i.e. probablity of 1 hour rides is very very low while in reality it could happen), a gaussian distribution with a longer tails towards long bike trips could be more adapt. The average ride time is multiplied by the average number of rides per day. A very rough estimation of the waiting time due to the lack of bikes is also given and a penalty for long waiting time is added.

In [49]:
def ride_time_calc(rides, avaliable_bikes):            
    ride_time = round(rides * abs(random.gauss(15, 10)),1)
    # as a control: total ride time is wht we have if all availiable bikes are take all day
    # the computed ride time should be less than the total ride time 
    # if it is more it is considered as waiting time to get a bike
    tot_ride_time = round(24*60*avaliable_bikes,2) #24hx60min x all bike available
    #possible difference is take as waiting time
    if ride_time >= tot_ride_time:
        waiting_time = ride_time - tot_ride_time
    else:
        waiting_time = 0
    time = [ride_time, waiting_time]
    return time


This functions estimates the price per ride for the two types of users: prices differ if the user has 
a subscription or if he does not. Note that it is always convenient for the bike sharing service to have registered users: despite the lower price per minute, the subscription ensures a fix fee per month and a regular usage of our bikes.

In [50]:
def price(minutes, registered_users, occasional_users):
    #percentage of registered/occasional users
    reg_fraction = registered_users/(registered_users+occasional_users)
    occ_fraction = occasional_users/(registered_users+occasional_users)
    #fees for the two users
    minute_fee_reg = 0.30  #30 cents per minutes
    minute_fee_occ = 0.50 #50 cents per minutes
    
    fee = minutes*reg_fraction*0.30 + minutes*occ_fraction*0.50
    
    return fee

Costumer satisfaction evaluation. A 20% increase in observed due to performance improvements and app popularity and modeled as a linear function depending on the month. A negative impact due to waiting time is included while if many bikes are available a positive impact is introduced.

In [51]:
def customer_satisfaction(date, waiting_time, ava_bikes):    
    # linear growth of satisfaction describing a 20% increase during 2020 due to
    # performance improvements
    evaluation = 0.18*date.month + 5.8 + 0.005*date.day
    #gaussian smoothing
    evaluation = abs(random.gauss(evaluation, evaluation*0.1))
    # limits the results between 0 and 10
    if evaluation > 10:
        evaluation = 10
    #negative impact if non negligible waiting time
    if waiting_time>0:
        evaluation = evaluation - abs(random.gauss(0.1, 0.01)) 
    if ava_bikes > 195: 
        evaluation = evaluation + abs(random.gauss(0.1, 0.01)) 
    return evaluation

Total number of new subscriptions to the bike sharing app for each city. During 2020 a global increase of new subscriptions has been observed.
This function assumes a different increase of registered users per city: for example in Paris we set a threshold of 0.75, meaning that 3/4 of the days the number of registrations increase while 1/4 of the time we observe unsubscriptions. Berlin has the highest increase beacuse, despite the bad weather, germans are very environmental attentive and our marketing strategy worked well in this city. London has the lowest increase because many bike sharing competitors are present in the city.
We assume a random gaussian increase or decrease of subcriptions with a mean of 5 and standard deviation of 3. Given than we have 200 bikes +-5 new users per day seems a reasonable amount.

In [52]:
threshold_per_city= {'Paris':0.75, 'Milan':0.65,'Madrid': 0.6, 'Berlin':0.85,'London': 0.52}
def new_subscription(city):
    # sampling uniformly between 0 and 1
    r = random.uniform(0, 1)
    # apply the threshold to estimate if increase or decrease of registered users
    #increase
    if r < threshold_per_city[city]:
        new_subscriptions = abs(random.gauss(5,2))
    #decrease
    if r > threshold_per_city[city]:
        new_subscriptions = -abs(random.gauss(5,2))
    return new_subscriptions

Main loop creating the cvs file and filling all the columns.

In [53]:
# Open the csv file we want to fill with our data in writing mode
with open('bike_summary.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.writer(f)
    
    #define the header
    header = ['date', 'month','city', 'available bikes', 'season', 'weather condition', 'temperature (degree C)', 'trips per bike','total bike trips', 'number of users', 'registered users (%)', 'occasional_users (%)', 'tot bike trip duration (min)', 'day income (euros)', 'customer satisfaction', 'new day subscriptions']
    # write the header
    writer.writerow(header)    
    
    #loop over all days of 2020
    for i in range(delta.days + 1):
        day = sdate + timedelta(days=i)
        
        #loop over all 5 cities
        for city in city_list:
            
            # estimate the number of avaliable bikes considering av average of 10 broken bikes
            # with 5 bikes gaussian incertitude. Constant over the year
            avaliable_bikes = bikes_per_city - abs(random.gauss(10, 5))
            
            # estimate weather and temperature conditions and keep track of the season
            condition = condition_estimation(city)
            temperature = condition[0]
            season = condition[1]
            weather = condition[2]

            #estimation of rides per bike and per day adding temperature dependence
            rides_1bike_1day = temperature_dep(temperature)
            
            #considering all avaliable bikes: bike rides per day
            rides_1day = rides_1bike_1day * avaliable_bikes

            # correct the number of bike rides per day by bike occupation which depends on the weather
            # here we use a previously defined weigths, different for each city ([0]=Paris, [1]=Milan...)
            if weather[0] == 'clear':
                rides = rides_1day * bike_occupation[0]
            if weather[0] == 'cloudy':
                rides = rides_1day * bike_occupation[1]
            if weather[0] == 'light rain':
                rides = rides_1day * bike_occupation[2]
            if weather[0] == 'heavy rain':
                rides = rides_1day * bike_occupation[3]
            
            # estimate the number of users: it should be less than the bike rides:
            # we assume that about 20% of the trip are made by the same user and smooth
            # with a gaussian distribution with 5% standard dev
            users = rides - abs(random.gauss(rides*0.20, rides*0.05))
            
            # estimated the type of users in percentage
            registered_users = user_type(users,day)
            occasional_users = users - registered_users    
            
            # estimate the total ride time
            ride_time = ride_time_calc(rides, avaliable_bikes)  
            
            #estimate the price of all bike trips depending on user type
            ride_prices = round(price(ride_time[0], registered_users, occasional_users),1)
            
            # estimated the type of users in percentage: more useful for data visualization
            registered_users = registered_users/users*100
            occasional_users = occasional_users/users*100
            
            #custumer stfistafction
            safisfaction = round(customer_satisfaction(day, ride_time[1], avaliable_bikes),2)
            
            #new subscriptions of the day (negative or positive)
            day_subscriptions =  new_subscription(city)
            

            #fill the file
            data =[day, day.month, city, int(avaliable_bikes), season, weather[0], temperature, int(rides_1bike_1day),int(rides), int(users), round(registered_users,1), round(occasional_users,1), ride_time[0], ride_prices, safisfaction, int(day_subscriptions)]
            # write the data
            writer.writerow(data)