<h1>CITYBIKE 2017

<h3>Import the data, add additional columns and start the analysis.

In [60]:
import pandas as pd
import numpy as np
import datetime

In [61]:
#import dataset
df2017 = pd.read_csv('wyp_2017.csv')

In [62]:
#check the size
df2017.shape

(102256, 6)

In [63]:
#check it out
df2017.head(5)

Unnamed: 0,bike_num,start_time,end_time,departure,return,duration_sec
0,58516,2017-05-01 00:26:45,2017-05-01 01:04:49,Al. Bolesława Krzywoustego,Al. Bolesława Krzywoustego,2284
1,58416,2017-05-01 00:26:28,2017-05-01 01:05:08,Al. Bolesława Krzywoustego,Al. Bolesława Krzywoustego,2320
2,58498,2017-05-01 01:23:07,2017-05-01 03:04:20,Al. Księżnej Jadwigi Śląskiej,Al. Księcia Henryka Pobożnego,6073
3,58389,2017-05-01 06:23:11,2017-05-01 08:25:08,Katowice Rynek,Katowice Rynek,7317
4,58520,2017-05-01 09:06:31,2017-05-01 09:07:04,Al. Księżnej Jadwigi Śląskiej,Al. Księżnej Jadwigi Śląskiej,33


In [64]:
#create additional TIME columns so it would be easier to manipulate the data
df2017['start_day'] = pd.DatetimeIndex(df2017['start_time']).day
df2017['start_month'] = pd.DatetimeIndex(df2017['start_time']).month
df2017['start_hour'] = pd.DatetimeIndex(df2017['start_time']).hour
df2017['start_minute'] = pd.DatetimeIndex(df2017['start_time']).minute
df2017['month_day'] = pd.DatetimeIndex(df2017['start_time']).date
df2017['duration_min'] = df2017['duration_sec']/60
df2017['duration_hour'] = (df2017['duration_min']/60).apply(int)
#which day of week was it? monday = 0, sunday = 6
df2017['which_day'] = pd.DatetimeIndex(df2017['start_time']).dayofweek
#which day of year was it?
df2017['daynumber'] = pd.DatetimeIndex(df2017['start_time']).dayofyear
#was it weekend?
df2017['is_weekend'] = df2017['which_day'].apply(lambda x: 1 if (x == 6 or x == 5 ) else 0 )
#was it a day of because of holidays/other events?
free=[ datetime.date(2017,1,1),
        datetime.date(2017,1,6),
        datetime.date(2017,4,16),
        datetime.date(2017,4,17),
        datetime.date(2017,5,1),
        datetime.date(2017,5,3),
        datetime.date(2017,6,4),
        datetime.date(2017,6,15),
        datetime.date(2017,8,15),
        datetime.date(2017,11,1),
        datetime.date(2017,11,11),
        datetime.date(2017,12,25),
        datetime.date(2017,12,26)
        ]
df2017['is_free'] = df2017['month_day'].apply(lambda x: 1 if (x in free) else 0 )

In [65]:
#create additional column for routes
df2017["route"] = df2018["departure"] + ' - ' + df2018["return"]

<h3>Basic analysis.

<h3>Remove unnecessary data.

In [66]:
#Before we start, we need to get rid of data that is disturbing our results. We assume, that if the bike was
#returned after < 3 minutes, it might be damaged or someone just changed his mind
to_dump = df2017[(df2017['duration_min'] < 3) & (df2017['departure'] == df2017['return'])].index

print('Number of rentals that last < 3 minutes: ', df2017[(df2017['duration_min'] < 3) & (df2017['departure'] == df2017['return'])]['bike_num'].count())
print('Percentage : {0:.2f} '.format(df2017[(df2017['duration_min'] < 3) & (df2017['departure'] == df2017['return'])]['bike_num'].count()/df2017['duration_min'].count()))

Number of rentals that last < 3 minutes:  13325
Percentage : 0.13 


In [67]:
#for how long (in hours) are the bikes being rented?
df2017.groupby('duration_hour')['bike_num'].count().head(10)

duration_hour
0    85456
1    10752
2     3852
3     1356
4      455
5      134
6       65
7       36
8       35
9       16
Name: bike_num, dtype: int64

In [32]:
#rentals that last more than 3 hours
to_dump2 = df2017[df2017['duration_hour'] >3]['bike_num'].index
print('rentals that last more than 3 hours: ', df2017[df2017['duration_hour'] >3]['bike_num'].count())
print('Percentage:  {0:.2f} '.format(df2017[df2017['duration_hour'] >3]['bike_num'].count()/df2017['duration_hour'].count()))

rentals that last more than 3 hours:  0
Percentage:  0.00 


In [33]:
#let's remove the outliers
#rentals that last more than 3 hours are less than 1% of dataset, we will remove them
# we will also remove all rentals under 3 minutes
df2017 = df2017.drop(to_dump.union(to_dump2))

In [34]:
#let's save what we have so far
df2017.to_csv('df2017_add.csv')

In [35]:
#check out what's left
df2017.shape

(88091, 18)

<h3>Get general info.

In [36]:
#For how many days and months were the bikes available for rental?
print('Bikes were available for {0} days.'.format(df2017['month_day'].nunique()))
print('Bikes were available for {0} months.'.format(df2017['start_month'].nunique()))

Bikes were available for 202 days.
Bikes were available for 8 months.


In [37]:
#how many rentals and bikes were there?
print('{0} bikes were available for rental.'.format(df2017['bike_num'].nunique()))
print('They were rented {0} times.'.format(df2017['bike_num'].count()))

284 bikes were available for rental.
They were rented 88091 times.


In [38]:
#The average time of a rental:
print('The average time of a rental is:')
df2017['duration_min'].mean()

The average time of a rental is:


34.62957339569241

In [39]:
#the median
print('The median of time of a rental is: ')
df2017['duration_min'].median()

The median of time of a rental is: 


14.683333333333334

In [40]:
#rentals under 15 minutes
print('The number of rentals under 15 minutes (they were free): ')
df2017[df2017['duration_min'] < 15]['bike_num'].count()

The number of rentals under 15 minutes (they were free): 


44673

In [41]:
#wypozyczenia ponizej 15 minut jako procent
print('Percentage of free rides: {0:.2f}'.format(df2017[df2017['duration_min'] < 15]['bike_num'].count()/(df2017['bike_num'].count())))

Percentage of free rides: 0.51


<h3>More detailed info.

In [42]:
#number of rentals per month
print('The number of rentals per month:')
print(df2017.groupby('start_month')['bike_num'].count())
print('Most rentals in: ', df2017.groupby('start_month')['bike_num'].count().sort_values(ascending=False).index[0])

The number of rentals per month:
start_month
4      3150
5     21255
6     20703
7     15809
8     14786
9      6540
10     5811
11       37
Name: bike_num, dtype: int64
Most rentals in:  5


In [43]:
#number of rentals per week
print('The number of rentals per week:')
print(df2017.groupby('which_day')['bike_num'].count())
print('Most rentals in: ', df2017.groupby('which_day')['bike_num'].count().sort_values(ascending=False).index[0])

The number of rentals per week:
which_day
0    13336
1    12761
2    13083
3    12846
4    11742
5    11346
6    12977
Name: bike_num, dtype: int64
Most rentals in:  0


In [44]:
#average time per week day
print('average time per week day:')
print(df2017.groupby('which_day')['duration_min'].mean())

average time per week day:
which_day
0    30.995417
1    29.794407
2    30.598450
3    30.478074
4    30.009333
5    41.192883
6    49.734727
Name: duration_min, dtype: float64


In [46]:
# median time per week day
print('median time per week day:')
print(df2017.groupby('which_day')['duration_min'].median())

median time per week day:
which_day
0    13.300000
1    12.766667
2    12.983333
3    13.266667
4    13.083333
5    21.150000
6    30.333333
Name: duration_min, dtype: float64


In [47]:
#daily top score
print('Days with most rentals:')
df2017.groupby('month_day')['bike_num'].count().sort_values(ascending=False).head(5)

Days with most rentals:


month_day
2017-05-18    1124
2017-05-17    1075
2017-05-19    1015
2017-05-28     984
2017-06-01     968
Name: bike_num, dtype: int64

In [48]:
#how many holidays were there in 2018?
df2017[df2017['is_free'] == 1].groupby('month_day').count().index.nunique()

8

In [49]:
#how many rentals during holidays?
df2017[df2017['is_free'] == 1]['bike_num'].count()

3601

In [50]:
#how many rentals during holidays?
print('Percentage :  {0:.2f}'.format(df2017[df2017['is_free'] == 1]['bike_num'].count()/(df2017['bike_num'].count())))

Percentage :  0.04


In [51]:
#daily top score during holidays
print('daily top score during holidays: ')
df2017[df2017['is_free'] == 1].groupby('month_day')['bike_num'].count().sort_values(ascending=False).head(5)

daily top score during holidays: 


month_day
2017-06-15    802
2017-05-01    701
2017-08-15    663
2017-05-03    576
2017-06-04    535
Name: bike_num, dtype: int64

In [52]:
# average number of rentals during working days vs weekends vs holidays
print('average number of rentals during working days: ', df2017[df2017['is_weekend'] == 0]['bike_num'].count() / df2017[df2017['which_day'].isin(range(0,5))]['month_day'].nunique())
print('average number of rentals during weekends: ', df2017[df2017['is_weekend'] == 1]['bike_num'].count() / df2017[df2017['which_day'].isin([5,6])]['month_day'].nunique())
print('average number of rentals during holidays: ', df2017[df2017['is_free'] == 1]['bike_num'].count() / df2017[df2017['is_free'] == 1].groupby('month_day').count().index.nunique())

average number of rentals during working days:  442.8333333333333
average number of rentals during weekends:  419.36206896551727
average number of rentals during holidays:  450.125


In [53]:
# average time (in minutes) of rentals during working days vs weekends vs holidays
print('average time of rentals during working days: ', df2017[df2017['is_weekend'] == 0]['duration_min'].mean())
print('average time of rentals during weekends: ', df2017[df2017['is_weekend'] == 1]['duration_min'].mean())
print('average time of rentals during holidays: ', df2017[df2017['is_free'] == 1]['duration_min'].mean())

average time of rentals during working days:  30.387839511981237
average time of rentals during weekends:  45.7501952884101
average time of rentals during holidays:  57.709756549106785


In [54]:
#number of working days during which you could rent a bike
lr = df2017[df2017['which_day'].isin(range(0,5))]['month_day'].nunique()
#number of weekend days during which you could rent a bike
lw = df2017[df2017['which_day'].isin([5,6])]['month_day'].nunique()
#number of holidays during which you could rent a bike
sw = df2017[df2017['is_free'] == 1].groupby('month_day').count().index.nunique()

#sum of rentals for hour interval during working days
a = df2017[df2017['is_weekend'] == False].groupby('start_hour')['bike_num'].count()
#sum of rentals for hour interval during weekends
b = df2017[df2017['is_weekend'] == True].groupby('start_hour')['bike_num'].count()
#sum of rentals for hour interval during holidays
s = df2017[df2017['is_free'] == True].groupby('start_hour')['bike_num'].count()

#average number of rentals  during working days
c = df2017[df2017['is_weekend'] == False].groupby('start_hour')['bike_num'].count() / lr
#average number of rentals  during working weekends
d = df2017[df2017['is_weekend'] == True].groupby('start_hour')['bike_num'].count() / lw
#average number of rentals  during working holidays
t = df2017[df2017['is_free'] == True].groupby('start_hour')['bike_num'].count() / sw

#average time  of rentals  during working days
e = df2017[df2017['is_weekend'] == False].groupby('start_hour')['duration_min'].sum() / df2017[df2017['is_weekend'] == False].groupby('start_hour')['bike_num'].count()
#average time  of rentals  during working weekends
f = df2017[df2017['is_weekend'] == True].groupby('start_hour')['duration_min'].sum() / df2017[df2017['is_weekend'] == True].groupby('start_hour')['bike_num'].count()
#average time  of rentals  during working holidays
u = df2017[df2017['is_free'] == True].groupby('start_hour')['duration_min'].sum() / df2017[df2017['is_free'] == True].groupby('start_hour')['bike_num'].count()

hour_data = {'hour' : a.index, 'sum_working' : a.values, 'sum_weekends' : b.values, 'sum_holidays' : s.values,
             'avg_working' : c.values, 'avg_weekends' : d.values, 'avg_holidays' : t.values,
             'avg_time_working' : e.values, 'avg_time_weekends' : f.values, 'avg_time_hoidays' : u.values}

hourdf = pd.DataFrame(data=hour_data)
hourdf

Unnamed: 0,hour,sum_working,sum_weekends,sum_holidays,avg_working,avg_weekends,avg_holidays,avg_time_working,avg_time_weekends,avg_time_hoidays
0,0,439,441,36,3.048611,7.603448,4.5,21.327828,26.00941,36.8
1,1,231,356,17,1.604167,6.137931,2.125,19.635209,23.736657,24.438235
2,2,185,173,8,1.284722,2.982759,1.0,18.837027,22.824566,16.954167
3,3,70,158,8,0.486111,2.724138,1.0,21.392143,21.067722,6.945833
4,4,86,93,2,0.597222,1.603448,0.25,12.906783,19.471326,9.933333
5,5,200,82,10,1.388889,1.413793,1.25,16.494417,18.036992,9.951667
6,6,1191,101,15,8.270833,1.741379,1.875,13.708578,13.629043,23.432222
7,7,2993,148,13,20.784722,2.551724,1.625,12.522012,24.931081,56.962821
8,8,2368,332,42,16.444444,5.724138,5.25,16.033875,38.643223,52.030159
9,9,1973,522,74,13.701389,9.0,9.25,18.675427,40.05514,52.138514


In [55]:
#most popular departure station
print('most popular departure stations: ')
df2017.groupby('departure')['bike_num'].count().sort_values(ascending=False).head()

most popular departure stations: 


departure
Katowice Rynek                   15262
Silesia City Center               4774
KTBS – Krasińskiego 14            4680
Murapol Mariacka                  4412
Al. Księcia Henryka Pobożnego     4401
Name: bike_num, dtype: int64

In [56]:
#most popular return station
print('most popular return stations: ')
df2017.groupby('return')['bike_num'].count().sort_values(ascending=False).head()

most popular return stations: 


return
Katowice Rynek                   15598
KTBS – Krasińskiego 14            4712
Al. Bolesława Krzywoustego        4669
Al. Księcia Henryka Pobożnego     4645
Murapol Mariacka                  4423
Name: bike_num, dtype: int64

In [57]:
#most popular routes
print('most popular routes: ')
df2017.groupby('route')['bike_num'].count().sort_values(ascending=False).head()

most popular routes: 


route
Katowice Rynek - Katowice Rynek                                  2570
Dolina 3-ch Stawów - Dolina 3-ch Stawów                          2509
Al. Księcia Henryka Pobożnego - Al. Księcia Henryka Pobożnego    2430
KTBS – Krasińskiego 14 - Katowice Rynek                          2033
Katowice Rynek - KTBS – Krasińskiego 14                          1965
Name: bike_num, dtype: int64

In [58]:
#most popular routes where departure != return
print('most popular routes where departure != return: ')
df2017[df2017['departure'] != df2017['return']].groupby('route')['bike_num'].count().sort_values(ascending=False).head()

most popular routes where departure != return: 


route
KTBS – Krasińskiego 14 - Katowice Rynek                2033
Katowice Rynek - KTBS – Krasińskiego 14                1965
Silesia City Center - Katowice Rynek                   1091
Ul. Powstańców - Biblioteka Śląska - Katowice Rynek    1084
Katowice Rynek - Ul. Powstańców - Biblioteka Śląska    1034
Name: bike_num, dtype: int64

In [59]:
#most rented bike
print('ID of most rented bike and number of its rentals: ')
df2017.groupby('bike_num')['duration_sec'].count().sort_values(ascending=False).head(1)

ID of most rented bike and number of its rentals: 


bike_num
58521    591
Name: duration_sec, dtype: int64