<h1>CITYBIKE 2018

<h3>Import the data, add additional columns and start the analysis.

In [1]:
import pandas as pd
import numpy as np
import datetime

In [2]:
#import dataset
df2018 = pd.read_csv('wyp_2018.csv')

In [None]:
#check the size
df2018.shape()

In [1253]:
#check it out
df2018.head(5)

Unnamed: 0,bike_num,start_time,end_time,departure,return,duration_sec
0,58388,2018-04-01 00:16:10,2018-04-01 00:20:17,Murapol Mariacka,KTBS – Krasińskiego 14,247
1,58745,2018-04-01 00:10:28,2018-04-01 00:20:28,Murapol Mariacka,KTBS – Krasińskiego 14,600
2,58547,2018-04-01 10:34:09,2018-04-01 10:41:38,KTBS – Krasińskiego 14,Katowice Rynek,449
3,58786,2018-04-01 11:31:14,2018-04-01 11:45:20,Bogucice Szpital,KTBS – Saint Etienne 1,846
4,58884,2018-04-01 11:29:51,2018-04-01 11:45:39,Bogucice Szpital,KTBS – Saint Etienne 1,948


In [3]:
#create additional TIME columns so it would be easier to manipulate the data
df2018['start_day'] = pd.DatetimeIndex(df2018['start_time']).day
df2018['start_month'] = pd.DatetimeIndex(df2018['start_time']).month
df2018['start_hour'] = pd.DatetimeIndex(df2018['start_time']).hour
df2018['start_minute'] = pd.DatetimeIndex(df2018['start_time']).minute
df2018['month_day'] = pd.DatetimeIndex(df2018['start_time']).date
df2018['duration_min'] = df2018['duration_sec']/60
df2018['duration_hour'] = (df2018['duration_min']/60).apply(int)
#which day of week was it? monday = 0, sunday = 6
df2018['which_day'] = pd.DatetimeIndex(df2018['start_time']).dayofweek
#which day of year was it?
df2018['daynumber'] = pd.DatetimeIndex(df2018['start_time']).dayofyear
#was it weekend?
df2018['is_weekend'] = df2018['which_day'].apply(lambda x: 1 if (x == 6 or x == 5 ) else 0 )
#was it a day of because of holidays/other events?
free=[ datetime.date(2018,1,1),
        datetime.date(2018,1,6),
        datetime.date(2018,4,1),
        datetime.date(2018,4,2),
        datetime.date(2018,5,1),
        datetime.date(2018,5,3),
        datetime.date(2018,5,20),
        datetime.date(2018,5,31),
        datetime.date(2018,8,15),
        datetime.date(2018,11,1),
        datetime.date(2018,11,11),
        datetime.date(2018,11,12),
        datetime.date(2018,12,25),
        datetime.date(2018,12,26)
        ]
df2018['is_free'] = df2018['month_day'].apply(lambda x: 1 if (x in free) else 0 )

In [6]:
#create additional column for routes
df2018["route"] = df2018["departure"] + ' - ' + df2018["return"]

<h3>Basic analysis.

<h3>Remove unnecessary data.

In [10]:
#Before we start, we need to get rid of data that is disturbing our results. We assume, that if the bike was
#returned after < 3 minutes, it might be damaged or someone just changed his mind
to_dump = df2018[(df2018['duration_min'] < 3) & (df2018['departure'] == df2018['return'])].index

print('Number of rentals that last < 3 minutes: ', df2018[(df2018['duration_min'] < 3) & (df2018['departure'] == df2018['return'])]['bike_num'].count())
print('Percentage : {0:.2f} '.format(df2018[(df2018['duration_min'] < 3) & (df2018['departure'] == df2018['return'])]['bike_num'].count()/df2018['duration_min'].count()))

Number of rentals that last < 3 minutes:  20363
Percentage : 0.11 


In [11]:
#for how long (in hours) are the bikes being rented?
df2018.groupby('duration_hour')['bike_num'].count().head(10)

duration_hour
0    165892
1     17859
2      6002
3      2196
4       741
5       254
6       117
7        59
8        32
9        29
Name: bike_num, dtype: int64

In [12]:
#rentals that last more than 3 hours
to_dump2 = df2018[df2018['duration_hour'] >3]['bike_num'].index
print('rentals that last more than 3 hours: ', df2018[df2018['duration_hour'] >3]['bike_num'].count())
print('Percentage:  {0:.2f} '.format(df2018[df2018['duration_hour'] >3]['bike_num'].count()/df2018['duration_hour'].count()))

rentals that last more than 3 hours:  1379
Percentage:  0.01 


In [13]:
#let's remove the outliers
#rentals that last more than 3 hours are less than 1% of dataset, we will remove them
# we will also remove all rentals under 3 minutes
df2018 = df2018.drop(to_dump.union(to_dump2))

In [1260]:
#let's save what we have so far
df2018.to_csv('wyp2018_add.csv')

In [1214]:
#check out what's left
df2018.shape

(171586, 17)

<h3>Get general info.

In [14]:
#For how many days and months were the bikes available for rental?
print('Bikes were available for {0} days.'.format(df2018['month_day'].nunique()))
print('Bikes were available for {0} months.'.format(df2018['start_month'].nunique()))

Bikes were available for 260 days.
Bikes were available for 9 months.


In [16]:
#how many rentals and bikes were there?
print('{0} bikes were available for rental.'.format(df2018['bike_num'].nunique()))
print('They were rented {0} times.'.format(df2018['bike_num'].count()))

420 bikes were available for rental.
They were rented 171586 times.


In [17]:
#The average time of a rental:
print('The average time of a rental is:')
df2018['duration_min'].mean()

The average time of a rental is:


30.931685180220065

In [18]:
#the median
print('The median of time of a rental is: ')
df2018['duration_min'].median()

The median of time of a rental is: 


13.266666666666667

In [20]:
#rentals under 15 minutes
print('The number of rentals under 15 minutes (they were free): ')
df2018[df2018['duration_min'] < 15]['bike_num'].count()

The number of rentals under 15 minutes (they were free): 


94009

In [21]:
#wypozyczenia ponizej 15 minut jako procent
print('Percentage of free rides: {0:.2f}'.format(df2018[df2018['duration_min'] < 15]['bike_num'].count()/(df2018['bike_num'].count())))

Percentage of free rides: 0.55


<h3>More detailed info.

In [22]:
#number of rentals per month
print('The number of rentals per month:')
print(df2018.groupby('start_month')['bike_num'].count())
print('Most rentals in: ', df2018.groupby('start_month')['bike_num'].count().sort_values(ascending=False).index[0])

The number of rentals per month:
start_month
4     24514
5     26042
6     28372
7     26670
8     29263
9     17238
10    12101
11     6224
12     1162
Name: bike_num, dtype: int64
Most rentals in:  8


In [23]:
#number of rentals per week
print('The number of rentals per week:')
print(df2018.groupby('which_day')['bike_num'].count())
print('Most rentals in: ', df2018.groupby('which_day')['bike_num'].count().sort_values(ascending=False).index[0])

The number of rentals per week:
which_day
0    24741
1    24417
2    25591
3    24437
4    23956
5    23487
6    24957
Name: bike_num, dtype: int64
Most rentals in:  2


In [24]:
#average time per week day
print('average time per week day:')
print(df2018.groupby('which_day')['duration_min'].mean())

average time per week day:
which_day
0    26.027197
1    25.450449
2    26.362291
3    27.659074
4    25.495636
5    38.667184
6    46.984407
Name: duration_min, dtype: float64


In [25]:
# median time per week day
print('median time per week day:')
print(df2018.groupby('which_day')['duration_min'].median())

median time per week day:
which_day
0    11.950000
1    11.750000
2    11.850000
3    12.250000
4    11.916667
5    18.016667
6    25.850000
Name: duration_min, dtype: float64


In [26]:
#daily top score
print('Days with most rentals:')
df2018.groupby('month_day')['bike_num'].count().sort_values(ascending=False).head(5)

Days with most rentals:


month_day
2018-06-20    1322
2018-08-15    1312
2018-06-06    1304
2018-06-07    1255
2018-05-31    1247
Name: bike_num, dtype: int64

In [27]:
#how many holidays were there in 2018?
df2018[df2018['is_free'] == 1].groupby('month_day').count().index.nunique()

10

In [30]:
#how many rentals during holidays?
df2018[df2018['is_free'] == 1]['bike_num'].count()

6605

In [29]:
#how many rentals during holidays?
print('Percentage :  {0:.2f}'.format(df2018[df2018['is_free'] == 1]['bike_num'].count()/(df2018['bike_num'].count())))

Percentage :  0.04


In [31]:
#daily top score during holidays
print('daily top score during holidays: ')
df2018[df2018['is_free'] == 1].groupby('month_day')['bike_num'].count().sort_values(ascending=False).head(5)

daily top score during holidays: 


month_day
2018-08-15    1312
2018-05-31    1247
2018-05-01    1173
2018-05-20     927
2018-05-03     841
Name: bike_num, dtype: int64

In [32]:
# average number of rentals during working days vs weekends vs holidays
print('average number of rentals during working days: ', df2018[df2018['is_weekend'] == 0]['bike_num'].count() / df2018[df2018['which_day'].isin(range(0,5))]['month_day'].nunique())
print('average number of rentals during weekends: ', df2018[df2018['is_weekend'] == 1]['bike_num'].count() / df2018[df2018['which_day'].isin([5,6])]['month_day'].nunique())
print('average number of rentals during holidays: ', df2018[df2018['is_free'] == 1]['bike_num'].count() / df2018[df2018['is_free'] == 1].groupby('month_day').count().index.nunique())

average number of rentals during working days:  665.6324324324324
average number of rentals during working weekends:  645.92
average number of rentals during working holidays:  660.5


In [33]:
# average time (in minutes) of rentals during working days vs weekends vs holidays
print('average time of rentals during working days: ', df2018[df2018['is_weekend'] == 0]['duration_min'].mean())
print('average time of rentals during weekends: ', df2018[df2018['is_weekend'] == 1]['duration_min'].mean())
print('average time of rentals during holidays: ', df2018[df2018['is_free'] == 1]['duration_min'].mean())

average time of rentals during working days:  26.20290518263462
average time of rentals during weekends:  42.9519854539948
average time of rentals during holidays:  50.58347968710567


In [34]:
#number of working days during which you could rent a bike
lr = df2018[df2018['which_day'].isin(range(0,5))]['month_day'].nunique()
#number of weekend days during which you could rent a bike
lw = df2018[df2018['which_day'].isin([5,6])]['month_day'].nunique()
#number of holidays during which you could rent a bike
sw = df2018[df2018['is_free'] == 1].groupby('month_day').count().index.nunique()

#sum of rentals for hour interval during working days
a = df2018[df2018['is_weekend'] == False].groupby('start_hour')['bike_num'].count()
#sum of rentals for hour interval during weekends
b = df2018[df2018['is_weekend'] == True].groupby('start_hour')['bike_num'].count()
#sum of rentals for hour interval during holidays
s = df2018[df2018['is_free'] == True].groupby('start_hour')['bike_num'].count()

#average number of rentals  during working days
c = df2018[df2018['is_weekend'] == False].groupby('start_hour')['bike_num'].count() / lr
#average number of rentals  during working weekends
d = df2018[df2018['is_weekend'] == True].groupby('start_hour')['bike_num'].count() / lw
#average number of rentals  during working holidays
t = df2018[df2018['is_free'] == True].groupby('start_hour')['bike_num'].count() / sw

#average time  of rentals  during working days
e = df2018[df2018['is_weekend'] == False].groupby('start_hour')['duration_min'].sum() / df2018[df2018['is_weekend'] == False].groupby('start_hour')['bike_num'].count()
#average time  of rentals  during working weekends
f = df2018[df2018['is_weekend'] == True].groupby('start_hour')['duration_min'].sum() / df2018[df2018['is_weekend'] == True].groupby('start_hour')['bike_num'].count()
#average time  of rentals  during working holidays
u = df2018[df2018['is_free'] == True].groupby('start_hour')['duration_min'].sum() / df2018[df2018['is_free'] == True].groupby('start_hour')['bike_num'].count()

hour_data = {'hour' : a.index, 'sum_working' : a.values, 'sum_weekends' : b.values, 'sum_holidays' : s.values,
             'avg_working' : c.values, 'avg_weekends' : d.values, 'avg_holidays' : t.values,
             'avg_time_working' : e.values, 'avg_time_weekends' : f.values, 'avg_time_hoidays' : u.values}

hourdf = pd.DataFrame(data=hour_data)
hourdf

Unnamed: 0,hour,sum_working,sum_weekends,sum_holidays,avg_working,avg_weekends,avg_holidays,avg_time_working,avg_time_weekends,avg_time_hoidays
0,0,1039,978,113,5.616216,13.04,11.3,19.36864,22.637986,17.70472
1,1,621,712,72,3.356757,9.493333,7.2,24.315862,18.002528,19.511806
2,2,355,499,36,1.918919,6.653333,3.6,18.623333,19.28016,18.177778
3,3,314,354,34,1.697297,4.72,3.4,15.067091,18.786158,15.368137
4,4,237,183,10,1.281081,2.44,1.0,16.058087,16.296266,11.611667
5,5,659,182,17,3.562162,2.426667,1.7,11.231108,16.230861,35.865686
6,6,2965,203,21,16.027027,2.706667,2.1,11.447937,24.104762,16.554762
7,7,7121,348,31,38.491892,4.64,3.1,12.745298,23.676102,20.901075
8,8,4808,561,70,25.989189,7.48,7.0,14.951764,25.0388,27.954762
9,9,3919,1128,143,21.183784,15.04,14.3,19.728608,36.715647,50.984965


In [35]:
#most popular departure station
print('most popular departure stations: ')
df2018.groupby('departure')['bike_num'].count().sort_values(ascending=False).head()

most popular departure stations: 


departure
Katowice Rynek                27493
Silesia City Center            8750
KTBS – Krasińskiego 14         8381
Murapol Mariacka               7775
Al. Bolesława Krzywoustego     6526
Name: bike_num, dtype: int64

In [36]:
#most popular return station
print('most popular return stations: ')
df2018.groupby('return')['bike_num'].count().sort_values(ascending=False).head()

most popular return stations: 


return
Katowice Rynek                29084
KTBS – Krasińskiego 14         8236
Murapol Mariacka               8083
Silesia City Center            8082
Al. Bolesława Krzywoustego     7227
Name: bike_num, dtype: int64

In [38]:
#most popular routes
print('most popular routes: ')
df2018.groupby('route')['bike_num'].count().sort_values(ascending=False).head()

most popular routes: 


route
Katowice Rynek - Katowice Rynek                                  3807
KTBS – Krasińskiego 14 - Katowice Rynek                          3418
Dolina 3-ch Stawów - Dolina 3-ch Stawów                          3253
Katowice Rynek - KTBS – Krasińskiego 14                          3207
Al. Księcia Henryka Pobożnego - Al. Księcia Henryka Pobożnego    2806
Name: bike_num, dtype: int64

In [39]:
#most popular routes where departure != return
print('most popular routes where departure != return: ')
df2018[df2018['departure'] != df2018['return']].groupby('route')['bike_num'].count().sort_values(ascending=False).head()

most popular routes where departure != return: 


route
KTBS – Krasińskiego 14 - Katowice Rynek    3418
Katowice Rynek - KTBS – Krasińskiego 14    3207
Silesia City Center - Katowice Rynek       2289
Katowice Rynek - Murapol Mariacka          1995
Katowice Rynek - Silesia City Center       1797
Name: bike_num, dtype: int64

In [40]:
#most rented bike
print('ID of most rented bike and number of its rentals: ')
df2018.groupby('bike_num')['duration_sec'].count().sort_values(ascending=False).head(1)

ID of most rented bike and number of its rentals: 


bike_num
58463    772
Name: duration_sec, dtype: int64