In [55]:
import pandas as pd
import numpy as np
import math

#### In this exercises session, you will continue with the bike dataset. Please load it first.

In [56]:
bikes = pd.read_csv("../data/bikesharing/data.csv")

#### Convert the "timestamp" column to an actual timestamp. With the first timestamp in the list, find 2 new members and 2 new members functions. Hint: after typing the dot "." click a "Tab" and explore the list.

In [57]:
bikes["timestamp"] = pd.to_datetime(bikes["timestamp"], format="%Y-%m-%d %H:%M:%S")   

#### How many full weeks does this dataset cover? (a full week meaning from Monday to Sunday)

In [58]:
max_dt = bikes["timestamp"].max()
min_dt = bikes["timestamp"].min()
days = (max_dt - min_dt).days
# days = (bikes["timestamp"].max() - bikes["timestamp"].min()).days  # or in one line
full_weeks = math.floor( days / 7)  # floor rounds down to the nearest integer
full_weeks

104

#### What was the maximum temperature (column: t1) recorded on a Friday night (between 10p.m. and 6a.m.)?

In [59]:
bikes["weekday"] = bikes["timestamp"].dt.weekday   # https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.weekday.html
bikes["hour"] = bikes["timestamp"].dt.hour

t1max_fri = bikes[(bikes["weekday"] == 4) & (bikes["hour"] >= 22)]["t1"].max()  # maximum value for Friday between 10pm and midnight
t1max_sat = bikes[(bikes["weekday"] == 5) & (bikes["hour"] < 6)]["t1"].max()  # maximum value for Saturday between midnight and 6am
max(t1max_fri, t1max_sat)

# bikes.drop(columns=["weekday", "hour"], inplace=True) # drop the columns again because you don't need them anymore

21.5

#### Try to figure out what the single values of the column: weather_code mean. Create a new column which contains names of the seasons.

In [60]:
# from the README.md file in data/bikesharing we know 
# "season" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter

seasons = {0:"string", 1:"summer", 2:"fall", 3:"winter"}
# prefered solution
bikes["season_name"] = bikes["season"].replace(to_replace=seasons)

# not so nice solution because it doesn't use vectorization and requires more code
def assign_season_name(season_code, seasons):
    return seasons[season_code]
    
bikes["season_name"] = bikes["season"].apply(lambda x: assign_season_name(x, seasons)) 

In [71]:
# in jupyter notebooks you can easily compare the runtime (how long it takes to execute (run) some code) by using magic commands
# have a look at https://towardsdatascience.com/top-8-magic-commands-in-jupyter-notebook-c1582e813560
# the following to cells compare the runtime of the vectorized and non-vectorized solution. as expected, the vectorized solution is faster (about 3 times as fast as the non-vectorized)

In [70]:
%%timeit 
bikes["season"].replace(to_replace=seasons)

744 µs ± 6.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [64]:
%%timeit
bikes["season"].apply(lambda x: assign_season_name(x, seasons)) 

2.28 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### What was the average number of bike rides in winter? How was the number in the summer? What can you say about these numbers and is a direct comparison the right thing to do? Why? Why not?

In [72]:
bikes.groupby(["season_name"])["cnt"].mean()  # returns the average (mean) for all 4 seasons

season_name
fall      1178.954218
string    1103.831589
summer    1464.465238
winter     821.729099
Name: cnt, dtype: float64

In [None]:
# maken statements/drawing conclusions based only on the mean can be dangerous because the mean can be affected by outliers (in this example, very large values that increase the average). The median 
# is more robust against outliers. Have a look at https://www.clinfo.eu/mean-median/

In [73]:
bikes.groupby(["season_name"])["cnt"].median()

season_name
fall       898
string     823
summer    1214
winter     632
Name: cnt, dtype: int64

In [None]:
# To answer the questions: there seems to be evidence that the number of shared bikes per hour is the highest during the summer, which is also the warmest month (on average) (see the following cell). There seems to be no
# large difference between the number of shared bikes per hour in Spring and Fall. During winter people seem to use less rental bikes 

In [75]:
bikes.groupby(["season_name"])["t1"].median()

season_name
fall      13.0
string    10.0
summer    18.0
winter     7.5
Name: t1, dtype: float64