# Challenge Questions - TfL Dataset

# Instructions:
• Please ensure you don't overwrite any existing cells. Add new cells below by pressing ALT+ENTER

• Attempt all of the questions

• You are encouraged to look online for help should you need it

# Dataset overview:
There are three datasets stored in the same directory as this Notebook, they are all related to each other:

• **tfl-daily-cycle-hires.csv**: This dataset contains bike hire data from Transport for London during the period 
30th July 2010 to 30th September 2021. 'Day' is the day in '%d/%m/%Y' format. 'Number of Bicycle Hires' is the total number of bikes hired that day.


# 

## Import pandas, numpy and datetime

In [1]:
import pandas as pd


In [2]:
import numpy as np

## Load the files:
• "tfl-daily-cycle-hires.csv" should be assigned to the variable **tfl**

In [9]:
tfl = pd.read_csv('tfl-daily-cycle-hires.csv')

## Check the head of the DataFrame

In [10]:
tfl.head(5)

Unnamed: 0,Day,Number of Bicycle Hires,Unnamed: 2
0,30/07/2010,6897.0,
1,31/07/2010,5564.0,
2,01/08/2010,4303.0,
3,02/08/2010,6642.0,
4,03/08/2010,7966.0,


## Check the data types of the DataFrame columns

In [11]:
tfl.dtypes

Day                         object
Number of Bicycle Hires    float64
Unnamed: 2                 float64
dtype: object

## Change the data types and remove unnecessary columns 

• 'Day' should be a datetime64 data type

• 'Number of Bicycle Hires' should be float64

• Any other columns should be deleted

In [12]:
tfl.columns

Index(['Day', 'Number of Bicycle Hires', 'Unnamed: 2'], dtype='object')

In [13]:
tfl.drop(columns='Unnamed: 2',inplace= True)

In [15]:
tfl.columns

Index(['Day', 'Number of Bicycle Hires'], dtype='object')

In [19]:
tfl['Day'] = tfl['Day'].astype("datetime64")

  tfl['Day'] = tfl['Day'].astype("datetime64")


In [20]:
tfl.dtypes

Day                        datetime64[ns]
Number of Bicycle Hires           float64
dtype: object

## What is the average number of bicycle hires per day across the entire dataset?

In [22]:
round(tfl['Number of Bicycle Hires'].mean(),2)

26261.93

## Create a new column called 'Year' which contains the 4 digit year

In [24]:
tfl['Year'] = tfl['Day'].dt.year

In [25]:
tfl['Year']

0       2010
1       2010
2       2010
3       2010
4       2010
        ... 
4076    2021
4077    2021
4078    2021
4079    2021
4080    2021
Name: Year, Length: 4081, dtype: int64

## What is the average number of bicycle hires per Year across the entire dataset

In [26]:
tfl.columns

Index(['Day', 'Number of Bicycle Hires', 'Year'], dtype='object')

In [28]:
round(tfl.groupby('Year')['Number of Bicycle Hires'].mean(),2)

Year
2010    14069.76
2011    19568.35
2012    26008.97
2013    22042.35
2014    27462.73
2015    27046.13
2016    28152.01
2017    28619.30
2018    28952.16
2019    28561.52
2020    28508.65
2021    30091.07
Name: Number of Bicycle Hires, dtype: float64

## What is the total number of bicycle hires per Year across the entire dataset

In [29]:
tfl.groupby('Year')['Number of Bicycle Hires'].sum()

Year
2010     2180813.0
2011     7142449.0
2012     9519283.0
2013     8045459.0
2014    10023897.0
2015     9871839.0
2016    10303637.0
2017    10446044.0
2018    10567540.0
2019    10424955.0
2020    10434167.0
2021     8214862.0
Name: Number of Bicycle Hires, dtype: float64

In [30]:
tfl.groupby('Year').sum()

  tfl.groupby('Year').sum()


Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,2180813.0
2011,7142449.0
2012,9519283.0
2013,8045459.0
2014,10023897.0
2015,9871839.0
2016,10303637.0
2017,10446044.0
2018,10567540.0
2019,10424955.0


In [33]:
tfl[['Year','Number of Bicycle Hires']].groupby('Year').sum()

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,2180813.0
2011,7142449.0
2012,9519283.0
2013,8045459.0
2014,10023897.0
2015,9871839.0
2016,10303637.0
2017,10446044.0
2018,10567540.0
2019,10424955.0


## Create a new column called 'Category' on the tfl DataFrame that classifies the number of bike hires per day as:
* 'Low' if the 'Number of Bicycle Hires' is below 10,000
* 'Medium' if the 'Number of Bicycle Hires' is below 40,000 but greater than or equal to 10,000
* 'High' if the 'Number of Bicycle Hires' is greater than or equal to 40,000

In [35]:
tfl['Category'] =  tfl['Number of Bicycle Hires'].apply( lambda x: 'Low' if x<10000 else 'High' if x>=40000 else 'Medium')



In [36]:
tfl

Unnamed: 0,Day,Number of Bicycle Hires,Year,Category
0,2010-07-30,6897.0,2010,Low
1,2010-07-31,5564.0,2010,Low
2,2010-01-08,4303.0,2010,Low
3,2010-02-08,6642.0,2010,Low
4,2010-03-08,7966.0,2010,Low
...,...,...,...,...
4076,2021-09-26,45120.0,2021,High
4077,2021-09-27,32167.0,2021,Medium
4078,2021-09-28,32539.0,2021,Medium
4079,2021-09-29,39889.0,2021,Medium


## For each year in the tfl DataFrame how many days are classed as 'Low', 'Medium' or 'High'?

In [39]:
tfl[['Year','Category','Day']].groupby(by=['Year','Category']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Day
Year,Category,Unnamed: 2_level_1
2010,Low,44
2010,Medium,111
2011,Low,30
2011,Medium,335
2012,High,27
2012,Low,19
2012,Medium,320
2013,Low,25
2013,Medium,340
2014,High,27
