# TFL Bike data prep
___

#### Data prep as part of my MSc thesis, "Using machine learning to analyse and predict Transport for London bike sharing habits in the post COVID-19 era".

The following code for downloading the data has been adopted from [Markus Hauru's](https://github.com/mhauru) analysis, 'Predicting Boris Bike usage'.



In [2]:
# importing libraries

import os
import pickle
import requests
import zipfile
import pandas as pd
import numpy as np
import scipy as sp
import statsmodels.api as sm
from sklearn import linear_model, svm, neighbors, tree
from matplotlib import pyplot as plt
import matplotlib
import seaborn as sns
from pathlib import Path
from timeit import default_timer as timer
from IPython.display import set_matplotlib_formats
from urllib.parse import urlparse
import openpyxl

try:
    import xlrd
except Exception as e:
    msg = (
        "Please install the package xlrd: `pip install --user xlrd`"
        "It's an optional requirement for pandas, and we'll be needing it."
    )
    print(msg)
    raise e

In [3]:
# For pretty and exportable matplotlib plots.
# If you are running this yourself and want interactivity,
# try `%matplotlib widget` instead.
set_matplotlib_formats("svg")
%matplotlib inline
# %matplotlib widget
# Set a consistent plotting style across the notebook using Seaborn.
sns.set_style("darkgrid")
sns.set_context("notebook")
# Make pandas cooperate with pyplot
pd.plotting.register_matplotlib_converters()


  set_matplotlib_formats("svg")


1. Processing and cleaning the bike data
Before getting anywhere with it, we'll need to process the bike data quite a bit. The data comes in CSV files, each of which covers a period of time. Up first, we need to download the data from the TfL website. If you are running this code yourself, here's a script that does that. Be warned though, it's almost seven gigs of data. You can run it repeatedly, and it'll only download data that it doesn't have already.

In [4]:
bikefolder = "data/bikes"

In [5]:
def download_file(datafolder, url, verbosity=0):
    """Download the data from the given URL into the datafolder, unless it's
    already there. Return path to downloaded file.
    """
    # data folder variable for where the folder for where the downloaded file should be stores 
    # using the path() function to converted the data folder string into a path
    datafolder = Path(datafolder)
    datafolder.mkdir(parents=True, exist_ok=True)

    # using the url parse function to extract the file from the url and create a filepath for it to be stored
    a = urlparse(url)
    filename = Path(os.path.basename(a.path))
    filepath = datafolder / filename
    # Don't redownload if we already have this file.
    if filepath.exists():
        if verbosity > 1:
            print("Already have {}".format(filename))
    else:
        if verbosity > 0:
            print("Downloading {}".format(filename))
        # sends a GET request to the URL using the requests module and raises an exception if there is an error
        rqst = requests.get(url)
        rqst.raise_for_status()
        with open(filepath, "wb") as f:
            f.write(rqst.content)
    return filepath


In [6]:
# Adjust whether to print progress reports of the downloads.
# verbosity=0 is silence, verbosity=1 reports only when actually doing things,
# verbosity>1 also reports when there's nothing to do.
verbosity = 1

# Most files are individual CSV files, listed in bike_data_urls.txt. Download them.
urlsfile = "data/bikes/bike_data_urls.txt"
with open(urlsfile, "r") as f:
    urls = f.read().splitlines()
# There are a few comments in the file, marked by lines starting with #.
# Filter them out.
urls = [u for u in urls if u[0] != "#"]
for url in urls:
    download_file(bikefolder, url, verbosity)

# The early years come in zips. Download and unzip them.
zipsfolder = Path("data/bikes/bikezips")
bikezipurls = [
    "https://cycling.data.tfl.gov.uk/usage-stats/cyclehireusagestats-2012.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/cyclehireusagestats-2013.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/cyclehireusagestats-2014.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/2015TripDatazip.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/2016TripDataZip.zip",
]
# A list of CSV files that are already there. Only unzip if some of the files
# in the zip aren't present already.
current_csvs = sorted(os.listdir(bikefolder))
for url in bikezipurls:
    zippath = download_file(zipsfolder, url, verbosity)
    with zipfile.ZipFile(zippath, "r") as z:
        namelist = z.namelist()
        has_been_extracted = any(name not in current_csvs for name in namelist)
        if has_been_extracted:
            if verbosity > 0:
                print("Unzipping {}".format(zippath))
            z.extractall(bikefolder)
        else:
            if verbosity > 1:
                print("{} has already been extracted.".format(zippath))

# Finally, there's an odd one out: One week's data comes in as an .xlsx.
# Download it and use pandas to convert it to csv.
xlsxurl = "https://cycling.data.tfl.gov.uk/usage-stats/49JourneyDataExtract15Mar2017-21Mar2017.xlsx"
xlsxfile = download_file(bikefolder, xlsxurl)
csvfile = xlsxfile.with_suffix(".csv")
if not csvfile.exists():
    if verbosity > 0:
        print("Converting .xlsx to .csv.")
    pd.read_excel(xlsxfile).to_csv(csvfile, date_format="%d/%m/%Y %H:%M:%S")
else:
    if verbosity > 1:
        print("Already have {}".format(csvfile))

The data we have now lists on each line of the CSV file a single bike trip, with starting point and time, end point and time, and things like bike ID number. Here's an example.

In [7]:
example_file  = Path(bikefolder) / Path("47JourneyDataExtract01Mar2017-07Mar2017.csv")
pd.read_csv(example_file, encoding="ISO-8859-2").head()

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name
0,62857677,3780.0,7851,06/03/2017 19:20,43.0,"Crawford Street, Marylebone",06/03/2017 18:17,811,"Westferry Circus, Canary Wharf"
1,62863035,540.0,4089,06/03/2017 22:17,295.0,"Swan Street, The Borough",06/03/2017 22:08,272,"Baylis Road, Waterloo"
2,62775896,600.0,4895,02/03/2017 21:27,295.0,"Swan Street, The Borough",02/03/2017 21:17,197,"Stamford Street, South Bank"
3,62747748,420.0,4347,01/03/2017 21:08,295.0,"Swan Street, The Borough",01/03/2017 21:01,803,"Southwark Street, Bankside"
4,62843939,420.0,3192,06/03/2017 09:28,193.0,"Bankside Mix, Bankside",06/03/2017 09:21,197,"Stamford Street, South Bank"


In [8]:
from glob import glob 

# using glob to list all the csv file in the bikefolder filepath
all_csv = glob(bikefolder+str('/*.csv'))
all_csv

['data/bikes\\01aJourneyDataExtract10Jan16-23Jan16.csv',
 'data/bikes\\01bJourneyDataExtract24Jan16-06Feb16.csv',
 'data/bikes\\02aJourneyDataExtract07Feb16-20Feb2016.csv',
 'data/bikes\\02bJourneyDataExtract21Feb16-05Mar2016.csv',
 'data/bikes\\03JourneyDataExtract06Mar2016-31Mar2016.csv',
 'data/bikes\\04JourneyDataExtract01Apr2016-30Apr2016.csv',
 'data/bikes\\05JourneyDataExtract01May2016-17May2016.csv',
 'data/bikes\\06JourneyDataExtract18May2016-24May2016.csv',
 'data/bikes\\07JourneyDataExtract25May2016-31May2016.csv',
 'data/bikes\\08JourneyDataExtract01Jun2016-07Jun2016.csv',
 'data/bikes\\09JourneyDataExtract08Jun2016-14Jun2016.csv',
 'data/bikes\\1. Journey Data Extract 01Jan-05Jan13.csv',
 'data/bikes\\1. Journey Data Extract 04Jan-31Jan 12.csv',
 'data/bikes\\1. Journey Data Extract 05Jan14-02Feb14.csv',
 'data/bikes\\10. Journey Data Extract 18Aug-13Sep13.csv',
 'data/bikes\\10. Journey Data Extract 21Aug-22 Aug12.csv',
 'data/bikes\\10a Journey Data Extract 20Sep15-03Oct

### 2019 data prep

In [35]:
# creating a list of csv files that contain '2019' and '2022' respectively
csv_2019 = [item for item in all_csv if '2019' in item]
csv_2022 = [item for item in all_csv if '2022' in item]

In [13]:
csv_2019

['data/bikes\\142JourneyDataExtract26Dec2018-01Jan2019.csv',
 'data/bikes\\143JourneyDataExtract02Jan2019-08Jan2019.csv',
 'data/bikes\\144JourneyDataExtract09Jan2019-15Jan2019.csv',
 'data/bikes\\145JourneyDataExtract16Jan2019-22Jan2019.csv',
 'data/bikes\\146JourneyDataExtract23Jan2019-29Jan2019.csv',
 'data/bikes\\147JourneyDataExtract30Jan2019-05Feb2019.csv',
 'data/bikes\\148JourneyDataExtract06Feb2019-12Feb2019.csv',
 'data/bikes\\149JourneyDataExtract13Feb2019-19Feb2019.csv',
 'data/bikes\\150JourneyDataExtract20Feb2019-26Feb2019.csv',
 'data/bikes\\151JourneyDataExtract27Feb2019-05Mar2019.csv',
 'data/bikes\\152JourneyDataExtract06Mar2019-12Mar2019.csv',
 'data/bikes\\153JourneyDataExtract13Mar2019-19Mar2019.csv',
 'data/bikes\\154JourneyDataExtract20Mar2019-26Mar2019.csv',
 'data/bikes\\155JourneyDataExtract27Mar2019-02Apr2019.csv',
 'data/bikes\\156JourneyDataExtract03Apr2019-09Apr2019.csv',
 'data/bikes\\157JourneyDataExtract10Apr2019-16Apr2019.csv',
 'data/bikes\\158Journey

In [14]:
# using list comprehension that reads each csv file from the list and gnerators a sequence of dataframes
dfs = (pd.read_csv(csv) for csv in csv_2019)

# concatenate csvs them into a single DataFrame using pd.concat()
# ignore_index=True parameter resets the index of the resulting DataFrame, so that it is a continuous sequence of integers.
data_2019 = pd.concat(dfs, ignore_index=True)

In [15]:
print(data_2019.shape)
data_2019.head()

(10388411, 9)


Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name
0,83252102,720,2077,31/12/2018 19:05,272,"Baylis Road, Waterloo",31/12/2018 18:53,94,"Bricklayers Arms, Borough"
1,83195883,120,10781,27/12/2018 19:47,93,"Cloudesley Road, Angel",27/12/2018 19:45,339,"Risinghill Street, Angel"
2,83196070,120,2977,27/12/2018 20:11,339,"Risinghill Street, Angel",27/12/2018 20:09,234,"Liverpool Road (N1 Centre), Angel"
3,83197932,660,10802,28/12/2018 07:35,282,"Royal London Hospital, Whitechapel",28/12/2018 07:24,698,"Shoreditch Court, Haggerston"
4,83176351,1380,15749,26/12/2018 11:55,785,"Aquatic Centre, Queen Elizabeth Olympic Park",26/12/2018 11:32,783,"Monier Road, Hackney Wick"


In [16]:
# 2019

## Add some extra variables to the dataset for use later in filtering

import datetime

## Feeding a specififed date format speeds up the pd.to_datetime function immeasurably, especially over large datasets
## e.g. http://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31

format = "%d/%m/%Y %H:%M"

## Some routes had dates with a seconds component, whereas some didn't - the below code cuts these seconds off
data_2019['Start Date']= data_2019['Start Date'].str[:16]

data_2019['Start Date Time']= pd.to_datetime(data_2019['Start Date'], format=format)

data_2019['Hour']= pd.to_datetime(data_2019['Start Date'], format=format).dt.hour

data_2019['Day']= pd.to_datetime(data_2019['Start Date'], format=format).dt.weekday

data_2019.head()


Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,Start Date Time,Hour,Day
0,83252102,720,2077,31/12/2018 19:05,272,"Baylis Road, Waterloo",31/12/2018 18:53,94,"Bricklayers Arms, Borough",2018-12-31 18:53:00,18,0
1,83195883,120,10781,27/12/2018 19:47,93,"Cloudesley Road, Angel",27/12/2018 19:45,339,"Risinghill Street, Angel",2018-12-27 19:45:00,19,3
2,83196070,120,2977,27/12/2018 20:11,339,"Risinghill Street, Angel",27/12/2018 20:09,234,"Liverpool Road (N1 Centre), Angel",2018-12-27 20:09:00,20,3
3,83197932,660,10802,28/12/2018 07:35,282,"Royal London Hospital, Whitechapel",28/12/2018 07:24,698,"Shoreditch Court, Haggerston",2018-12-28 07:24:00,7,4
4,83176351,1380,15749,26/12/2018 11:55,785,"Aquatic Centre, Queen Elizabeth Olympic Park",26/12/2018 11:32,783,"Monier Road, Hackney Wick",2018-12-26 11:32:00,11,2


In [17]:
# 2019 filtering data - remove any rows that aren't from 2019
# remember the first csv contained data from 2018... 26Dec2018-01Jan2019.csv
bike_data_2019 = data_2019[data_2019['Start Date Time'].dt.year == 2019]
print(bike_data_2019.shape)

(10310063, 12)


In [19]:
# bike_data_2019 has no null values, perfect
#bike_data_2019.isnull().sum()

Rental Id            0
Duration             0
Bike Id              0
End Date             0
EndStation Id        0
EndStation Name      0
Start Date           0
StartStation Id      0
StartStation Name    0
Start Date Time      0
Hour                 0
Day                  0
dtype: int64

### 2022 data prep

- In September 2022 the column names change slightly and additional clumns have been added
- for example the 'Bike model' column has been added (classic or PBSC_EBIKE)

Cycle Hire Data - data format change & new data https://techforum.tfl.gov.uk/t/cycle-hire-data-data-format-change-new-data/2520

### Exploring the 2022 data

In [9]:
csv_2022 = [item for item in all_csv if '2022' in item]

In [10]:
# CSVs before September 2022 part 1 data 
# use slicing to includes all elements of the previous list except for the last 16
csv_2022_p1 = csv_2022[:-16]

# CSVs From september 12th 2022 
# use slicing to create a new list that includes only the last 16 elements
csv_2022_p2 = csv_2022[-16:]

In [11]:
# doing the same for the 2022 data
# passing errors within the csv files as per https://stackoverflow.com/questions/52105659/pandas-read-csv-unexpected-end-of-data-error
dfs_2022_p1 = (pd.read_csv(csv, engine='python', encoding='utf-8', on_bad_lines='skip') for csv in csv_2022_p1)
data_2022_p1 = pd.concat(dfs_2022_p1, ignore_index=True)

In [12]:
data_2022_p1.isnull().sum()
# for the part 1 data, there were 312144 records with null station ids  

#es_id_null = data_2022_p1.loc[data_2022_p1['EndStation Id'].isnull()] 
#es_id_null.sort_values(by='Start Date', ascending=False)

# filtering the data above reveal the journeys taken between 06/07/2022 00:00 and 12/07/2022 23:56 did not record an end station Id

Rental Id                 0
Duration                  0
Bike Id                   0
End Date                  0
EndStation Id        312144
EndStation Name           0
Start Date                0
StartStation Id           0
StartStation Name         0
dtype: int64

In [13]:
data_2022_p1.count()

Rental Id            8677104
Duration             8677104
Bike Id              8677104
End Date             8677104
EndStation Id        8364960
EndStation Name      8677104
Start Date           8677104
StartStation Id      8677104
StartStation Name    8677104
dtype: int64

In [14]:
# read in data with datetime data type for column 2 and column 5
dfs_2022_p2 = (pd.read_csv(csv) for csv in csv_2022_p2)
#dfs_2022_p2 = (pd.read_csv(csv, parse_dates={'Start date': 'datetime64', 'End date': 'datetime64'}) for csv in csv_2022_p2)
data_2022_p2 = pd.concat(dfs_2022_p2, ignore_index=True)

  dfs_2022_p2 = (pd.read_csv(csv) for csv in csv_2022_p2)


In [15]:
data_2022_p2.isnull().sum()

Number                  0
Start date              0
Start station number    0
Start station           0
End date                0
End station number      0
End station             0
Bike number             0
Bike model              0
Total duration          0
Total duration (ms)     0
dtype: int64

In [16]:
data_2022_p2.count()

Number                  2555077
Start date              2555077
Start station number    2555077
Start station           2555077
End date                2555077
End station number      2555077
End station             2555077
Bike number             2555077
Bike model              2555077
Total duration          2555077
Total duration (ms)     2555077
dtype: int64

In [17]:
# doing the same for the 2022 data
# passing errors within the csv files as per https://stackoverflow.com/questions/52105659/pandas-read-csv-unexpected-end-of-data-error
dfs_2022 = (pd.read_csv(csv, engine='python', encoding='utf-8', on_bad_lines='skip') for csv in csv_2022)
data_2022 = pd.concat(dfs_2022, ignore_index=True)

In [18]:
# check the data type of the 'date' column
print(data_2022['Start date'].dtype)

object


In [19]:
# 2022

# Let's clean this up and get all the data into single columns


#creating a copy of the orginal data
data_2022_clean = data_2022.copy()


In [20]:
#let's start by sorting out the date time formatting
format = "%d/%m/%Y %H:%M"
format2 = "%Y/%m/%d %H:%M"


data_2022_clean['Start Date'] = data_2022_clean['Start Date'].str[:16]


# let's create some extra columns to store the newly formatted datetime data,
#remember the date columns have different formatting before and after September
# we will merge them into single columns later on 
data_2022_clean['Start Date Time'] = pd.to_datetime(data_2022_clean['Start Date'], format=format)
data_2022_clean['Start Date Time 2']= pd.to_datetime(data_2022_clean['Start date'], format=format2)
data_2022_clean['End Date Time'] = pd.to_datetime(data_2022_clean['End Date'], format=format)
data_2022_clean['End Date Time 2'] = pd.to_datetime(data_2022_clean['Start date'], format=format2)

In [21]:
data_2022_clean

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,Number,...,End station number,End station,Bike number,Bike model,Total duration,Total duration (ms),Start Date Time,Start Date Time 2,End Date Time,End Date Time 2
0,115967515.0,1260.0,15338.0,01/01/2022 23:13,310.0,"Black Prince Road, Vauxhall",01/01/2022 22:52,529.0,"Manresa Road, Chelsea",,...,,,,,,,2022-01-01 22:52:00,NaT,2022-01-01 23:13:00,NaT
1,116017034.0,720.0,19861.0,04/01/2022 19:08,11.0,"Brunswick Square, Bloomsbury",04/01/2022 18:56,804.0,"Good's Way, King's Cross",,...,,,,,,,2022-01-04 18:56:00,NaT,2022-01-04 19:08:00,NaT
2,115895660.0,360.0,19666.0,29/12/2021 16:34,70.0,"Calshot Street , King's Cross",29/12/2021 16:28,57.0,"Guilford Street , Bloomsbury",,...,,,,,,,2021-12-29 16:28:00,NaT,2021-12-29 16:34:00,NaT
3,116016563.0,480.0,19861.0,04/01/2022 18:46,804.0,"Good's Way, King's Cross",04/01/2022 18:38,57.0,"Guilford Street , Bloomsbury",,...,,,,,,,2022-01-04 18:38:00,NaT,2022-01-04 18:46:00,NaT
4,116014412.0,1260.0,17235.0,04/01/2022 17:45,14.0,"Belgrove Street , King's Cross",04/01/2022 17:24,297.0,"Geraldine Street, Elephant & Castle",,...,,,,,,,2022-01-04 17:24:00,NaT,2022-01-04 17:45:00,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11232176,,,,,,,,,,127641458.0,...,200249,"Queen Mary's, Mile End",53664.0,CLASSIC,1h 49m 4s,6544593.0,NaT,2022-12-26 00:02:00,NaT,2022-12-26 00:02:00
11232177,,,,,,,,,,127641459.0,...,200147,"Salmon Lane, Limehouse",54303.0,CLASSIC,32m 16s,1936877.0,NaT,2022-12-26 00:02:00,NaT,2022-12-26 00:02:00
11232178,,,,,,,,,,127641453.0,...,200160,"Langdon Park, Poplar",21426.0,CLASSIC,49m 15s,2955280.0,NaT,2022-12-26 00:00:00,NaT,2022-12-26 00:00:00
11232179,,,,,,,,,,127641454.0,...,22167,"Millharbour, Millwall",54786.0,CLASSIC,1h 30m 27s,5427555.0,NaT,2022-12-26 00:00:00,NaT,2022-12-26 00:00:00


In [22]:
data_2022_clean.isnull().sum()

Rental Id               2555077
Duration                2555077
Bike Id                 2555077
End Date                2555077
EndStation Id           2867221
EndStation Name         2555077
Start Date              2555077
StartStation Id         2555077
StartStation Name       2555077
Number                  8677104
Start date              8677104
Start station number    8677104
Start station           8677104
End date                8677104
End station number      8677104
End station             8677104
Bike number             8677104
Bike model              8677104
Total duration          8677104
Total duration (ms)     8677104
Start Date Time         2555077
Start Date Time 2       8677104
End Date Time           2555077
End Date Time 2         8677104
dtype: int64

In [23]:
# transfering values from one pandas column to another pandas column only for null rows

data_2022_clean.loc[data_2022_clean['Rental Id'].isnull(), 'Rental Id'] = data_2022_clean['Number']
# converting from milliseconds to seconds, multipyling by 1000 
data_2022_clean.loc[data_2022_clean['Duration'].isnull(), 'Duration'] = data_2022_clean['Total duration (ms)'] / 1000
data_2022_clean.loc[data_2022_clean['Bike Id'].isnull(), 'Bike Id'] = data_2022_clean['Bike number']
data_2022_clean.loc[data_2022_clean['End Date'].isnull(), 'End Date'] = data_2022_clean['End date']
data_2022_clean.loc[data_2022_clean['EndStation Name'].isnull(), 'EndStation Name'] = data_2022_clean['End station']
data_2022_clean.loc[data_2022_clean['Start Date'].isnull(), 'Start Date'] = data_2022_clean['Start Date Time 2']
data_2022_clean.loc[data_2022_clean['End Date'].isnull(), 'End Date'] = data_2022_clean['End Date Time']
data_2022_clean.loc[data_2022_clean['StartStation Name'].isnull(), 'StartStation Name'] = data_2022_clean['Start station']

#data_2022_clean.sort_values(by='Bike model', ascending=False)

In [24]:
data_2022_clean.isnull().sum()

Rental Id                     0
Duration                      0
Bike Id                       0
End Date                      0
EndStation Id           2867221
EndStation Name               0
Start Date                    0
StartStation Id         2555077
StartStation Name             0
Number                  8677104
Start date              8677104
Start station number    8677104
Start station           8677104
End date                8677104
End station number      8677104
End station             8677104
Bike number             8677104
Bike model              8677104
Total duration          8677104
Total duration (ms)     8677104
Start Date Time         2555077
Start Date Time 2       8677104
End Date Time           2555077
End Date Time 2         8677104
dtype: int64

In [25]:
#ensuring a consistent format of the date columns, initially in string format
data_2022_clean['Start Date'] = pd.to_datetime(data_2022_clean['Start Date']).dt.strftime('%d/%m/%Y %H:%M')



In [26]:
# doing the same for end date
data_2022_clean['End Date'] = pd.to_datetime(data_2022_clean['End Date']).dt.strftime('%d/%m/%Y %H:%M')

In [1]:
data_2022_clean

NameError: name 'data_2022_clean' is not defined

In [95]:
data_2022_clean1 = data_2022_clean.copy()

MemoryError: Unable to allocate 686. MiB for an array with shape (8, 11232181) and data type float64

In [35]:
# adding hour and day columns  
data_2022_clean1['Hour']= pd.to_datetime(data_2022_clean1['Start Date'], format = "%d/%m/%Y %H:%M").dt.hour
data_2022_clean1['Day']= pd.to_datetime(data_2022_clean1['Start Date'], format = "%d/%m/%Y %H:%M").dt.weekday

In [94]:
# removing columns that are no longer needed
data_2022_clean_drop = data_2022_clean1.drop(['Number', 'Start date', 'Start station', 'End date', 'End station',
                                             'Bike number', 'Total duration', 'Total duration (ms)', 'Start Date Time', 'Start Date Time 2','End Date Time','End Date Time 2'], axis=1)

MemoryError: Unable to allocate 428. MiB for an array with shape (5, 11232181) and data type float64

In [37]:
data_2022_clean_drop

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,Start station number,End station number,Bike model,Hour,Day
0,115967515.0,1260.000,15338.0,01/01/2022 23:13,310.0,"Black Prince Road, Vauxhall",01/01/2022 22:52,529.0,"Manresa Road, Chelsea",,,,22,5
1,116017034.0,720.000,19861.0,01/04/2022 19:08,11.0,"Brunswick Square, Bloomsbury",01/04/2022 18:56,804.0,"Good's Way, King's Cross",,,,18,4
2,115895660.0,360.000,19666.0,29/12/2021 16:34,70.0,"Calshot Street , King's Cross",29/12/2021 16:28,57.0,"Guilford Street , Bloomsbury",,,,16,2
3,116016563.0,480.000,19861.0,01/04/2022 18:46,804.0,"Good's Way, King's Cross",01/04/2022 18:38,57.0,"Guilford Street , Bloomsbury",,,,18,4
4,116014412.0,1260.000,17235.0,01/04/2022 17:45,14.0,"Belgrove Street , King's Cross",01/04/2022 17:24,297.0,"Geraldine Street, Elephant & Castle",,,,17,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11232176,127641458.0,6544.593,53664.0,26/12/2022 01:51,,"Queen Mary's, Mile End",26/12/2022 00:02,,"Woodstock Grove, Shepherd's Bush",200214,200249,CLASSIC,0,0
11232177,127641459.0,1936.877,54303.0,26/12/2022 00:34,,"Salmon Lane, Limehouse",26/12/2022 00:02,,"Curlew Street, Shad Thames",1213,200147,CLASSIC,0,0
11232178,127641453.0,2955.280,21426.0,26/12/2022 00:49,,"Langdon Park, Poplar",26/12/2022 00:00,,"Curlew Street, Shad Thames",1213,200160,CLASSIC,0,0
11232179,127641454.0,5427.555,54786.0,26/12/2022 01:31,,"Millharbour, Millwall",26/12/2022 00:00,,"Millharbour, Millwall",22167,22167,CLASSIC,0,0


In [38]:
data_2022_clean_drop.isnull().sum()

Rental Id                     0
Duration                      0
Bike Id                       0
End Date                      0
EndStation Id           2867221
EndStation Name               0
Start Date                    0
StartStation Id         2555077
StartStation Name             0
Start station number    8677104
End station number      8677104
Bike model              8677104
Hour                          0
Day                           0
dtype: int64

In [92]:
# let's rename a couple of columns to make it clearer
# we will rename the Start and End station number column 
# these columns actually terminal to the station 'terminalName' as per https://tfl.gov.uk/tfl/syndication/feeds/cycle-hire/livecyclehireupdates.xml

data_2022_clean_drop = data_2022_clean_drop.rename(columns={'Start station number': 'SS Terminal Name', 'End station number': 'ES Terminal Name'})

In [93]:

# finally, 2022 filtering data - remove any rows that aren't from 2022
data_2022_clean_drop['Start Date Time'] = pd.to_datetime(data_2022_clean_drop["Start Date"], format="%d/%m/%Y %H:%M")
data_2022_clean_drop1 = data_2022_clean_drop[data_2022_clean_drop['Start Date Time'].dt.year == 2022]
print(data_2022_clean_drop1.shape)

MemoryError: Unable to allocate 85.2 MiB for an array with shape (11166111, 1) and data type datetime64[ns]

In [None]:
#removing the additional column 
data_2022_clean_drop2 = data_2022_clean_drop1 .drop(['Start Date Time'], axis=1)

In [None]:
data_2022_clean_drop2 

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,SS Terminal Name,ES Terminal Name,Bike model,Hour,Day
0,115967515.0,1260.000,15338.0,01/01/2022 23:13,310.0,"Black Prince Road, Vauxhall",01/01/2022 22:52,529.0,"Manresa Road, Chelsea",,,,22,5
1,116017034.0,720.000,19861.0,01/04/2022 19:08,11.0,"Brunswick Square, Bloomsbury",01/04/2022 18:56,804.0,"Good's Way, King's Cross",,,,18,4
3,116016563.0,480.000,19861.0,01/04/2022 18:46,804.0,"Good's Way, King's Cross",01/04/2022 18:38,57.0,"Guilford Street , Bloomsbury",,,,18,4
4,116014412.0,1260.000,17235.0,01/04/2022 17:45,14.0,"Belgrove Street , King's Cross",01/04/2022 17:24,297.0,"Geraldine Street, Elephant & Castle",,,,17,4
5,116013350.0,480.000,13790.0,01/04/2022 16:50,252.0,"Jubilee Gardens, South Bank",01/04/2022 16:42,310.0,"Black Prince Road, Vauxhall",,,,16,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11232176,127641458.0,6544.593,53664.0,26/12/2022 01:51,,"Queen Mary's, Mile End",26/12/2022 00:02,,"Woodstock Grove, Shepherd's Bush",200214,200249,CLASSIC,0,0
11232177,127641459.0,1936.877,54303.0,26/12/2022 00:34,,"Salmon Lane, Limehouse",26/12/2022 00:02,,"Curlew Street, Shad Thames",1213,200147,CLASSIC,0,0
11232178,127641453.0,2955.280,21426.0,26/12/2022 00:49,,"Langdon Park, Poplar",26/12/2022 00:00,,"Curlew Street, Shad Thames",1213,200160,CLASSIC,0,0
11232179,127641454.0,5427.555,54786.0,26/12/2022 01:31,,"Millharbour, Millwall",26/12/2022 00:00,,"Millharbour, Millwall",22167,22167,CLASSIC,0,0


In [41]:
bike_data_2022 = data_2022_clean_drop2.copy()

### Storing the data in an PostgreSQL databse

In [42]:
# psycopg2 library installed to connect to a PostgreSQL database from Python

import psycopg2
from sqlalchemy import create_engine

In [43]:
# connection to postgres database
conn = psycopg2.connect(
    user="postgres",
    password="password123",
    host="localhost",
    database="diss_data",
)


In [44]:
# Create a SQLAlchemy engine: Create a SQLAlchemy engine using the create_engine function, which will be used to write the DataFrame to the database.
engine = create_engine('postgresql+psycopg2://postgres:password123@localhost:5432/diss_data')

In [45]:
# Export the DataFrame to the database: Once you have the connection and engine set up, you can use the to_sql method of the DataFrame to export it to the database.
# save the DataFrame to the PostgreSQL database
# set the index parameter to False to avoid saving the DataFrame's index as a separate column in the database.
bike_data_2019.to_sql('bike_data_2019_tb', engine, if_exists='replace', index=False)

NameError: name 'bike_data_2019' is not defined

In [46]:
# save the DataFrame to the PostgreSQL database
bike_data_2022.to_sql('bike_data_2022_tb_v02', engine, if_exists='replace', index=False)

181

### Transforming the dataframes into a matrix, whereby the value of each cell is the number of events per hour

In [47]:
import pandas as pd
import psycopg2
import sqlalchemy
from sqlalchemy import create_engine

# connection to postgres database
conn = psycopg2.connect(
    user="postgres",
    password="password123",
    host="localhost",
    database="diss_data",
)

engine = sqlalchemy.create_engine('postgresql://postgres:password123@localhost:5432/diss_data')

# create a connection to the database
conn = psycopg2.connect(database="diss_data", user="postgres", password="password123", host="localhost", port="5432")



In [50]:
# define the SQL query to retrieve the data from the table
sql_query = "SELECT * FROM bike_data_2019_tb"

# use the read_sql function to read the table into a Pandas dataframe
df = pd.read_sql(sql_query, conn)


  df = pd.read_sql(sql_query, conn)


In [48]:
# doing ther same for 2022
# define the SQL query to retrieve the data from the table
sql_query_2022 = "SELECT * FROM bike_data_2022_tb_v02"

# use the read_sql function to read the table into a Pandas dataframe
df_2022 = pd.read_sql(sql_query_2022, conn)


  df_2022 = pd.read_sql(sql_query_2022, conn)


# Now lets create a matrix for all the data in 2019

In [51]:
#copying the dataframe
bike_data_2019 = df.copy()
bike_data_2022 = df_2022.copy()


In [67]:
def add_station_names(station_names, df, namecolumn, idcolumn):
    """Given a DataFrame df that has df[namecolumn] listing names of stations
    and df[idcolumn] listing station ID numbers, add to the dictionary
    station_names all the names that each ID is attached to.

    """
    namemaps = (
        df[[idcolumn, namecolumn]]
        .groupby(idcolumn)
        .aggregate(lambda x: x.unique())
    )
    for number, names in namemaps.iterrows():
        current_names = station_names.get(number, set())
        # The following two lines are a stupid dance around the annoying fact
        # that pd.unique sometimes returns a single value, sometimes a numpy
        # array of values, but since the single value is a string, it too is an
        # iterable.
        vals = names[0]
        new_names = set([vals]) if type(vals) == str else set(vals)
        current_names.update(new_names)
        station_names[number] = current_names


In [62]:
def clean_datetime_column(df, colname, roundto="H"):
    """Parse df[colname] from strings to datetime objects, and round the times
    to the nearest hour. 
    """

    format = "%d/%m/%Y %H:%M"
    df.loc[:, colname] = pd.to_datetime(df[colname], format=format)
    df.loc[:, colname] = df[colname].dt.round(roundto)

    return df

In [73]:
def compute_single_events(df, which):
    """Read from df all the events, either departures or arrivals depending on
    whether `which` is "Start" or "End", and collect them in a DataFrame that
    lists event counts per station and time.
    """
    stationcol = "{}Station Id".format(which)
    datecol = "{} Date".format(which)
    events = (
        df.rename(columns={stationcol: "Station", datecol: "Date"})
        .groupby(["Date", "Station"])
        .size()
        .unstack("Station")
    )
    return events

In [71]:
def compute_both_events(df):
    """Read from df all the events, both arrivals and departures, and collect
    them in a DataFrame that lists event counts per station and time.
    """
    arrivals = compute_single_events(df, "End")
    departures = compute_single_events(df, "Start")
    both = (
        pd.concat(
            [arrivals, departures], keys=["Arrivals", "Departures"], axis=1
        )
        .reorder_levels([1, 0], axis=1)
        .fillna(0.0)
    )
    return both

In [68]:
station_allnames = {}
add_station_names(station_allnames, bike_data_2019, "EndStation Name", "EndStation Id")
add_station_names(station_allnames, bike_data_2019, "StartStation Name", "StartStation Id")

In [54]:
station_allnames_newdic = {key: value.pop() for key, value in station_allnames.items()}

print(station_allnames_newdic)

NameError: name 'station_allnames' is not defined

In [55]:
#clean start and end dates
bd_data_2019_clean1 = clean_datetime_column(bike_data_2019, "Start Date", roundto="H")
bd_data_2019_clean2 = clean_datetime_column(bd_data_2019_clean1, "End Date", roundto="H")

  df.loc[:, colname] = pd.to_datetime(df[colname], format=format)
  df.loc[:, colname] = df[colname].dt.round(roundto)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, colname] = pd.to_datetime(df[colname], format=format)
  df.loc[:, colname] = pd.to_datetime(df[colname], format=format)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, colname] = df[colname].dt.round(roundto)
  df.loc[:, colname] = df[colname].dt.round(roundto)


In [59]:
bike_data_2022.sort_values(by="End Date")

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,Start station number,End station number,Bike model,Hour,Day
70045,115944485.0,780.000,17247.0,01/01/2022 00:00,804.0,"Good's Way, King's Cross",2022-01-01 00:00:00,109.0,"Soho Square , Soho",,,,23,4
97426,115944427.0,900.000,16287.0,01/01/2022 00:00,100.0,"Albert Embankment, Vauxhall",2022-01-01 00:00:00,653.0,"Simpson Street, Clapham Junction",,,,23,4
105178,115943720.0,2520.000,6665.0,01/01/2022 00:00,213.0,"Wellington Arch, Hyde Park",2021-12-31 23:00:00,839.0,"Sea Containers, South Bank",,,,23,4
79304,115944542.0,660.000,21514.0,01/01/2022 00:00,163.0,"Sloane Avenue, Knightsbridge",2022-01-01 00:00:00,169.0,"Porchester Place, Paddington",,,,23,4
7682,115944488.0,780.000,17373.0,01/01/2022 00:00,804.0,"Good's Way, King's Cross",2022-01-01 00:00:00,109.0,"Soho Square , Soho",,,,23,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11183500,127691861.0,992.086,53770.0,31/12/2022 23:57,,"Lower Thames Street, Monument",2023-01-01 00:00:00,,"Pott Street, Bethnal Green",200156,1098,CLASSIC,23,5
11183944,127691406.0,3039.877,53091.0,31/12/2022 23:57,,"Abbey Orchard Street, Westminster",2022-12-31 23:00:00,,"Grosvenor Square, Mayfair",10627,3429,CLASSIC,23,5
11183599,127691782.0,1399.255,41612.0,31/12/2022 23:58,,"Kennington Lane Rail Bridge, Vauxhall",2023-01-01 00:00:00,,"Sheepcote Lane, Battersea",200176,1190,CLASSIC,23,5
11183597,127691780.0,1395.675,41081.0,31/12/2022 23:58,,"Kennington Lane Rail Bridge, Vauxhall",2023-01-01 00:00:00,,"Sheepcote Lane, Battersea",200176,1190,CLASSIC,23,5


In [66]:
#clean start and end dates
bd_data_2022_clean1 = clean_datetime_column(bike_data_2022, "Start Date", roundto="H")
bd_data_2022_clean2 = clean_datetime_column(bd_data_2022_clean1, "End Date", roundto="H")
bd_data_2022_clean2

  df.loc[:, colname] = pd.to_datetime(df[colname], format=format)
  df.loc[:, colname] = df[colname].dt.round(roundto)
  df.loc[:, colname] = pd.to_datetime(df[colname], format=format)
  df.loc[:, colname] = df[colname].dt.round(roundto)


In [97]:
bd_data_2022_clean2.sort_values(by="Start Date")

MemoryError: Unable to allocate 42.8 MiB for an array with shape (11232181,) and data type int32

In [76]:
events_data_2019 = compute_both_events(bd_data_2019_clean2)

In [74]:
events_data_2022 = compute_both_events(bd_data_2022_clean2)

In [186]:
# Finally rename the columns according to the chosen names for stations.
events_2019 = events_data_2019.rename(mapper=station_allnames_newdic, axis=1, level=0)
events_2019 = events_2019.sort_index(axis=1, level=0)

In [77]:
events_data_2019

Station,1,2,3,4,5,6,7,8,9,10,...,829,830,831,832,833,834,835,836,838,839
Unnamed: 0_level_1,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,...,Departures,Departures,Departures,Departures,Departures,Departures,Departures,Departures,Departures,Departures
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2019-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,4.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2019-01-01 01:00:00,1.0,0.0,0.0,5.0,1.0,0.0,0.0,2.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2019-01-01 02:00:00,0.0,0.0,5.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,2.0,0.0
2019-01-01 03:00:00,1.0,1.0,2.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,7.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0
2019-01-01 04:00:00,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,4.0,0.0,0.0,3.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-31 19:00:00,0.0,2.0,0.0,0.0,3.0,0.0,1.0,1.0,3.0,0.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,3.0
2019-12-31 20:00:00,0.0,0.0,0.0,0.0,1.0,0.0,3.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0
2019-12-31 21:00:00,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0
2019-12-31 22:00:00,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [75]:
events_data_2022

Station,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,836.0,838.0,839.0,840.0,841.0,842.0,844.0,845.0,846.0,850.0
Unnamed: 0_level_1,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,Arrivals,...,Departures,Departures,Departures,Departures,Departures,Departures,Departures,Departures,Departures,Departures
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2021-12-29 00:00:00,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
2021-12-29 01:00:00,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
2021-12-29 02:00:00,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0
2021-12-29 03:00:00,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-12-29 04:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-09 09:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-12-09 10:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-12-09 15:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-12-09 18:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Exporting the the events data frame as a pickle

In [203]:
import pickle
import os

events_path = Path("data/events_2019.p")

# Store the file on disk so we can read it later.
events_2019.to_pickle(events_path)