# TFL Bike data prep
___

#### Data prep as part of my MSc thesis, "Using machine learning to analyse and predict Transport for London bike sharing habits in the post COVID-19 era".

The following code for downloading the data has been adopted from [Markus Hauru's](https://github.com/mhauru) analysis, 'Predicting Boris Bike usage'.



In [5]:
# importing libraries

import os
import pickle
import requests
import zipfile
import pandas as pd
import numpy as np
import scipy as sp
import statsmodels.api as sm
from sklearn import linear_model, svm, neighbors, tree
from matplotlib import pyplot as plt
import matplotlib
import seaborn as sns
from pathlib import Path
from timeit import default_timer as timer
from IPython.display import set_matplotlib_formats
from urllib.parse import urlparse
import openpyxl

try:
    import xlrd
except Exception as e:
    msg = (
        "Please install the package xlrd: `pip install --user xlrd`"
        "It's an optional requirement for pandas, and we'll be needing it."
    )
    print(msg)
    raise e

In [6]:
# For pretty and exportable matplotlib plots.
# If you are running this yourself and want interactivity,
# try `%matplotlib widget` instead.
set_matplotlib_formats("svg")
%matplotlib inline
# %matplotlib widget
# Set a consistent plotting style across the notebook using Seaborn.
sns.set_style("darkgrid")
sns.set_context("notebook")
# Make pandas cooperate with pyplot
pd.plotting.register_matplotlib_converters()


  set_matplotlib_formats("svg")


1. Processing and cleaning the bike data
Before getting anywhere with it, we'll need to process the bike data quite a bit. The data comes in CSV files, each of which covers a period of time. Up first, we need to download the data from the TfL website. If you are running this code yourself, here's a script that does that. Be warned though, it's almost seven gigs of data. You can run it repeatedly, and it'll only download data that it doesn't have already.

In [7]:
bikefolder = "data/bikes"

In [8]:
def download_file(datafolder, url, verbosity=0):
    """Download the data from the given URL into the datafolder, unless it's
    already there. Return path to downloaded file.
    """
    # data folder variable for where the folder for where the downloaded file should be stores 
    # using the path() function to converted the data folder string into a path
    datafolder = Path(datafolder)
    datafolder.mkdir(parents=True, exist_ok=True)

    # using the url parse function to extract the file from the url and create a filepath for it to be stored
    a = urlparse(url)
    filename = Path(os.path.basename(a.path))
    filepath = datafolder / filename
    # Don't redownload if we already have this file.
    if filepath.exists():
        if verbosity > 1:
            print("Already have {}".format(filename))
    else:
        if verbosity > 0:
            print("Downloading {}".format(filename))
        # sends a GET request to the URL using the requests module and raises an exception if there is an error
        rqst = requests.get(url)
        rqst.raise_for_status()
        with open(filepath, "wb") as f:
            f.write(rqst.content)
    return filepath


In [9]:
# Adjust whether to print progress reports of the downloads.
# verbosity=0 is silence, verbosity=1 reports only when actually doing things,
# verbosity>1 also reports when there's nothing to do.
verbosity = 1

# Most files are individual CSV files, listed in bike_data_urls.txt. Download them.
urlsfile = "data/bikes/bike_data_urls.txt"
with open(urlsfile, "r") as f:
    urls = f.read().splitlines()
# There are a few comments in the file, marked by lines starting with #.
# Filter them out.
urls = [u for u in urls if u[0] != "#"]
for url in urls:
    download_file(bikefolder, url, verbosity)

# The early years come in zips. Download and unzip them.
zipsfolder = Path("data/bikes/bikezips")
bikezipurls = [
    "https://cycling.data.tfl.gov.uk/usage-stats/cyclehireusagestats-2012.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/cyclehireusagestats-2013.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/cyclehireusagestats-2014.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/2015TripDatazip.zip",
    "https://cycling.data.tfl.gov.uk/usage-stats/2016TripDataZip.zip",
]
# A list of CSV files that are already there. Only unzip if some of the files
# in the zip aren't present already.
current_csvs = sorted(os.listdir(bikefolder))
for url in bikezipurls:
    zippath = download_file(zipsfolder, url, verbosity)
    with zipfile.ZipFile(zippath, "r") as z:
        namelist = z.namelist()
        has_been_extracted = any(name not in current_csvs for name in namelist)
        if has_been_extracted:
            if verbosity > 0:
                print("Unzipping {}".format(zippath))
            z.extractall(bikefolder)
        else:
            if verbosity > 1:
                print("{} has already been extracted.".format(zippath))

# Finally, there's an odd one out: One week's data comes in as an .xlsx.
# Download it and use pandas to convert it to csv.
xlsxurl = "https://cycling.data.tfl.gov.uk/usage-stats/49JourneyDataExtract15Mar2017-21Mar2017.xlsx"
xlsxfile = download_file(bikefolder, xlsxurl)
csvfile = xlsxfile.with_suffix(".csv")
if not csvfile.exists():
    if verbosity > 0:
        print("Converting .xlsx to .csv.")
    pd.read_excel(xlsxfile).to_csv(csvfile, date_format="%d/%m/%Y %H:%M:%S")
else:
    if verbosity > 1:
        print("Already have {}".format(csvfile))

The data we have now lists on each line of the CSV file a single bike trip, with starting point and time, end point and time, and things like bike ID number. Here's an example.

In [10]:
example_file  = Path(bikefolder) / Path("47JourneyDataExtract01Mar2017-07Mar2017.csv")
pd.read_csv(example_file, encoding="ISO-8859-2").head()

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name
0,62857677,3780.0,7851,06/03/2017 19:20,43.0,"Crawford Street, Marylebone",06/03/2017 18:17,811,"Westferry Circus, Canary Wharf"
1,62863035,540.0,4089,06/03/2017 22:17,295.0,"Swan Street, The Borough",06/03/2017 22:08,272,"Baylis Road, Waterloo"
2,62775896,600.0,4895,02/03/2017 21:27,295.0,"Swan Street, The Borough",02/03/2017 21:17,197,"Stamford Street, South Bank"
3,62747748,420.0,4347,01/03/2017 21:08,295.0,"Swan Street, The Borough",01/03/2017 21:01,803,"Southwark Street, Bankside"
4,62843939,420.0,3192,06/03/2017 09:28,193.0,"Bankside Mix, Bankside",06/03/2017 09:21,197,"Stamford Street, South Bank"


In [11]:
bikefolder

'data/bikes'

In [12]:
from glob import glob 

# using glob to list all the csv file in the bikefolder filepath
all_csv = glob(bikefolder+str('/*.csv'))
all_csv

['data/bikes\\01aJourneyDataExtract10Jan16-23Jan16.csv',
 'data/bikes\\01bJourneyDataExtract24Jan16-06Feb16.csv',
 'data/bikes\\02aJourneyDataExtract07Feb16-20Feb2016.csv',
 'data/bikes\\02bJourneyDataExtract21Feb16-05Mar2016.csv',
 'data/bikes\\03JourneyDataExtract06Mar2016-31Mar2016.csv',
 'data/bikes\\04JourneyDataExtract01Apr2016-30Apr2016.csv',
 'data/bikes\\05JourneyDataExtract01May2016-17May2016.csv',
 'data/bikes\\06JourneyDataExtract18May2016-24May2016.csv',
 'data/bikes\\07JourneyDataExtract25May2016-31May2016.csv',
 'data/bikes\\08JourneyDataExtract01Jun2016-07Jun2016.csv',
 'data/bikes\\09JourneyDataExtract08Jun2016-14Jun2016.csv',
 'data/bikes\\1. Journey Data Extract 01Jan-05Jan13.csv',
 'data/bikes\\1. Journey Data Extract 04Jan-31Jan 12.csv',
 'data/bikes\\1. Journey Data Extract 05Jan14-02Feb14.csv',
 'data/bikes\\10. Journey Data Extract 18Aug-13Sep13.csv',
 'data/bikes\\10. Journey Data Extract 21Aug-22 Aug12.csv',
 'data/bikes\\10a Journey Data Extract 20Sep15-03Oct

### 2019 data prep

In [13]:
# creating a list of csv files that contain '2019' and '2022' respectively
csv_2019 = [item for item in all_csv if '2019' in item]
csv_2022 = [item for item in all_csv if '2022' in item]

In [14]:
csv_2019

['data/bikes\\142JourneyDataExtract26Dec2018-01Jan2019.csv',
 'data/bikes\\143JourneyDataExtract02Jan2019-08Jan2019.csv',
 'data/bikes\\144JourneyDataExtract09Jan2019-15Jan2019.csv',
 'data/bikes\\145JourneyDataExtract16Jan2019-22Jan2019.csv',
 'data/bikes\\146JourneyDataExtract23Jan2019-29Jan2019.csv',
 'data/bikes\\147JourneyDataExtract30Jan2019-05Feb2019.csv',
 'data/bikes\\148JourneyDataExtract06Feb2019-12Feb2019.csv',
 'data/bikes\\149JourneyDataExtract13Feb2019-19Feb2019.csv',
 'data/bikes\\150JourneyDataExtract20Feb2019-26Feb2019.csv',
 'data/bikes\\151JourneyDataExtract27Feb2019-05Mar2019.csv',
 'data/bikes\\152JourneyDataExtract06Mar2019-12Mar2019.csv',
 'data/bikes\\153JourneyDataExtract13Mar2019-19Mar2019.csv',
 'data/bikes\\154JourneyDataExtract20Mar2019-26Mar2019.csv',
 'data/bikes\\155JourneyDataExtract27Mar2019-02Apr2019.csv',
 'data/bikes\\156JourneyDataExtract03Apr2019-09Apr2019.csv',
 'data/bikes\\157JourneyDataExtract10Apr2019-16Apr2019.csv',
 'data/bikes\\158Journey

In [15]:
# using list comprehension that reads each csv file from the list and gnerators a sequence of dataframes
dfs = (pd.read_csv(csv) for csv in csv_2019)

# concatenate csvs them into a single DataFrame using pd.concat()
# ignore_index=True parameter resets the index of the resulting DataFrame, so that it is a continuous sequence of integers.
data_2019 = pd.concat(dfs, ignore_index=True)

In [85]:
print(data_2019.shape)
data_2019.head()

(10388411, 9)


Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name
0,83252102,720,2077,31/12/2018 19:05,272,"Baylis Road, Waterloo",31/12/2018 18:53,94,"Bricklayers Arms, Borough"
1,83195883,120,10781,27/12/2018 19:47,93,"Cloudesley Road, Angel",27/12/2018 19:45,339,"Risinghill Street, Angel"
2,83196070,120,2977,27/12/2018 20:11,339,"Risinghill Street, Angel",27/12/2018 20:09,234,"Liverpool Road (N1 Centre), Angel"
3,83197932,660,10802,28/12/2018 07:35,282,"Royal London Hospital, Whitechapel",28/12/2018 07:24,698,"Shoreditch Court, Haggerston"
4,83176351,1380,15749,26/12/2018 11:55,785,"Aquatic Centre, Queen Elizabeth Olympic Park",26/12/2018 11:32,783,"Monier Road, Hackney Wick"


In [86]:
# 2019

## Add some extra variables to the dataset for use later in filtering

import datetime

## Feeding a specififed date format speeds up the pd.to_datetime function immeasurably, especially over large datasets
## e.g. http://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31

format = "%d/%m/%Y %H:%M"

## Some routes had dates with a seconds component, whereas some didn't - the below code cuts these seconds off
data_2019['Start Date']= data_2019['Start Date'].str[:16]

data_2019['Start Date Time']= pd.to_datetime(data_2019['Start Date'], format=format)

data_2019['Hours']= pd.to_datetime(data_2019['Start Date'], format=format).dt.hour

data_2019['Minute']= pd.to_datetime(data_2019['Start Date'], format=format).dt.hour

data_2019['Day']= pd.to_datetime(data_2019['Start Date'], format=format).dt.weekday

data_2019.head()


Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,Start Date Time,Hours,Minute,Day
0,83252102,720,2077,31/12/2018 19:05,272,"Baylis Road, Waterloo",31/12/2018 18:53,94,"Bricklayers Arms, Borough",2018-12-31 18:53:00,18,18,0
1,83195883,120,10781,27/12/2018 19:47,93,"Cloudesley Road, Angel",27/12/2018 19:45,339,"Risinghill Street, Angel",2018-12-27 19:45:00,19,19,3
2,83196070,120,2977,27/12/2018 20:11,339,"Risinghill Street, Angel",27/12/2018 20:09,234,"Liverpool Road (N1 Centre), Angel",2018-12-27 20:09:00,20,20,3
3,83197932,660,10802,28/12/2018 07:35,282,"Royal London Hospital, Whitechapel",28/12/2018 07:24,698,"Shoreditch Court, Haggerston",2018-12-28 07:24:00,7,7,4
4,83176351,1380,15749,26/12/2018 11:55,785,"Aquatic Centre, Queen Elizabeth Olympic Park",26/12/2018 11:32,783,"Monier Road, Hackney Wick",2018-12-26 11:32:00,11,11,2


In [95]:
# 2019 filtering data - remove any rows that aren't from 2019
# remember the first csv contained data from 2018... 26Dec2018-01Jan2019.csv
bike_data_2019 = data_2019[data_2019['Start Date Time'].dt.year == 2019]
print(bike_data_2019.shape)

(10310063, 13)


### 2022 data prep

- In September 2022 the column names change slightly and additional clumns have been added
- for example the 'Bike model' column has been added (classic or PBSC_EBIKE)

Cycle Hire Data - data format change & new data https://techforum.tfl.gov.uk/t/cycle-hire-data-data-format-change-new-data/2520

### Exploring the 2022 data

In [98]:
csv_2022 = [item for item in all_csv if '2022' in item]

['data/bikes\\298JourneyDataExtract29Dec2021-04Jan2022.csv',
 'data/bikes\\299JourneyDataExtract05Jan2022-11Jan2022.csv',
 'data/bikes\\300JourneyDataExtract12Jan2022-18Jan2022.csv',
 'data/bikes\\301JourneyDataExtract19Jan2022-25Jan2022.csv',
 'data/bikes\\302JourneyDataExtract26Jan2022-01Feb2022.csv',
 'data/bikes\\303JourneyDataExtract02Feb2022-08Feb2022.csv',
 'data/bikes\\304JourneyDataExtract09Feb2022-15Feb2022.csv',
 'data/bikes\\305JourneyDataExtract16Feb2022-22Feb2022.csv',
 'data/bikes\\306JourneyDataExtract23Feb2022-01Mar2022.csv',
 'data/bikes\\307JourneyDataExtract02Mar2022-08Mar2022.csv',
 'data/bikes\\308JourneyDataExtract09Mar2022-15Mar2022.csv',
 'data/bikes\\309JourneyDataExtract16Mar2022-22Mar2022.csv',
 'data/bikes\\310JourneyDataExtract23Mar2022-29Mar2022.csv',
 'data/bikes\\311JourneyDataExtract30Mar2022-05Apr2022.csv',
 'data/bikes\\312JourneyDataExtract06Apr2022-12Apr2022.csv',
 'data/bikes\\313JourneyDataExtract13Apr2022-19Apr2022.csv',
 'data/bikes\\314Journey

In [None]:
# CSVs before September 2022 part 1 data 
# use slicing to includes all elements of the previous list except for the last 16
csv_2022_p1 = csv_2022[:-16]

# CSVs From september 12th 2022 
# use slicing to create a new list that includes only the last 16 elements
csv_2022_p2 = csv_2022[-16:]

In [None]:
# doing the same for the 2022 data
# passing errors within the csv files as per https://stackoverflow.com/questions/52105659/pandas-read-csv-unexpected-end-of-data-error
dfs_2022_p1 = (pd.read_csv(csv, engine='python', encoding='utf-8', on_bad_lines='skip') for csv in csv_2022_p1)
data_2022_p1 = pd.concat(dfs_2022_p1, ignore_index=True)

In [None]:
data_2022_p1.isnull().sum()
# for the part 1 data, there were 312144 records with null station ids  

#es_id_null = data_2022_p1.loc[data_2022_p1['EndStation Id'].isnull()] 
#es_id_null.sort_values(by='Start Date', ascending=False)

# filtering the data above reveal the journeys taken between 06/07/2022 00:00 and 12/07/2022 23:56 did not record an end station Id

Rental Id                 0
Duration                  0
Bike Id                   0
End Date                  0
EndStation Id        312144
EndStation Name           0
Start Date                0
StartStation Id           0
StartStation Name         0
dtype: int64

In [None]:
data_2022_p1.count()

Rental Id            8677104
Duration             8677104
Bike Id              8677104
End Date             8677104
EndStation Id        8364960
EndStation Name      8677104
Start Date           8677104
StartStation Id      8677104
StartStation Name    8677104
dtype: int64

In [None]:
# read in data with datetime data type for column 2 and column 5
dfs_2022_p2 = (pd.read_csv(csv) for csv in csv_2022_p2)
#dfs_2022_p2 = (pd.read_csv(csv, parse_dates={'Start date': 'datetime64', 'End date': 'datetime64'}) for csv in csv_2022_p2)
data_2022_p2 = pd.concat(dfs_2022_p2, ignore_index=True)

  dfs_2022_p2 = (pd.read_csv(csv) for csv in csv_2022_p2)


In [None]:
data_2022_p2.isnull().sum()

Number                  0
Start date              0
Start station number    0
Start station           0
End date                0
End station number      0
End station             0
Bike number             0
Bike model              0
Total duration          0
Total duration (ms)     0
dtype: int64

In [None]:
data_2022_p2.count()

Number                  2555077
Start date              2555077
Start station number    2555077
Start station           2555077
End date                2555077
End station number      2555077
End station             2555077
Bike number             2555077
Bike model              2555077
Total duration          2555077
Total duration (ms)     2555077
dtype: int64

In [52]:
# doing the same for the 2022 data
# passing errors within the csv files as per https://stackoverflow.com/questions/52105659/pandas-read-csv-unexpected-end-of-data-error
dfs_2022 = (pd.read_csv(csv, engine='python', encoding='utf-8', on_bad_lines='skip') for csv in csv_2022)
data_2022 = pd.concat(dfs_2022, ignore_index=True)

In [26]:
# check the data type of the 'date' column
print(data_2022['Start date'].dtype)

object


In [57]:
# 2022

# Let's clean this up and get all the data into single columns


#creating a copy of the orginal data
data_2022_clean = data_2022.copy()


In [59]:
#let's start by sorting out the date time formatting
format = "%d/%m/%Y %H:%M"
format2 = "%Y/%m/%d %H:%M"

data_2022_clean['Start Date'] = data_2022_clean['Start Date'].str[:16]
data_2022_clean['Start Date Time'] = pd.to_datetime(data_2022_clean['Start Date'], format=format)
data_2022_clean['Start Date Time 2']= pd.to_datetime(data_2022_clean['Start date'], format=format2)

In [60]:
data_2022_clean

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,Number,...,Start station,End date,End station number,End station,Bike number,Bike model,Total duration,Total duration (ms),Start Date Time,Start Date Time 2
0,115967515.0,1260.0,15338.0,01/01/2022 23:13,310.0,"Black Prince Road, Vauxhall",01/01/2022 22:52,529.0,"Manresa Road, Chelsea",,...,,,,,,,,,2022-01-01 22:52:00,NaT
1,116017034.0,720.0,19861.0,04/01/2022 19:08,11.0,"Brunswick Square, Bloomsbury",04/01/2022 18:56,804.0,"Good's Way, King's Cross",,...,,,,,,,,,2022-01-04 18:56:00,NaT
2,115895660.0,360.0,19666.0,29/12/2021 16:34,70.0,"Calshot Street , King's Cross",29/12/2021 16:28,57.0,"Guilford Street , Bloomsbury",,...,,,,,,,,,2021-12-29 16:28:00,NaT
3,116016563.0,480.0,19861.0,04/01/2022 18:46,804.0,"Good's Way, King's Cross",04/01/2022 18:38,57.0,"Guilford Street , Bloomsbury",,...,,,,,,,,,2022-01-04 18:38:00,NaT
4,116014412.0,1260.0,17235.0,04/01/2022 17:45,14.0,"Belgrove Street , King's Cross",04/01/2022 17:24,297.0,"Geraldine Street, Elephant & Castle",,...,,,,,,,,,2022-01-04 17:24:00,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11232176,,,,,,,,,,127641458.0,...,"Woodstock Grove, Shepherd's Bush",2022-12-26 01:51,200249,"Queen Mary's, Mile End",53664.0,CLASSIC,1h 49m 4s,6544593.0,NaT,2022-12-26 00:02:00
11232177,,,,,,,,,,127641459.0,...,"Curlew Street, Shad Thames",2022-12-26 00:34,200147,"Salmon Lane, Limehouse",54303.0,CLASSIC,32m 16s,1936877.0,NaT,2022-12-26 00:02:00
11232178,,,,,,,,,,127641453.0,...,"Curlew Street, Shad Thames",2022-12-26 00:49,200160,"Langdon Park, Poplar",21426.0,CLASSIC,49m 15s,2955280.0,NaT,2022-12-26 00:00:00
11232179,,,,,,,,,,127641454.0,...,"Millharbour, Millwall",2022-12-26 01:31,22167,"Millharbour, Millwall",54786.0,CLASSIC,1h 30m 27s,5427555.0,NaT,2022-12-26 00:00:00


In [62]:
data_2022_clean.loc[data_2022_clean['Start Date Time'].isnull(), 'Start Date Time'] = data_2022_clean['Start Date Time 2']

In [63]:
data_2022_clean.isnull().sum()

Rental Id               2555077
Duration                2555077
Bike Id                 2555077
End Date                2555077
EndStation Id           2867221
EndStation Name         2555077
Start Date              2555077
StartStation Id         2555077
StartStation Name       2555077
Number                  8677104
Start date              8677104
Start station number    8677104
Start station           8677104
End date                8677104
End station number      8677104
End station             8677104
Bike number             8677104
Bike model              8677104
Total duration          8677104
Total duration (ms)     8677104
Start Date Time               0
Start Date Time 2       8677104
dtype: int64

In [64]:
# transfering values from one pandas column to another pandas column only for null rows

data_2022_clean.loc[data_2022_clean['Rental Id'].isnull(), 'Rental Id'] = data_2022_clean['Number']
# converting from milliseconds to seconds, multipyling by 1000 
data_2022_clean.loc[data_2022_clean['Duration'].isnull(), 'Duration'] = data_2022_clean['Total duration (ms)'] / 1000
data_2022_clean.loc[data_2022_clean['Bike Id'].isnull(), 'Bike Id'] = data_2022_clean['Bike number']
data_2022_clean.loc[data_2022_clean['End Date'].isnull(), 'End Date'] = data_2022_clean['End date']
data_2022_clean.loc[data_2022_clean['EndStation Name'].isnull(), 'EndStation Name'] = data_2022_clean['End station']
data_2022_clean.loc[data_2022_clean['Start Date'].isnull(), 'Start Date'] = data_2022_clean['Start date']
data_2022_clean.loc[data_2022_clean['StartStation Name'].isnull(), 'StartStation Name'] = data_2022_clean['Start station']

#data_2022_clean.sort_values(by='Bike model', ascending=False)

In [68]:
data_2022_clean.isnull().sum()

Rental Id                     0
Duration                      0
Bike Id                       0
End Date                      0
EndStation Id           2867221
EndStation Name               0
Start Date                    0
StartStation Id         2555077
StartStation Name             0
Number                  8677104
Start date              8677104
Start station number    8677104
Start station           8677104
End date                8677104
End station number      8677104
End station             8677104
Bike number             8677104
Bike model              8677104
Total duration          8677104
Total duration (ms)     8677104
Start Date Time               0
Start Date Time 2       8677104
Hours                         0
dtype: int64

In [69]:
#adding the additional columns
data_2022_clean['Hours']= data_2022_clean['Start Date Time'].dt.hour
data_2022_clean['Day']= data_2022_clean['Start Date Time'].dt.weekday

In [81]:
# removing columsn that are no longer needed
data_2022_clean_drop = data_2022_clean.drop(['Number', 'Start date', 'Start station', 'End date', 'End station',
                                             'Bike number', 'Total duration', 'Total duration (ms)', 'Start Date Time 2'], axis=1)

In [82]:
data_2022_clean_drop

Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,Start station number,End station number,Bike model,Start Date Time,Hours,Day
0,115967515.0,1260.000,15338.0,01/01/2022 23:13,310.0,"Black Prince Road, Vauxhall",01/01/2022 22:52,529.0,"Manresa Road, Chelsea",,,,2022-01-01 22:52:00,22,5
1,116017034.0,720.000,19861.0,04/01/2022 19:08,11.0,"Brunswick Square, Bloomsbury",04/01/2022 18:56,804.0,"Good's Way, King's Cross",,,,2022-01-04 18:56:00,18,1
2,115895660.0,360.000,19666.0,29/12/2021 16:34,70.0,"Calshot Street , King's Cross",29/12/2021 16:28,57.0,"Guilford Street , Bloomsbury",,,,2021-12-29 16:28:00,16,2
3,116016563.0,480.000,19861.0,04/01/2022 18:46,804.0,"Good's Way, King's Cross",04/01/2022 18:38,57.0,"Guilford Street , Bloomsbury",,,,2022-01-04 18:38:00,18,1
4,116014412.0,1260.000,17235.0,04/01/2022 17:45,14.0,"Belgrove Street , King's Cross",04/01/2022 17:24,297.0,"Geraldine Street, Elephant & Castle",,,,2022-01-04 17:24:00,17,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11232176,127641458.0,6544.593,53664.0,2022-12-26 01:51,,"Queen Mary's, Mile End",2022-12-26 00:02,,"Woodstock Grove, Shepherd's Bush",200214,200249,CLASSIC,2022-12-26 00:02:00,0,0
11232177,127641459.0,1936.877,54303.0,2022-12-26 00:34,,"Salmon Lane, Limehouse",2022-12-26 00:02,,"Curlew Street, Shad Thames",1213,200147,CLASSIC,2022-12-26 00:02:00,0,0
11232178,127641453.0,2955.280,21426.0,2022-12-26 00:49,,"Langdon Park, Poplar",2022-12-26 00:00,,"Curlew Street, Shad Thames",1213,200160,CLASSIC,2022-12-26 00:00:00,0,0
11232179,127641454.0,5427.555,54786.0,2022-12-26 01:31,,"Millharbour, Millwall",2022-12-26 00:00,,"Millharbour, Millwall",22167,22167,CLASSIC,2022-12-26 00:00:00,0,0


In [84]:
data_2022_clean_drop.isnull().sum()

Rental Id                     0
Duration                      0
Bike Id                       0
End Date                      0
EndStation Id           2867221
EndStation Name               0
Start Date                    0
StartStation Id         2555077
StartStation Name             0
Start station number    8677104
End station number      8677104
Bike model              8677104
Start Date Time               0
Hours                         0
Day                           0
dtype: int64

In [96]:
# let's rename a couple of columns to make it clearer
# we will rename the Start and End station number column 
# these columns actually terminal to the station 'terminalName' as per https://tfl.gov.uk/tfl/syndication/feeds/cycle-hire/livecyclehireupdates.xml

data_2022_clean_drop.rename(columns={'Start station number': 'SS Terminal Name', 'End station number': 'ES Terminal Name'})


Unnamed: 0,Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name,SS Terminal Name,ES Terminal Name,Bike model,Start Date Time,Hours,Day
0,115967515.0,1260.000,15338.0,01/01/2022 23:13,310.0,"Black Prince Road, Vauxhall",01/01/2022 22:52,529.0,"Manresa Road, Chelsea",,,,2022-01-01 22:52:00,22,5
1,116017034.0,720.000,19861.0,04/01/2022 19:08,11.0,"Brunswick Square, Bloomsbury",04/01/2022 18:56,804.0,"Good's Way, King's Cross",,,,2022-01-04 18:56:00,18,1
2,115895660.0,360.000,19666.0,29/12/2021 16:34,70.0,"Calshot Street , King's Cross",29/12/2021 16:28,57.0,"Guilford Street , Bloomsbury",,,,2021-12-29 16:28:00,16,2
3,116016563.0,480.000,19861.0,04/01/2022 18:46,804.0,"Good's Way, King's Cross",04/01/2022 18:38,57.0,"Guilford Street , Bloomsbury",,,,2022-01-04 18:38:00,18,1
4,116014412.0,1260.000,17235.0,04/01/2022 17:45,14.0,"Belgrove Street , King's Cross",04/01/2022 17:24,297.0,"Geraldine Street, Elephant & Castle",,,,2022-01-04 17:24:00,17,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11232176,127641458.0,6544.593,53664.0,2022-12-26 01:51,,"Queen Mary's, Mile End",2022-12-26 00:02,,"Woodstock Grove, Shepherd's Bush",200214,200249,CLASSIC,2022-12-26 00:02:00,0,0
11232177,127641459.0,1936.877,54303.0,2022-12-26 00:34,,"Salmon Lane, Limehouse",2022-12-26 00:02,,"Curlew Street, Shad Thames",1213,200147,CLASSIC,2022-12-26 00:02:00,0,0
11232178,127641453.0,2955.280,21426.0,2022-12-26 00:49,,"Langdon Park, Poplar",2022-12-26 00:00,,"Curlew Street, Shad Thames",1213,200160,CLASSIC,2022-12-26 00:00:00,0,0
11232179,127641454.0,5427.555,54786.0,2022-12-26 01:31,,"Millharbour, Millwall",2022-12-26 00:00,,"Millharbour, Millwall",22167,22167,CLASSIC,2022-12-26 00:00:00,0,0


In [102]:
# 2022 filtering data - remove any rows that aren't from 2022
bike_data_2022 = data_2022_clean_drop[data_2022_clean_drop['Start Date Time'].dt.year == 2022]
print(bike_data_2022.shape)

(11166111, 15)


### Storing the data in an PostgreSQL databse

In [87]:
# psycopg2 library installed to connect to a PostgreSQL database from Python

import psycopg2
from sqlalchemy import create_engine

In [88]:
# connection to postgres database
conn = psycopg2.connect(
    user="postgres",
    password="password123",
    host="localhost",
    database="diss_data",
)


In [89]:
# Create a SQLAlchemy engine: Create a SQLAlchemy engine using the create_engine function, which will be used to write the DataFrame to the database.
engine = create_engine('postgresql+psycopg2://postgres:password123@localhost:5432/diss_data')

In [None]:
# Export the DataFrame to the database: Once you have the connection and engine set up, you can use the to_sql method of the DataFrame to export it to the database.
# save the DataFrame to the PostgreSQL database
# set the index parameter to False to avoid saving the DataFrame's index as a separate column in the database.
bike_data_2019.to_sql('bike_data_2019_tb', engine, if_exists='replace', index=False)

In [None]:
# save the DataFrame to the PostgreSQL database
bike_data_2022.to_sql('bike_data_2022_tb', engine, if_exists='replace', index=False)