# Tales of London's Busiest & Quietest Stations

Author: Eu Meng Chong  
Course: Final Capstone Project under Applied Data Science Capstone by IBM  
Date Published: 08th June 2020

## Table of Contents

1. Introduction
2. Data
3. Methodology
4. Results
5. Discussions
6. Conclusion
7. Reference
8. Acknowledgement
9. Appendices (if any)

## 1. Introduction

With the world oldest metro system which is known as the London Underground, alongside with the Overground and privately owwned commuter trains services, London has one of the most extensive rail network in the world, consisting a total of 330 stations. Moreover, as mentioned in this [webpage](https://en.wikipedia.org/wiki/Rail_transport_in_Great_Britain), London and Southeaast England (counted together as their railway network are more integrated) has a annual passenger count of 1.216 billion between 2018 - 2019 and the trend is steadily increasing.

In this project, we would like to first look into the neighbourhoods surrounding these railway stations in London and classify them whether they are having high volume of trains stopping by or otherwise. Fortunately, the Department of Transport (DfT) has categorised all the stations in the UK (London included) based on that metric from A being the highest to F2 being the lowest. Moreover, as not all the neighbourhoods surrounding each stations are made equal (i.e. some neighbourhoods are solely residental, others having more commercial or corporate area surrounding it), we will use the assumption which is the purpose of usage of each stations is solely determined by venues closest to a it. For example, if there are no residental areas in a neighborhood, then the passengers at the particular station could be using it to travel back home for work. This instance is one example of daily migrations of people in a city as populous as London.

What we would like to acheive from this project is to construct a calssification model based on these data on each neighbourhoods surrounding the stations to determine the their respective volume of trains stopping by on their respective stations. In fact, this model has the potential to determine the factors which affect the ridership count on each train stations in London and could be used to pinpoint a profitable railway network extensions in London.

## 2. Data

We will use the following sources to retrieve the abovementioned data:
1. https://en.wikipedia.org/wiki/List_of_London_railway_stations: This webpage provides the list of all the railway stations in London with their respective coordinates, operators, DfT Categories and the Boroughs in London. Here we shall use the Python package of `BeautifulSoup` to scrape the table from the webpage.


In [2]:
# Installing packages
! pip install bs4
print("Package Installed!")

Package Installed!


In [6]:
# Importing libraries
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import requests
print("Packages Imported!")

Packages Imported!


In [7]:
webpage = requests.get('https://en.wikipedia.org/wiki/List_of_London_railway_stations').text
soup = BeautifulSoup(webpage,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of London railway stations - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"d88b9800-edcf-4a7d-b845-bdfa7ca89bff","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_London_railway_stations","wgTitle":"List of London railway stations","wgCurRevisionId":957330329,"wgRevisionId":957330329,"wgArticleId":361268,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: archived copy as title","Webarchive template wayback links","CS1: Julian–Gregorian unce

In [15]:
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table

<table class="wikitable sortable" style="font-size:95%;clear:both;">
<tbody><tr>
<th>Station
</th>
<th>Local authority
</th>
<th>Managed by
</th>
<th>Station<br/>code
</th>
<th>Fare<br/>zone
</th>
<th>Year<br/>opened
</th>
<th><a href="/wiki/United_Kingdom_railway_station_categories" title="United Kingdom railway station categories">Category</a>
</th>
<th>Coordinates
</th></tr>
<tr>
<td><a href="/wiki/Abbey_Wood_railway_station" title="Abbey Wood railway station">Abbey Wood</a>
</td>
<td><a href="/wiki/Royal_Borough_of_Greenwich" title="Royal Borough of Greenwich">Greenwich</a>
</td>
<td><a href="/wiki/TfL_Rail" title="TfL Rail">TfL Rail</a><sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</td>
<td>ABW
</td>
<td>4
</td>
<td>1849
</td>
<td>C
</td>
<td><span class="plainlinks nourlexpansion"><a class="external text" href="//tools.wmflabs.org/geohack/geohack.php?pagename=List_of_London_railway_stations&amp;params=51.4915_N_0.1229_E_region:GB&amp;title=Abbey+Wood

In [37]:
indices = [0,1,2,6,7]

stations_df = pd.DataFrame(columns=['Station','Borough','Managed By','DfT Category','Coordinates'])

for tr in My_table.find_all('tr')[1:]:
    cells = tr.find_all('td')
    #ignore cells that don't have coordinates
    if cells[0].text.strip() != '':
        stations_df = stations_df.append({
            'Station': cells[0].text.strip(),
            'Borough': cells[1].text.strip(),
            'Managed By': cells[2].text.strip(),
            'DfT Category': cells[6].text.strip(),
            'Coordinates': cells[7].find('span', {'class': 'geo'}).text.strip().replace('; ',',')
        }, ignore_index=True)

stations_df.head()
stations_df.to_csv('London_stations.csv')

In [38]:
# Now, let's read the file
stations_df = pd.read_csv('London_stations.csv', index_col=0)
stations_df.head()

Unnamed: 0,Station,Borough,Managed By,DfT Category,Coordinates
0,Abbey Wood,Greenwich,TfL Rail[1],C,"51.4915,0.1229"
1,Acton Central,Ealing,London Overground,D,"51.5088,-0.2634"
2,Acton Main Line[2],Ealing,TfL Rail,E,"51.5169,-0.2669"
3,Albany Park,Bexley,Southeastern,D,"51.4358,0.1266"
4,Alexandra Palace[3],Haringey,Great Northern,D,"51.5983,-0.1197"


2. Next, we shall use the Foursquare API to explore venue types surrounding each station. As a matter in fact, Foursquare also label categories on each venue categories with a more refined sub-categories. We may find such list of categories with its corresponding Category ID [here](https://developer.foursquare.com/docs/build-with-foursquare/categories/). Here are the example of categories we are interested to look at:-
    - Arts & Entertainment; 4d4b7104d754a06370d81259
    - College & University; 4d4b7105d754a06372d81259
    - Events; 4d4b7105d754a06373d81259
    - Food; 4d4b7105d754a06374d81259
    - Outdoors & Recreation; 4d4b7105d754a06377d81259
    - Professional & Other Places; 4d4b7105d754a06375d81259
    - Residence; 4e67e38e036454776db1fb3a
    - Shop & service; 4d4b7105d754a06378d81259
    - Travel & Transport; 4d4b7105d754a06379d81259

## 3. Methodology