# Rush Hour Dynamics: Using Python to Study the London Underground
#### Camilla Montonen
#### PyData Paris 2015

<img src="londontube.png" height=500 width=500>

# Introduction
<img src="introimage.png" height=800 width=800>



#Background

* Bryn Mawr College 2013 
* University of Edinburgh 2014
* Currently working as in QA at Caplin Systems Ltd.
* Member of Pyladies London and Women in Data. If you're ever in London, please drop in to one of our meetups!


# Roadmap

1. Motivation: Why would you want to analyse the London Underground?! Commuting on it is bad enough.
2. Data collection: The Challenge of Collecting Data Stored in a Map 
3. Data analysis: Leveraging graph-tool to analyse the London Underground
4. Simulations: Creating simulations using Bokeh

# There are interesting data problems everywhere...

* Python gives you the tools, but you have to ask the questions!

<img src="fancy_graph_arf.png">

# Back in August 2014...

<img src="train.gif">


# Which Tube line should I take to work?
<img src="tube.png">

## Some days it was all good...

<img src="norush.jpg" height=400 width=600>



# Other days ...not so good

<img src="congestion.jpg" height=400 width=400>

# A pattern starts to emerge

<img src="delays.jpg">
Source: [BBC News](http://news.bbc.co.uk/1/hi/in_pictures/8092917.stm)

### Observation: delays or suspensions on one station can affect remote stations

<img src="tube.png">


## Questions that demand an answer



What are the most "important" stations in the London Underground network?


How does suspending these "important" stations affect the rest of the network

# Let's bring the Python to the Data

<img src="python-logo.png" height=50 widht=50>
<img src="bokeh_logo_small.png">
<img src="graph-tool-small-logo.png">

# In the beginning, there was the 'Data'

How do I translate a physical map of the London Underground into a Graph I can process with Python?





## Start


<img src="tube.png">

## Goal

<img src="fancy_graph_sfdp.png">

# Goal

<img src="betweenness.png">

## Data collection:

It would be cool to program some kind of OCR to automatically read the data from the map and produce a data file!
But alas, I had to resort to manually creating a data file:

```
#Station #Neighbour(line)
Acton Town	        Chiswick Park (District), South Ealing (Picadilly), Turnham Green (Picadilly)
Aldgate		        Tower Hill (Circle; District), Liverpool Street (Metropolitan; Circle; District)
Aldgate East	    Tower Hill (District), Liverpool Street (HammersmithCity; Metropolitan)
Alperton	        Sudbury Town (Picadilly), Park Royal (Picadilly)
```

# Now it's a piece of cake...

<img src="data-slide.png" >

#... to make a graph

<img src="arrow-graph.png">

#Let's go back to our question 1

1. What is the most "important" station in the London Underground network?


# Defining "importance"

<img src="tube.png">

# Let's talk about betweenness centrality

<img src="betweenness_illustration.png" height=400 width=400>

## Betweenness seems like a good metric to measure the "importance" of a station


# Graphs and Python:  `graph-tool`

<img src="graph-tool-small-logo.png">

* `graph-tool` is a Python library written by Tiago Peixoto that provides a number of tools for analyzing and plotting graphs.




# What can you do with `graph-tool` ?



## Create a graph object 

```python
from graph_tool.
```

## Add edges and vertices to the graph

# Data analysis using graph-tool

Let's see what the graph looks like!

<img src="fancy_graph_arf.png">

# Data analysis using graph-tool

`graph-tool` allows us to compute several interesting metrics, which are often used to characterize graphs:

1. Degree distribution
2. Average shortest path



In [2]:
# import necessary packages

# define data files
geographical_data="/home/winterflower/programming_projects/python-londontube/src/data/london_stations.csv"
network_data="/home/winterflower/programming_projects/python-londontube/src/data/londontubes.txt"

# Data analysis using graph-tool

1. What are the most "important" stations in the London Underground network?

Of course, there are many ways to measure the importance of a vertex in a graph. One such measure is called *betweenness centrality* . Simply stated, it measures the fraction of shortest paths out of all shortest paths that pass through the vertex. 

`graph-tool` provides a module `graph-tool.centrality` which allows you to compute various centrality measures out of the box. 

## Betweenness Centrality
What fraction of all shortest paths passes through this vertex?

In [2]:
#define some useful preliminaries
geographical_data="/home/winterflower/programming_projects/python-londontube/src/data/london_stations.csv"
network_data="/home/winterflower/programming_projects/python-londontube/src/data/londontubes.txt"

#calculate the betweenness centrality
#create the map_object

from src import simulation_utils
from src.graph_analytics import graph_analysis
import pandas as pd
betweenness_centrality_series_object=graph_analysis.calculate_betweenness(network_data)
betweenness_centrality_series_object.sort(ascending=False)
print betweenness_centrality_series_object[:10]

Baker Street               0.344084
King's Cross St.Pancras    0.303868
Liverpool Street           0.267392
Green Park                 0.263264
Mile End                   0.229449
Bethnal Green              0.227822
Victoria                   0.222771
Stratford                  0.220119
Finchley Road              0.211660
Waterloo                   0.207129
dtype: float64


## Betweenness Centrality

<img src="baker_street.jpg">

## Betweenness Centrality

<img src="tube.png">

## Shortest paths

Which station has the smallest average shortest path to any other station in the graph?






In [5]:
## calculate the length of the shortest path from any two stations

from src.graph_analytics import graph_analysis
shortest_paths=graph_analysis.calculate_all_shortest_paths(network_data)
#calculate the mean shortest path
mean_shortest_path=shortest_paths.mean(axis=0)
#find out stations with smallest mean shortest paths
mean_shortest_path.order(ascending=True, inplace=True)
#find out the top 5 stations
mean_shortest_path[:5]


Green Park       8.901887
Oxford Circus    9.007547
Bond Street      9.090566
Baker Street     9.211321
Westminster      9.339623
dtype: float64

# Simulating commuter flow between stations

Designing a simple two component simulation:

<img src="simulation_diagram.png">





## Simulating commuter flow between stations

<img src="bokeh_logo_small.png">

* Bokeh allows you to create graphs that update in "real-time"




## Summary 

* Python provides excellent libraries for studying real-world problems where the natural representation of the data is a graph
* In addition to calculating metrics, you can easily make amazing animations by integrating graph-tool with bokeh
* Find interesting problems, ask hard questions and start exploring!


## Thank you ! (and please ask questions!)