# Tube Twin: Passenger count forecasting/general tube analysis 

© Explore Data Science Academy

## Introduction 

https://www.bing.com/images/search?view=detailv2&form=SBIHVR&darkschemeovr=1&iss=VSI&q=imgurl:https%3A%2F%2Fraw.githubusercontent.com%2Fchidike10%2FTowards-a-Tube-Twin-Project---The-London-Tube---Explore-EDSA%2F027378c0267f88714ace6cf1b7b8c6d22622f576%2FAssets%2FLU_Baker-street.jpg%3Ftoken%3DGHSAT0AAAAAABYACX34YHUVR2C3E2DCU4TIYYU2FSQ&pageurl=https%3A%2F%2Fraw.githubusercontent.com%2Fchidike10%2FTowards-a-Tube-Twin-Project---The-London-Tube---Explore-EDSA%2F027378c0267f88714ace6cf1b7b8c6d22622f576%2FAssets%2FLU_Baker-street.jpg%3Ftoken%3DGHSAT0AAAAAABYACX34YHUVR2C3E2DCU4TIYYU2FSQ&pagetl=LU_Baker-street.jpg+(1000%C3%97667)&imgsz=1000x666&selectedindex=0&id=4BD458B546082DB48497D0A139EE75DB70AA13A3&mediaurl=https%3A%2F%2Fwww.greaterlondonproperties.co.uk%2Fwp-content%2Fuploads%2F2019%2F04%2FLondon_Underground_History.jpg&exph=572&expw=850&vt=2&sim=11&ccid=08Oy4oCS&simid=608045143391089595&ck=635D87D817294B5AC304A590827256E2&thid=OIP.08Oy4oCSfbJRBwVIGh8q3gHaE-&cdnurl=https%3A%2F%2Fth.bing.com%2Fth%2Fid%2FR.d3c3b2e280927db2510705481a1f2ade%3Frik%3DoxOqcNt17jmh0A%26pid%3DImgRaw%26r%3D0&pivotparams=imgurl%3Dhttps%253A%252F%252Fraw.githubusercontent.com%252Fchidike10%252FTowards-a-Tube-Twin-Project---The-London-Tube---Explore-EDSA%252F027378c0267f88714ace6cf1b7b8c6d22622f576%252FAssets%252FLU_Baker-street.jpg%253Ftoken%253DGHSAT0AAAAAABYACX34YHUVR2C3E2DCU4TIYYU2FSQ%26%26cal%3D0.1%26cat%3D0.1%26car%3D0.9%26cab%3D0.9%26ann%3D%26hotspot%3D

**Client**: Transport for London (TfL) 

Transport for London runs the London Underground (aka “The Tube”), which is a network of train stations which connects the city of London.

**Team**: 

This is Team 6. A combination of data scientists and data engineers that have been assigned the Tube Twin project and this is a notebook for executing various aspect of the project workflow. 

## Context

This project's objective is to create a digital twin of the london tube that can be used for passenger count and traffic analyses/forecasting.


## Importing libraries
Below we import the libraries neccessary for the required executions. 

In [15]:
# import findspark
# findspark.init()
# findspark.find() 

# from pyspark import SparkContext
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.types import * 

# All the above will/might be useful at a later stage in the workflow. 

# This are the basic packages to work with for now. 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from statsmodels.tsa.stattools import adfuller,acf,pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,


We will use a `SparkContext` and `SparkSession` to interface with Spark.
We will mostly be using the `SparkContext` to interact with RDDs 
and the `SparkSession` to interface with Python objects.

> ℹ️ **Objective** ℹ️
>
>Initialise a new **Spark Context** and **Session** that you will use to interface with Spark.

In [2]:
# This cell might be useful later. 

#sc = SparkContext.getOrCreate()
#spark = SparkSession(sc) 

## Importing source files
Historical tube data has been merged into a single CSV file and will be read and analysed using the panda liberary. 

> ℹ️ **Objective** ℹ️
>
> Read the CSV file stored in the Data directory into a panda dataframe. 

In [3]:
# Reading and viewing historial_tube_data csv file 

df = pd.read_csv('../Data/historical_tube_data.csv') 

df.head(5) 

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,Total,Early,...,0230-0245,0245-0300,0300-0315,0315-0330,0330-0345,0345-0400,0400-0415,0415-0430,0430-0445,0445-0500
0,LU,500,ACTu,Acton Town,Station entry / exit,2020,MTT,IN,3701.9,288.2,...,0,0,0,0,0,0,0,0,0.6,15.3
1,LU,502,ALDu,Aldgate,Station entry / exit,2020,MTT,IN,2489.416667,172.583333,...,0,0,0,0,0,0,0,0,0.0,0.0
2,LU,503,ALEu,Aldgate East,Station entry / exit,2020,MTT,IN,3198.307692,103.0,...,0,0,0,0,0,0,0,0,0.0,0.538462
3,LU,505,ALPu,Alperton,Station entry / exit,2020,MTT,IN,2072.538462,360.230769,...,0,0,0,0,0,0,0,0,0.230769,0.076923
4,LU,506,AMEu,Amersham,Station entry / exit,2020,MTT,IN,980.466667,148.4,...,0,0,0,0,0,0,0,0,0.133333,0.2


In [4]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17458 entries, 0 to 17457
Columns: 111 entries, Mode to 0445-0500
dtypes: float64(93), int64(12), object(6)
memory usage: 14.8+ MB


None

In [5]:
print(df.isnull().sum())

Mode          0
NLC           0
ASC           0
Station       0
Coverage     32
             ..
0345-0400     0
0400-0415     0
0415-0430     0
0430-0445     0
0445-0500     0
Length: 111, dtype: int64


In [10]:
# look at data statistics
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Mode,17458,6,LU,10200,,,,,,,
NLC,17458.0,,,,1852.831138,2424.449102,500.0,608.0,719.0,1083.0,9846.0
ASC,17458,698,SFDu,144,,,,,,,
Station,17458,467,Stratford,152,,,,,,,
Coverage,17426,6,Station entry / exit,14806,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
0345-0400,17458.0,,,,1.280101,8.398658,0.0,0.0,0.0,0.0,291.0
0400-0415,17458.0,,,,1.237599,7.863148,0.0,0.0,0.0,0.0,242.0
0415-0430,17458.0,,,,1.024115,6.263069,0.0,0.0,0.0,0.0,189.0
0430-0445,17458.0,,,,1.078293,5.715259,0.0,0.0,0.0,0.0,129.0


In [11]:
df.kurtosis()

  df.kurtosis()


NLC             1.778086
year           -1.246407
Total          33.681702
Early          39.350814
AM Peak        94.900374
                 ...    
0345-0400     262.797762
0400-0415     204.286460
0415-0430     181.257843
0430-0445     140.173200
0445-0500      84.275677
Length: 105, dtype: float64

In [14]:
df.shape

(17458, 111)