## 3. Data Scraping

This notebook in the **Delhi Metro Network Anaysis** project focuses on collecting temporal data related to the phases and opening dates of various metro lines. It uses the requests module to fetch web pages, while BeautifulSoup is employed to parse and extract relevant information like inauguration dates. The scraped data is then cleaned and organized using Pandas for further analysis, providing historical context for the network's development and expansion.

URL: https://en.wikipedia.org/wiki/Delhi_Metro, https://en.wikipedia.org/wiki/List_of_Delhi_Metro_stations

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

output_dir = '../data/temporal_data/'

### 1. Scraping Phase wise Development Data (Temporal Data of Entire Phase)

In [2]:
url = "https://en.wikipedia.org/wiki/Delhi_Metro"

response = requests.get(url)
print("Response Code: ", response.status_code) # 200 = Success

Response Code:  200


In [3]:
content = response.content
html = BeautifulSoup(content, "lxml") # Using LXML Parser since I prefer this over html.parser

print(html.title)

<title>Delhi Metro - Wikipedia</title>


In [4]:
table_data = html.findAll('table', class_="wikitable sortable")

In [5]:
def remove_trailing_n(text):
    return text.strip()

def remove_references(table):
    for sup_tag in table.find_all('sup'):
        sup_tag.decompose()  # Removes reference tags
    return table

def handle_NaN(table):
    table = table.fillna("")
    return table

def read_wiki_metro_tables(table, phase: str = None) -> pd.DataFrame:
    # Clean up the table
    table = remove_references(table)
    
    # Extract the phase number from the heading
    table_heading = remove_trailing_n(table.findAll("th")[0].text)
    
    # Extract the table rows
    table_rows = table

    # Convert table to DataFrame
    table_df = pd.read_html(str(table_rows))[0]

    # Change Columns
    table_df.columns = [ column[1] for column in table_df.columns]

    # Create a new column 'Phase' and fill with the extracted phase number
    if phase is not None:
        table_df['Phase'] = f"Phase {phase}"

    table_df = table_df.fillna("")

    return table_heading, table_df

In [6]:
# Phase I
phase1_heading, phase1_table = read_wiki_metro_tables(table_data[0], phase="I")
phase1_table = phase1_table[['Line', 'Stations', 'Length (km)', 'Terminals', 'Terminals.1', 'Opening date', 'Phase']]
phase1_table.columns = ['Line', 'Stations', 'Length (km)', 'Terminal A', 'Terminal B', 'Opening Date', 'Phase']
phase1_table

Unnamed: 0,Line,Stations,Length (km),Terminal A,Terminal B,Opening Date,Phase
0,Red Line,6,8.35,Shahdara,Tis Hazari,25 December 2002,Phase I
1,Red Line,4,4.87,Tis Hazari,Inderlok,3 October 2003,Phase I
2,Red Line,8,8.84,Inderlok,Rithala,31 March 2004,Phase I
3,Yellow Line,4,4.06,Vishwa Vidyalaya,Kashmere Gate,20 December 2004,Phase I
4,Yellow Line,6,6.62,Kashmere Gate,Central Secretariat,3 July 2005,Phase I
5,Blue Line,22,22.74,Dwarka,Barakhamba Road,31 December 2005,Phase I
6,Blue Line,6,6.47,Dwarka,Dwarka Sector 9,1 April 2006,Phase I
7,Blue Line,3,2.8,Barakhamba Road,Indraprastha,11 November 2006,Phase I
8,Total,59,64.75,,,,Phase I


In [7]:
# Phase II
phase2_heading, phase2_table = read_wiki_metro_tables(table_data[1], phase="II")
phase2_table = phase2_table[['Line', 'Stations', 'Length (km)', 'Terminals', 'Terminals.1', 'Opening date', 'Phase']]
phase2_table.columns = ['Line', 'Stations', 'Length (km)', 'Terminal A', 'Terminal B', 'Opening Date', 'Phase']
phase2_table

Unnamed: 0,Line,Stations,Length (km),Terminal A,Terminal B,Opening Date,Phase
0,Red Line,3,2.86,Shahdara,Dilshad Garden,4 June 2008,Phase II
1,Yellow Line,5,6.38,Vishwavidyalaya,Jahangirpuri,4 February 2009,Phase II
2,Yellow Line,9,15.82,Millenium City Centre,Qutab Minar,21 June 2010,Phase II
3,Yellow Line,1,15.82,Chhatarpur,Chhatarpur,26 August 2010,Phase II
4,Yellow Line,9,11.76,Qutab Minar,Central Secretariat,3 September 2010,Phase II
5,Blue Line,1,2.17,Indraprastha,Yamuna Bank,10 May 2009,Phase II
6,Blue Line,10,12.85,Yamuna Bank,Noida City Centre,12 November 2009,Phase II
7,Blue Line,2,2.28,Dwarka Sector 9,Dwarka Sector 21,30 October 2010,Phase II
8,Blue Line Branch,6,6.25,Yamuna Bank,Anand Vihar,6 January 2010,Phase II
9,Blue Line Branch,2,2.26,Anand Vihar,Vaishali,14 July 2011,Phase II


In [8]:
# Phase III
phase3_heading, phase3_table = read_wiki_metro_tables(table_data[2], phase="III")
phase3_table = phase3_table[['Line', 'Stations', 'Length (km)', 'Terminals', 'Terminals.1', 'Opening date', 'Phase']]
phase3_table.columns = ['Line', 'Stations', 'Length (km)', 'Terminal A', 'Terminal B', 'Opening Date', 'Phase']
phase3_table

Unnamed: 0,Line,Stations,Length (km),Terminal A,Terminal B,Opening Date,Phase
0,Red Line,8,9.64,Dilshad Garden,Shaheed Sthal (New Bus Adda),9 March 2019,Phase III
1,Yellow Line,3,4.37,Jahangirpuri,Samaypur Badli,10 November 2015,Phase III
2,Blue Line,6,6.8,Noida City Centre,Noida Electronic City,9 March 2019,Phase III
3,Green Line,7,11.19,Mundka,Brigadier Hoshiyar Singh,24 June 2018,Phase III
4,Violet Line|,2,3.23,Mandi House,Central Secretariat,26 June 2014,Phase III
5,Violet Line|,1,0.97,Mandi House,ITO,8 June 2015,Phase III
6,Violet Line|,9,13.56,Badarpur Border,Escorts Mujesar,6 September 2015,Phase III
7,Violet Line|,4,5.07,Kashmere Gate,ITO,28 May 2017,Phase III
8,Violet Line|,2,3.35,Escorts Mujesar,Raja Nahar Singh,19 November 2018,Phase III
9,Airport Express,1,2.01,Dwarka Sector 21,Yashobhoomi - Dwarka Sector 25,17 September 2023,Phase III


In [9]:
# Phase IV
phase4_heading, phase4_table = read_wiki_metro_tables(table_data[4], phase="IV")
phase4_table = phase4_table[['Name', 'Stations', 'Length (km)', 'Terminals', 'Terminals.1', 'Via', 'Status', 'Expected completion date', 'Phase']]
phase4_table.columns = ['Line', 'Stations', 'Length (km)', 'Terminal A', 'Terminal B', 'Via', 'Status', 'Expected Completion date', 'Phase']
phase4_table['Expected Completion date'] = phase4_table['Expected Completion date'].apply(lambda x: int(x) if type(x) == float else x)
phase4_table

Unnamed: 0,Line,Stations,Length (km),Terminal A,Terminal B,Via,Status,Expected Completion date,Phase
0,Magenta Line,22,29.26,Janakpuri West,RK Ashram Marg,"Janakpuri West, Krishna Park Extension, Keshop...",Under construction,2026.0,Phase IV
1,Golden Line,15,23.62,Tughlakabad,Terminal 1-IGI Airport,"Tughlakabad, Tughlakabad Railway Colony, Anand...",Under construction,2026.0,Phase IV
2,Pink Line,8,12.32,Majlis Park,Maujpur - Babarpur,"Majlis Park, Burari, Jharoda Majra, Jagatpur V...",Under construction,2025.0,Phase IV
3,Green Line,10,12.38,Inderlok,Indraprastha,"Inderlok, Daya Basti, Sarai Rohilla, Ajmal Kha...",Approved,2029.0,Phase IV
4,Brown Line,8,8.4,Lajpat Nagar,Saket G-Block,"Lajpat Nagar, Andrews Ganj, Greater Kailash-1,...",Approved,2029.0,Phase IV
5,Red Line,21,27.32,Rithala,Nathupur,"Rohini Sector 25, Rohini Sector 26, Rohini Sec...",Pending approval,,Phase IV
6,Red Line,21,27.32,Rithala,Nathupur,"Rohini Sector 25, Rohini Sector 26, Rohini Sec...",Pending approval,,Phase IV
7,Blue Line,5,5.2,Noida Electronic City,Sahibabad,"Vaibhav Khand, DPS Indirapuram, Shakti Khand, ...",Proposed,,Phase IV
8,Delhi Metrolite,21,19.09,Kirti Nagar,Bamnoli Village,"Kirti Nagar, Saraswati Garden, Mayapuri Bus De...",Proposed,,Phase IV
9,,115,141.21,,,,,,Phase IV


In [10]:
# Current Delhi Metro Network
delhi_metro_network_heading, delhi_metro_network_table = read_wiki_metro_tables(table_data[3])
delhi_metro_network_table = delhi_metro_network_table[['Line Name', 'Opened', 'Last extension', 'Stations', 'Length (km)', 'Terminals', 'Terminals.1', 'Rolling stock']]
delhi_metro_network_table.columns = ['Line', 'Opened', 'Last extension', 'Stations', 'Length (km)', 'Terminal A', 'Terminal B', 'Rolling']
delhi_metro_network_table

Unnamed: 0,Line,Opened,Last extension,Stations,Length (km),Terminal A,Terminal B,Rolling
0,Red Line,25 December 2002,9 March 2019,29,34.55,Shaheed Sthal,Rithala,"31 trains, 219 coaches"
1,Yellow Line,20 December 2004,10 November 2015,37,49.02,Samaypur Badli,Millennium City Centre,"54 trains, 429 coaches"
2,Blue Line,31 December 2005,9 March 2019,50,56.11,Noida Electronic City,Dwarka Sector 21,"60 trains, 480 coaches"
3,Blue Line,7 January 2010,14 July 2011,8,8.51,Vaishali,Dwarka Sector 21,"60 trains, 480 coaches"
4,Green Line,3 April 2010,24 June 2018,24,28.78,Inderlok,Brigadier Hoshiyar Singh City Park,"20 trains, 80 coaches"
5,Green Line,27 August 2011,–,24,28.78,Kirti Nagar,Brigadier Hoshiyar Singh City Park,"20 trains, 80 coaches"
6,Violet Line,3 October 2010,19 November 2018,34,46.34,Kashmere Gate,Raja Nahar Singh Ballabhgarh,"37 trains, 220 coaches"
7,Airport Express Line,23 February 2011,17 September 2023,7,22.91,New Delhi,Yashobhoomi Dwarka Sector 25,"6 trains, 36 coaches"
8,Pink Line,14 March 2018,6 August 2021,38,59.24,Majlis Park,Shiv Vihar,"33 trains, 196 coaches"
9,Magenta Line,25 December 2017,29 May 2018,25,37.46,Botanical Garden,Janakpuri West,"24 trains, 144 coaches"


In [11]:
# Export Data
phase1_table.to_csv(output_dir + "Phase I.csv", index=False)
phase2_table.to_csv(output_dir + "Phase II.csv", index=False)
phase3_table.to_csv(output_dir + "Phase III.csv", index=False)
phase4_table.to_csv(output_dir + "Phase IV.csv", index=False)
delhi_metro_network_table.to_csv(output_dir + "Delhi Metro Network.csv", index=False)

### 2. Scraping Data Regarding Stationwise details including Opening date, layout details.

In [12]:
url = 'https://en.wikipedia.org/wiki/List_of_Delhi_Metro_stations'

response = requests.get(url)
print("Response Code: ", response.status_code) # 200 = Success

Response Code:  200


In [13]:
content = response.content
html = BeautifulSoup(content, "lxml") # Using LXML Parser since I prefer this over html.parser

print(html.title)

<title>List of Delhi Metro stations - Wikipedia</title>


In [14]:
table_data = html.findAll('table', class_="wikitable sortable static-row-numbers static-row-header-hash")

In [15]:
stations_df = read_wiki_metro_tables(table_data[0])[1]
stations_df.drop(['o', 'e'], axis=1, inplace=True)
stations_df.columns = ['Station Name', 'Metro Line', 'Opened', 'Station Layout', 'Platform Layout']

stations_df

Unnamed: 0,Station Name,Metro Line,Opened,Station Layout,Platform Layout
0,Adarsh Nagar,Yellow Line,4 February 2009,Elevated,Side
1,AIIMS,Yellow Line,3 September 2010,Underground,Island
2,Akshardham,Blue Line,12 November 2009,Elevated,Side
3,Anand Vihar**,Blue Line,6 January 2010,Elevated,Side
4,Anand Vihar**,Pink Line,31 October 2018,Elevated,Side
...,...,...,...,...,...
251,Welcome*,Red Line,24 December 2002,At Grade,Island
252,Welcome*,Pink Line,31 October 2018,Elevated,Side
253,Yamuna Bank*,Blue Line,10 May 2009,At-Grade,Island
254,Yamuna Bank*,Blue Line,6 January 2010,At-Grade,Side


In [16]:
import re

# Function to clean unwanted characters like **, †*, etc.
def clean_station_name(name):
    # Remove any special characters (like **, †*, etc.) at the end of the station name
    return re.sub(r'[\*\†]+$', '', name).strip()

# Apply the cleaning function to the 'station_name' column
stations_df['Station Name'] = stations_df['Station Name'].apply(clean_station_name)

stations_df.head()

Unnamed: 0,Station Name,Metro Line,Opened,Station Layout,Platform Layout
0,Adarsh Nagar,Yellow Line,4 February 2009,Elevated,Side
1,AIIMS,Yellow Line,3 September 2010,Underground,Island
2,Akshardham,Blue Line,12 November 2009,Elevated,Side
3,Anand Vihar,Blue Line,6 January 2010,Elevated,Side
4,Anand Vihar,Pink Line,31 October 2018,Elevated,Side


In [17]:
stations_df.to_csv(output_dir + "Station_Details.csv", index=False)