# Data Preprocessing for Google Maps Queries
### Before performing the automated commute‑time extraction, the dataset underwent a preprocessing step to make the station names compatible with Google Maps queries. Many stations in the original data contained abbreviations such as LU for London Underground, DLR for Docklands Light Railway, NR for National Rail, or H&C for Hammersmith & City. These shorthand notations are often not recognized by Google Maps, which would result in failed searches or inaccurate routing. To address this, each station name was scanned with regular expressions to detect and remove these abbreviations, replacing them with the corresponding full transport system names. This produced clean, standardized station names that can be reliably interpreted by Google Maps in the subsequent Playwright‑based web‑scraping process, ensuring that all origin–destination pairs are correctly located and transit routes can be retrieved without interruption.

In [None]:
import pandas as pd
import re

transport_abbreviations = {
    'LU': 'London Underground',
    'DLR': 'Docklands Light Railway',
    'NR': 'National Rail',
    'TfL': 'Transport for London',
    'H&C': 'Hammersmith & City',
    'EL': 'Elizabeth Line',
    '(Bak)': 'Bakerloo',
    '(H&C)': 'Hammersmith & City'
}

def replace_abbreviation(station):
    if not isinstance(station, str):
        return ""
    for abbr, full in transport_abbreviations.items():
        pattern = rf"\s*\(?{re.escape(abbr)}\)?\s*$"
        if re.search(pattern, station):
            station_name = re.sub(pattern, '', station).strip()
            return f"{station_name} {full}"
    return station.strip()

df = pd.read_csv("frequent_routes_final_origin.csv")
df.columns = df.columns.str.strip()

df['Start_Station'] = df['Start_Station'].apply(replace_abbreviation)
df['End_Station'] = df['End_Station'].apply(replace_abbreviation)

df = df[['Start_Station', 'End_Station', 'Count']]

output_path = "/content/frequent_routes_final.csv"
df.to_csv(output_path, index=False, encoding='utf-8-sig')
print(f"finished and saved：{output_path}")


finished and saved：/content/frequent_routes_final.csv


# Automated Commute Time Extraction via Web Scraping
### Since the Google Maps API does not provide step‑free or wheelchair‑accessible route information, it is not possible to directly obtain total commute times for accessible journeys using the API alone. To address this limitation, we implemented an automated web‑scraping approach using Playwright in Python to collect the total transit time for each origin–destination pair under the wheelchair‑accessible constraint.

### 1. The scraping pipeline automatically performs the following steps for each route:

### 2. Navigate to Google Maps and select the “Directions” panel.

### 3. Input the origin and destination stations, which have been preprocessed to ensure that Google Maps can correctly identify the locations.

### 4. Switch to transit mode and open the route options to enable the “Wheelchair accessible” filter, as this is not supported by the public API.

### 5. Select the first suggested route card, expand its details, and extract the total commute time from the page (e.g. 1 hr 12 min convert to 72 minutes).

### 6. Record the result in a structured dataset for subsequent analysis, while logging any errors for retry or manual inspection.

### This approach leverages asynchronous Playwright for headless browser automation, combined with regular expressions to parse mixed “hours and minutes” text into numeric total minutes. The scraping script iterates over all routes in the dataset and outputs a final CSV file with the corresponding total commute times for wheelchair‑accessible travel.

In [None]:
!pip install nest_asyncio
!pip install -q playwright
!playwright install --with-deps chromium


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling dependencies...
Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,775 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [5,1

In [None]:
import pandas as pd
import re, asyncio
from playwright.async_api import async_playwright

df = pd.read_csv("frequent_routes_final.csv")
df["commute_time"] = None

# commute time
async def get_commute_time_final(origin: str, destination: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(locale="en-GB", timezone_id="Europe/London")
        page = await context.new_page()

        await page.goto("https://www.google.com/maps?hl=en", timeout=0)
        await page.click('button[aria-label="Directions"]')
        await page.fill('input[aria-label^="Choose starting"]', origin)
        await page.fill('input[aria-label^="Choose destination"]', destination)
        await page.keyboard.press("Enter")
        await asyncio.sleep(4)
        #await page.click('button[aria-label*="Transit"]')
        await page.locator('button[role="radio"] div[role="img"][aria-label="Transit"]').click(timeout=5000)
        await asyncio.sleep(1)

        try:
            await page.click('button[aria-label*="Options"]', timeout=5_000)
        except:
            await page.click('text=Options', timeout=5_000)
        await page.click('text=Wheelchair accessible')
        await asyncio.sleep(3)

        await page.wait_for_selector('div[role="listitem"]', timeout=30_000)
        first_card = page.locator('div[role="listitem"]').first
        try:
            await first_card.locator('text=Details').click(timeout=5_000)
        except:
            await first_card.click()

        await page.wait_for_selector("div.Fk3sm", timeout=10_000)
        time_text = await page.locator("div.Fk3sm").first.inner_text()
        print(f"-- Raw time text: {time_text}")

        match = re.search(r"(?:(\d+)\s*hr[s]?)?\s*(\d+)?\s*min", time_text.lower())
        if match:
            hours = int(match.group(1)) if match.group(1) else 0
            minutes = int(match.group(2)) if match.group(2) else 0
            total_minutes = hours * 60 + minutes
        else:
            total_minutes = None

        await browser.close()
        return total_minutes

N = len(df)

async def process_multiple_rows():
    for idx in range(N):
        row = df.loc[idx]
        origin = row["Start_Station"]
        dest = row["End_Station"]
        print(f"< Processing: {origin} → {dest} >")
        try:
            commute_time = await get_commute_time_final(origin, dest)
            df.at[idx, "commute_time"] = commute_time
            print(f"=>> {origin} → {dest}: {commute_time} min")
        except Exception as e:
            print(f"X Error for {origin} → {dest}: {e}")
            df.at[idx, "commute_time"] = "ERROR"

    df.to_csv("final_frequent_routes_with_commute.csv", index=False)
    print("Saved to 'final_frequent_routes_with_commute.csv'")

await process_multiple_rows()


from google.colab import files
files.download("final_frequent_routes_with_commute.csv")


< Processing: Wimbledon → Parsons Green >
X Error for Wimbledon → Parsons Green: Locator.click: Timeout 5000ms exceeded.
Call log:
  - waiting for locator("button[role=\"radio\"] div[role=\"img\"][aria-label=\"Transit\"]")

< Processing: Tooting Broadway → Clapham North >
-- Raw time text: 31 min
=>> Tooting Broadway → Clapham North: 31 min
< Processing: Euston National Rail → Maida Vale >
-- Raw time text: 39 min
=>> Euston National Rail → Maida Vale: 39 min
< Processing: Hammersmith Hammersmith & City → Ladbroke Grove >
-- Raw time text: 34 min
=>> Hammersmith Hammersmith & City → Ladbroke Grove: 34 min
< Processing: Euston National Rail → Willesden Junction >
-- Raw time text: 14 min
=>> Euston National Rail → Willesden Junction: 14 min
< Processing: Morden → Balham London Underground >
-- Raw time text: 1 hr 5 min
=>> Morden → Balham London Underground: 65 min
< Processing: Romford → Manor Park >
X Error for Romford → Manor Park: Locator.click: Timeout 5000ms exceeded.
Call log:
  

ERROR:asyncio:Future exception was never retrieved
future: <Future finished exception=TargetClosedError('Target page, context or browser has been closed')>
playwright._impl._errors.TargetClosedError: Target page, context or browser has been closed


X Error for Barking → Walthamstow Queen's Road: Locator.click: Timeout 5000ms exceeded.
Call log:
  - waiting for locator("button[role=\"radio\"] div[role=\"img\"][aria-label=\"Transit\"]")

< Processing: Richmond → Willesden Junction >
X Error for Richmond → Willesden Junction: Locator.click: Timeout 5000ms exceeded.
Call log:
  - waiting for locator("button[role=\"radio\"] div[role=\"img\"][aria-label=\"Transit\"]")

< Processing: Colindale → Edgware >
-- Raw time text: 38 min
=>> Colindale → Edgware: 38 min
< Processing: Maidenhead → Iver >
X Error for Maidenhead → Iver: Page.click: Timeout 5000ms exceeded.
Call log:
  - waiting for locator("text=Options")
    - locator resolved to <div class="BunUDe">…</div>
  - attempting click action
    2 × waiting for element to be visible, enabled and stable
      - element is visible, enabled and stable
      - scrolling into view if needed
      - done scrolling
      - <div class="fTryM">Did you mean:</div> from <div id="omnibox-container" 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>