# ***Singapore MRT Commuting Patterns in 2024 May***

## **Project Goal**
- To uncover interesting trends in MRT trip data in May 2024

## **Research Questions**
1. What were the most visited MRT stations on weekdays vs. weekends?
2. Which stations do people in Bishan travel to/come from?

## **Project Steps**
1. Extract Data
2. Transform Data
3. Visualise Data

### **Step 1: Extract Data**
- Train station names and coordinates were found in two separate CSV files online.
- Monthly MRT trip data between origin-destination pairs of stations was obtained by calling the API provided by Singapore's Land Transport Authority DataMall.

In [9]:
'''import requests

url = 'https://datamall2.mytransport.sg/ltaodataservice/PV/ODTrain'
headers = {'AccountKey': ''}
date_to_check = 202405
params = {'Date': date_to_check}

# make request to get api's response
try:
    response = requests.get(url, headers = headers, params = params)
except:
    print('request to api unsuccessful')
dataset_url = response.json()['value'][0]['Link']
print(dataset_url)'''

"import requests\n\nurl = 'https://datamall2.mytransport.sg/ltaodataservice/PV/ODTrain'\nheaders = {'AccountKey': ''}\ndate_to_check = 202405\nparams = {'Date': date_to_check}\n\n# make request to get api's response\ntry:\n    response = requests.get(url, headers = headers, params = params)\nexcept:\n    print('request to api unsuccessful')\ndataset_url = response.json()['value'][0]['Link']\nprint(dataset_url)"

### **Step 2: Transform Data**

**Importing Libraries**

In [10]:
import numpy as np
import pandas as pd

**Checking Data Quality of Main Dataset**

In [11]:
df = pd.read_csv('origin_destination_train_202405.csv')
df.head()

Unnamed: 0,YEAR_MONTH,DAY_TYPE,TIME_PER_HOUR,PT_TYPE,ORIGIN_PT_CODE,DESTINATION_PT_CODE,TOTAL_TRIPS
0,2024-05,WEEKDAY,9,TRAIN,DT24,EW32,2
1,2024-05,WEEKDAY,9,TRAIN,EW32,DT24,7
2,2024-05,WEEKENDS/HOLIDAY,6,TRAIN,BP4,EW31,6
3,2024-05,WEEKDAY,6,TRAIN,BP4,EW31,38
4,2024-05,WEEKENDS/HOLIDAY,12,TRAIN,NE15,SW5,69


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808618 entries, 0 to 808617
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   YEAR_MONTH           808618 non-null  object
 1   DAY_TYPE             808618 non-null  object
 2   TIME_PER_HOUR        808618 non-null  int64 
 3   PT_TYPE              808618 non-null  object
 4   ORIGIN_PT_CODE       808618 non-null  object
 5   DESTINATION_PT_CODE  808618 non-null  object
 6   TOTAL_TRIPS          808618 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 43.2+ MB


In [13]:
df.duplicated().sum()

np.int64(0)

While the data is clean,
- Provided station codes are not suitable for visualisation
- Dataset lacks train station coordinates to map later
- Comparison between weekday and weekend traffic is skewed since there are more weekdays in a month

Hence, the following section attempts to
1. Replace train station codes with names
2. Add coordinates for each train station


**Checking for Inconsistencies between Datasets**

In [14]:
station_names = pd.read_excel("Train Station Codes and Chinese Names.xls")
station_coords = pd.read_csv('mrt_coords.csv')
df = pd.read_csv('origin_destination_train_202405.csv')

# checking for train stations not present in both 'station name' and 'station coordinates' datasets
s1 = set(station_names['mrt_station_english'])
s2 = set(station_coords['station_name'])
common_items = s1.intersection(s2)
all_items = s1.union(s2)
#print(all_items  - common_items)

# checking for origin station codes in main df not present in 'station name' dataset
s1 = set(station_names['stn_code'])
s3 = set(df['ORIGIN_PT_CODE'].str.split('/').str[0])
#print(s3.difference(s1))

# checking for destination station codes in main df not present in 'station name' dataset
s4 = set(df['DESTINATION_PT_CODE'].str.split('/').str[0])
print(s4.difference(s1))



set()


**Changing train codes to names**

In [15]:
def code_to_name_2(df):
    # read in necessary files
    names = pd.read_excel('Train Station Codes and Chinese Names.xls')

    # change origin code to origin station
    df['ORIGIN_PT_CODE'] = df['ORIGIN_PT_CODE'].str.split('/').str[0]
    df = pd.merge(df, names, left_on = 'ORIGIN_PT_CODE', right_on = 'stn_code', how = 'left')
    df = df.drop(columns=['stn_code', 'mrt_station_chinese', 'mrt_line_english', 'mrt_line_chinese', 'ORIGIN_PT_CODE', 'PT_TYPE'])
    df = df.rename(columns = {'mrt_station_english':'ORIGIN_STATION_NAME'})
    #print(result.head())
    
    # change destination code to destination station
    df['DESTINATION_PT_CODE'] = df['DESTINATION_PT_CODE'].str.split('/').str[0]
    df = pd.merge(df, names, left_on = 'DESTINATION_PT_CODE', right_on = 'stn_code', how = 'left')
    df = df.drop(columns=['stn_code', 'mrt_station_chinese', 'mrt_line_english', 'mrt_line_chinese', 'DESTINATION_PT_CODE'])
    df = df.rename(columns = {'mrt_station_english':'DESTINATION_STATION_NAME'})

    return df

df = pd.read_csv('origin_destination_train_202405.csv')
df = code_to_name_2(df)
#df.head()
df.isna().sum()

YEAR_MONTH                  0
DAY_TYPE                    0
TIME_PER_HOUR               0
TOTAL_TRIPS                 0
ORIGIN_STATION_NAME         0
DESTINATION_STATION_NAME    0
dtype: int64

**Adding coordinates corresponding to each train station**

In [16]:
def add_coordinates(df):
    coords = pd.read_csv('mrt_coords.csv')
    # add origin station coordinates
    df = pd.merge(df, coords, left_on = 'ORIGIN_STATION_NAME', right_on = 'station_name', how = 'left')
    df = df.drop(columns=['type', 'station_name'])
    df = df.rename(columns = {'lat':'ORIGIN_STATION_LAT', 'lng':'ORIGIN_STATION_LNG'})

    # add destination station coordinates
    df = pd.merge(df, coords, left_on = 'DESTINATION_STATION_NAME', right_on = 'station_name', how = 'left')
    df = df.drop(columns=['type', 'station_name'])
    df = df.rename(columns = {'lat':'DESTINATION_STATION_LAT', 'lng':'DESTINATION_STATION_LNG'})

    return df

df = add_coordinates(df)
#df.head()
df.isna().sum()


YEAR_MONTH                  0
DAY_TYPE                    0
TIME_PER_HOUR               0
TOTAL_TRIPS                 0
ORIGIN_STATION_NAME         0
DESTINATION_STATION_NAME    0
ORIGIN_STATION_LAT          0
ORIGIN_STATION_LNG          0
DESTINATION_STATION_LAT     0
DESTINATION_STATION_LNG     0
dtype: int64

**Change total trips across all weekdays/weekend to average volume per weekday/weekend**

In [17]:
def avg_volume_2(df, weekdays, weekend):
    df['TOTAL_TRIPS'] = df['TOTAL_TRIPS'].astype(float)

    # Perform the division
    df.loc[df['DAY_TYPE'] == 'WEEKDAY', 'TOTAL_TRIPS'] /= weekdays
    df.loc[df['DAY_TYPE'] == 'WEEKENDS/HOLIDAY', 'TOTAL_TRIPS'] /= weekend

    # Explicitly round and cast to integer type
    df['TOTAL_TRIPS'] = df['TOTAL_TRIPS'].round(0).astype(int)

    return df

avg_volume_2(df, 23, 8).head()


Unnamed: 0,YEAR_MONTH,DAY_TYPE,TIME_PER_HOUR,TOTAL_TRIPS,ORIGIN_STATION_NAME,DESTINATION_STATION_NAME,ORIGIN_STATION_LAT,ORIGIN_STATION_LNG,DESTINATION_STATION_LAT,DESTINATION_STATION_LNG
0,2024-05,WEEKDAY,9,0,Geylang Bahru,Tuas West Road,1.321479,103.871457,1.330075,103.639636
1,2024-05,WEEKDAY,9,0,Tuas West Road,Geylang Bahru,1.330075,103.639636,1.321479,103.871457
2,2024-05,WEEKENDS/HOLIDAY,6,1,Teck Whye,Tuas Crescent,1.376738,103.753665,1.321091,103.649075
3,2024-05,WEEKDAY,6,2,Teck Whye,Tuas Crescent,1.376738,103.753665,1.321091,103.649075
4,2024-05,WEEKENDS/HOLIDAY,12,9,Buangkok,Fernvale,1.382991,103.893347,1.392033,103.876256


**Create function that strings together all previous functions**

In [18]:
def transform_2(date, avg, weekdays = 0, weekend = 0):
    df = pd.read_csv(f'origin_destination_train_{date}.csv')
    df = code_to_name_2(df)
    df = add_coordinates(df)
    if avg:
        try:
            df = avg_volume_2(df, weekdays, weekend)
        except:
            print('Invalid weekday/weekend input')
    return df

df_may = transform_2('202405', True, 21, 10)
df_jun = transform_2('202406', True, 19, 11)
df_jul = transform_2('202407', True, 23, 8)

In [19]:
df_may.isna().sum()

YEAR_MONTH                  0
DAY_TYPE                    0
TIME_PER_HOUR               0
TOTAL_TRIPS                 0
ORIGIN_STATION_NAME         0
DESTINATION_STATION_NAME    0
ORIGIN_STATION_LAT          0
ORIGIN_STATION_LNG          0
DESTINATION_STATION_LAT     0
DESTINATION_STATION_LNG     0
dtype: int64

**Exporting transformed datasets**

In [20]:
def export_df(df, date):
    df.to_csv(f'origin_destination_{date}_updated.csv', index = False)

#export_df(df_may, '202405')
#export_df(df_jun, '202406')
#export_df(df_jul, '202407')

## **Step 3: Visualise Data**
See Tableau dashboard [here](https://public.tableau.com/app/profile/norman.ng4484/viz/SGMRTTripsMay2024/Outbound)