# Process Overview for Scraping Images

The first step for in this traffic analysis project is to scrape the traffic images from LTA. The traffic images are provided by LTA through the [Datamall API](https://datamall.lta.gov.sg/content/datamall/en/dynamic-data.html):
- The API returns traffic images from 87 different locations in Singapore's expressways
- The images are updated every 1 to 5 minutes
- These images are retraived by data.gov.sg, and wrapped in another API
- We will be using the data.gov.sg API for the purposes of this project

The process for interacting with the data.gov.sg [Traffic Images API](https://data.gov.sg/dataset/traffic-images) is as follows:
- Provide a datetime call to the API
- The API will retrieve the latest available data at that moment in time, which will contain the following:
    - image metadata (image dimensions, camera longitude & latitude)
    - datetime of acquisition
    - a static link to the traffic image
    - an MD5 hash
    
Therefore, to obtain the traffic images, we would first need to use the data.gov.sg traffic images API to obtain the links to the images.

For the purposes of this study, we will be scraping traffic images from selected cameras for the entire month of October.

# Imports

In [1]:
import datetime
import pytz
import requests
import ast
import numpy as np
import urllib 
import shutil
import matplotlib.pyplot as plt
import pandas as pd
import os
from tqdm import tqdm

# Function Definition for Image Link Scraping

First, we will need to define a couple of functions to interact with the data.gov traffic images API, and then transform the information returned to a dataframe

In [2]:
def convert_to_dataframe(input_list):
    '''
    This function converts the output from the API (a list of dictionaries) into a pandas dataframe
    '''
    
    # casting the input list as a dataframe
    output = pd.DataFrame(input_list)

    # convert location columns from dictionary to columns
    output = (pd.concat([output, # concatenating the original output column
                         output['location'].apply(pd.Series)], # with the dictionary of data from the location column
                        axis=1). # concatenating on columns
              drop('location',axis = 1)) # dropping the original location column

    # convert image_metadata columns from dictionary to columns
    output = (pd.concat([output, # concatenating the original output column
                         output['image_metadata'].apply(pd.Series)], # with the dictionary of data from the image_metadata column
                        axis=1). # concatenating on columns
              drop('image_metadata',axis = 1)) # dropping the original image_metadata column
    
    # returning the output dataframe
    return output


def get_lta_traffic_camera_data(datetime_call):
    '''
    This function converts the datetime_call (in the form of a datetime object) to the required format, and then calls the traffic images API
    '''
    
    # formats the datetime_call
    datetime_call_formatted = datetime_call.strftime("%Y-%m-%d") + "T" + datetime_call.strftime("%H") + "%3A" + datetime_call.strftime("%M") + "%3A00" 
    
    # getting the api call
    api = 'https://api.data.gov.sg/v1/transport/traffic-images?date_time='+ datetime_call_formatted
    

    # reading the camera data from data.gov.sg
    camera_data = ast.literal_eval(requests.get(api).content.decode("utf-8"))["items"][0]["cameras"]

    # returning the output, converted as a dataframe
    return convert_to_dataframe(camera_data)


def lta_traffic_camera_scraping(start_datetime, end_datetime, resolution_minute):
    '''
    This function takes in the start and end datetime as well as the required resolution, and returns a dataframe that contains the API call with links to the traffic images
    '''
    # calculate number of observations required:
    num_obs = (end_datetime - start_datetime) / datetime.timedelta(minutes=resolution_minute)
    num_obs = int(np.floor(num_obs)) # converting the num_obst to an integer
    
    # create a list of datetime to be called inside the API
    datetime_list = [start_datetime + datetime.timedelta(minutes=resolution_minute*x) for x in range(num_obs)]
    
    # convert the datetime list into a tqdm for displaying progress bar
    datetime_list_pbar = tqdm(datetime_list)
    
    # insantiating a dictionary to contain the API calls
    output_dict = {}
    
    # iterating through all the datetime_call from the datetime_list above and calling the api for each of the datetime_call
    for datetime_call in datetime_list_pbar:
        
        # printing out current datetime_call
        datetime_list_pbar.set_description(f'Currently scraping: {datetime_call.strftime("%Y-%m-%d_%H:%M")}')

        # actual API call
        current_output = get_lta_traffic_camera_data(datetime_call) # calling the api on the specified datetime_call
        output_dict[datetime_call.strftime("%Y-%m-%d_%H:%M")] = current_output # appending the output dictionary with the current API call
        
    # CONVERTING THE DICTIONARY TO A DATAFRAME
    # converting the output_dict to a long dataframe
    output_df = pd.concat(output_dict.values(), axis=0)
    
    # filtering the columns of the df
    output_df = output_df[['timestamp','image','camera_id','md5']]
    
    # removign the image link prefix (same for all images) in order to save space
    image_link_prefix='https://images.data.gov.sg/api/traffic-images/'
    output_df['image'] = output_df['image'].str.replace(image_link_prefix,'',regex=False)
    
    # returns the dataframe
    return output_df

# Scraping Image Links

We first need to scrape the image links from the data.gov.sg API. The actual images will be downloaded in the next notebook.

In [3]:
# for this example, we will only be scraping the data for 1 hour on 01/10/2022
# the actualdata required for scraping has already been obtained and scraped beforehand

# DEFINING THE PARAMETERS FOR SCRAPING
start_datetime = datetime.datetime(2022,10,1,0,0,0)
end_datetime = datetime.datetime(2022,10,1,1,0,0)
resolution_minute = 5

df = lta_traffic_camera_scraping(start_datetime=start_datetime,
                                 end_datetime=end_datetime,
                                 resolution_minute=resolution_minute,
                                )

Currently scraping: 2022-10-01_00:55: 100%|████████████████████████████████████████████| 12/12 [00:03<00:00,  3.16it/s]


# Saving Scraped Data

In [4]:
# getting the filename for saving the file
filename = 'LTA_traffic_cam_' + start_datetime.strftime('%Y%m%d') + '-' + end_datetime.strftime('%Y%m%d') + '.csv'

# saving the data
df.to_csv(f'../data/{filename}')