# **Generation SG Junior Data Engineer Programme**
### **Interim Project presented by DPPS Team (5)**<br><span style="color:darkblue; font-weight:bold;">Members: Daniel | Pin Pin, Yvonne | Pin Yean, Erica | Shawn</span>


### <span style="color:darkblue; font-weight:bold;">API Data Extraction - Air Temperature</span>
<div>This document demonstrated how the team made requests and worked with APIs in Python using common libraries to extract a single source of truth data from OpenAPI, Data.Gov.SG for our project that offers the following benefits:</div>

- **Real-time data access**: retrieve up-to-date data on demand
- **Large datasets**: integrate data from multiple sources
- **Pre-processed data**: saving significant time and resources


### **Understanding API Documentation**
<div>When working with APIs, consulting the documentation is crucial for making successful requests. It includes:</div>

- available endpoints
- required parameters
- authentication methods
- expected response formats
- license of the Open Data (free or paid, personal or commercial, etc)


### **Evaluating The Options Available**
<div>We accessed different APIs in Data.Gov.SG and noticed it uses <em>REST API</em> (Representational State Transfer Application Programming Interface), a popular web service that follows the <em>REST</em> architectural style, allowing applications to communicate with each other by exchanging data through standardized methods, typically using HTTP requests to access and manipulate resources on a server.</div><br>

**API request and response models:**
- **Request body**: JSON (JavaScript Object Notation) format
- **200**: Everything went okay and the result has been returned (if any)
- **400**: Server thinks we made a bad request when we send incorrect data or make other client-side errors
- **404**: Resource we tried to access wasn't found on server


### **Making GET Request To The API**
<div>We performed a *GET Request* to the Singapore government's open data API endpoint for real-time air temperature data. This request retrieved a JSON response containing information about the air temperature readings across various locations. The daya obtained allows us to analyze:</div>

1. The number of monitoring stations deployed
2. The geographical distribution of these stations
3. The current air temperature readings at each location
4. Timestamps of the measurements

By examining this JSON data, we can gain insights into the scale and coverage of Singapore's air temperature monitoring network, as well as observe real-time temperature variations across different parts of the city-state.

**[Link](https://api-open.data.gov.sg/v2/real-time/api/air-temperature)**

<span style="color:darkgreen; font-weight:bold;">JSON code</span>

In [None]:
{
  "code": 0,
  "data": {
    "stations": [
      {
        "id": "S109",
        "deviceId": "S109",
        "name": "Ang Mo Kio Avenue 5",
        "location": {
          "latitude": 1.3764,
          "longitude": 103.8492
        }
      },
      {
        "id": "S106",
        "deviceId": "S106",
        "name": "Pulau Ubin",
        "location": {
          "latitude": 1.4168,
          "longitude": 103.9673
        }
      },
      {
        "id": "S117",
        "deviceId": "S117",
        "name": "Banyan Road",
        "location": {
          "latitude": 1.256,
          "longitude": 103.679
        }
      },
      {
        "id": "S107",
        "deviceId": "S107",
        "name": "East Coast Parkway",
        "location": {
          "latitude": 1.3135,
          "longitude": 103.9625
        }
      },
      {
        "id": "S104",
        "deviceId": "S104",
        "name": "Woodlands Avenue 9",
        "location": {
          "latitude": 1.44387,
          "longitude": 103.78538
        }
      },
      {
        "id": "S115",
        "deviceId": "S115",
        "name": "Tuas South Avenue 3",
        "location": {
          "latitude": 1.29377,
          "longitude": 103.61843
        }
      },
      {
        "id": "S116",
        "deviceId": "S116",
        "name": "West Coast Highway",
        "location": {
          "latitude": 1.281,
          "longitude": 103.754
        }
      },
      {
        "id": "S60",
        "deviceId": "S60",
        "name": "Sentosa",
        "location": {
          "latitude": 1.25,
          "longitude": 103.8279
        }
      },
      {
        "id": "S50",
        "deviceId": "S50",
        "name": "Clementi Road",
        "location": {
          "latitude": 1.3337,
          "longitude": 103.7768
        }
      },
      {
        "id": "S44",
        "deviceId": "S44",
        "name": "Nanyang Avenue",
        "location": {
          "latitude": 1.34583,
          "longitude": 103.68166
        }
      },
      {
        "id": "S43",
        "deviceId": "S43",
        "name": "Kim Chuan Road",
        "location": {
          "latitude": 1.3399,
          "longitude": 103.8878
        }
      },
      {
        "id": "S24",
        "deviceId": "S24",
        "name": "Upper Changi Road North",
        "location": {
          "latitude": 1.3678,
          "longitude": 103.9826
        }
      },
      {
        "id": "S06",
        "deviceId": "S06",
        "name": "Paya Lebar",
        "location": {
          "latitude": 1.3524,
          "longitude": 103.9007
        }
      },
      {
        "id": "S111",
        "deviceId": "S111",
        "name": "Scotts Road",
        "location": {
          "latitude": 1.31055,
          "longitude": 103.8365
        }
      },
      {
        "id": "S121",
        "deviceId": "S121",
        "name": "Old Choa Chu Kang Road",
        "location": {
          "latitude": 1.37288,
          "longitude": 103.72244
        }
      }
    ],
    "readings": [
      {
        "timestamp": "2024-11-27T19:59:00+08:00",
        "data": [
          {
            "stationId": "S109",
            "value": 25
          },
          {
            "stationId": "S106",
            "value": 24.7
          },
          {
            "stationId": "S117",
            "value": 26.1
          },
          {
            "stationId": "S107",
            "value": 25.6
          },
          {
            "stationId": "S104",
            "value": 25.3
          },
          {
            "stationId": "S115",
            "value": 26.3
          },
          {
            "stationId": "S116",
            "value": 26.3
          },
          {
            "stationId": "S60",
            "value": 25.8
          },
          {
            "stationId": "S50",
            "value": 25.4
          },
          {
            "stationId": "S44",
            "value": 25.7
          },
          {
            "stationId": "S43",
            "value": 25.6
          },
          {
            "stationId": "S24",
            "value": 24.7
          },
          {
            "stationId": "S06",
            "value": 25
          },
          {
            "stationId": "S111",
            "value": 24.9
          },
          {
            "stationId": "S121",
            "value": 25.4
          }
        ]
      }
    ],
    "readingType": "DBT 1M F",
    "readingUnit": "deg C"
  },
  "errorMsg": ""
}

### **Python Libraries Used**
<div>We selected these libraries together as it provide a powerful toolkit for data processing, web interaction, database operations and parallel execution in Python.</div>

- **Requests**: used for making HTTP requests in Python
- **Pandas**: used for data manipulation for effectively handling structured data
- **Datetime and Timedelta**: used for working with dates, times and time intervals
- **SQLAlchemy**: a SQL toolkit and Object-Relational Mapping (ORM) to connect to relational databases
- **Logging**: provides a flexible framework for generating log messages
- **Concurrent.futures**: used for parallelizing tasks 

In [None]:
# Install the following packages in your Anaconda Prompt or Terminal:
conda install request
conda install pandas
conda install sqlalchemy

# Logging module and Concurrent Futures are standard libraries included

### **Accessing Data**
<div>The following script has been optimized, ingestion of the data has improved by <b>approximately 86.87%</b> from 15 minutes to 2 minutes. This script benefits for <b>collecting</b>, <b>processing</b> and <b>storing</b> air temperature data from an API by:</div>

1. **Comprehensive data collection**: Fetches hourly air temperature data for rich dataset analysis
2. **Efficient concurrent processing**: Using ThreadPoolExcutor for parallel data fetching, reducing overall execution time
3. **Robust error handling**: implements try-except to gracefully handle and log various potential errors that helps in debugging
4. **Data Intgrity**: Removes duplication entries before insertion
5. **Modular Design**: Enhance readbility and maintainablity with distinct functions
6. **Database Integration**: Seamlessly stores processed data in PostgreSQL databbase
7. **Data Verification**: Verify number of records in database after insertion for quick sanity check
8. **Flexible Date Range**: Apply start and end dates for data collection period
9. **Pandas Integration**: Leverages pandas for efficient data manipulation and transformation
10. **API Error Resilience**: Handles potential API failures gracefully, logging errors without crashing entire process
11. **Scalability**: Allows for easy scaling to handle large date ranges or more frequent data points
12. **Reusability**: Easy to adapt the script for similar data collection tasks from different APIs
13. **Automated Workflow**: Script can run autonomously to collect and store data regularly
14. **Data Consistency**: Format data by processing API responses into a standardized structure before database insertion.

In [None]:
import requests
import pandas as pd
from datetime import datetime, timedelta
from sqlalchemy import create_engine, text
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed

# Logging Configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Constants Definition
API_URL = "https://api.data.gov.sg/v1/environment/air-temperature" # Points to API endpoint
DB_CONFIG = {
    'user': 'postgres',              # Database username
    'password': 'admin',             # Database password
    'host': 'localhost',             # Database server address
    'port': '5432',                  # Database port number, default for Postgre
    'database': 'data_gov_project'   # Name of database for connection
}
START_DATE = datetime(2023, 10, 1)       # Parameter passed to 'datetime constructor', set at 00:00:00
END_DATE = datetime(2024, 9, 30, 23, 59) # Parameter passed to 'datetime constructor', set at 23::59:00


def fetch_air_temperature_data(date):
    """Fetch air temperature data from the API for a given date."""
    date_time_str = date.strftime("%Y-%m-%dT%H:%M:%S") # Formatting the date
    params = {"date_time": date_time_str} # Setting parameters for the API call
    
    try: # API call with error handling
        response = requests.get(API_URL, params=params)
        response.raise_for_status()
        
        # Handling the Response
        json_data = response.json()
        items = json_data.get("items", [])
        
        if not items: 
            logging.warning(f"No data returned for {date_time_str}.") # Check for data presence
            return None
        
        return process_items(items) # Processing data
    
    except requests.RequestException as e: # Exception Handling for Request Failures
        logging.error(f"Failed to fetch data for {date_time_str}. Error: {e}")
        return None

    
def process_items(items):
    """Process the fetched items and return a DataFrame."""
    data = [] # Data list initialization
    for item in items: # Iterating over items
        readings = item.get('readings', []) # Extracting readings and timestamp
        timestamp = item.get('timestamp')
        for sensor in readings: # Iterating over sensor readings
            data.append({ # Appending processed data to list
                'station_id': sensor['station_id'],
                'temperature': sensor['value'],
                'airtemp_date': timestamp
            })
    return pd.DataFrame(data) # Creating and returning a dataframe


def load_data_to_postgres(data_frame, engine):
    """Load the provided pandas DataFrame into the 'air_temp' table."""
    try:
        data_frame.to_sql('air_temp', engine, if_exists='append', index=False) # Loading data into PostgreSQL
        logging.info(f"Successfully loaded {len(data_frame)} records to PostgreSQL table.") # Logging successful load
    except Exception as e: # Exception handling
        logging.error(f"Error loading data into PostgreSQL: {e}")

        
def verify_data_in_db(engine):
    """Retrieves the number of rows from 'air_temp' table."""
    try:
        # Connecting to the database
        with engine.connect() as connection:
            result = connection.execute(text("SELECT COUNT(*) FROM air_temp")) # Executing the SQL Query
            count = result.fetchone()[0] # Fetching the row count
            logging.info(f"Total records in 'air_temp' table: {count}") # Logging result
    except Exception as e: # Exception handling
        logging.error(f"Error verifying data in PostgreSQL: {e}")

        
def main():
    logging.info("Starting the script...") # Logging start of the script

    # Create database engine
    engine = create_engine(f'postgresql://{DB_CONFIG["user"]}:{DB_CONFIG["password"]}@{DB_CONFIG["host"]}:{DB_CONFIG["port"]}/{DB_CONFIG["database"]}')
    
    # Connecting to database
    try:
        engine.connect()
        logging.info("Database connection successful")
    except Exception as e:
        logging.error(f"Error connecting to PostgreSQL: {e}")
        return

    date_range = pd.date_range(start=START_DATE, end=END_DATE, freq='H') # Generating data range
    
    # Fetching data concurrently
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_date = {executor.submit(fetch_air_temperature_data, date): date for date in date_range}
        data_frames = []
        
        for future in as_completed(future_to_date): # Processing results as complete
            df = future.result()
            if df is not None:
                data_frames.append(df)

    if data_frames: # Combining and processing dataframes
        combined_df = pd.concat(data_frames, ignore_index=True)
        combined_df['airtemp_date'] = pd.to_datetime(combined_df['airtemp_date'])
        combined_df = combined_df.drop_duplicates()
        
        load_data_to_postgres(combined_df, engine)
        verify_data_in_db(engine)
    else:
        logging.warning("No data collected.")

    logging.info("Script completed.") # Completion log

if __name__ == "__main__": # Running main function
    main()

### **Air Temperature Output Result**
<img src="https://raw.githubusercontent.com/YvonneLipLim/Images/main/Air_Temperature_Output.png" alt="Alt Text" width="800">
