# **Generation SG Junior Data Engineer Programme**
### **Interim Project presented by DPPS Team (5)**<br><span style="color:darkblue; font-weight:bold;">Members: Daniel | Pin Pin, Yvonne | Pin Yean, Erica | Shawn</span>


### <span style="color:darkblue; font-weight:bold;">API Data Extraction - Relative Humidity across Singapore</span>
<div>This document demonstrated how the team made requests and worked with APIs in Python using common libraries to extract a single source of truth data from OpenAPI, Data.Gov.SG for our project that offers the following benefits:</div>

- **Real-time data access**: retrieve up-to-date data on demand
- **Large datasets**: integrate data from multiple sources
- **Pre-processed data**: saving significant time and resources


### **Understanding API Documentation**
<div>When working with APIs, consulting the documentation is crucial for making successful requests. It includes:</div>

- available endpoints
- required parameters
- authentication methods
- expected response formats
- license of the Open Data (free or paid, personal or commercial, etc)


### **Evaluating The Options Available**
<div>We accessed different APIs in Data.Gov.SG and noticed it uses <em>REST API</em> (Representational State Transfer Application Programming Interface), a popular web service that follows the <em>REST</em> architectural style, allowing applications to communicate with each other by exchanging data through standardized methods, typically using HTTP requests to access and manipulate resources on a server.</div><br>


**API request and response models:**
- **Request body**: JSON (JavaScript Object Notation) format
- **200**: Everything went okay and the result has been returned (if any)
- **400**: Server thinks we made a bad request when we send incorrect data or make other client-side errors
- **404**: Resource we tried to access wasn't found on server

### **Making GET Request To The API**
<div>We executed a <b>GET request</b> to the Singapore government’s open data API endpoint for real-time relative humidity data. This request retrieved a JSON response containing comprehensive information about relative humidity readings across various locations in Singapore. The data obtained allows us to analyze:</div>

1. The number and distribution of humidity monitoring stations
2. Real-time relative humidity levels at different locations
3. Temporal patterns in humidity measurements
4. Potential correlations between humidity and geographical features

The <b>GET</b> method efficiently fetches information without altering server-side data. By leveraging this API, we gain valuable insights into Singapore’s atmospheric conditions, supporting various applications in urban planning, environmental monitoring, and public health initiatives.

**[Link](https://api-open.data.gov.sg/v2/real-time/api/relative-humidity)**

<span style="color:darkgreen; font-weight:bold;">JSON code</span>

In [None]:
{
  "code": 0,
  "data": {
    "stations": [
      {
        "id": "S109",
        "deviceId": "S109",
        "name": "Ang Mo Kio Avenue 5",
        "location": {
          "latitude": 1.3764,
          "longitude": 103.8492
        }
      },
      {
        "id": "S106",
        "deviceId": "S106",
        "name": "Pulau Ubin",
        "location": {
          "latitude": 1.4168,
          "longitude": 103.9673
        }
      },
      {
        "id": "S117",
        "deviceId": "S117",
        "name": "Banyan Road",
        "location": {
          "latitude": 1.256,
          "longitude": 103.679
        }
      },
      {
        "id": "S107",
        "deviceId": "S107",
        "name": "East Coast Parkway",
        "location": {
          "latitude": 1.3135,
          "longitude": 103.9625
        }
      },
      {
        "id": "S104",
        "deviceId": "S104",
        "name": "Woodlands Avenue 9",
        "location": {
          "latitude": 1.44387,
          "longitude": 103.78538
        }
      },
      {
        "id": "S115",
        "deviceId": "S115",
        "name": "Tuas South Avenue 3",
        "location": {
          "latitude": 1.29377,
          "longitude": 103.61843
        }
      },
      {
        "id": "S116",
        "deviceId": "S116",
        "name": "West Coast Highway",
        "location": {
          "latitude": 1.281,
          "longitude": 103.754
        }
      },
      {
        "id": "S60",
        "deviceId": "S60",
        "name": "Sentosa",
        "location": {
          "latitude": 1.25,
          "longitude": 103.8279
        }
      },
      {
        "id": "S50",
        "deviceId": "S50",
        "name": "Clementi Road",
        "location": {
          "latitude": 1.3337,
          "longitude": 103.7768
        }
      },
      {
        "id": "S44",
        "deviceId": "S44",
        "name": "Nanyang Avenue",
        "location": {
          "latitude": 1.34583,
          "longitude": 103.68166
        }
      },
      {
        "id": "S43",
        "deviceId": "S43",
        "name": "Kim Chuan Road",
        "location": {
          "latitude": 1.3399,
          "longitude": 103.8878
        }
      },
      {
        "id": "S24",
        "deviceId": "S24",
        "name": "Upper Changi Road North",
        "location": {
          "latitude": 1.3678,
          "longitude": 103.9826
        }
      },
      {
        "id": "S06",
        "deviceId": "S06",
        "name": "Paya Lebar",
        "location": {
          "latitude": 1.3524,
          "longitude": 103.9007
        }
      },
      {
        "id": "S111",
        "deviceId": "S111",
        "name": "Scotts Road",
        "location": {
          "latitude": 1.31055,
          "longitude": 103.8365
        }
      },
      {
        "id": "S121",
        "deviceId": "S121",
        "name": "Old Choa Chu Kang Road",
        "location": {
          "latitude": 1.37288,
          "longitude": 103.72244
        }
      }
    ],
    "readings": [
      {
        "timestamp": "2024-11-28T14:08:00+08:00",
        "data": [
          {
            "stationId": "S109",
            "value": 63.6
          },
          {
            "stationId": "S106",
            "value": 84.6
          },
          {
            "stationId": "S117",
            "value": 94
          },
          {
            "stationId": "S107",
            "value": 77.1
          },
          {
            "stationId": "S104",
            "value": 79.1
          },
          {
            "stationId": "S115",
            "value": 89.3
          },
          {
            "stationId": "S116",
            "value": 88.3
          },
          {
            "stationId": "S60",
            "value": 83.3
          },
          {
            "stationId": "S50",
            "value": 77.7
          },
          {
            "stationId": "S44",
            "value": 88.4
          },
          {
            "stationId": "S43",
            "value": 69
          },
          {
            "stationId": "S24",
            "value": 74.8
          },
          {
            "stationId": "S06",
            "value": 67
          },
          {
            "stationId": "S111",
            "value": 89.8
          },
          {
            "stationId": "S121",
            "value": 84.5
          }
        ]
      }
    ],
    "readingType": "RH 1M F",
    "readingUnit": "percentage"
  },
  "errorMsg": ""
}

### **Python Libraries Used**
<div>We selected these libraries together as it provide a powerful toolkit for data processing, web interaction, database operations and parallel execution in Python.</div>

- **Requests**: used for making HTTP requests in Python
- **Pandas**: used for data manipulation for effectively handling structured data
- **Datetime and Timedelta**: used for working with dates, times and time intervals
- **SQLAlchemy**: a SQL toolkit and Object-Relational Mapping (ORM) to connect to relational databases
- **Logging**: provides a flexible framework for generating log messages
- **Concurrent.futures**: used for parallelizing tasks 

In [None]:
# Install the following packages in your Anaconda Prompt or Terminal:
conda install request
conda install pandas
conda install sqlalchemy

# Logging module and Concurrent Futures are standard libraries included

### **Accessing Data**
<div>The data extraction has been optimized, an improvement of <b>approximately 93.33%</b>, reducing the time taken from  from 30 minutes to 2 minutes. This script benefits for <b>collecting</b>, <b>processing</b> and <b>storing</b> humidity data from an API by:</div>

1. **Comprehensive data collection**: Fetches hourly air temperature data for rich dataset analysis
2. **Efficient concurrent processing**: Using ThreadPoolExcutor for parallel data fetching, reducing overall execution time
3. **Robust error handling**: implements try-except to gracefully handle and log various potential errors that helps in debugging
4. **Data Intgrity**: Removes duplication entries before insertion
5. **Modular Design**: Enhance readbility and maintainablity with distinct functions
6. **Database Integration**: Seamlessly stores processed data in PostgreSQL databbase
7. **Data Verification**: Verify number of records in database after insertion for quick sanity check
8. **Flexible Date Range**: Apply start and end dates for data collection period
9. **Pandas Integration**: Leverages pandas for efficient data manipulation and transformation
10. **API Error Resilience**: Handles potential API failures gracefully, logging errors without crashing entire process
11. **Scalability**: Allows for easy scaling to handle large date ranges or more frequent data points
12. **Reusability**: Easy to adapt the script for similar data collection tasks from different APIs
13. **Automated Workflow**: Script can run autonomously to collect and store data regularly
14. **Data Consistency**: Format data by processing API responses into a standardized structure before database insertion.

In [None]:
import requests
import pandas as pd
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed

# Logging Configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Constants Definition
URL = "https://api-open.data.gov.sg/v2/real-time/api/relative-humidity" # Points to API endpoint
START_DATE = datetime(2024, 9, 1) # Parameter passed to 'datetime constructor'
END_DATE = datetime(2024, 9, 30)  # Parameter passed to 'datetime constructor'
DB_CONNECTION_STRING = 'postgresql://postgres:admin@localhost:5432/data_gov_project' # Database connection string

def fetch_humidity_data(current_datetime):
    """Fetch humidity data from the API for a given date."""
    date_str = current_datetime.strftime("%Y-%m-%d") # Converts 'current_datetime' to a string in format "YYYY-MM-DD" using 'strftime'
    params = {"date": date_str} # Create dictionary to send date as a parameter in API request
    
    try: # Initiate the handle exceptions that may occur during API request
        response = requests.get(URL, params=params, timeout=10)
        response.raise_for_status() # Check HTTP errors and raises an exception if occur
        json_data = response.json() # JSON response processing 
        
        # Check if response contains a "code" key with a value of '0' (which mean success) if there is data
        if json_data.get("code") == 0 and "data" in json_data:
            # Data extraction
            readings = json_data["data"].get("readings", [])
            return [
                {
                    'stationId': data.get("stationId"),
                    'humidity': data.get("value"),
                    'timestamp': reading["timestamp"]
                }
                for reading in readings
                for data in reading["data"]
            ]
        # Logging warnings
        else:
            logging.warning(f"No humidity readings returned for {date_str}. Code: {json_data.get('code')}")
            return []
    # Exception handling
    except requests.RequestException as e:
        logging.error(f"Failed to fetch data for {date_str}: {str(e)}")
        return []

def process_data(start_date, end_date):
    """Fetch and process humidity data for a range of start and end dates"""
    date_range = [start_date + timedelta(days=i) for i in range((end_date - start_date).days + 1)] # Generating date range
    humidity_data = [] # Data storage initalization
    
    #Concurrency with Thread Pool to configure to use a maximum of 10 worker threads
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_date = {executor.submit(fetch_humidity_data, date): date for date in date_range} # Submitting tasks
        # Collecting results
        for future in as_completed(future_to_date):
            humidity_data.extend(future.result())
    # Returning final data
    return humidity_data

def main():
    """The process of fetching, processing, logging and storing data"""
    humidity_data = process_data(START_DATE, END_DATE) # Fetching humidity data
    
    if humidity_data: # Checking data presence
        df = pd.DataFrame(humidity_data) # Creating a dataframe
        df['timestamp'] = pd.to_datetime(df['timestamp']) # Timestamp conversion
        df = df.rename(columns={'stationId': 'station_id', 'timestamp': 'humidity_date', 'humidity': 'humidity_readings'}) # Renaming columns
        
        # Logging data information
        logging.info(f"Data shape: {df.shape}")
        logging.info(f"\n{df.head()}")
        logging.info(f"\n{df.info()}")
        
        # Loading data into PostgreSQL
        try:
            engine = create_engine(DB_CONNECTION_STRING)
            df.to_sql('humidity', engine, if_exists='append', index=False)
            logging.info("Data successfully loaded into PostgreSQL.")
        # Error handling
        except Exception as e:
            logging.error(f"Failed to load data into PostgreSQL: {str(e)}")
    else:
        logging.warning("No humidity data collected.")

if __name__ == "__main__": #  Running main function
    main()

### **Humidity Output Result**
<img src="https://raw.githubusercontent.com/YvonneLipLim/Images/main/Humidity_Output.png" alt="Alt Text" width="800">
