# **Generation SG Junior Data Engineer Programme**
### **Interim Project presented by DPPS Team (5)**<br><span style="color:darkblue; font-weight:bold;">Members: Daniel | Pin Pin, Yvonne | Pin Yean, Erica | Shawn</span>


### <span style="color:darkblue; font-weight:bold;">Singapore Resale Flat Prices: Comprehensive Data Ingestion Methodology</span>
<div>This documentation outlines our systematic approach to ingesting and processing the Singapore Housing & Development Board (HDB) resale flat transaction dataset. Our robust data pipeline transforms raw CSV data into a structured PostgreSQL database, enabling comprehensive real estate market analysis. Our data ingestion workflow:</div>

- **Database Schema Preparation**: developed a precise PostgreSQL schema to capture the nuanced details of HDB resale transactions
- **Data Source Acquisition**: covers entire Singapore residential resale market
- **Data Transformation and Loading**: the Python-based data ingestion script implements a sophisticated ETL (Extract, Transform, Load) process

The resulting database provides a robust foundation for advanced real estate market analysis, enabling precise insights into Singapore's dynamic resale flat ecosystem.

**[Link](https://data.gov.sg/datasets/d_8b84c4ee58e3cfc0ece0d773c8ca6abc/view)**


### **API vs. CSV Data Ingestion: Strategic Decision for Singapore Resale Flat Transactions**
While developing our data pipeline for Singapore real estate market transactions, we conducted a comprehensive comparative analysis between API and CSV ingestion methodologies. Our objective was to optimize data retrieval efficiency for the daily-updated HDB resale flat transaction dataset.

**Key Challenges with API-Based Ingestion:**
1. **Limited Temporal Filtering Capabilities**: the API presented significant constraints in temporal data segmentation. Despite successfully implementing pagination strategies for JSON file extraction, we encountered critical limitations:
    - Inability to specify precise start and end dates
    - Restricted control over transaction and price data time periods
    - Reduced granularity in data retrieval
2. **Performance Bottlenecks**: performance metrics revealed stark disparities between API and CSV ingestion approaches:
    - API Ingestion: 15 minutes processing time
    - CSV Ingestion: 10 seconds processing time
    - Performance Acceleration: Approximately **8,900%** speed enhancement

**Strategic Decision on CSV-Based Data Ingestion**: given these technical constraints, we strategically pivoted to CSV-based ingestion, enabling:
- Comprehensive data coverage from January 2017 to Present (November 2024)
- Instantaneous full dataset loading
- Enhanced data retrieval flexibility
- Minimal computational overhead

**Technical Rationale**
- Maximized data acquisition efficiency
- Simplified data pipeline architecture
- Reduced computational resource consumption
- Improved overall system responsiveness


### **Python Libraries: Our Comprehensive Data Engineering Toolkit**
Our meticulously curated Python library selection represents a strategic approach to building a robust, scalable data processing ecosystem. Each library was deliberately chosen to address specific technical challenges in our data engineering workflow.

- **Pandas**: used for data manipulation for effectively handling structured data
- **SQLAlchemy**: a SQL toolkit and Object-Relational Mapping (ORM) to connect to relational databases
- **psycopg2**: PostgreSQL database adapter

**Strategic Library Synergies**
- Comprehensive data processing capabilities
- Seamless database interaction
- High-performance computational infrastructure
- Scalable and flexible data engineering architecture

By integrating these libraries, we've created a powerful, flexible toolkit capable of handling complex data engineering challenges with exceptional efficiency and precision.

In [None]:
# Install the following packages in your Anaconda Prompt or Terminal:
conda install request
conda install pandas
conda install sqlalchemy

### **Hard Coding vs Function-based Code**
We started with "Hard Coding", where the script was just 1 long sequence of codes. However, as we realised that we would repeatedly reuse this ingestion script over a few times, we moved to a function-based code. 
Doing so, allowed us to enjoy the below benefits:

1. **Reusable Logic**: The function can be used in multiple places with different inputs.
2. **Modular and Scalable**: Encourages separation of concerns, making the code easier to understand and modify.
3. **Flexible**: Parameters allow different values without altering the source code.
4. **Easier to Test**: Isolating logic in functions simplifies debugging and unit testing.

### **Function Sequence**
Here would be a description of what each function does and how it flows:

1. Begin by defining the constant variables used in the script: csv_file_path, db_user, db_pass, db_host, db_port and db_name  
1. load_data_from_csv(file_path): loads the data from file path, and filters by date range, using pandas .read_csv() and dataframes 
2. load_data_to_postgres(data_frame): connects to postgreSQL database, and inserts into the database using libraries sqlalchemy and psycopg2
3. main(): brings the above 3 together in 1 elegant function, and prints outputs to update the user on progress

### **Optimized Data Ingestion: Resale Flat Transaction CSV Ingestion**
Our advanced data acquisition script represents a breakthrough in computational efficiency, dramatically reducing data retrieval time by **8900%** — transforming a protracted 15-minute process into a swift 10-second operation. This sophisticated engineering solution delivers a comprehensive approach to temperature data collection, processing, and storage.

**Technical Architecture: Key Performance Capabilities**
1. **Advanced Data Collection Strategy**: systematic CSV-based data retrieval for comprehensive resale flat transaction records
2. **Efficient Data Processing:**: leverages pandas for sophisticated data manipulation and transformation
3. **Intelligent Error Management**: robust try-except error handling mechanism
4. **Data Quality Assurance**: proactive data cleaning and preprocessing techniques
5. **Architectural Design Principles**: modular function-based architecture, enhances code readability and maintainability
6. **Database Integration Capabilities**: seamless PostgreSQL data persistence, utilizes SQLAlchemy for advanced database connectivity
7. **Flexible Data Acquisition Framework**: supports comprehensive historical data retrieval, enable flexible data loading from CSV sources
8. **Advanced Data Processing**: standardizes raw data into consistent database schema, prepares data for analytical processing
9. **Enterprise-Grade Scalability**: handles large

**Library Ecosystem**
- **Pandas**: Core data manipulation and transformation
- **SQLAlchemy**: Advanced database connectivity and ORM
- **Psycopg2**: Low-level PostgreSQL database interactions
- **Psycopg2.extras**: Efficient bulk data insertion capabilities

The script transcends traditional data retrieval approaches, offering a robust, intelligent solution for comprehensive environmental data management.

In [1]:
import pandas as pd # For data manipulation
from sqlalchemy import create_engine # For database connectivity
import psycopg2 # For PostgreSQL connectivity
from psycopg2 import sql # For SQL queries
from psycopg2.extras import execute_values # For bulk insert

# Define constants
CSV_FILE_PATH = '/Users/shawnwee/teams notes_Generation SCTP JDE 05/Week 5 Interim Project/ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv' # Update your file path
DB_USER = 'postgres'             # Update with your PostgreSQL username
DB_PASS = 'password'             # Update with your PostgreSQL password
DB_HOST = 'localhost'            # Update with your database host
DB_PORT = '5432'                 # Update with your database port
DB_NAME = 'data_gov_project'     # Update with your PostgreSQL database name

# Define date range for filtering
START_DATE = pd.Timestamp('2023-12-01')
END_DATE = pd.Timestamp('2024-11-30')

def load_data_from_csv(file_path):
    """Load data from a CSV file and filter by date range."""
    df = pd.read_csv(file_path)

    # Convert 'month' to a datetime representing resale_date
    df['resale_date'] = pd.to_datetime(df['month'], format='%Y-%m', errors='coerce')

    # Convert numeric columns to appropriate types
    df['floor_area_sqm'] = pd.to_numeric(df['floor_area_sqm'], errors='coerce')
    df['resale_price'] = pd.to_numeric(df['resale_price'], errors='coerce')
    df['lease_commence_year'] = pd.to_datetime(df['lease_commence_date'], errors='coerce').dt.year

    # Prepare DataFrame and rename columns for PostgreSQL compatibility
    processed_df = df.rename(columns={
        'town': 'town_name',
        'block': 'block_no',
        'storey_range': 'storey_range'  # This remains unchanged
    })

    # Select relevant columns for PostgreSQL, excluding 'resale_id'
    processed_df = processed_df[['resale_date', 'town_name', 'flat_type', 'block_no', 
                                  'street_name', 'storey_range', 'floor_area_sqm', 
                                  'flat_model', 'lease_commence_year', 'remaining_lease', 
                                  'resale_price']]
    
    # Filter for valid date ranges
    filtered_df = processed_df[(processed_df['resale_date'] >= START_DATE) & 
                                (processed_df['resale_date'] <= END_DATE)]

    print(f"Filtered records count: {len(filtered_df)}")
    print("Unique resale prices:", filtered_df['resale_price'].unique())

    return filtered_df

def load_data_to_postgres(data_frame):
    """Load the provided DataFrame into the PostgreSQL database."""
    # Create database engine for SQLAlchemy usage
    engine = create_engine(f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}')

    # Prepare for bulk insert using psycopg2
    conn = psycopg2.connect(host=DB_HOST, database=DB_NAME, user=DB_USER, password=DB_PASS)
    cur = conn.cursor()

    # Prepare the insert query
    insert_query = sql.SQL("""
        INSERT INTO resale_flat_txn (resale_date, town_name, flat_type, block_no, street_name, 
                                      storey_range, floor_area_sqm, flat_model, lease_commence_year, 
                                      remaining_lease, resale_price)
        VALUES %s
    """)

    # Prepare tuples for the insert query
    data_tuples = [tuple(x) for x in data_frame.values]

    try:
        # Insert in batches using execute_values
        execute_values(cur, insert_query, data_tuples, template=None, page_size=1000)
        conn.commit()
        print(f"Successfully loaded {len(data_tuples)} records to PostgreSQL.")
    except Exception as e:
        print(f"Error loading data into PostgreSQL: {e}")
    finally:
        cur.close()
        conn.close()

def main():
    """Main function to execute the script."""
    print("Starting the script...")
    
    # Load the data from CSV
    filtered_df = load_data_from_csv(CSV_FILE_PATH)
    
    # Load the filtered data into PostgreSQL
    load_data_to_postgres(filtered_df)

    print("Script completed.")

# Execute the script
if __name__ == "__main__": # Running main function
    main()

Starting the script...
Filtered records count: 27656
Unique resale prices: [288000. 265000. 378000. ... 610599. 752888. 855500.]
Successfully loaded 27656 records to PostgreSQL.
Script completed.


### **Resale Flat transaction Output Result**
<img src="https://raw.githubusercontent.com/YvonneLipLim/Images/main/Resale_Flat_Output.png" alt="Alt Text" width="800">
