<img src="https://gallery.mailchimp.com/f98d5ac0a3fbbdcdda35136ab/images/2002af76-5fd4-4185-9d49-28558b6b8772.png">

# `sg-hdb-resale-bokeh` 
# Part 1: Extract, Transform, Load

In this notebook, I will be carry out steps and constructing structures that allows the following:
+ preliminary exploration of the raw data in terms of its state and format that it comes in.
+ storing of extracted data in class variables
+ loading extracted data into a database

These things takes one through the **ETL** process that is facilitated by a set of functions/classes that makes a data pipeline. Albeit a simple one, beginners can learn from this whole process that we're about to go through.

<img src="https://i.ibb.co/wJQ4fK7/etl-workflow-image.png">

What is **ETL**?
+ **Extract:** This is the process of extracting data/information from the raw files. In our context here, the raw files have been provided to us in CSV form. In other enterprise use cases these raw files can come in other forms such as streamed JSON objects or transactional data from OLTP databases.
+ **Transform:** The process of converting data from the aforementioned extraction process to a digestible format to be ingested to another database or a datalake.
+ **Load:** Following transformation where the extracted data has been reformated, the process of loading all of it into a database comes under here.

In [None]:
# Begin by importing the packages we need
import os
import pandas as pd

<font color="blue"><h1><center>Extract</center></h1></font>
Download data at:
+ https://data.gov.sg/dataset/resale-flat-prices
+ https://data.gov.sg/dataset/hdb-resale-price-index

__Resale Flat Prices:__ This dataset consist of transactions for HDB resale units.

This section is where we construct some functions to carry out the main objective of extracting the data from the CSV files containing information regarding HDB resale units. A list of the things that we'd like to have the functions carry out:
+ Checking the names (and their naming conventions) of the CSV files
+ Checking the number of header columns that exist within each file and see if they tally with each other
+ Checking the common names of headers that exist across all the files
+ Combine/concatenate all the data into one single object

In [None]:
# Listing down the list of files in the relevant directory
# Where prefixed with `!`, a shell runs the command
!ls ../data/raw/resale-flat-prices/

In [None]:
# For Windows, run this line instead
!dir ..\data\raw\resale-flat-prices\

Here's a list of several things to do:
+ List down files ending with .csv
+ Check length of columns if they are the same
+ Check if the names of the columns are the same

In [None]:
def ListFileNames(file_loc):
    """
    Returns a list of files in specified location.
    """
    file_list = os.listdir(file_loc)
    file_list_csv = [i for i in file_list if i.endswith(".csv")]
    return file_list_csv

In [None]:
ListFileNames('../data/raw/resale-flat-prices/')

In [None]:
def ColsLen(file_loc):
    """
    Returns the unique values of column lengths of imported CSV files. 
    Also to retain a list containing column names of each dataset.
    """
    # Initialise empty arrays
    col_lens = []
    col_names = []
    
    # Retrieving list of .csv files using ListFileNames
    file_list_csv = ListFileNames(file_loc)

    
    for i in file_list_csv:
        # Import each CSV file
        curr_df = pd.read_csv('{}{}'.format(file_loc, i))
        # Get no. of columns for each dataframe
        curr_col_len = len(curr_df.columns)
        # Get column names for each dataframe
        curr_col_names = curr_df.columns.values
        col_lens.append(curr_col_len)
        col_names.append(curr_col_names)
        #print(col_lens)
    set_len = len(set(col_lens))
    
    print('There are {} sets of column lengths.'.format(set_len))
    return col_lens, col_names

In [None]:
ColsLen('../data/raw/resale-flat-prices/')

In [None]:
def CommonColNames(file_loc):
    """
    Returns common column names across all datasets
    """
    
    column_lengths, column_names = ColsLen(file_loc)
    
    common_cols = list(set(column_names[0]).intersection(*column_names))
    return common_cols

In [None]:
CommonColNames('../data/raw/resale-flat-prices/')

From the above, it is seen that we have 10 common column names across all the datasets. There's only 1 column name that differs from the rest i.e. existing in some datasets and not existing in some other. This is due to the fact that the column `remaining_lease` exists only in 1 dataset and is absent in the others.

The class that we will be creating below is to combine all the data from all the CSV files into one `pandas` dataframe. We will be using the `pandas` function `concat` to combine. Where a dataset does not have a certain variable that exists in the other dataset, the variable will be retained while filling in '0's for empty values.

In [None]:
def CombineDF(file_loc):
    """
    This method imports all the CSV files and concatenate them together.
    Values for mismatched columns will be filled in with 0s.
    """
    
    file_list_csv = ListFileNames(file_loc)
    
    dataset_files = []
    for i in file_list_csv:
        dataset_files.append('{}{}'.format(file_loc, i))
    print(dataset_files)
    frames = [ pd.read_csv(f) for f in list(dataset_files) ]
    # .fillna() is being used below as other CSV files does not have `remaining_lease` col
    combi_result = pd.concat(frames, ignore_index=True).fillna(0)
    return combi_result

In [None]:
hdb_combi_df = CombineDF('../data/raw/resale-flat-prices/')

In [None]:
# Printing the first 5 observations of dataframe
hdb_combi_df.head()

In [None]:
hdb_combi_df

Now, we will import the data containing the quarterly HDB resale price index.

In [None]:
hdb_rpi = pd.read_csv('../data/raw/hdb-resale-price-index/housing-and-development-board-resale-price-index-1q2009-100-quarterly.csv')
hdb_rpi

Looks like all is good with the imported dataset for HDB resale price indexes. We will not bother with it for the following transformation process.

<font color="blue"><h1><center>Transform</center></h1></font>

The data that we have extracted from the CSV files are quite clean and hence we can choose to not do any transformation prior to the loading process. Of course, in the real world, hardly ever do we get such luck.
Further transformations for the purpose of feature engineering can be implemented during the [modelling phase](./1.0-ryzk-model.ipynb).

Even though the formatting/state of the dataset is good enough for us to ingest into a database, for the purpose of this exercise, let us create a class that enables us to transform the values of a single variable.

Currently, as seen below, the variable `flat_model` contains many (35) different categories and some are mismatched.

In [None]:
# Display unique values for the variable
hdb_combi_df['flat_model'].unique()

In [None]:
# Display no. of categories for the variable
len(hdb_combi_df['flat_model'].unique())

We have many different categories but some of them are linked to the same one category and are actually just spelled differently due to entry methods. For example we have the following categories as observed from above:
+ 'Model A'
+ 'MODEL A'

Both are pertaining to a single model category but due to the different casings they are treated as different categories. A simple act of transformation that we can employ through a class that we will be creating is to just convert every letter of the values in the `flat_model` column to lowercase. 

In [None]:
def TransformFlatModelVar(df, colname = 'flat_model'):
    """
    A method that transforms specified column to lowercase.
    """
    df[colname] = df[colname].str.lower()
    return df

In [None]:
hdb_combi_df = TransformFlatModelVar(hdb_combi_df)

In [None]:
# Display unique values for the variable after the creation and usage of the new class
hdb_combi_df['flat_model'].unique()

In [None]:
# Display no. of categories for the variable after transformation
len(hdb_combi_df['flat_model'].unique())

As you can see, through transformation we are able to handle mismatched categories and in this sense, we have only done some form of preliminary data cleaning but that of course does not deviate from the essence of the transformation process.

Let us now export the extracted data to one single .csv file for checkpoint.

In [None]:
hdb_combi_df.to_csv("../data/interim/sg-resale-flat-prices-1990-to-2019-jan.csv", index = False)

In [None]:
# Include index for dataframe and renaming the column to 'id'
hdb_combi_df.reset_index(level=0, inplace=True)
hdb_combi_df.columns.values[0] = 'id'

In [None]:
hdb_combi_df

<font color="blue"><h1><center>Load</center></h1></font>

Following extraction and transformation, we now intend to load the data derived from the above processes into a simple [SQLite](https://www.sqlite.org/index.html) RDMS/database. 
(For simplicity's sake, we'll use SQLite for now. In the future, one might want to take a look into remote alternatives.)

In [None]:
# Check max length of a value in a column of object data type
hdb_combi_df.town.str.len().max()

In [None]:
# Check number of null values across all columns
hdb_combi_df.isnull().sum()

# SQLite

In [None]:
import os
import sys
import sqlalchemy
from sqlalchemy import Table, Column, Integer, String, Float, Date
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

Quick observation on how the datasets that we intend to load into the database looks like.

In [None]:
hdb_combi_df.head()

In [None]:
hdb_rpi.head()

For the next few cells, we will be working towards creating the database:
+ create engine to initialise connection
+ specify table names and their columns

In [None]:
# Create engine
engine = create_engine('sqlite:///../data/processed/sg_hdb.db')
Base = declarative_base()

In [None]:
# Specify tables
class HDBRes(Base):
    __tablename__ = 'sg_hdb_resale'
    
    id = Column(Integer, primary_key=True)
    block = Column(String(7))
    flat_model = Column(String(30))
    flat_type = Column(String(20))
    floor_area_sqm = Column(Float())
    lease_commence_date = Column(Integer())
    month = Column(String(7))
    remaining_lease = Column(Integer())
    resale_price = Column(Float())
    storey_range = Column(String(15))
    street_name = Column(String(50))
    town = Column(String(20))
    
class HDBPI(Base):
    __tablename__ = 'sg_hdb_pi'
    
    quarter = Column(String(7), primary_key=True)
    index = Column(Float())

In [None]:
# Create tables as defined above
Base.metadata.create_all(engine)

Here, we create a function that allows us to connect with the database created from above and insert values from relevant `pandas` dataframes into the SQLite database.

In [None]:
def SGHDBBulkInsert(table_name, df_to_insert, engine_loc):
    engine = create_engine(engine_loc)
    
    # The orient='records' is the key of this, it allows to align with the format mentioned in the doc to insert in bulks.
    list_to_write = df_to_insert.to_dict(orient='records')
    metadata = sqlalchemy.schema.MetaData(bind=engine)
    table = sqlalchemy.Table(table_name, metadata, autoload=True)
    
    # Open the session
    Session = sessionmaker(bind=engine)
    session = Session()
    
    conn = engine.connect()
    # Insert the dataframe into the database in one bulk
    conn.execute(table.insert(), list_to_write)
    # Commit the changes
    session.commit()
    # Close the session
    session.close()

In [None]:
# Executing insertion of the HDB Resale data
SGHDBBulkInsert('sg_hdb_resale', hdb_combi_df, 'sqlite:///../data/processed/sg_hdb.db')

In [None]:
# Now for the HDB Resale Price Indexes
SGHDBBulkInsert('sg_hdb_pi', hdb_rpi, 'sqlite:///../data/processed/sg_hdb.db')

To observe as to whether the intended operations have been executed successfully, we can use relevant GUI tools to examine the contents of databases. For SQLite, we can use [DB Browser for SQLite](https://sqlitebrowser.org/dl/). Once we have loaded the relevant data into our database, it is time for us to work on a simple machine learning model. On to the next part!