# Parking Lots Availability and Forecasting App
### Data Engineering Capstone Project

#### Project Summary
The goal of this project is to build data pipeline for gathering real-time carpark lots availability and weather datasets from Data.gov.sg. These data are extracted via API, and stored them in the S3 bucket before ingesting them into the Dare Warehouse. These data will be used to power the mechanics of the Parking Lots Availability and Forecasting App

The objectives are to:

1. Building an ETL pipeline using Apache Airflow to extract data via API and store them in AWS S3.
2. Ingesting data from AWS S3, and staging them in Redshift, and
3. Transforming data into a set of dimensional and fact tables.

In [7]:
import requests
import configparser
import json
from datetime import datetime
import os
import logging
import psycopg2
import pandas as pd
import io
from pandas.io.json import json_normalize

In [8]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### Project Scope and Datasets

#### Project Scope

I am always curious about whether with weather data, I could forecast the carpark availability in specific locations in the next hours or days. So, this project that I am doing it to first build data pipeline to collect past and live weather and carpark availability data with 10-minutes interval. Then, store these data in Redshift.

Although this part of the project is outside of capstone project scope, the plan is to use the data to run exploration and train machine learning models. Then, use the data and models to power Web Apps that runs on R Shiny App.

##### Tools

I'm using Docker containers to build the data pipeline, as they are convenient to deploy technology stack without going through the hassle of installing. Furthermore, at the end of it, I could push the containers and host them in AWS services.

The tech-stack that I'm using for developing the pipeline are:

- __PostgreSQL:__ Deployed 2 instances. One is meant for Airflow metadata DB, while the other is used for initial development of the Data Warehouse locally. One of the instance can be removed after migrated to Redshift.
- __Pgadmin:__ DB Administrative Tool
- __Jyupter Notebook:__ Using jyupter notebook to develop environment for automating Redshift Data Warehouse deployment, Developing and Testing ETL codes, and Run simple data exploration.
- __Airflow:__ Using airflow, with local executors to design and deploy codes as workflows.
- __AWS S3:__ Using S3 for storing the data after quering the APIs.
- __AWS Redshift:__ Using AWS Redshift for Data Warehouse


#### Dataset Used
The following are datasets used in the project. They are all extracted from data.gov.sg using API calls:
- Temperature events: temperature measurement taken every 10 minutes interval.
- Rainfall events: Rainfall measurement taken every 10 minutes interval.
- Carpark Availability: Carpark availability count taken every 10 minutes interval.
- Carpark Information: Dataset that shows the carpark number and locations.
- Weather Stations Information: Dataset that show the weather stations number and their locations.


###### Parameters for connecting to redshift

In [9]:
config = configparser.ConfigParser()
config.read_file(open('dwh.cfg'))

DWH_DB                 = config.get("DWH","DWH_DB")
DWH_DB_USER            = config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD        = config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT               = config.get("DWH","DWH_PORT")

DWH_ENDPOINT           = config.get("CLUSTER", "HOST")
DWH_ROLE_ARN           = config.get("IAM_ROLE", "ARN")

###### Connect to Redshift

In [10]:
conn_string="postgresql://{}:{}@{}:{}/{}".format(DWH_DB_USER, DWH_DB_PASSWORD, DWH_ENDPOINT, DWH_PORT,DWH_DB)
print(conn_string)
%sql $conn_string

postgresql://awsuser:passw0rD@dwhcluster.crttik8cimnv.us-west-2.redshift.amazonaws.com:5439/dev


'Connected: awsuser@dev'

In [12]:
%sql SELECT * FROM temperature_events LIMIT 3;

 * postgresql://awsuser:***@dwhcluster.crttik8cimnv.us-west-2.redshift.amazonaws.com:5439/dev
3 rows affected.


date_time,station_id,temperature
2019-08-25 01:40:00+00:00,S100,27.7
2019-08-25 01:40:00+00:00,S104,28.0
2019-08-25 01:40:00+00:00,S106,26.4


In [13]:
%sql SELECT * FROM rainfall_events LIMIT 3;

 * postgresql://awsuser:***@dwhcluster.crttik8cimnv.us-west-2.redshift.amazonaws.com:5439/dev
3 rows affected.


date_time,station_id,rainfall
2019-08-25 01:50:00+00:00,S07,0.0
2019-08-25 01:50:00+00:00,S08,0.0
2019-08-25 01:50:00+00:00,S100,0.0


In [14]:
%sql SELECT * FROM carpark_availability LIMIT 3;

 * postgresql://awsuser:***@dwhcluster.crttik8cimnv.us-west-2.redshift.amazonaws.com:5439/dev
3 rows affected.


date_time,carpark_id,lots_available
2019-08-25 01:50:00+00:00,A10,30
2019-08-25 01:50:00+00:00,A100,7
2019-08-25 01:50:00+00:00,A11,155


### Step 2: Explore and Assess the Data
#### Explore the Data 



Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.