### Data Engineering Capstone Project

#### Project Summary

##### Introduction

A core responsibility of The National Travel and Tourism Office (NTTO) is to collect, analyze, and disseminate international travel and tourism statistics. 

NTTO's Board of Managers are charged with managing, improving, and expanding the system to fully account and report the impact of travel and tourism in the United States. The analysis results help to forcecast and operation, support make decision creates a positive climate for growth in travel and tourism by reducing institutional barriers to tourism, administers joint marketing efforts, provides official travel and tourism statistics, and coordinates efforts across federal agencies.

##### Project Description

In this project, some source datas will be use to do data modeling:
* **I94 Immigration**: The source data for I94 immigration data is available in local disk in the format of sas7bdat. This data comes from US National Tourism and Trade Office. The data dictionary is also included in this project for reference. The actual source of the data is from https://travel.trade.gov/research/reports/i94/historical/2016.html. This data is already uploaded to the workspace.

* **World Temperature Data**: This dataset came from Kaggle. This data is already uploaded to the workspace.

* **Airport Code**: This is a simple table with airport codes. The source of this data is from https://datahub.io/core/airport-codes#data. It is highly recommended to use it for educational purpose only but not for commercial or any other purpose. This data is already uploaded to the workspace.

* **U.S. City Demographic Data**: This data comes from OpenSoft link https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/

* Other validation text files such as *I94Addr.txt*, *I94CIT_I94RES.txt*, *I94Mode.txt*, *I94Port.txt** and **I94Visa.txt* extracted from the *I94_SAS_Labels_Descriptions.SAS*.

##### The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

#### Step 1: Scope the Project and Gather Data

##### Scope 

To make decision of project scope and the technical step solution we do data assessment on datasets:
* I94 Immigration
* World Temperature Data
* U.S. City Demographic Data
* Airport Code
* Other reference *I94_SAS_Labels_Descriptions.SAS*.

Tools will be used and import:
- Spark
- Python and its modules
- Pandas
- AWS Redshift cluster will be considered as an optional to run one or more ETL steps.

In [None]:
# Do all imports and installs here - Done
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
import pandas as pd
import re
import configparser
import os

Create config file *etl.cfg* for path parameters:
~~~
[DIR]
INPUT_DIR = .
OUTPUT_DIR = ./storage

[DATA]
I94_IMMI = ../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat
WORLD_TEMPE = ../../data2/GlobalLandTemperaturesByCity.csv
CITY_DEMOGRAPHIC = ./us-cities-demographics.csv
AIR_PORT = ./airport-codes_csv.csv

[SPLIT]
I94_IMMI_SPLITED_DIR = ./storage/.sas7bdat
WORLD_TEMPE_SPLITED_DIR = ./storage/.csv
~~~

Parse config file for path configurations - Done

In [None]:
config = configparser.ConfigParser()
config.read('etl.cfg')

input_data_source = config.get('DIR','INPUT_DIR')
output_processed_data = config.get('DIR','OUTPUT_DIR')

i94immi_dataset = config.get('DATA','I94_IMMI')
worldtempe_dataset = config.get('DATA','WORLD_TEMPE')
citydemo_dataset = config.get('DATA','CITY_DEMOGRAPHIC')
airport_dataset = config.get('DATA','AIR_PORT')

Create Spark session:

In [None]:
# Run on production only
spark = SparkSession.builder\
            .config("spark.jars.repositories", "https://repos.spark-packages.org/")\
            .config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11")\
            .enableHiveSupport()\
            .getOrCreate()

In [None]:
# Run on production only
local_spark = SparkSession.builder.appName("localSpark").getOrCreate()

##### Describe and Gather Data

Take a look on datasets description includes schema, sample records, number of rows, attributes, number of data file (if need).  And then choose datasets will be using for data modeling.

Dataset assessment by counting number of records and data size. Take a look on schema, data column structure, attributes and some of sample records.

- I94 Immigration

- World Temperature Data

- U.S. City Demographic Data

- Airport Code

- Extract dictionary informations from *I94_SAS_Labels_Descriptions.SAS*.
    - I94CIT & I94RES --> i94cntyl.txt
    - I94PORT --> i94prtl.txt
    - I94MODE --> i94model.txt
    - I94ADDR --> i94addrl
    - I94VISA --> i94visa.txt

In [None]:
with open('./I94_SAS_Labels_Descriptions.SAS') as f:
    f_content = f.read()
    f_content = f_content.replace('\t', '')

In [None]:
def code_mapper(file, idx):
    f_content2 = f_content[f_content.index(idx):]
    f_content2 = f_content2[:f_content2.index(';')].split('\n')
    f_content2 = [i.replace("'", "") for i in f_content2]
    dic = [i.split('=') for i in f_content2[1:]]
    dic = dict([i[0].strip(), i[1].strip()] for i in dic if len(i) == 2)
    return dic

In [None]:
i94_cit_and_res = code_mapper(f_content, "i94cntyl")
i94_port = code_mapper(f_content, "i94prtl")
i94_mode = code_mapper(f_content, "i94model")
i94_addr = code_mapper(f_content, "i94addrl")
i94_visa = {'1':'Business',
            '2': 'Pleasure',
            '3' : 'Student'}

In [None]:
def convert_city_to_i94port(city):
    results = [v for k, v in i94_port.items() if re.match(city, k)]

In [None]:
from pyspark.sql.functions import udf,col
convert_city_to_i94portUDF = udf(lambda z:convert_city_to_i94port(z))

temp_df_final = temp_df.withColumn('i94_port', convert_city_to_i94portUDF(col("city")))
temp_df_final.show(2)

Overview of analysis step targets:
- Staging datasets *Immigration Data*, *Temperature Data*, *U.S. City Demographic Data* and *Airport Code Table*.
- Create data modeling diagram as a star schema.
- Create and load data to fact table and dimension tables.
- Check data quality

#### Step 2: Explore and Assess the Data

##### Explore the Data

Dataset quality and validation issues recognition from gathering steps

- I94 Immigration
    - Re-structure columns, convert column name.
    - Check and drop dupllicate for index column and importance columns.
    - Clear emply cell
    - Clear data in wrong format
    - Clear wrong data

- World Temperature Data
    - Convert data type of column 'dt' to datetime
    - Filter to keep US in column "Country" only
    - Re-structure columns, convert column name.
    - Check and drop dupllicate for index column and importance columns: dt
    - Clear emply cell
    - Clear data in wrong format
    - Clear wrong data

- U.S. City Demographic Data
    - Re-structure columns, convert column name.
    - Change population values percentage values.
    - Check unique values of "City" and "State" and drop duplicate.
    - Clear emply cell
    - Drop duplicated records
    - Clear data in wrong format
    - Clear wrong data

- Airport Code
    - Drop records with non-US values on column "Country".
    - Re-structure columns, convert column name.
    - Clear emply cell
    - Drop duplicated records

- Other reference *I94_SAS_Labels_Descriptions.SAS*.

##### Cleaning Steps

- Parse *I94_SAS_Labels_Descriptions.SAS*. for validations on staging steps

- Cleaning and staging I94 Immigration include columns:
    - i94cit = code for visitor origin country
    - i94port = code for destination USA city
    - arrdate = arrival date in the USA
    - i94mode = code for transportation mode
    - depdate = departure date from the USA
    - i94visa = code for visa type (reason for visiting)

- Cleaning and staging World Temperature Data include columns:
    - Date
    - AverageTemperature
    - City
    - State
    - Country

- Cleaning and staging U.S. City Demographic Data include columns:
    - City
    - State
    - Median Age
    - Male Population
    - Female Population
    - Total Population
    - Number of Veterans
    - Foreign-born
    - Average Household Size
    - Race
    - Count

- Cleaning and staging  Airport Code

#### Step 3: Define the Data Model

##### 3.1 Conceptual Data Model

Start schema diagram transformed
- Start_schema_diagram here

Fact table:
- Fact table fact_i94visits will contain informations from the i94 immigration data joined with daily average temperature on the port city and arrival date.

Dimension table:
- dim_us_ports contains informations like US port of entry code, city, state code and state name.
- dim_visa maps visa type which gives information like reason for visiting.
- dim_countries tells which country does visitor come from.
- dim_travelmode gives mode of transportation: air, land or sea.
- dim_demographicscontains median age and population informations about US port city.
- dim_date contains date information like year, month, day, week of year and weekday.

##### 3.2 Mapping Out Data Pipelines

The pipeline steps are described below:
- Load raw dataset from source into Spark dataframe: df_spark_i94, df_spark_dem and df_spark_temp for one month.
- Clean each Spark dataframe as decscibed in *Step 2 Cleaning steps* and write each cleaned dataframe into parquet as staging table: stage_i94_immigration, stage_cities_demographics and stage_uscities_temperatures.
- Create and load dimension tables: dim_us_ports, dim_visa, dim_countries, dim_travelmode and dim_demographics.
- Create and load fact table fact_i94_visits joining stage_i94_immigration and stage_uscities_temperatures.
- Create and load dimension tables and dim_date.

#### Step 4: Run Pipelines to Model the Data

##### 4.1 Create the data model

Build the data pipelines to create the data model.

##### 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:

* Integrity constraints on the relational database (e.g., unique key, data type, etc.)
* Unit tests for the scripts to ensure they are doing the right thing
* Source/Count checks to ensure completeness

Run Quality Checks

##### 4.3 Data dictionary

The first dimension table will contain events from the I94 immigration data. The columns below will be extracted from the immigration dataframe:
* i94yr = 4 digit year
* i94mon = numeric month
* i94cit = 3 digit code of origin city
* i94port = 3 character code of destination city
* arrdate = arrival date
* i94mode = 1 digit travel code
* depdate = departure date
* i94visa = reason for immigration

The second dimension table will contain city temperature data. The columns below will be extracted from the temperature dataframe:
* i94port = 3 character code of destination city (mapped from immigration data during cleanup step)
* AverageTemperature = average temperature
* City = city name
* Country = country name
* Latitude= latitude
* Longitude = longitude

The fact table will contain information from the I94 immigration data joined with the city temperature data on i94port:
* i94yr = 4 digit year
* i94mon = numeric month
* i94cit = 3 digit code of origin city
* i94port = 3 character code of destination city
* arrdate = arrival date
* i94mode = 1 digit travel code
* depdate = departure date
* i94visa = reason for immigration
* AverageTemperature = average temperature of destination city

#### Step 5: Complete Project Write Up

- Clearly state the rationale for the choice of tools and technologies for the project.
    * Spark was chosen since it can easily handle multiple file formats (including SAS) containing large amounts of data. 
    * Spark SQL was chosen to process the large input files into dataframes and manipulated via standard SQL join operations to form additional tables.

- Propose how often the data should be updated and why.
    * The data should be updated monthly in conjunction with the current raw file format.
    * Case 2
    * Case 3

- Write a description of how you would approach the problem differently under the following scenarios:
    * The data was increased by 100x.
    * If the data was increased by 100x, we would no longer process the data as a single batch job. We could perhaps do incremental updates using a tool such as Uber's Hudi. We could also consider moving Spark to cluster mode using a cluster manager such as Yarn.

- The data populates a dashboard that must be updated on a daily basis by 7am every day.
    * If the data needs to populate a dashboard daily to meet an SLA then we could use a scheduling tool such as Airflow to run the ETL pipeline overnight.
    * Others solution ???

- The database needed to be accessed by 100+ people.
    * If the database needed to be accessed by 100+ people, we could consider publishing the parquet files to HDFS and giving read access to users that need it. 
    * If the users want to run SQL queries on the raw data, we could consider publishing to HDFS using a tool such as Impala.