### Data Engineering Capstone Project

#### Project Summary

##### Introduction

A core responsibility of The National Travel and Tourism Office (NTTO) is to collect, analyze, and disseminate international travel and tourism statistics. 

NTTO's Board of Managers are charged with managing, improving, and expanding the system to fully account and report the impact of travel and tourism in the United States. The analysis results help to forcecast and operation, support make decision creates a positive climate for growth in travel and tourism by reducing institutional barriers to tourism, administers joint marketing efforts, provides official travel and tourism statistics, and coordinates efforts across federal agencies.

##### Project Description
The target of project is analysis the relationship between amount of travel immigration and weather duration by month of city.

In this project, some source datas will be use to do data modeling:
* **I94 Immigration**: The source data for I94 immigration data is available in local disk in the format of sas7bdat. This data comes from US National Tourism and Trade Office. The data dictionary is also included in this project for reference. The actual source of the data is from https://travel.trade.gov/research/reports/i94/historical/2016.html. This data is already uploaded to the workspace.

* **World Temperature Data**: This dataset came from Kaggle. This data is already uploaded to the workspace.

* **I94_SAS_Labels_Descriptions.SAS** to get validation dataset. We will use `I94Port.txt` as list of airport, city, state.

##### The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

#### Step 1: Scope the Project and Gather Data

##### Scope 

To make decision of project scope and the technical step solution we do data assessment on datasets:
* I94 Immigration.
* World Temperature Data.
* I94_SAS_Labels_Descriptions.SAS.

Tools will be used and import:
- Spark, Spark SQL
- Python, Pandas

##### Describe and Gather Data

- I94 Immigration dataset:
    - `cicid`: Visitor US cid code issue on every travller get throught the immigration port.
    - `i94yr | i94mon`: Year, Month of immigration date.
    - `i94cit | i94res`: Country of Citizenship & Country of recidence. From `I94_SAS_Labels_Descriptions.SAS`. Don't use.
    - `i94port`: Port code for a specific immigration port USA city. From `I94_SAS_Labels_Descriptions.SAS`. There are types of airport that do not allow immigration entry.
    - `arrdate | depdate`: Arrival date in the USA & Departure date from the USA.
    - `i94mode`: Code for immigration transportation mode. From `I94_SAS_Labels_Descriptions.SAS`. There are methods of immigration transportation without airport gateway. Don't use.
    - `i94addr`: US state code. From `I94_SAS_Labels_Descriptions.SAS`.
    - `i94bir`: Age of traveller in Years. Don't use.
    - `i94visa`: Code for visa type corresponse to visiting reason. Don't use.
    - `count, tadfile, visapost, occup, entdepa, entdepd, entdepu, matflag, dtaddto, insnum`: Don't use.
    - `biryear`: Immigrant year of birth. Have to review data type. Don't use.
    - `gender`: Immigrant sex. There are some un-common sex kind. Don't use.
    - `airline`: Airline Coporate used to arrive in U.S. Don't use.
    - `admnum`: Admission Number. Don't use.
    - `fltno`: Flight number of Airline used to arrive in U.S. Don't use.
    - `visatype`: Class of admission legally admitting the non-immigrant to temporarily stay in U.S. Don't use.


- World Temperature dataset
    - `dt`: The creation time of temperature.
    - `AverageTemperature | AverageTemperatureUncertainty`: temperature value recognized.
    - `City | Country`: City of Country that the teperature recognized.
    - `Latitude | Longitude`: Geographical location in lat-long. Helpful for heatmap but these columns is useless in our project.

- I94_SAS_Labels_Descriptions.SAS contains information parts:
    - `I94PORT_sas_label_validation` (to be used later)
        - `i94port_valid_code`: airport code.
        - `i94port_city_name`: the city corresponding to airport code.
        - `i94port_state_code`: the state the city belong to.
    - `I94MODE_sas_label_validation`
        - `i94mode_valid_code`: 
        - `i94mode_valid_value`:
    - `I94VISA_sas_label_validation`
        - `i94visa_valid_code`: 
        - `i94visa_valid_value`:
    - `I94ADDR_sas_label_validation`
        - `i94addr_valid_code`: 
        - `i94addr_valid_value`:
    - `I94RES_sas_label_validation`
        - `i94res_valid_code`: 
        - `i94res_valid_value`:

Our expectations :
- The choosen datasets enough to perform a data modeling of fact and dimention tables to analysis the relationship between amount of travel immigration and weather duration by month of city.

#### Step 2: Explore and Assess the Data

##### Explore the Data

Prepare steps

In [None]:
# Do all imports and installs here - Done
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType as R, StructField as Fld,\
    DoubleType as Dbl, StringType as Str, IntegerType as Int,\
    TimestampType as Timestamp, DateType as Date, LongType as Long
import pandas as pd
import re
import configparser
import os
import shutil

In [None]:
# Create Spark session - Using for droduction only
spark = SparkSession.builder\
            .config("spark.jars.repositories", "https://repos.spark-packages.org/")\
            .config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11")\
            .enableHiveSupport()\
            .getOrCreate()

In [None]:
# Procedure read out validation pair values from SAS Labels Description
def get_validation_code_from_SAS_labels(sas_input_label):
    '''
    This procedure read a input SAS Labels Description and then write out validation code datasets.
    The SAS Labels Description included validation code datasets with labels: I94RES (same to I94CIT), I94PORT, I94ADDR, I94MODE, I94VISA.
    
    Parameters
    ----------
    sas_input_label : string
        The label name of validation code dataset. Its can be one of I94RES (same to I94CIT), I94PORT, I94ADDR, I94MODE, I94VISA.
    
    Returns
    -------
    validation_code_list : validation_value_pairs(tuple(str_valid_code, str_valid_value))
        The return output is a specific SAS label list of validation code value pairs.
    '''

    # Read input SAS Labels Descriptions
    with open('I94_SAS_Labels_Descriptions.SAS') as sas_validation_code:
            labels_from_sas = sas_validation_code.read()

    # Parse labels from SAS Label Description input
    sas_labels = labels_from_sas[labels_from_sas.index(sas_input_label):]
    sas_labels = sas_labels[:sas_labels.index(';')]
    
    # Processing line by line, remove separate charaters and then append value pair
    lines = sas_labels.splitlines()
    validation_code_list = []
    for line in lines:
        try:
            valid_code, valid_value = line.split('=')
            valid_code = valid_code.strip().strip("'").strip('"')
            valid_value = valid_value.strip().strip("'").strip('"').strip()
            validation_code_list.append((valid_code, valid_value))
        except:
            pass
        
    return validation_code_list

In [None]:
# Procedure extract parts from SAS Labels Description
def extract_staging_sas_label(label):
    '''
    asdjhkjf.
    
    Parameters
    ----------
    label: 
        a string input of specific label from "SAS_Label_Descriptions.SAS"
        
    Syntax note: 
        input value in string datatype, need inside a pair of single quotes. Ex: 'I94RES', 'I94PORTS'
    
    Returns
    -------
    Dir of csv files with a specific part as input label.
    '''
    label_name = label
    valid_code = label + "_valid_code"
    valid_value = label + "_valid_value"
    csv_output = label + "_sas_label_validation"
    parent_dir = "./"
    path = os.path.join(parent_dir, csv_output)
    # os.mkdir(path)

    schema = R([
        Fld(valid_code, Str()),
        Fld(valid_value, Str())
    ])

    df = spark.createDataFrame(
        data=get_validation_code_from_SAS_labels(label_name),
        schema=schema
    )

    shutil.rmtree(csv_output, ignore_errors=False, onerror=None)
    df.write.options(header='True', delimiter=',').csv(csv_output)
    # df.write.mode('overwrite').csv(csv_output)

    df = spark.read.options(inferSchema="true", delimiter=",", header = "true").csv(csv_output)

    print("Top 20 rows of {} ".format(csv_output))
    df.show()

    print("Count rows of {}: {} ".format(csv_output, df.count()))
    
    print("Check unique value of {}: {} ".format(csv_output, df.select(valid_code).distinct().count()))

    print("Staging csv files in: {}".format(csv_output))

    return df

In [None]:
# Procedure convert column name
def convert_column_names(df):
    '''
    This procedure standardizing column names to snake case format. Format ex: customer_name, billing_address, total_price.
    
    Parameters
    ----------
    dataframe : string_of_dataframe
        The input dataframe with column names might have elements of messy columns names, including accents, different delimiters, casing and multiple white spaces.
        Snake case style replaces the white spaces and symbol delimiters with underscore and converts all characters to lower case
    
    Returns
    -------
    Dataframe with column names has been changed to snake_case format.
    '''
    cols = df.columns
    column_name_changed = []

    for col in cols:
        new_column = col.lstrip().rstrip().lower().replace (" ", "_").replace ("-", "_")
        column_name_changed.append(new_column)

    df.columns = column_name_changed

In [None]:
# Procedure remove specific dir (if need)
def rmdir(directory):
    '''
    This procedure perform pure recursive a directory.
    
    Parameters
    ----------
    directory : string_of_path_to_dir
        The input directory is a path to target dir. This dir and all its belong child objects wil be deleted.
        Syntax note: rmdir(Path("target_path_to_dir"))
            with Path("target_path_to_dir") returns path to dir format as 'directory' input
    
    Returns
    -------
    None
    '''
    directory = Path(directory)
    for item in directory.iterdir():
        if item.is_dir():
            rmdir(item)
        else:
            item.unlink()
    directory.rmdir()

I94 Immigration dataset

In [None]:
# Read input dataset to Spark dataframe
# i94immi_dataset = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
# i94immi_df = spark.read.format('com.github.saurfang.sas.spark').load(i94immi_dataset)
i94immi_df = spark.read.parquet("sas_data")

In [None]:
# Data type
i94immi_df.dtypes

In [None]:
# Attributes columns
i94immi_df.show(5)

In [None]:
# Count rows
i94immi_df.count()

World Temperature dataset

In [None]:
# read input dataset to Pandas dataframe
worldtempe_dataset = '../../data2/GlobalLandTemperaturesByCity.csv'
worldtempe_df = pd.read_csv(worldtempe_dataset,sep=",")

In [None]:
# Data type
worldtempe_df.info()

In [None]:
# Attributes columns
worldtempe_df.head(5)

In [None]:
# Count rows
worldtempe_df.count()

I94PORT_sas_label_validation

In [None]:
# Extract `I94PORT` label
# Create dir 'I94PORT_sas_label_validation' to save extracted label result
I94PORT_df = extract_staging_sas_label('I94PORT')
I94PORT_df = I94PORT_df.toPandas()

In [None]:
# Data type
I94PORT_df.info()

In [None]:
# Attributes columns
I94PORT_df.head(5)

In [None]:
# Count rows
I94PORT_df.shape

I94MODE_sas_label_validation

In [None]:
# Extract `I94MODE` label
# Create dir 'I94MODE_sas_label_validation' to save extracted label result
I94MODE_df = extract_staging_sas_label('I94MODE')
I94MODE_df = I94MODE_df.toPandas()
I94MODE_df.head(5)

I94VISA_sas_label_validation

In [None]:
# Extract `I94VISA` label
# Create dir 'I94VISA_sas_label_validation' to save extracted label result
I94VISA_df = extract_staging_sas_label('I94VISA')
I94VISA_df = I94VISA_df.toPandas()
I94VISA_df.head(5)

Explorer outputs:
- Amount of records and data size: 
    - I94 Immigration Dataset: `3096313 rows`
    - World Temperature Dataset: `8599212 rows`
    - i94port SAS Labels Dataset: `660 rows`
- Data file extension formats included: 
    - I94 Immigration Dataset is a `.sas7bdat`
    - World Temperature Dataset is a `.csv`
    - SAS Labels Descriptions is a `.SAS`

Our choosen datesets sastify the project rubric and will be using for data modeling.

##### Cleaning Steps

I94 Immigration dataset

- Cleaning I94 Immigration dataset of a month `../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat`
- The outputs of this step: `i94immi_df_clean.csv`
- Cleaning task to do:
    - Read dataset to spark dataframe.
    - Create Spark SQL table from dataframe.
    - Choose Primarykey.
    - Verify arrival date and departure date logical conditional.
    - Add column `arival_date`, `departure_date` as `datetime` datatype
    - Verify `arival_date`, `departure_date` wrong value.
    - Filter US airport with immigration allowed.
    - Remove missing value.
    - Create and verify staging table.
    

In [None]:
# Check NULL values

In [None]:
# Check & drop duplicate

World Temperature dataset

- Cleaning World Temperature dataset `../../data2/GlobalLandTemperaturesByCity.csv`
- The outputs of this step: `worldtempe_df_clean.csv`.
- Cleaning task to do:
    - Read dataset to pandas dataframe.
    - Filter value of `United States` only.
    - Limit dataset duration by immigration time duration
    - Clean column with datetime datatype.
    - Standalizing column names format.
    - Create and verify staging table.
- Jupiter Notebook for cleaning
    ```code reference
        cleaning_staging_worldtempe-v3c.ipynb
    ```
    

I94PORT_sas_label_validation

- Cleaning `i94port` from `I94_SAS_Labels_Descriptions.SAS`
- The outputs of this step: `i94_port.csv`
- Cleaning task to do: 
    - Extract `I94PORT` from `I94_SAS_Labels_Descriptions.SAS`
    - Clean leading and trailing white space.
    - Split to port_code, city, state.
    - Clean others columns to limit dataset.
    - Create and verify staging table.
- Jupiter Notebook for cleaning
    ```code reference
        Extract_I94_SAS_Labels-v03d.ipynb
    ```

#### Step 3: Define the Data Model

##### 3.1 Conceptual Data Model

As expectation mention, we want to find out the relations between US immigration with either weather, immigration traffic and the arrival place (city). To archive the expectation, we create star data modeling with fact and dim tables detail bellow:

Start schema diagram transformed
- Start_schema_diagram here

Fact table:
- The fact table `fact_i94immi` should includes columns:
    - `traveller_cicid`
    - `arr_airport_code`
    - `arr_city`
    - `avg_tempe`
    - `avg_uncertain_tempe`
    - `arr_datetime_iso`
    - `arr_year`
    - `arr_month`
    - `arr_state_code`

Dimension tables:

- `dim_immi_traveller` contains travller informations like cicid, date, airport, city.
    - `immi_cicid` 
    - `immi_datetime_iso`
    - `arr_port_code`
    - `travel_city`
    - `travel_month`
    - `travel_year`

- `dim_i94immi_airline` airline, flight number, flight time. (not yet)
    - `immi_cicid` 
    - `immi_datetime_iso`
    - `arr_port_code`
    - `travel_city`
    - `travel_month`
    - `travel_year`

- `fact_us_temperature` contains temperature records of US cities has been collect corresponse immigration data scope.
    - `city_tempe_collect`
    - `avg_tempe`
    - `avg_uncertain_tempe`
    - `tempe_month`
    - `tempe_year`

- `dim_us_tempetime` contains time event of temperature collected
    - `tempe_datetime`
    - `tempe_month`
    - `tempe_year`

- `dim_port` contains list of airport allow immigration.
    - `port_code`
    - `city_name`
    - `state`

- `dim_datetime` contains date information like year, month, day, week of year and weekday.
    - `arrival_year`
    - `arrival_month`
    - `arrival_date`
    - dim_datetime created by append datetime from staging data `i94immi_table`. In this project we use **2016 April** only.

##### 3.2 Mapping Out Data Pipelines

The pipeline steps are described below:
- Load raw dataset from source into dataframes:
    - I94 Immigration to `i94immi_df` as Spark dataframe.
    - WWorld Temperature to `worldtempe_df` as Pandas dataframe.
    - Extract `I94PORT` from `I94_SAS_Labels_Descriptions.SAS` as a Spark dataframe.
- Describe and Gather Data on:
    - `i94immi_df` as Spark dataframe.
    - `worldtempe_df` as Pandas dataframe.
    - `I94PORT` as a Spark dataframe.
- Clean each Spark dataframe as decscibed to staging dataset:
    - `i94immi_df` cleaned output to `i94immi_df_clean` as a csv format.
    - `worldtempe_df` cleaned output to `worldtempe_df_clean` as a csv format.
    - `I94PORT` cleaned output to `i94port_staging` as a csv format.
- Transform csv staging datasets to staging tables:
    - `i94immi_df_clean` staging to `i94immi_table` as Spark SQL table.
    - `worldtempe_df_clean` cleaned output to `worldtempe_table` as Spark SQL table.
    - `i94port_staging` cleaned output to `i94port_table` as Spark SQL table.
- Transform staging tables to fact and dim tables.
    - `dim_immi_travller` is transformed from `i94immi_table`, `dim_port`.
    - `dim_us_temperature` is transformed from `worldtempe_table` and `dim_datetime`.
    - `dim_datetime` is transformed from `i94immi_table`.
    - `dim_port` is transformed from `i94port_table`.
    - `fact_immi_weather` is loaded from dim tables.
- Create and run pipeline to model data.
    - Load raw datasets; Describe and Gather Data: `Describe_and_Gather_Data-submit-03b.ipynb`
    - Clean and staging datasets:
        - `Extract_I94_SAS_Labels-v03d.ipynb`
        - `cleaning_staging_i94immi-v3c.ipynb`
        - `cleaning_staging_worldtempe-v3c.ipynb`
    - Transform to fact and dim tables:
        - `transform_fact_dims-v03c.ipynb`
- Create data quality check for fact and dim tables.
    - For dim tables `quality_check_dims.ipynb` (not yet)
    - For fact tables `quality_check_fact.ipynb` (not yet)

#### Step 4: Run Pipelines to Model the Data

##### 4.1 Create the data model
Run steps of pipeline

- `Describe_and_Gather_Data-submit-03b.ipynb`
- `Extract_I94_SAS_Labels-v03d.ipynb`
- `cleaning_staging_i94immi-v3c.ipynb`
- `cleaning_staging_worldtempe-v3c.ipynb`
- `transform_fact_dims-v03c.ipynb`

##### 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:

* Integrity constraints on the relational database (e.g., unique key, data type, etc.)
* Unit tests for the scripts to ensure they are doing the right thing
* Source/Count checks to ensure completeness

Run Quality Checks

##### 4.3 Data dictionary

- The fact table `fact_immi_weather` should includes columns:
    - `traveller_cicid` as `varchar` datatype
    - `arr_airport_code` as `varchar` datatype
    - `arr_city` as `varchar` datatype
    - `avg_tempe` as `doudbletype` datatype
    - `avg_uncertain_tempe` as `doudbletype` datatype
    - `arr_datetime_iso` as `datetime` datatype
    - `arr_year` as `datetime` datatype
    - `arr_month` as `datetime` datatype
    - `arr_state_code` as `varchar` datatype

Dim tables

- `dim_immi_traveller` contains travller informations like cicid, date, airport, city.
    - `immi_cicid` as `varchar` datatype
    - `immi_datetime_iso` as `datetime` datatype
    - `arr_port_code` as `varchar` datatype
    - `travel_city` as `varchar` datatype
    - `travel_month` as `datetime` datatype
    - `travel_year` as `datetime` datatype

- `dim_us_temperature` contains temperature records of US cities has been collect corresponse immigration data scope.
    - `city_tempe_collect` as `datetime` datatype
    - `avg_tempe` as `doudbletype` datatype
    - `avg_uncertain_tempe` as `doudbletype` datatype
    - `tempe_month` as `datetime` datatype
    - `tempe_year` as `datetime` datatype

- `dim_port` contains list of airport allow immigration.
    - `port_code` as `varchar` datatype
    - `city_name` as `varchar` datatype
    - `state` as `varchar` datatype

- `dim_datetime` contains date information like year, month, day, week of year and weekday.
    - `arrival_year` as `datetime` datatype
    - `arrival_month` as `datetime` datatype
    - `arrival_date` as `datetime` datatype
    * dim_datetime created by append datetime from staging data `i94immi_table`. In this project we use **2016 April** only.

#### Step 5: Complete Project Write Up

- Clearly state the rationale for the choice of tools and technologies for the project.
    - Pandas was chosen since it can easily handle dataframe either input or output csv. Easy to install and config Pandas on local computer to push up progress with SDK as VScode or Atom.
    - Spark were chosen since capable of handling support file formats (ex. sas7bdat, SAS) with large volume of data.
    - Dataframes of Spark or Pandas can be converted to each other.
    - Spark SQL was chosen since capable of processing the large input files into dataframes and manipulated via commom SQL JOIN, modify table structure, aggregate data values.

- Propose how often the data should be updated and why.
    - Depending on the purpose of use to make period data analysis suggestions.
        * Yearly: To have data for data analysis education
        * Monthly: To have a basis for predicting seasonal travller traffic.
        * Weekly: To have a basis for serving medical purposes (tracing infectious diseases like COVID, flu, ...).

- Write a description of how you would approach the problem differently under the following scenarios the data was increased by 100x:
    - For the case that need to be met in terms of periodic timing, 100x is a really big volume to meet target be on time. For this scenarios, we should make design a region distributed big data for collecting, processing, modeling. This distributed big data should be many cluster on cloud-based system. Data will be split to chunks, every chunks being processed by a batch-job.
    - For the case no worry about time, we can process the data as a single batch job. We could use existing cloud-based bigdata processing solution as datalake, datawarehouse...

- The data populates a dashboard that must be updated on a daily basis by 7am every day.
    - For specific time update dashboard visualization, we can apply pipeline automation. Airflow is a good candidate cause of capacities schedule, automation, processing batch-job.
    - The configurations of cloud-based processing system is a key factor to speed up the analysis progress.

- The database needed to be accessed by 100+ people.
    - We could consider publishing the parquet files to read-ony HDFS. 
    - In scenarios need  to meet high frequency SQL queries we consider place a midleware (ex. redis, memcahe, memsql) layer in the front of database.