### Data Engineering Capstone Project

#### Project Summary

##### Introduction

A core responsibility of The National Travel and Tourism Office (NTTO) is to collect, analyze, and disseminate international travel and tourism statistics. 

NTTO's Board of Managers are charged with managing, improving, and expanding the system to fully account and report the impact of travel and tourism in the United States. The analysis results help to forcecast and operation, support make decision creates a positive climate for growth in travel and tourism by reducing institutional barriers to tourism, administers joint marketing efforts, provides official travel and tourism statistics, and coordinates efforts across federal agencies.

##### Project Description
The target of project is analysis the relationship between amount of travel immigration and weather duration by month of city.

In this project, some source datas will be use to do data modeling:
* **I94 Immigration**: The source data for I94 immigration data is available in local disk in the format of sas7bdat. This data comes from US National Tourism and Trade Office. The data dictionary is also included in this project for reference. The actual source of the data is from https://travel.trade.gov/research/reports/i94/historical/2016.html. This data is already uploaded to the workspace.

* **World Temperature Data**: This dataset came from Kaggle. This data is already uploaded to the workspace.

* **I94_SAS_Labels_Descriptions.SAS** to get validation dataset. We will use `I94Port.txt` as list of airport, city, state.

##### The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

#### Step 1: Scope the Project and Gather Data

##### Scope 

To make decision of project scope and the technical step solution we do data assessment on datasets:
* I94 Immigration.
* World Temperature Data.
* I94_SAS_Labels_Descriptions.SAS.

Tools will be used and import:
- Spark, Spark SQL
- Python, Pandas

##### Describe and Gather Data

Perform dataset assessment use Jupiter Notebook and then review the outputs.
```code reference
    Describe_and_Gather_Data-submit-03b.ipynb
```

The perform outputs:
- Amount of records and data size: 
    - I94 Immigration Dataset: `3096313 rows`
    - World Temperature Dataset: `8599212 rows`
    - i94port SAS Labels Dataset: `660 rows`
- Data file extension formats included: 
    - I94 Immigration Dataset is a `.sas7bdat`
    - World Temperature Dataset is a `.csv`
    - SAS Labels Descriptions is a `.SAS`

Our choosen datesets sastify the project rubric and will be using for data modeling.

Our expectations :
- The choosen datasets enough to perform a data modeling of fact and dimention tables to analysis the relationship between amount of travel immigration and weather duration by month of city.

#### Step 2: Explore and Assess the Data

##### Explore the Data

Dataset quality and validation issues recognition from gathering steps

- I94 Immigration dataset:
    - `cicid`: Visitor US cid code issue on every travller get throught the immigration port.
    - `i94yr | i94mon`: Year, Month of immigration date.
    - `i94cit | i94res`: Country of Citizenship & Country of recidence. From `I94_SAS_Labels_Descriptions.SAS`. Don't use.
    - `i94port`: Port code for a specific immigration port USA city. From `I94_SAS_Labels_Descriptions.SAS`. There are types of airport that do not allow immigration entry.
    - `arrdate | depdate`: Arrival date in the USA & Departure date from the USA.
    - `i94mode`: Code for immigration transportation mode. From `I94_SAS_Labels_Descriptions.SAS`. There are methods of immigration transportation without airport gateway. Don't use.
    - `i94addr`: US state code. From `I94_SAS_Labels_Descriptions.SAS`.
    - `i94bir`: Age of traveller in Years. Don't use.
    - `i94visa`: Code for visa type corresponse to visiting reason. Don't use.
    - `count, tadfile, visapost, occup, entdepa, entdepd, entdepu, matflag, dtaddto, insnum`: Don't use.
    - `biryear`: Immigrant year of birth. Have to review data type. Don't use.
    - `gender`: Immigrant sex. There are some un-common sex kind. Don't use.
    - `airline`: Airline Coporate used to arrive in U.S. Don't use.
    - `admnum`: Admission Number. Don't use.
    - `fltno`: Flight number of Airline used to arrive in U.S. Don't use.
    - `visatype`: Class of admission legally admitting the non-immigrant to temporarily stay in U.S. Don't use.


- World Temperature dataset
    - `dt`: The creation time of temperature.
    - `AverageTemperature | AverageTemperatureUncertainty`: temperature value recognized.
    - `City | Country`: City of Country that the teperature recognized.
    - `Latitude | Longitude`: Geographical location in lat-long. Helpful for heatmap but these columns is useless in our project.

- *I94_SAS_Labels_Descriptions.SAS* extracted `i94port` (with column names will be used later)
    - `i94port_valid_code`: airport code.
    - `i94port_city_name`: the city corresponding to airport code.
    - `i94port_state_code`: the state the city belong to.

##### Cleaning Steps

- Cleaning `i94port` from `I94_SAS_Labels_Descriptions.SAS`
- The outputs of this step: `i94_port.csv`
- Cleaning task to do: 
    - Extract `I94PORT` from `I94_SAS_Labels_Descriptions.SAS`
    - Clean leading and trailing white space.
    - Split to port_code, city, state.
    - Clean others columns to limit dataset.
    - Create and verify staging table.
- Jupiter Notebook for cleaning
    ```code reference
        Extract_I94_SAS_Labels-v03d.ipynb
    ```

- Cleaning I94 Immigration dataset of a month `../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat`
- The outputs of this step: `i94immi_df_clean.csv`
- Cleaning task to do:
    - Read dataset to spark dataframe.
    - Create Spark SQL table from dataframe.
    - Choose Primarykey.
    - Verify arrival date and departure date logical conditional.
    - Add column `arival_date`, `departure_date` as `datetime` datatype
    - Verify `arival_date`, `departure_date` wrong value.
    - Filter US airport with immigration allowed.
    - Remove missing value.
    - Create and verify staging table.
- Jupiter Notebook for cleaning
    ```code reference
        cleaning_staging_i94immi.ipynb
    ```
    

- Cleaning World Temperature dataset `../../data2/GlobalLandTemperaturesByCity.csv`
- The outputs of this step: `worldtempe_df_clean.csv`.
- Cleaning task to do:
    - Read dataset to pandas dataframe.
    - Filter value of `United States` only.
    - Limit dataset duration by immigration time duration
    - Clean column with datetime datatype.
    - Standalizing column names format.
    - Create and verify staging table.
- Jupiter Notebook for cleaning
    ```code reference
        cleaning_staging_worldtempe.ipynb
    ```
    

#### Step 3: Define the Data Model

##### 3.1 Conceptual Data Model

As expectation mention, we want to find out the relations between US immigration with either weather, immigration traffic and the arrival place (city). To archive the expectation, we create star data modeling with fact and dim tables detail bellow:

Start schema diagram transformed
- Start_schema_diagram here

Fact table:
- The fact table `fact_immi_weather` should includes columns:
    - `traveller_cicid`
    - `arr_airport_code`
    - `arr_city`
    - `avg_tempe`
    - `avg_uncertain_tempe`
    - `arr_datetime_iso`
    - `arr_year`
    - `arr_month`
    - `arr_state_code`

Dimension tables:

- `dim_immi_traveller` contains travller informations like cicid, date, airport, city.
    - `immi_cicid` 
    - `immi_datetime_iso`
    - `arr_port_code`
    - `travel_city`
    - `travel_month`
    - `travel_year`

- `dim_us_temperature` contains temperature records of US cities has been collect corresponse immigration data scope.
    - `city_tempe_collect`
    - `avg_tempe`
    - `avg_uncertain_tempe`
    - `tempe_month`
    - `tempe_year`

- `dim_port` contains list of airport allow immigration.
    - `port_code`
    - `city_name`
    - `state`

- `dim_datetime` contains date information like year, month, day, week of year and weekday.
    - `arrival_year`
    - `arrival_month`
    - `arrival_date`
    - dim_datetime created by append datetime from staging data `i94immi_table`. In this project we use **2016 April** only.

##### 3.2 Mapping Out Data Pipelines

The pipeline steps are described below:
- Load raw dataset from source into dataframes:
    - I94 Immigration to `i94immi_df` as Spark dataframe.
    - WWorld Temperature to `worldtempe_df` as Pandas dataframe.
    - Extract `I94PORT` from `I94_SAS_Labels_Descriptions.SAS` as a Spark dataframe.
- Describe and Gather Data on:
    - `i94immi_df` as Spark dataframe.
    - `worldtempe_df` as Pandas dataframe.
    - `I94PORT` as a Spark dataframe.
- Clean each Spark dataframe as decscibed to staging dataset:
    - `i94immi_df` cleaned output to `i94immi_df_clean` as a csv format.
    - `worldtempe_df` cleaned output to `worldtempe_df_clean` as a csv format.
    - `I94PORT` cleaned output to `i94port_staging` as a csv format.
- Transform csv staging datasets to staging tables:
    - `i94immi_df_clean` staging to `i94immi_table` as Spark SQL table.
    - `worldtempe_df_clean` cleaned output to `worldtempe_table` as Spark SQL table.
    - `i94port_staging` cleaned output to `i94port_table` as Spark SQL table.
- Transform staging tables to fact and dim tables.
    - `dim_immi_travller` is transformed from `i94immi_table`, `dim_port`.
    - `dim_us_temperature` is transformed from `worldtempe_table` and `dim_datetime`.
    - `dim_datetime` is transformed from `i94immi_table`.
    - `dim_port` is transformed from `i94port_table`.
    - `fact_immi_weather` is loaded from dim tables.
- Create and run pipeline to model data.
    - Load raw datasets; Describe and Gather Data: `Describe_and_Gather_Data-submit-03b.ipynb`
    - Clean and staging datasets:
        - `Extract_I94_SAS_Labels-v03d.ipynb`
        - `cleaning_staging_i94immi.ipynb`
        - `cleaning_staging_worldtempe.ipynb`
    - Transform to fact and dim tables:
        - `transform_fact_dims-v03c.ipynb`
- Create data quality check for fact and dim tables.
    - For dim tables `quality_check_dims.ipynb` (not yet)
    - For fact tables `quality_check_fact.ipynb` (not yet)

#### Step 4: Run Pipelines to Model the Data

##### 4.1 Create the data model
Run steps of pipeline

- `Describe_and_Gather_Data-submit-03b.ipynb`
- `Extract_I94_SAS_Labels-v03d.ipynb`
- `cleaning_staging_i94immi.ipynb`
- `cleaning_staging_worldtempe.ipynb`
- `transform_fact_dims-v03c.ipynb`

##### 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:

* Integrity constraints on the relational database (e.g., unique key, data type, etc.)
* Unit tests for the scripts to ensure they are doing the right thing
* Source/Count checks to ensure completeness

Run Quality Checks

##### 4.3 Data dictionary

- The fact table `fact_immi_weather` should includes columns:
    - `traveller_cicid` as `varchar` datatype
    - `arr_airport_code` as `varchar` datatype
    - `arr_city` as `varchar` datatype
    - `avg_tempe` as `doudbletype` datatype
    - `avg_uncertain_tempe` as `doudbletype` datatype
    - `arr_datetime_iso` as `datetime` datatype
    - `arr_year` as `datetime` datatype
    - `arr_month` as `datetime` datatype
    - `arr_state_code` as `varchar` datatype

Dim tables

- `dim_immi_traveller` contains travller informations like cicid, date, airport, city.
    - `immi_cicid` as `varchar` datatype
    - `immi_datetime_iso` as `datetime` datatype
    - `arr_port_code` as `varchar` datatype
    - `travel_city` as `varchar` datatype
    - `travel_month` as `datetime` datatype
    - `travel_year` as `datetime` datatype

- `dim_us_temperature` contains temperature records of US cities has been collect corresponse immigration data scope.
    - `city_tempe_collect` as `datetime` datatype
    - `avg_tempe` as `doudbletype` datatype
    - `avg_uncertain_tempe` as `doudbletype` datatype
    - `tempe_month` as `datetime` datatype
    - `tempe_year` as `datetime` datatype

- `dim_port` contains list of airport allow immigration.
    - `port_code` as `varchar` datatype
    - `city_name` as `varchar` datatype
    - `state` as `varchar` datatype

- `dim_datetime` contains date information like year, month, day, week of year and weekday.
    - `arrival_year` as `datetime` datatype
    - `arrival_month` as `datetime` datatype
    - `arrival_date` as `datetime` datatype
    * dim_datetime created by append datetime from staging data `i94immi_table`. In this project we use **2016 April** only.

#### Step 5: Complete Project Write Up

- Clearly state the rationale for the choice of tools and technologies for the project.
    * Spark was chosen since it can easily handle multiple file formats (including SAS) containing large amounts of data. 
    * Spark SQL was chosen to process the large input files into dataframes and manipulated via standard SQL join operations to form additional tables.

- Propose how often the data should be updated and why.
    * The data should be updated monthly in conjunction with the current raw file format.
    * Case 2
    * Case 3

- Write a description of how you would approach the problem differently under the following scenarios:
    * The data was increased by 100x.
    * If the data was increased by 100x, we would no longer process the data as a single batch job. We could perhaps do incremental updates using a tool such as Uber's Hudi. We could also consider moving Spark to cluster mode using a cluster manager such as Yarn.

- The data populates a dashboard that must be updated on a daily basis by 7am every day.
    * If the data needs to populate a dashboard daily to meet an SLA then we could use a scheduling tool such as Airflow to run the ETL pipeline overnight.
    * Others solution ???

- The database needed to be accessed by 100+ people.
    * If the database needed to be accessed by 100+ people, we could consider publishing the parquet files to HDFS and giving read access to users that need it. 
    * If the users want to run SQL queries on the raw data, we could consider publishing to HDFS using a tool such as Impala.