### Data Engineering Capstone Project

#### Project Summary

##### Introduction

A core responsibility of The National Travel and Tourism Office (NTTO) is to collect, analyze, and disseminate international travel and tourism statistics. 

NTTO's Board of Managers are charged with managing, improving, and expanding the system to fully account and report the impact of travel and tourism in the United States. The analysis results help to forcecast and operation, support make decision creates a positive climate for growth in travel and tourism by reducing institutional barriers to tourism, administers joint marketing efforts, provides official travel and tourism statistics, and coordinates efforts across federal agencies.

##### Project Description

In this project, some source datas will be use to do data modeling:
* **I94 Immigration**: The source data for I94 immigration data is available in local disk in the format of sas7bdat. This data comes from US National Tourism and Trade Office. The data dictionary is also included in this project for reference. The actual source of the data is from https://travel.trade.gov/research/reports/i94/historical/2016.html. This data is already uploaded to the workspace.

* **World Temperature Data**: This dataset came from Kaggle. This data is already uploaded to the workspace.

* **Airport Code**: This is a simple table with airport codes. The source of this data is from https://datahub.io/core/airport-codes#data. It is highly recommended to use it for educational purpose only but not for commercial or any other purpose. This data is already uploaded to the workspace.

* **U.S. City Demographic Data**: This data comes from OpenSoft link https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/

* Other validation text files such as *I94Addr.txt*, *I94CIT_I94RES.txt*, *I94Mode.txt*, *I94Port.txt** and **I94Visa.txt* extracted from the *I94_SAS_Labels_Descriptions.SAS*.

##### The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

#### Step 1: Scope the Project and Gather Data

##### Scope 

To make decision of project scope and the technical step solution we do data assessment on datasets:
* I94 Immigration
* World Temperature Data
* U.S. City Demographic Data
* Airport Code
* Other reference *I94_SAS_Labels_Descriptions.SAS*.

Tools will be used and import:
- Spark
- Python and its modules
- Pandas
- AWS Redshift cluster will be considered as an optional to run one or more ETL steps.

##### Describe and Gather Data

Perform dataset assessment by  take a look on: 
- Amount of records and data size.
- Schema, data column structure.
- Sample records and attributes. 

And the last, choose datasets will be using for data modeling.

List of datasets will be assessed:
- `I94 Immigration`
- `World Temperature Data`
- `U.S. City Demographic Data`
- `Airport Code`
- `I94_SAS_Labels_Descriptions.SAS`

Python script to review the output of datasets assessment
```code reference
    Describe_and_Gather_Data-submit-02.ipynb
```

Our expectations :
- The relation between amount of travel immigration with weather duration of US regions.
- The relation between amount of travel immigration with US airports.
- The relation between amount of travel immigration with city demographics.
- About immigration type statistics.

#### Step 2: Explore and Assess the Data

##### Explore the Data

Dataset quality and validation issues recognition from gathering steps

- I94 Immigration dataset:
    - `cicid`: Code for visitor origin country. Need to perform uniqueness verification. Have to review datatype. Check NULL values.
    - `i94yr | i94mon`: Year, Month of immigration date. Have to review data type. Check NULL or NaN values.
    - `i94cit | i94res`: Country of citizenship & Country of recidence. Have to validate these values with `I94_SAS_Labels_Descriptions.SAS`. Check NULL or Nan values.
    - `i94port`: Code for destination immigration port of a specific USA city. Have to validate these values with **I94_SAS_Labels_Descriptions.SAS**. There are types of airport that do not allow immigration entry. Check NULL or Nan values.
    - `arrdate | depdate`: Arrival date in the USA & Departure date from the USA. Have to review datatype. Check NULL values. Check the dependence between arrival date and departure date. 
    - `i94mode`: Code for immigration transportation mode. Have to validate these values with **I94_SAS_Labels_Descriptions.SAS**. There are methods of immigration transportation without airport gateway.
    - `i94addr`: US state code. Have to validate these values with **I94_SAS_Labels_Descriptions.SAS**.
    - `i94bir`: Age of Respondent in Years. Have to review data type. Check NULL or NaN values. We should scope passenger with birth year that keep use airport for their traveling next many year. The analysis will be usefull also.
    - `i94visa`: Code for visa type corresponse to visiting reason. Have to validate these values with **I94_SAS_Labels_Descriptions.SAS**.
    - count, tadfile, visapost, occup, entdepa, entdepd, entdepu, matflag, dtaddto, insnum: Useless. Do not use these columns.
    - `biryear`: Immigrant year of birth. Have to review data type. Check NULL or NaN values. We should scope passenger with birth year that keep use airport for their traveling next many year. The analysis will be usefull also.
    - `gender`: Immigrant sex. There are some un-common sex kind. No need these un-common values.
    - `airline`: Airline used to arrive in U.S. Have to review data type. Check NULL or NaN values. Check for combination key.
    - `admnum`: Admission Number. Have to review data type. Check NULL or NaN values.
    - `fltno`: Flight number of Airline used to arrive in U.S. Have to review data type. Check NULL or NaN values.
    - `visatype`: Class of admission legally admitting the non-immigrant to temporarily stay in U.S. Have to validate these values with **I94_SAS_Labels_Descriptions.SAS**.


- World Temperature dataset
    - `dt`: The creation time of temperature. Datatype must be converto datetime. Perform uniqueless verification.
    - `AverageTemperature | AverageTemperatureUncertainty`: temperature value recognized. Column name as title style, have to lower this column name.
    - `City | Country`: City of Country that the teperature recognized. Column name as title style, have to lower this column name.
    - `Latitude | Longitude`: Geographical location in lat-long. Helpful for heatmap but these columns is useless in our project.
    - Character case of column names mixed of upper, lower, whitespace. Have to lower case and replace whitespaces with '_'.

- Airport Code dataset
    - `ident`: The airport identification code. Have to perform uniqueness verification. Check code naming format. Have to review datatype. Check NULL values.
    - `type`: The airport type. There is confused airport type don't allow immigration traveling. Have to filter out allow immigration airport.
    - `name`: The airport name. Have to mapping this column to city and country.
    - `elevation_ft | continent | gps_code | iata_code | local_code | coordinates`: No need for our project.
    - `iso_country | iso_region | municipality`: The country & state & city or town of the airport. Have to filter out list of this combination of these belong to US only.

- U.S. City Demographic dataset
    - `city | state | state_code`: Polulation of a city of a state. Have to check the uniqueless of this combination. This combination is primarykey also.
    - `male_population | female_population | total_population`: Main population statistics include male, female and total.
    - `median_age | number_of_veterans | foreign_born | average_household_size | race | count`: No need for our project.

- Other reference *I94_SAS_Labels_Descriptions.SAS*.

##### Cleaning Steps

- Parse *I94_SAS_Labels_Descriptions.SAS*. for validations on staging steps
    ```code reference
        Extract_I94_SAS_Labels_Descriptions_SAS-submit-02.ipynb
    ```
    - The outputs of this step:
        - `i94_addr.csv`
        - `i94_cit_and_res.csv`
        - `i94_mode.csv`
        - `i94_port.csv`
        - `i94_visa.csv`

- Cleaning I94 Immigration dataset

    ```code reference
        unitTest-cleaning_staging_i94.ipynb
    ```
    The outputs of this step: `i94immi_df_clean.csv`

- Cleaning World Temperature dataset

    ```code reference
        unitTest-cleaning_staging_world_tempe.ipynb
    ```
    The outputs of this step: `worldtempe_df_clean.csv`

- Cleaning U.S. City Demographic dataset

    ```code reference
        unitTest-cleaning_staging_usdemo.ipynb
    ```
    The outputs of this step: `citydemo_df_clean.csv`

- Cleaning Airport Code dataset

    ```code reference
        unitTest-cleaning_staging_airport.ipynb
    ```
    The outputs of this step: `airports_df_clean.csv`

#### Step 3: Define the Data Model

##### 3.1 Conceptual Data Model

Start schema diagram transformed
- Start_schema_diagram here

Fact table:
- As expectation mention, we want to find out the relations between US immigration with either weather, immigration traffic and the arrival place (city).
- The fact table `fact_us_immigration` should includes columns:
    - `cicid`
    - `cit_country`
    - `res_country`
    - `year_olds`
    - `visa_type`
    - `visa_type_des`
    - `arr_city`
    - `arr_state`
    - `arr_date`
    - `dep_date`

Dimension tables:

- `dim_us_airports` contains informations like US port of entry code, city, state code and state name.
    - `airport_ident` 
    - `airport_name`
    - `airport_type`
    - `state`
    - `municipality`

- `dim_us_teperature` contains temperature records of US cities has been collect corresponse immigration data scope.
    - `date`
    - `city`
    - `avg_tempe`
    - `avg_uncertain_tempe`

- `dim_demographics` contains median age and population informations about US port cities.
    - `city`
    - `state`
    - `median`
    - `male`
    - `female`
    - `total_population`

- `dim_datetime` contains date information like year, month, day, week of year and weekday.
    - `year`
    - `month`
    - `week`
    - `weekday`

##### 3.2 Mapping Out Data Pipelines

The pipeline steps are described below:
- Load raw dataset from source into Spark dataframe: df_spark_i94, df_spark_dem and df_spark_temp for one month.
- Clean each Spark dataframe as decscibed in *Step 2 Cleaning steps* and write each cleaned dataframe into parquet as staging table: stage_i94_immigration, stage_cities_demographics and stage_uscities_temperatures.
- Create and load dimension tables: dim_us_ports, dim_visa, dim_countries, dim_travelmode and dim_demographics.
- Create and load fact table fact_i94_visits joining stage_i94_immigration and stage_uscities_temperatures.
- Create and load dimension tables and dim_date.

#### Step 4: Run Pipelines to Model the Data

##### 4.1 Create the data model

Build the data pipelines to create the data model.

##### 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:

* Integrity constraints on the relational database (e.g., unique key, data type, etc.)
* Unit tests for the scripts to ensure they are doing the right thing
* Source/Count checks to ensure completeness

Run Quality Checks

##### 4.3 Data dictionary

The first dimension table will contain events from the I94 immigration data. The columns below will be extracted from the immigration dataframe:
* i94yr = 4 digit year
* i94mon = numeric month
* i94cit = 3 digit code of origin city
* i94port = 3 character code of destination city
* arrdate = arrival date
* i94mode = 1 digit travel code
* depdate = departure date
* i94visa = reason for immigration

The second dimension table will contain city temperature data. The columns below will be extracted from the temperature dataframe:
* i94port = 3 character code of destination city (mapped from immigration data during cleanup step)
* AverageTemperature = average temperature
* City = city name
* Country = country name
* Latitude= latitude
* Longitude = longitude

The fact table will contain information from the I94 immigration data joined with the city temperature data on i94port:
* i94yr = 4 digit year
* i94mon = numeric month
* i94cit = 3 digit code of origin city
* i94port = 3 character code of destination city
* arrdate = arrival date
* i94mode = 1 digit travel code
* depdate = departure date
* i94visa = reason for immigration
* AverageTemperature = average temperature of destination city

#### Step 5: Complete Project Write Up

- Clearly state the rationale for the choice of tools and technologies for the project.
    * Spark was chosen since it can easily handle multiple file formats (including SAS) containing large amounts of data. 
    * Spark SQL was chosen to process the large input files into dataframes and manipulated via standard SQL join operations to form additional tables.

- Propose how often the data should be updated and why.
    * The data should be updated monthly in conjunction with the current raw file format.
    * Case 2
    * Case 3

- Write a description of how you would approach the problem differently under the following scenarios:
    * The data was increased by 100x.
    * If the data was increased by 100x, we would no longer process the data as a single batch job. We could perhaps do incremental updates using a tool such as Uber's Hudi. We could also consider moving Spark to cluster mode using a cluster manager such as Yarn.

- The data populates a dashboard that must be updated on a daily basis by 7am every day.
    * If the data needs to populate a dashboard daily to meet an SLA then we could use a scheduling tool such as Airflow to run the ETL pipeline overnight.
    * Others solution ???

- The database needed to be accessed by 100+ people.
    * If the database needed to be accessed by 100+ people, we could consider publishing the parquet files to HDFS and giving read access to users that need it. 
    * If the users want to run SQL queries on the raw data, we could consider publishing to HDFS using a tool such as Impala.