# Data Engineering Project 5
### Data Engineering Capstone Project

#### Project Summary

##### Introduction

A core responsibility of The National Travel and Tourism Office (NTTO) is to collect, analyze, and disseminate international travel and tourism statistics. 

NTTO's Board of Managers are charged with managing, improving, and expanding the system to fully account and report the impact of travel and tourism in the United States. The analysis results help to forcecast and operation, support make decision creates a positive climate for growth in travel and tourism by reducing institutional barriers to tourism, administers joint marketing efforts, provides official travel and tourism statistics, and coordinates efforts across federal agencies.

##### Project Description

In this project, some source datas will be use to do data modeling:
* **I94 Immigration**: The source data for I94 immigration data is available in local disk in the format of sas7bdat. This data comes from US National Tourism and Trade Office. The data dictionary is also included in this project for reference. The actual source of the data is from https://travel.trade.gov/research/reports/i94/historical/2016.html. This data is already uploaded to the workspace.

* **World Temperature Data**: This dataset came from Kaggle. This data is already uploaded to the workspace.

* **Airport Code**: This is a simple table with airport codes. The source of this data is from https://datahub.io/core/airport-codes#data. It is highly recommended to use it for educational purpose only but not for commercial or any other purpose. This data is already uploaded to the workspace.

* Other text files such as * *I94Addr.txt* *, * *I94CIT_I94RES.txt* *, * *I94Mode.txt* *, * *I94Port.txt* * and * *I94Visa.txt* * files are used to enrich immigration data for better analysis. These files are created from the * *I94_SAS_Labels_Descriptions.SAS* * file provided to describe each and every field in the immigration data.

**The project follows the follow steps**:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

#### Step 1: Scope the Project and Gather Data

##### Scope 

Processes datasources Immigration Data, Temperature Data and Airport Code Table, to create a star schema optimized for queries on international travel and tourism statistics. This includes a fact table and dimension tables.

Spark, Python modules (pyspark, os, pandas) are using for this project to to steps garthering, exploring, cleaning, modeling, pipeline creating for ETL building on local system. AWS Redshift cluster will be considered as an optional to run one or more ETL steps. 

- Fact Table
    <tbd>
    
- Dimension Tables
    <tbd>

##### Describe and Gather Data

Take a overview on datas will be using for data modeling. Data description information include schema, sample record, number of rows, number of data file (if need)

In [None]:
# Do all imports and installs here
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
import pandas as pd
import re
import os

In [55]:
# Create SparkSession

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

# df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

# write to parquet
# df_spark.write.parquet("sas_data")
# df_spark=spark.read.parquet("sas_data")

In [None]:
# Defines procedure get data description informations
def gather_datasource(input_datasource):
    '''
    A procedure that returns ...
    
    Parameters
    ----------
    input_datasource : str
        name of the ...
    
    Returns
    -------
    ...
        ...
    '''

    filename_from_fname = os.path.basename(input_datasource).split('/')[-1]
    split_tup = os.path.splitext(filename_from_fname)
    file_name = split_tup[0]
    file_ext = split_tup[1]

    print("Datasource File Name: ", file_name)
    print("Datasource File Extension: ", file_ext)
    
    # in each line remove unnecessary spaces and extract the code and its corresponding value 
    if '.csv' in file_ext:
        df = spark.read.csv(input_datasource,header='True')
        print('The schema: ')
        df.printSchema()
        print()
        print('Sample records: ')
        df.show(5)
        print()
        print('Total rows: ', df.count())
        print()
    elif '.sas7bdat' in file_ext:
        df = spark.read.format('com.github.saurfang.sas.spark').load(input_datasource)
        print('The schema: ')
        df.printSchema()
        print()
        print('Sample records: ')
        df.show(5)
        print()
        print('Total rows: ', df.count())
        print()
    else :
        print("Datasource file extension {} will be update later. Finish here!!!!".format(file_ext))
        
    return df

In [None]:
# Defines procedure count data files
def count_datafile(input_datasource):
    path, filename = os.path.split(input_datasource)
    file_list = os.listdir(path)
    count_file = 0
    print("Files and directories in '", path, "' :")
    for file_name in file_list:
        print(file_name)
        count_file += 1

    print()
    print('Total data files:', count_file)

    return None

#### Step 2: Explore and Assess the Data

##### Explore the Data

Data quality issues:

- Missing or empty or wrong values
- Duplicate data
- Wrong format values
- NULL values

##### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Testing from here
from pyspark.sql import SparkSession

spark_clean = SparkSession.builder.\
    config("spark.jars.repositories", "https://repos.spark-packages.org/").\
    config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
    appName('CleanSteps').getOrCreate()

In [None]:
# Defines procedure split data file to chunks
def split_data(input_datasource, parent_dir):
    dir_file, file_name_ext = os.path.split(input_datasource)
    # dir_file without / at the end
    # file_name with extension
    
    filename_from_fname = os.path.basename(input_datasource).split('/')[-1]
    split_tup = os.path.splitext(filename_from_fname)
    file_name = split_tup[0] # file name without extension
    file_ext = split_tup[1] # file extension

    chunking_name = file_name + '_part'
    chunk_size=100000
    numbering_batch=1

    if '.sas7bdat' in file_ext:
        directory = file_ext
        full_dir = os.path.join(parent_dir, directory)
        if not os.path.exists(full_dir):
            os.mkdir(full_dir)
            print("Directory " , full_dir,  " Created ")
        else:
            print("Directory " , full_dir,  " already exists. Clean existing directory and run again.")
            exit(1)
        # split .sas7bdat file
        for batch in pd.read_sas(input_datasource, encoding="ISO-8859-1", chunksize=chunk_size):
            batch.to_csv(full_dir + '/' + chunking_name + str(numbering_batch) + '.csv', index=False)
            numbering_batch += 1
    elif '.csv' in file_ext:
        directory = file_ext
        full_dir = os.path.join(parent_dir, directory)
        if not os.path.exists(full_dir):
            os.mkdir(full_dir)
            print("Directory " , full_dir,  " Created ")
        else:
            print("Directory " , full_dir,  " already exists. Clean existing directory and run again.")
            exit(1)
        # split .csv file
        for batch in pd.read_csv(input_datasource, chunksize=chunk_size):
            batch.to_csv(full_dir + '/' + chunking_name + str(numbering_batch) + '.csv', index=False)
            numbering_batch += 1
    else :
            print('Datasource type have not update yet')
    print("Separated to batchs: ")
    return (numbering_batch - 1)

In [None]:
def get_chunk_dir(parent_dir,input_datasource):
    filename_from_fname = os.path.basename(input_datasource).split('/')[-1]
    split_tup = os.path.splitext(filename_from_fname)
    file_ext = split_tup[1] # file extension

    return (parent_dir + file_ext)

In [62]:
# Matching immigration data with i94port.txt
reg_exp_ops = re.compile(r'\'(.*)\'.*\'(.*)\'')
valid_i94port = {}
list_i94port_valid_code = '/home/workspace/i94port.txt'
with open(list_i94port_valid_code) as f:
    for port_name in f:
        matching_port = reg_exp_ops.search(port_name)
        valid_i94port[matching_port[1]]=[matching_port[2]]

In [None]:
# Function filter out .csv chunk data
def clean_i94_csv_data(input_datasource):
    df_immi = spark_clean.read.csv(input_datasource,inferSchema=True,header=True)
    df_immi = df_immi.filter(df_immi.i94port.isin(list(valid_i94port.keys())))

    return df_immi

In [None]:
# Run clean_i94_csv_data function
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
home_dir = '/home/workspace/enrich_data/'
immi_chunk_dir = get_chunk_dir(home_dir, fname)
chunk_name = immi_chunk_dir + 'GlobalLandTemperaturesByCity_part9.csv'

df_clean_csv = clean_i94_csv_data(chunk_name)
df_clean_csv.select(df_clean_csv.i94port).show(n=50)

In [None]:
# Function filter out .sas7bdat data
def clean_i94_sas7bdat_data(input_datasource):
    df_immigration = spark_clean.read.format('com.github.saurfang.sas.spark').load(input_datasource)
    df_immigration = df_immigration.filter(df_immigration.i94port.isin(list(valid_i94port.keys())))

    return df_immigration

In [None]:
# Run clean_i94_sas7bdat_data function
fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
home_dir = '/home/workspace/enrich_data/'
immi_chunk_dir = get_chunk_dir(home_dir, fname)
chunk_name = immi_chunk_dir + 'i94_apr16_sub_part1.csv'

df_clean_sas7bdat = clean_i94_sas7bdat_data(fname)
df_clean_sas7bdat.select(df_clean_sas7bdat.i94port).show(n=50)

In [None]:
# Draft note - Do not run this block

fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
fname_dir = '../../data/18-83510-I94-Data-2016'

fname = '../../data2/GlobalLandTemperaturesByCity.csv'
fname_dir = '../../data2'

fname = './airport-codes_csv.csv'
fname_dir = '.'

df = spark.read.format('com.github.saurfang.sas.spark').load(fname)
df.printSchema()
df.count()

fname = '../../data2/GlobalLandTemperaturesByCity.csv'
df = pd.read_csv(fname, sep=',')
df.head()

# input data source airport-codes_csv.csv
airport_input_data = "./airport-codes_csv.csv"
# Parse csv file
airport_df = pd.read_csv(airport_input_data, sep=',')
# Verify airport-codes_csv.csv parsed as dataframe
airport_df.head()

#### Step 3: Define the Data Model

##### 3.1 Conceptual Data Model

(Writing a little of data modeling here)

Result Table - I94 immigration data joined with the city temperature data on i94port, Columns:

* i94yr = 4 digit year,
* i94mon = numeric month,
* i94cit = 3 digit code of origin city,
* i94port = 3 character code of destination USA city,
* arrdate = arrival date in the USA,
* i94mode = 1 digit travel code,
* depdate = departure date from the USA,
* i94visa = reason for immigration,
* AverageTemperature = average temperature of destination city,

##### 3.2 Mapping Out Data Pipelines

List the steps necessary to pipeline the data into the chosen data model

#### Step 4: Run Pipelines to Model the Data

##### 4.1 Create the data model

Build the data pipelines to create the data model.

##### 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:

* Integrity constraints on the relational database (e.g., unique key, data type, etc.)
* Unit tests for the scripts to ensure they are doing the right thing
* Source/Count checks to ensure completeness

Run Quality Checks

##### 4.3 Data dictionary

Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up

* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
    * The data was increased by 100x.
    * The data populates a dashboard that must be updated on a daily basis by 7am every day.
    * The database needed to be accessed by 100+ people.