# Data Engineering Project 5
### Data Engineering Capstone Project

#### Project Summary
Introduction
A core responsibility of The National Travel and Tourism Office (NTTO) is to collect, analyze, and disseminate international travel and tourism statistics. 
NTTO's data engineer project team  is charged with managing, improving, and expanding the system to fully account and report the impact of travel and tourism in the United States. The analysis results help to forcecast and operation, support make decision creates a positive climate for growth in travel and tourism by reducing institutional barriers to tourism, administers joint marketing efforts, provides official travel and tourism statistics, and coordinates efforts across federal agencies.
As a part of project, data engineers were tasked with building an ETL pipeline that extracts data from Immigration Data, Temperature Data and Airport Code Table ready using for analysis. Ready datasets be able to test and ETL pipeline by running queries given by the analytics team from NTTO and compare test results with expected results.

Project Description
In this project, Spark and data lakes will be used for to build an ETL pipeline for a data lake hosted on local storage. To complete the project, process the data into analytics tables using Spark, and load them back into local storage. Spark processes will be running on a cluster using AWS.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [26]:
# Do all imports and installs here
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
import pandas as pd, re

### Step 1: Scope the Project and Gather Data

#### Scope 
Using Spark to processes datasources Immigration Data, Temperature Data and Airport Code Table, to create a star schema optimized for queries on international travel and tourism statistics. This includes the following tables.

Fact Table
    <tbd>
Dimension Tables
    <tbd>

#### Describe and Gather Data 
I94 Immigration Data: This data comes from the US National Tourism and Trade Office. 

World Temperature Data: This dataset came from Kaggle. We will use a copy dataset at local storage.
        
Airport Code Table: This is a simple table of airport codes and corresponding cities. This data ready on local storage. 

##### For I94 Immigration Data
These are local spark parquet files named "sas_data". List them with command: 
    "$ ls -la ./sas_data/*.sas7bdat". 
Also, we can use Pandas to read immigration data sample with parameters:
    read_sas(
            path_to_sas_data_file, 
            format="sas7bdat", 
            encoding="ISO-8859-1", 
            chunksize=5000
     )

In [27]:
# input sample data source i94_apr16_sub.sas7bdat
immigration_input_data = "../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat"

# Parse sas7bdat file with encoding format parameter "ISO-8859-1"
immi_df = pd.read_sas(immigration_input_data, format="sas7bdat", encoding="ISO-8859-1", chunksize=5000)
immi_df = immi_df.read()

# Verify i94_apr16_sub.o parsed as dataframe
immi_df.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.000411,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.000411,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.000411,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.000411,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.000411,09302016,,,AA,92468460000.0,199.0,B2


##### For World Temperature Data
These are local temperature .csv file named "GlobalLandTemperaturesByCity.csv". To find this file command: 
    "$ ls -la ../../data2 | grep GlobalLandTemperaturesByCity"
Also, we can use Pandas to read more with parameters:
    read_csv(
            path_to_csv_data_file, 
            sep = csv_separate_symbol
    )

In [28]:
# input data source GlobalLandTemperaturesByCity.csv
temperature_input_data = "../../data2/GlobalLandTemperaturesByCity.csv"
# Parse csv file
tempe_df = pd.read_csv(temperature_input_data, sep=',')
# Verify GlobalLandTemperaturesByCity.csv parsed as dataframe
tempe_df.head()

#tempe_df = tempe_df.filter(tempe_df.Country == 'Denmark')

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


##### For Airport Code Table
These are local airport code .csv file named "airport-codes_csv.csv". To find this file command: 
    "$ ls -la ./ | grep airport-codes_csv"
Also, we can use Pandas to read more with parameters:
    read_csv(
            path_to_csv_data_file, 
            sep = csv_separate_symbol
    )

In [29]:
# input data source airport-codes_csv.csv
airport_input_data = "./airport-codes_csv.csv"
# Parse csv file
airport_df = pd.read_csv(airport_input_data, sep=',')
# Verify airport-codes_csv.csv parsed as dataframe
airport_df.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


#### Review data source schema with Spark
Using Spark to read more about datasources: SchemaHeader&DataType, SampleContents, Number of records.

In [55]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

# df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [56]:
df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

Py4JJavaError: An error occurred while calling o347.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.github.saurfang.sas.spark.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
	... 13 more


In [None]:
df_immigration = spark.read.format('com.github.saurfang.sas.spark').load(file)

In [44]:
#write to parquet
# df_spark.write.parquet("sas_data")

#read immigration data from parquet files ./sas_data/*
immi_df=spark.read.parquet("sas_data")
immi_df.printSchema()
immi_df.show(5)
immi_df.count()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

3096313

In [45]:
#read airport-code data from "./airport-codes_csv.csv"
airport_input_data = "./airport-codes_csv.csv"
airport_df = spark.read.csv(airport_input_data,header='True')
airport_df.printSchema()
airport_df.show(5)
airport_df.count()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|     heliport|   Total Rf Heliport|    

55075

In [46]:
#read temperature data from "../../data2/GlobalLandTemperaturesByCity.csv"
temperature_input_data = "../../data2/GlobalLandTemperaturesByCity.csv"
tempe_df = spark.read.csv(temperature_input_data,header='True')
tempe_df.printSchema()
tempe_df.show(5)
tempe_df.count()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: string (nullable = true)
 |-- AverageTemperatureUncertainty: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)

+----------+------------------+-----------------------------+-----+-------+--------+---------+
|        dt|AverageTemperature|AverageTemperatureUncertainty| City|Country|Latitude|Longitude|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
|1743-11-01|             6.068|           1.7369999999999999|Århus|Denmark|  57.05N|   10.33E|
|1743-12-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
|1744-01-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
|1744-02-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
|1744-03-01|              nu

8599212

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [62]:
# Baseline list of ports from "./I94_SAS_Labels_Descriptions.SAS" to a text file of valid codes
import re

# Create .txt file contain valid i94 ports
reg_exp_ops = re.compile(r'\'(.*)\'.*\'(.*)\'')
valid_i94port = {}
with open('valid_i94port.txt') as f:
     for port_name in f:
         matching_port = reg_exp_ops.search(port_name)
         valid_i94port[matching_port[1]]=[matching_port[2]]
            
def clean_i94_data(file):
    '''
    Input: Path to I94 immigration data file
    
    Output: Spark dataframe of I94 immigration data with valid i94port
    
    '''
    
    # Read I94 data into Spark
    # df_immigration = spark.read.format('com.github.saurfang.sas.spark').load(file)
    immi_df=spark.read.parquet("sas_data")

    # Filter out entries where i94port is invalid
    immi_df = immi_df.filter(immi_df.i94port.isin(list(valid_i94port.keys())))

    return immi_df

In [63]:
# Test clean_i94_data function
immigration_test_data = "../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat"
#immigration_test_file = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
immigration_test_data = clean_i94_data(immigration_test_data)
immigration_test_data.select(immigration_test_data.i94port).show(n=50)

+-------+
|i94port|
+-------+
+-------+



### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.