# Data Engineering Capstone Project

## US I94 Immigration Data Lake

### Project Summary
This project performs ETL operations on Udacity provided I94 immigration and demographics datasets using Pyspark. It generates a Star Schema in parquet file at the end following Data Lake's schema-on-read semantics.

This notebooks performs Exploratory Data Analysis on used datasets.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Check readme for installation and env setup
import configparser

import pandas as pd

from pyspark.sql import SparkSession

### Step 1: Scope the Project and Gather Data

#### Scope 
*Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc.*

I plan to create a data lake using Pyspark about immigrants destinations in US. To achieve this, I've used I94 immigrations dataset along with US demographics and ISO-3166 country codes datasets. Processed data lake could be used to analyse immigration trends at particular time periods and origin of the travelers. Output is generated in `Apache Parquet` columnar format for better performance on aggregation queries.

**Tools/Tech Used**: Python, Apache Spark (PySpark), Pandas

#### Describe and Gather Data 
*Describe the data sets you're using. Where did it come from? What type of information is included?*

Following datasets are used for this project:
- **I94 Immigration Data 2016:** This data comes from the US National Tourism and Trade Office.
    - Source: https://travel.trade.gov/research/reports/i94/historical/2016.html
    - Note this data is behind a paywall and provided by Udacity for this project.
    - Dataset consists of 12 files containing data for each month. Each file has around 3 million rows and 28 columns. A data dictionary explaining columns is also included at `data/I94_SAS_Labels_Descriptions.SAS`.
    - Sample CSV: `data/input/immigration_data_sample.csv`
    - NOTE: I've used sample sas dataset provided in `sas_data` dir in workspace by Udacity. This data contains ~3MM rows which satisfies the requirement of at least 1MM rows. It contains data for April 2016 only.
- **U.S. City Demographic Data:** This data comes from OpenSoft.
    - Source: https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/

##### 1.1 Read config and load sample from datasets

In [3]:
# Because the immigration data has 28 columns
pd.set_option('display.max_columns', 28)

# Read config
config = configparser.ConfigParser()
config.read_file(open('capstone.cfg'))

I94_DATA_FILE_PATH = config['DATA']['I94_DATA_FILE_PATH']
print(I94_DATA_FILE_PATH)
# df = pd.read_sas(I94_DATA_FILE_PATH, format='sas7bdat', encoding="ISO-8859-1")

../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat


In [4]:
# Read local parquet dataset in `df`
I94_LOCAL_DATA_DIR = config['DATA']['I94_LOCAL_DATA_DIR']
print("I94_LOCAL_DATA_DIR: ", I94_LOCAL_DATA_DIR)

# To reduce memory usage locally
df = pd.read_parquet(I94_LOCAL_DATA_DIR).sample(n=1000)
df.describe()

I94_LOCAL_DATA_DIR:  data/input/sas_data/


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,biryear,admnum
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,954.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,2992814.0,2016.0,4.0,308.641,307.279,20559.497,1.073,20573.730608,41.745,1.823,1.0,1974.255,70355580000.0
std,1743423.0,0.0,0.0,211.357271,210.610242,8.672093,0.47948,21.677009,17.699923,0.409681,0.0,17.699923,22642220000.0
min,4606.0,2016.0,4.0,103.0,103.0,20545.0,1.0,20547.0,0.0,1.0,1.0,1925.0,673067200.0
25%,1541770.0,2016.0,4.0,135.0,135.0,20552.0,1.0,20561.0,31.0,2.0,1.0,1961.75,56001250000.0
50%,2913126.0,2016.0,4.0,213.0,213.0,20559.0,1.0,20569.0,42.0,2.0,1.0,1974.0,59347020000.0
75%,4512445.0,2016.0,4.0,514.0,513.0,20567.0,1.0,20579.0,54.25,2.0,1.0,1985.0,93434720000.0
max,6059538.0,2016.0,4.0,746.0,745.0,20574.0,9.0,20703.0,91.0,3.0,1.0,2016.0,95012780000.0


In [5]:
df.sample(n=10)

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
502069,1020474.0,2016.0,4.0,148.0,112.0,WAS,20550.0,1.0,CA,20566.0,29.0,2.0,1.0,20160406,,,T,O,,M,1987.0,7042016,M,,DY,698624000.0,7091,WT
296184,632364.0,2016.0,4.0,111.0,111.0,NYC,20548.0,1.0,NJ,20555.0,44.0,2.0,1.0,20160404,,,O,O,,M,1972.0,7022016,,,AF,55602600000.0,12,WT
1519835,3109118.0,2016.0,4.0,129.0,129.0,NYC,20560.0,1.0,NY,20567.0,31.0,2.0,1.0,20160417,,,G,O,,M,1985.0,7152016,M,,IB,56296300000.0,6251,WT
265387,517917.0,2016.0,4.0,438.0,438.0,HHW,20547.0,1.0,HI,20554.0,27.0,2.0,1.0,20160403,,,O,O,,M,1989.0,7012016,,,QF,55569370000.0,3,WT
1923636,3877989.0,2016.0,4.0,209.0,209.0,CHI,20565.0,1.0,HI,20568.0,51.0,2.0,1.0,20160421,,,G,O,,M,1965.0,7192016,M,,NH,56521880000.0,184,WT
2601274,5255090.0,2016.0,4.0,209.0,209.0,HHW,20572.0,1.0,HI,20576.0,68.0,2.0,1.0,20160428,,,G,O,,M,1948.0,7262016,F,,HA,59406040000.0,450,WT
2407736,4888421.0,2016.0,4.0,207.0,207.0,LVG,20570.0,1.0,CA,20593.0,38.0,2.0,1.0,20160426,,,G,N,,M,1978.0,7242016,F,,CX,59285320000.0,884,WT
1974753,4023006.0,2016.0,4.0,690.0,690.0,MIA,20565.0,1.0,FL,20575.0,33.0,1.0,1.0,20160421,,,G,O,,M,1983.0,7192016,M,,LA,56502940000.0,500,WB
2137331,4312434.0,2016.0,4.0,213.0,213.0,NEW,20567.0,1.0,NJ,20650.0,69.0,2.0,1.0,20160423,BMB,,G,O,,M,1947.0,10222016,F,,AI,94313490000.0,191,B2
1695267,3467813.0,2016.0,4.0,585.0,585.0,MIA,20562.0,1.0,FL,20574.0,46.0,2.0,1.0,20160418,SDO,,G,N,,M,1970.0,10172016,M,,AA,93914810000.0,1481,B2


##### 1.2 Configure Spark session

In [11]:
spark = SparkSession.builder \
            .appName("Capstone Project - Immigration Dataset") \
            .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11") \
            .enableHiveSupport() \
            .getOrCreate()
print("Spark session created")


Spark session created


##### 1.3 Show sample immigration data

Read April 2016 file in spark dataframe (same as Pandas df)

In [12]:
# Reading apr sas file in Spark df
immigration_df = spark.read.format('com.github.saurfang.sas.spark').load(I94_DATA_FILE_PATH)

immigration_df.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [14]:
print("Paritions: ", immigration_df.rdd.getNumPartitions())
immigration_df.head()

Paritions:  14


[Row(cicid=6.0, i94yr=2016.0, i94mon=4.0, i94cit=692.0, i94res=692.0, i94port='XXX', arrdate=20573.0, i94mode=None, i94addr=None, depdate=None, i94bir=37.0, i94visa=2.0, count=1.0, dtadfile=None, visapost=None, occup=None, entdepa='T', entdepd=None, entdepu='U', matflag=None, biryear=1979.0, dtaddto='10282016', gender=None, insnum=None, airline=None, admnum=1897628485.0, fltno=None, visatype='B2')]

### Step 2: Explore and Assess the Data
#### 2.1 Explore the Data
*Identify data quality issues, like missing values, duplicate data, etc.*

##### 2.1.1 Immigration Data

#### 2.2 Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.