# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

**Importing all libraries**

In [2]:
import pandas as pd
import configparser
import os
import datetime as dt
from datetime import timedelta, datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
from pyspark.sql import SQLContext
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType
from pyspark.sql.functions import isnan, when, count, col, udf, dayofmonth, dayofweek, month, year, weekofyear
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import *

import utils

**Creating Spark session**

In [3]:
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

***Load immigration data***

To view immigration data that is avilable. Below we will be loading data of immigration in the US for the just the month of April 2016.

**Reading immigration data for month of April 2016**

In [4]:
df_immig= spark.read.format('com.github.saurfang.sas.spark')\
                .load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [5]:
df_immig.count()

3096313

In [9]:
df_immig.limit(5).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,474.0,2016.0,4.0,103.0,103.0,NEW,20545.0,2.0,,20547.0,...,,M,1991.0,6292016,F,,VES,55410440000.0,91285,WT
1,1508.0,2016.0,4.0,104.0,104.0,NYC,20545.0,1.0,NY,20552.0,...,,M,2000.0,6292016,F,,LX,55416410000.0,16,WT
2,1669.0,2016.0,4.0,104.0,104.0,NYC,20545.0,1.0,FL,20561.0,...,,M,1959.0,6292016,M,,AA,55457750000.0,39,WT
3,2025.0,2016.0,4.0,104.0,104.0,NYC,20545.0,1.0,NY,20549.0,...,,M,1965.0,6292016,,,SN,55419980000.0,1401,WT
4,2048.0,2016.0,4.0,104.0,104.0,MIA,20545.0,1.0,FL,20554.0,...,,M,2013.0,6292016,,,UX,55456900000.0,97,WT


In [12]:
df_immig =df_immig.dropDuplicates()
df_immig.count()

3096313

In [13]:
df_immig.summary("count").toPandas()

Unnamed: 0,summary,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,count,3096313,3096313,3096313,3096313,3096313,3096313,3096313,3096074,2943721,...,392,2957884,3095511,3095836,2682044,113708,3012686,3096313,3076764,3096313


***Load airport codes***

To view the dataset for airport codes

In [14]:
df_airportcodes=spark.read.csv('airport-codes_csv.csv',header=True, inferSchema=True)

In [15]:
df_airportcodes.limit(5).toPandas()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [16]:
df_airportcodes.summary("count").toPandas()

Unnamed: 0,summary,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,count,55075,55075,55075,48069,55075,55075,55075,49399,41030,9189,28686,55075


In [17]:
df_airportcodes.select("iso_country").distinct().show(5)

+-----------+
|iso_country|
+-----------+
|         DZ|
|         LT|
|         MM|
|         CI|
|         TC|
+-----------+
only showing top 5 rows



***Load US city demographics dataset***

To view data on US demographics

In [18]:
df_usdemo=spark.read.csv('us-cities-demographics.csv',sep=';',header=True, inferSchema=True)

In [19]:
df_usdemo.limit(5).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402


In [20]:
df_usdemo.summary("count").toPandas()

Unnamed: 0,summary,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,count,2891,2891,2891,2888,2888,2891,2878,2878,2875,2891,2891,2891


***Load Country codes***

To read codes for each country

In [21]:
with open("I94_SAS_Labels_Descriptions.SAS") as f:
    read_lines = f.readlines()
code_country = {}
for countries in read_lines[10:298]:
    values = countries.split('=')
    code, country = values[0].strip(), values[1].strip().strip("'")
    code_country[code] = country

In [22]:
countryColumns = ["Code","Country"]
df_code_country_dim=spark.createDataFrame(data=list(code_country.items()),schema=countryColumns)
df_code_country_dim.limit(5).toPandas()

Unnamed: 0,Code,Country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

## *Explore **Immigration** data*

In [30]:
utils.get_null_count(df_immig).T

Unnamed: 0,0
cicid,0
i94cit,0
i94res,0
i94port,0
arrdate,0
i94mode,239
i94addr,152592
depdate,142457
i94bir,802
i94visa,0


#### *Cleaning Steps - **Immigration** data*

In [33]:
# Dropping columns with more than 50% Nulls
drop_cols=['i94yr','i94mon','dtadfile','visapost','occup',
           'count','entdepd','entdepu','entdepa','matflag','insnum','admnum']
df_immig=df_immig.drop(*drop_cols)

In [34]:
df_immig.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- fltno: string (nullable = true)
 |-- visatype: string (nullable = true)



In [41]:
df_immig=df_immig.dropDuplicates(['cicid'])
df_immig.count()

3096313

In [42]:
df_immig=df_immig.dropna(how='all',subset=['cicid'])
df_immig.count()

3096313

> From the above we can determine that there is no duplicates and null values for "cicid" column in immigration dataset.

## *Explore **Airport Codes** data*

In [37]:
df_airportcodes.count()

55075

In [46]:
utils.get_null_count(df_airportcodes).T

Unnamed: 0,0
ident,0
name,0
iso_country,0
iso_region,0


#### *Cleaning Steps - **Airport codes** data*

In [47]:
#dropping columns with missing data
code_cols=['ocal_code','iata_code','gps_code','municipality','elevation_ft','type',
          'type','continent','local_code','coordinates']
df_airportcodes=df_airportcodes.drop(*code_cols)
df_airportcodes.printSchema()

root
 |-- ident: string (nullable = true)
 |-- name: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)



In [48]:
df_airportcodes=df_airportcodes.dropna(how='all',subset=['iso_region','iso_country'])
df_airportcodes.count()

55075

> From the above we can determine that there is no null values for "iso_region" and "iso_country" column in airport codes dataset.

## *Explore **US Demographics** data*

In [49]:
df_usdemo.count()

2891

In [50]:
utils.get_null_count(df_usdemo).T

Unnamed: 0,0
City,0
State,0
Median Age,0
Male Population,3
Female Population,3
Total Population,0
Number of Veterans,13
Foreign-born,13
Average Household Size,16
State Code,0


#### *Cleaning Steps - **US Demographics** data*

In [51]:
df_usdemo.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: integer (nullable = true)



In [52]:
df_usdemo=df_usdemo.dropna(how='all',subset=['State Code'])
df_usdemo.count()

2891

> From the above we can determine that there is no null values for "State Code" column in US Demographics dataset.

## *Explore **country & code** data*

In [53]:
df_code_country_dim.count()

288

In [54]:
df_code_country_dim.dropDuplicates().count()

288

In [55]:
utils.get_null_count(df_code_country_dim).T

Unnamed: 0,0
Code,0
Country,0


> No cleaning required for this data frame

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

***Create the immigration fact table***

In [56]:
destination_path="output_tables/"

In [64]:
def immigration_fact_table(df,dest_data):
    
   
    date_format = "%Y-%m-%d"
    get_datetime = udf(lambda x: x if x is None else (timedelta(days=x) + datetime(1960, 1, 1)).strftime(date_format))
    immig_df = df.withColumnRenamed('cicid','id') \
                 .withColumnRenamed('i94res','immig_residence_country_code') \
                 .withColumnRenamed('i94cit','immig_birth_country_code') \
                 .withColumnRenamed('i94port','port_of_entry') \
                 .withColumnRenamed('i94mode','mode_of_transportation') \
                 .withColumnRenamed('i94addr','arrival_state_code') \
                 .withColumnRenamed('i94bir','age') \
                 .withColumnRenamed('i94visa','visa_code') \
                 .withColumnRenamed('biryear','birth_year') \
                 .withColumnRenamed('gender','gender')
    
    immig_df=immig_df.withColumn("arrival_date",get_datetime(df['arrdate'])) \
                     .withColumn("departure_date",get_datetime(df['depdate'])) \
                     .withColumn("date_until",get_datetime(df['dtaddto'])) 
    
    immig_df = immig_df.withColumn("arrival_date",immig_df["arrival_date"].cast(DateType())) \
                       .withColumn("departure_date",immig_df["departure_date"].cast(DateType())) \
                       .withColumn("date_until",immig_df["date_until"].cast(DateType()))
    
    
    # write dimension to parquet file
    immig_df.write.mode("overwrite").partitionBy('id')\
            .parquet(path=dest_data + 'immigration_fact_table')
    
    return immig_df

In [None]:
immigration_fact__table_df = immigration_fact_table(df_immig, destination_path)

***Create airport code dimension table***

In [57]:
def airport_code_dim_table(df,dest_data):
    airport_df = df.filter(col("iso_country") == "US")
    airport_df = airport_df.withColumnRenamed('iso_country','country')\
                           .withColumnRenamed('iso_region','state_code')
    
    airport_df.write.parquet(dest_data + "airport_dim_table", mode="overwrite")
    return airport_df

In [58]:
airport_dim_table = airport_code_dim_table(df_airportcodes,destination_path)

***Create US Demographic dimension table***

In [59]:
def us_demo_dim_table(df,dest_data):
    us_demo_dim_df = df.select(col("City"),col("State"),col("State Code"),col("Total Population"))
    
    us_demo_dim_df =us_demo_dim_df.withColumnRenamed('State Code','state_code') \
                                  .withColumnRenamed('Total Population','total_population')
    
    us_demo_dim_df.write.parquet(dest_data + "us_demographic_dim_table", mode="overwrite")
    
    return us_demo_dim_df

In [60]:
us_demo_dim_table = us_demo_dim_table(df_usdemo,destination_path)

***Create Travel dimension table***

In [61]:
def travel_dim_table(df,dest_data):
    travel_dim_df = df.select(col("cicid"),col("visatype"),col("airline"),col("fltno"))
    
    travel_dim_df = travel_dim_df.withColumnRenamed('cicid','id')\
                                 .withColumnRenamed('visatype','visa_type')\
                                 .withColumnRenamed('airline','airline_code')\
                                 .withColumnRenamed('fltno', 'flight_number')
    
    travel_dim_df.write.parquet(dest_data + "travel_dim_table", mode="overwrite")
    
    return travel_dim_df                                    

In [62]:
travel_dim_table = travel_dim_table(df_immig,destination_path)

***Create Country code dimension table***

In [63]:
def country_dim_table(df,dest_data):
    country_dim_df = df
    
    country_dim_df.write.parquet(dest_data + "country_code_dim_table", mode="overwrite")
    
    return country_dim_df

In [64]:
country_dim_table = country_dim_table(df_code_country_dim,destination_path)

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here
utils.data_quality_check(destination_path,spark)

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.