# Project Capstone
### Data Engineering Capstone Project

#### Project Summary
This Project aims to provide useful information to analysts as well as for the immigration staff in order they can make decisions regarding immigration process because they will be able to access what is the current and historical situation about population who have entered the country, what is the relation with demographics indicators and also what has been the behavior regarding COVID19 in the country where visitors resides.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd, re

In [2]:
#df = pd.read_sas('I94_SAS_Labels_Descriptions.SAS')
fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
dfi = pd.read_sas(fname, 'sas7bdat', encoding="ISO-8859-1")

### Step 1: Scope the Project and Gather Data

#### Scope 
This Project aims to provide useful information to analysts as well as for the immigration staff in order they can make decisions regarding immigration process because they will be able to access what is the current and historical situation about population who have entered the country, what is the relation with demographics indicators and also what has been the behavior regarding COVID19 in the country where visitors resides. For that we have built a Datapipeline by using pandas, spark with python. The information is taken from different sources, organize them into a model and written into parquet files.  

#### Describe and Gather Data 

* Immigration data comes from the <a href="https://www.trade.gov/national-travel-and-tourism-office">US National Tourism and Trade Office</a>
* US cities demographics information is provided with the project. Likewise data of countries, us states and i94ports to ensure the data quality.
* COVID19 by country is from <a href="https://www.kaggle.com/datasets/jcsantiago/covid19-by-country-with-government-response">Kaggle</a>

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import sum, avg
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [4]:
#Parses the file I94_SAS_Labels_Descriptions.SAS dinamically by iterating each line of each group of data within the file, discard the data that shouldn't be used, and then move the data into a dictionary and then into a pandas Dataframe.
def build_sas_table(target,array_words,keytype):
    array_valids = {}
    with open('I94_SAS_Labels_Descriptions.SAS') as f:
        start = False
        for line in f:
            if not start:
                match_start = re.compile(r'%s$' %(target)).search(line)
            if match_start:
                start = True
                match_end = re.compile(r';$').search(line)
                if match_end:
                    break
                if keytype == 1:
                    match = re.compile(r"'([^=]*)'.*'(.*)'").search(line)
                else:
                    match = re.compile(r"([^=]*).*'(.*)'").search(line)
                if match:         
                    i94port_invalid = False
                    for i in array_words:             
                        if re.search(i,match[2]):                       
                            i94port_invalid = True
                            break
                    if not i94port_invalid:
                        array_valids[match[1]]=match[2].strip()
    return array_valids

In [5]:
# Extract the data from I94_SAS_Labels_Descriptions.SAS to clean it and generate the Dataframe for dim_i94ports
dfi94port = pd.DataFrame.from_dict(build_sas_table("i94prtl",['No PORT','Collapsed'],1), orient='index', columns = {'i94port_name'})

In [6]:
# Extract the data from I94_SAS_Labels_Descriptions.SAS to clean it and generate the Dataframe for dim_countries
country = pd.DataFrame.from_dict(build_sas_table("i94cntyl",['INVALID','Collapsed','No Country'],2), orient='index', columns = {'country_name'})

In [7]:
# Extract the data from I94_SAS_Labels_Descriptions.SAS to clean it and generate the Dataframe for dim_us_states
us_states = pd.DataFrame.from_dict(build_sas_table("i94addrl",[],1), orient='index', columns = {'us_states_codes'})

In [8]:
#Read the file covid19_by_country.csv
covid19_by_country = pd.read_csv('covid19_by_country.csv', sep=',')
covid19_by_country_table = spark.createDataFrame(covid19_by_country) 

In [16]:
#Read the file us-cities-demographics.csv
demographic = pd.read_csv('us-cities-demographics.csv', sep=';')
demographic_table = spark.createDataFrame(demographic) 

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model

<img src="../assets/ER_diagram.PNG" width="1250"/>

#### 3.2 Mapping Out Data Pipelines

* Given the file I94_SAS_Labels_Descriptions.SAS is in the project's root directory, likewise the different sources (covid19_by_country.csv, us-cities-demographics.csv)
* Parse the file I94_SAS_Labels_Descriptions.SAS dinamically by iterating each line of each group of data within the file, clean the data, and then move it into a dictionary and then into a pandas Dataframe.
* Generate the TempView for dim_i94ports table. 
* Generate the TempView for dim_countries table.
* Generate the TempView for dim_us_states table.
* Read the file covid19_by_country.csv and then generate the TempView for dim_covid_by_country table, grouping the data by country and leaving the most recient number of active cases.
* Read the file us-cities-demographics.csv and then generate the TempView for dim_demographic table, grouping the data by us_state and summing the foreignborn and total population.
* Read the folder where the immigration data is, clean the data by joining on dim_i94ports table, dim_countries table table, dim_us_states table and dim_demographic table.
* Generate the TemView for dim_immigration table.
* Group the dim_immigration table by us_state and join it on dim_covid_by_country table.
* Generate the TemView for fact_immigration table.
* Write the parquet files for the tables: dim_i94ports, dim_countries, dim_us_states, dim_demographic, dim_covid_by_country and fact_immigration.
* Run quality checks to ensure the data were loaded properly. 

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [8]:
#Generate TempView for dim_i94ports table
dfi94port.reset_index(drop=False,inplace=True)
i94ports_table = spark.createDataFrame(dfi94port) 
i94ports_table.createOrReplaceTempView("i94ports_table")
i94ports_table.show(3)

+-----+--------------------+
|index|        i94port_name|
+-----+--------------------+
|  ALC|           ALCAN, AK|
|  ANC|       ANCHORAGE, AK|
|  BAR|BAKER AAF - BAKER...|
+-----+--------------------+
only showing top 3 rows



In [9]:
#Generate TempView for dim_countries table
country.reset_index(drop=False,inplace=True)
countries_table = spark.createDataFrame(country) 
countries_table.createOrReplaceTempView("countries_table")
countries_table.show(3)

+-------+--------------------+
|  index|        country_name|
+-------+--------------------+
|   582 |MEXICO Air Sea, a...|
|   236 |         AFGHANISTAN|
|   101 |             ALBANIA|
+-------+--------------------+
only showing top 3 rows



In [10]:
#Generate TempView for dim_us_states table
us_states.reset_index(drop=False,inplace=True)
us_states_table = spark.createDataFrame(us_states) 
us_states_table.createOrReplaceTempView("us_states_table")
us_states_table.show(3)

+-----+---------------+
|index|us_states_codes|
+-----+---------------+
|   AL|        ALABAMA|
|   AK|         ALASKA|
|   AZ|        ARIZONA|
+-----+---------------+
only showing top 3 rows



In [None]:
#*****DIM COVID19 TABLE*****
#Generate TempView for dim_covid_by_country table grouped by country
covid19_table_grouped = covid19_by_country_table.select("Country","confirmed_inc","Date") \
                    .groupBy(col("Country").alias("country")) \
                    .agg({"Date":"last","confirmed_inc":"last"}) \
                    .withColumnRenamed('last(Date)','date') \
                    .withColumnRenamed('last(confirmed_inc)','confirmed_inc')
covid19_table_grouped.createOrReplaceTempView("covid19_table_grouped")
covid19_table_grouped.show(3)

In [17]:
#Generate TempView for dim_demographic table grouped by country
demographic_table_grouped = demographic_table.select("State Code","Foreign-born","Total Population","Median Age") \
                    .groupBy(col("State Code").alias("statecode")) \
                    .agg({"Foreign-born":"sum","Total Population":"sum","Median Age":"mean"}) \
                    .withColumnRenamed('sum(Foreign-born)','foreignborn') \
                    .withColumnRenamed('sum(Total Population)','totalpopulation') \
                    .withColumnRenamed('avg(Median Age)','medianage') 
demographic_table_grouped.createOrReplaceTempView("demographic_table_grouped")
demographic_table_grouped.show(3)

+---------+-----------------+---------------+-----------+
|statecode|        medianage|totalpopulation|foreignborn|
+---------+-----------------+---------------+-----------+
|       AZ|35.03750000000001|       22497710|  3411565.0|
|       SC|33.82500000000001|        2586976|   134019.0|
|       LA|34.62500000000001|        6502975|   417095.0|
+---------+-----------------+---------------+-----------+
only showing top 3 rows



In [None]:
#Generate the TempView for dim_immigration table joining on dim_countries, dim_i94ports, dim_us_states and dim_demographic 
df_immigration = spark.read.format('com.github.saurfang.sas.spark').load('/data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
df_immigration.createOrReplaceTempView("immigration_table_temp")
immigration_table_temp = spark.sql("""
    SELECT i94yr, i94mon, i94res, i94port, i94addr,
           countries_table.country_name,
           demographic_table_grouped.foreignborn
    FROM immigration_table_temp
    JOIN countries_table ON (immigration_table_temp.i94res = countries_table.index)
    JOIN i94ports_table ON (immigration_table_temp.i94port = i94ports_table.index)
    JOIN us_states_table ON (immigration_table_temp.i94addr = us_states_table.index)
    LEFT JOIN demographic_table_grouped ON (immigration_table_temp.i94addr = demographic_table_grouped.statecode)
    """)
immigration_table_temp.show(3)
immigration_table_temp.createOrReplaceTempView("immigration_table_temp")

In [None]:
#Join the dim_immigration table on dim_covid_by_countries the get the fact table
immigration_table_temp = spark.sql("""
    SELECT i94yr, i94mon, i94res, i94port, i94addr,country_name,foreignborn,
            covid19_table_grouped.confirmed_inc
    FROM immigration_table_temp
    LEFT JOIN covid19_table_grouped ON (LOWER(immigration_table_temp.country_name) LIKE '%' || LOWER(covid19_table_grouped.country) || '%')
    """)
immigration_table_temp.show(3)

In [None]:
#Generate the TempView for fact_immigration table grouped by year, month and us_states
immigration_table_temp.groupBy("i94yr","i94mon","i94addr") \
                    .agg({"foreignborn":"last","confirmed_inc":"last"}) \
                    .withColumnRenamed('last(foreignborn)','foreignborn') \
                    .withColumnRenamed('last(confirmed_inc)','confirmed_inc') \
                    .show()
fact_immigration_table = immigration_table_temp.createOrReplaceTempView("fact_immigration_table")

In [12]:
#Write dim_i94ports into parquet file
i94ports_table.write.parquet("output_data/dim_i94ports/dim_i94ports.parquet")

In [13]:
#Write dim_countries into parquet file
countries_table.write.parquet("output_data/dim_countries/dim_countries.parquet")

In [14]:
#Write dim_us_states into parquet file
us_states_table.write.parquet("output_data/dim_us_states/dim_us_states.parquet")

In [None]:
#Write dim_covid_by_country into parquet file
covid19_table_grouped.write.partitionBy("country").parquet(output_data + "output_data/dim_covid_by_country/dim_covid_by_country.parquet")

In [None]:
#Write dim_demographic into parquet file
demographic_table_grouped.write.parquet("output_data/dim_demographic/dim_demographic.parquet")

In [None]:
#Write dim_immigration into parquet file
immigration_table_temp.write.partitionBy("i94addr").parquet(output_data + "output_data/dim_immigration/dim_immigration.parquet")

In [None]:
#Write fact_immigration into parquet file
fact_immigration_table.write.partitionBy("i94addr").parquet(output_data + "output_data/fact_immigration/fact_immigration.parquet")

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [23]:
# Perform quality checks here
print(f"Total records for dim_i94ports table is: " + str(i94ports_table.count())) 
print(f"Total records for dim_countries table is: " + str(countries_table.count())) 
print(f"Total records for dim_us_states table is: " + str(us_states_table.count())) 
print(f"Total records for dim_covid_by_country table is: " + str(covid19_table_grouped.count())) 
print(f"Total records for dim_demographic table is: " + str(demographic_table_grouped.count())) 
print(f"Total records for dim_immigration table is: " + str(immigration_table_temp.count())) 
print(f"Total records for fact_immigration table is: " + str(fact_immigration_table.count())) 

Total records for dim_i94ports table is: 588


#### 4.3 Data dictionary 

dim_demographic
* statecode: US States code
* medianage: Population average age in that US State
* totalpopulation: Total population in that US State
* foreignborn: Foreign born population in that specific US State

dim_immigration
* i94yr: 4 digits year
* i94mon: Numeric month
* i94port: Destination city
* i94res: Country code where the immigrant resides
* i94addr: US State
* country_name: Country name where the immigrant resides
* foreignborn: Foreign born population in that specific US State
* confirmed_inc: COVID19 active cases at the year and month in the country where the immigrant resides

dim_i94ports
* index: Destination city code
* i94port: Destination city name

dim_countries
* index: Country code
* country_name: Country name

dim_us_states
* index: US State code
* country_name: US State name

dim_covid19_by_country
* country: Country name
* date: Date of statistics
* confirmed_inc: Confirmed cases

fact_immigration
* i94yr: 4 digits year
* i94mon: Numeric month
* i94addr: US State
* foreignborn: Foreign born population in that specific US State
* confirmed_inc: COVID19 active cases at the year and month in the country where the immigrant resides

#### Step 5: Complete Project Write Up
* Data should be updated daily because immigration process occurs at any time as well as COVID19 active cases. Demographic information could by updated yearly.
* Write a description of how you would approach the problem differently under the following scenarios:
 * If the data was increased by 100x is recommendable to move the Pipeline to a AWS EMR.
 * If the data populates a dashboard that must be updated on a daily basis by 7am every day is recommendable to set up Airflow Schedule and implement the same code in Operators to do that in the correct fashion.
 * If the database needed to be accessed by 100+ people is recommendable to populate a AWS Redshift with the model.