# USA Migration
### Data Engineering Capstone Project

#### Project Summary

TODO: We use X datasets to find Y relationships. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up


### Step 1: Scope the Project and Gather Data

#### Scope 

In the following project, we will show how to leverage the Cloud to create a Data Lake. We will use I94 Immigration data sponsored by Udacity, as well as other open data sources. The objective is to run analytics on the data and showcase the powers of a Serverless Infrastructure.

The Technologies we will be leveraging are:

  - **Storage**: AWS S3
  - **Infrastructure as Code**: Pulumi TypeScript SDK for AWS
  - **ETL Jobs**: AWS Glue and AWS Athena
  - **Analytics:** AWS Athena

The end solution will appear at the end of the notebook in the form of Visualizations based on Analytics Queries on our Data Lake.

#### Describe and Gather Data 

We will be using a total of 4 Data Sources. All data used is provided by Udacity, but we will be talking more about each source's background.

  - **I94 Immigration Data**: Original comes from [US National Tourism and Trade Office](https://travel.trade.gov/research/reports/i94/historical/2016.html).
  - **World Temperature Data**: Original Source is [Kaggle](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).
  - **U.S. City Demographic Data**: Original comes from [OpenSoft](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).
  - **Airport Code Table**: Original comes from [Datahub](https://datahub.io/core/airport-codes#data).


In [1]:
# IMPORTS

import pandas as pd
from os import path 
from glob import glob
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_rows', 500)

#### I94 Immigration Data

Immigration data from 2016 contains data related to an I94 Stamp, the Port of Entry and Flight Data (if Applicable), and some minimal information about the Person. Udacity provided data in SAS, Parquet, and CSV format, but all three are mostly the same, with CSV containing one extra unnamed column. When working with Big Data, the size format can make a massive difference in costs. The SAS and CSV file formats are considerably larger than parquet, so our initial stage will be to use a pipeline to transform SAS to Parquet and then use parquet on our Queries.

  - `I94YR`: 4 digit year.
  - `I94MON`: Numeric month.
  - `I94CIT & I94RES`: This format shows all the valid and invalid codes for processing.
  - `I94PORT`: This format shows all the valid and invalid codes for processing.
  - `ARRDATE`: the Arrival Date in the USA. It is a SAS date numeric field that a 
  - `I94MODE`: There are missing values as well as not reported (9).
    - 1 = Air
    - 2 = Sea
    - 3 = Land
    - 9 = Not Reported
  - `I94ADDR`: There is lots of invalid codes in this variable and the list below 
  - `DEPDATE`: the Departure Date from the USA. It is a SAS date numeric field that permament format has not been applied.
  - `I94BIR`: Age of Respondent in Years.
  - `I94VISA`: Visa codes collapsed into three categories:
    - 1 = Business
    - 2 = Pleasure
    - 3 = Student
  - `COUNT`: Used for summary statistics.
  - `DTADFILE`: Character Date Field - Date added to I-94 Files - CIC does not use.
  - `VISAPOST`: Department of State where where Visa was issued - CIC does not use.
  - `OCCUP`: Occupation that will be performed in U.S. - CIC does not use.
  - `ENTDEPA`: Arrival Flag - admitted or paroled into the U.S. - CIC does not use.
  - `ENTDEPD`: Departure Flag - Departed, lost I-94 or is deceased - CIC does not use.
  - `ENTDEPU`: Update Flag - Either apprehended, overstayed, adjusted to perm residence - CIC does not use.
  - `MATFLAG`: Match flag - Match of arrival and departure records.
  - `BIRYEAR`: 4 digit year of birth.
  - `DTADDTO`: Character Date Field - Date to which admitted to U.S. (allowed to stay until) - CIC does not use.
  - `GENDER`: Non-immigrant sex.
  - `INSNUM`: INS number.
  - `AIRLINE`: Airline used to arrive in U.S.
  - `ADMNUM`: Admission Number.
  - `FLTNO`: Flight number of Airline used to arrive in U.S.
  - `VISATYPE`: Class of admission legally admitting the non-immigrant to temporarily stay in U.S.
    - `B-1`: Visa Holders-Business
    - `B-2`: Visa Holders-Pleasure
    - `E-1`: Visa Holders-Treaty Trader
    - `E-2`: Visa Holders-Treaty Investor
    - `F-1`: Visa Holders-Students
    - `F-2`: Visa Holders-Family Members of Students
    - `I`: Visa Holders-Foreign Information Media
    - `M-1`: Visa Holders-Vocational Students
    - `M-2`: Visa Holders-Family Members of Vocational Students
    - `GMB`: Guam Visa Waiver-Business
    - `GMT`: Guam Visa Waiver-Tourist
    - `WB`: Visa Waiver-Business
    - `WT`: Visa Waiver-Pleasure

In [2]:
sas_data_parquet = glob('./sas-data/*.parquet')
immigration_data_full = glob('./immigration-data/18-83510-I94-Data-2016/*.sas7bdat')
immigration_data_sample = './immigration_data_sample.csv'

In [3]:
for chunk in pd.read_sas(immigration_data_full[0], format='sas7bdat', encoding="ISO-8859-1", chunksize=5):
    display(chunk.iloc[:2])
    print(chunk.columns.tolist())
    break

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,46.0,2016.0,12.0,129.0,129.0,HOU,20789.0,1.0,TX,20802.0,46.0,1.0,1.0,20161201,MDD,,H,O,,M,1970.0,05262018,M,,RS,97554140000.0,7715,E2
1,56.0,2016.0,12.0,245.0,245.0,NEW,20789.0,1.0,OH,20835.0,28.0,3.0,1.0,20161201,BEJ,,U,O,,M,1988.0,D/S,F,,CA,90623720000.0,819,F1


['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline', 'admnum', 'fltno', 'visatype']


In [4]:
immigration_data_df_from_parquet = pd.read_parquet(sas_data_parquet[0])
immigration_data_df_from_parquet.iloc[:2]

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,459651.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20559.0,54.0,2.0,1.0,20160403,,,O,R,,M,1962.0,7012016,,,VS,55556250000.0,115,WT
1,459652.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20555.0,74.0,2.0,1.0,20160403,,,T,O,,M,1942.0,7012016,F,,VS,674406500.0,103,WT


In [5]:
immigration_data_df_from_sample = pd.read_csv(immigration_data_sample)
immigration_data_df_from_sample.iloc[:2]

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2


#### World Temperature Data

World temperature data comes from Kaggle by Berkeley Earth. It contains temperatures dating back to 1750. We will not be using the full dataset, as we're only interested in what relationships we can find to our Immigration data.

In [6]:
temperature_data = './temperature-data/GlobalLandTemperaturesByCity.csv'
temperature_data_df = pd.read_csv(temperature_data)
display(temperature_data_df.iloc[:2])

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E


#### US City Demographic Data

US Cities demographics data contains census information on places with a population equal to or more than 65,000, and it dates to 2015. The date is relevant because we can't infer causality of the Immigration data on this dataset. 

In [7]:
demographic_data = './us-cities-demographics.csv'
demographic_data_df = pd.read_csv(demographic_data, sep=';')
demographic_data_df.iloc[:2]

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723


#### Airport Code Table

Airport Codes come from Datahub.io, and it dates to 2018. An airport code might refer to the IATA airport code, a three-letter airport code that could appear on the Immigration record, or the ICAO airport code, which is a four-letter code used by ATC systems. Immigration records do not specify a standard on their airport columns, and, likely, this data is not available. This statement does not mean one can't join the Immigration records with the Airport Codes, but it is not straightforward, and we will likely maintain all airport records because of that.

In [8]:
airport_codes = './airport-codes.csv'
airport_codes_df = pd.read_csv(airport_codes)
display(airport_codes_df.iloc[:2])
display(airport_codes_df[(airport_codes_df.ident.str.contains('CA')) & (airport_codes_df.iso_region.str.contains('OH'))])
print(airport_codes_df.shape)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
26334,KCAK,medium_airport,Akron Canton Regional Airport,1228.0,,US,US-OH,Akron,KCAK,CAK,CAK,"-81.44219970703125, 40.916099548339844"


(55075, 12)


### Step 2: Explore and Assess the Data

#### Summary

To create our model, we clean all datasets using a VM with Spark. We have 3 Datasets that we can extract Fact Tables from, and those are Demographic Facts, I94 Record Facts, and Temperature Reading Facts. The Dataset with Airport Codes is useful, but it is a Dimension to locate our data. We further analyze each Dataset in its Notebook, which are:

  - [I94 Records](./cleaning_i94_records.ipynb)
  - [World Temperatures](./cleaning_world_temperatures.ipynb)
  - [Airport Codes and Demographics Data](./cleaning_us_demographics_and_airport_codes.ipynb)

For all datasets, we try to use the Type as close as possible to its real Type. 

#### I94 Records
1. Based on the low number of rows with occupations, and the high amount of missing data when finding visa types with occupations, we remove it from our staging table.
2. Remove other columns with over 90% of Null values. (`entdepu`)
3. Remove `count` and `cicid` because they are internal variables.
4. Remove `insnum` because there is no source to join it with.
5. Transform all dates and handle multiple formats as well as incorrect formats as missing values.

#### World Temperature Data

- We validate latitudes and longitudes and augment our Dataset with State (in this case using Google Maps) to Join our I94 Records with World Temperature Data.
- We cache all [Google Geocoding API Responses](./google-maps/) for reproducible results.

#### Airport Codes and Demographics Data

- We add State to Airport Codes and limit our Airports to US airports only. 

After cleaning and augmenting each Dataset, we deploy it as a Staging Dataset in Parquet format to AWS S3. We create an AWS Glue Crawler to convert them into Tables to use in AWS Athena. 

#### Create a Crawler to Load the Staging Tables and use them from AWS Athena

In [4]:
!aws glue create-crawler --name dend-capstone --role service-role/AWSGlueServiceRole-dend --database-name capstone --targets "{ \"S3Targets\": [ { \"Path\": \"s3://claudiordgz-udacity-dend/capstone/\" } ] }"

In [10]:
!aws glue get-crawler --name dend-capstone --query "Crawler.Name"

"dend-capstone"


#### Run the Crawler to load the Data

In [12]:
!aws glue start-crawler --name dend-capstone

The end result is as follows:

<p align="center"><img src="./img/stagingTables.png" width="100%"/></p>


In [12]:
%%HTML
<style>
.flex-container {
  display: flex;
  flex-wrap: wrap;
}

.flex-container > div {
  width: 250px;
  margin: 10px;
  text-align: center;
  line-height: 75px;
  font-size: 30px;
}
</style>
<h2>And in AWS Athena</h2>
<div class="flex-container">
    <div><img src="./img/tablesLoadedInAthena.png"></div>
    <div><img src="./img/stagingI94Records.png"/></div>
    <div><img src="./img/stagingDemographicData.png"/></div>
    <div><img src="./img/stagingAirportCodes.png"/></div>
    <div><img src="./img/stagingWorldTemperatureData.png"/></div>
</div>

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model

We will create a Snowflake Schema using three Fact Tables named:

  - `fact_demographic`
  - `fact_i94_record`
  - `fact_temp_reading`
  
To support joining these tables we will create the following dimensions:

  - `dim_i94_citres_codes`
  - `dim_i94_model`
  - `dim_visa_type`
  - `dim_i94_port`
  - `dim_geography`
  - `dim_time`
  - `dim_person`
  - `dim_airline`

This schema will allow us to join across these tables and perform some analytics over the data.

The Data Model would look like the following:


![Capstone Data Model](./img/Dend-Capstone.png)


For our business purpose, we will use the following questions:

  - What are the types of immigrants in the US? Relationship between profession and Visa Type.
  - What is the Country that migrates the most to/from the US? Provide a Visualization of which Country and encode amount as color density.
  - Statistics (`AVG`, `MEDIAN`, `STDDEV`) per Port, Airport, Visa Type, Gender, Birth Year, and MIN MAX of Counts per Day, Month, Year.
  - Which airline brings most visitors per month?
  
  
Our process to create the data Model looks as follows:


![Data Pipeline Process](./img/DataPipelineProcess.png)



### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.