# US Visitors Data Warehouse
### Data Engineering Capstone Project

#### Project Summary
The project aims to take data relating to immigration, and perform ETL such that the data can be further analysed. The process will use airflow, and spark to co-ordinate the retrieval of the data, and transformation into fact and dimension tables. These will be stored in amazon redshift, such that a backend web service could then access, and subsequently serve insights into the dataset on request.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
import os
import pandas as pd
from datetime import datetime

from helper.util import convert_sas_date, convert_integer

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

***

__Immigration Data__

For decades, U.S. immigration officers issued the I-94 Form (Arrival/Departure Record) to foreign visitors (e.g., business visitors, tourists and foreign students) who lawfully entered the United States. The I-94 was a small white paper form that a foreign visitor received from cabin crews on arrival flights and from U.S. Customs and Border Protection at the time of entry into the United States. It listed the traveler's immigration category, port of entry, data of entry into the United States, status expiration date and had a unique 11-digit identifying number assigned to it. Its purpose was to record the traveler's lawful admission to the United States.

_Each report contains international visitor arrival statistics by world regions and select countries (including top 20), type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries)._
_Data sources include:
* _Overseas DHS/CBP I-94 Program data_
* _Canadian visitation data (Stats Canada)_
* _Mexican visitation data (Banco de Mexico)_

There is a file for each month of the year of 2016 available in the directory `../../data/18-83510-I94-Data-2016/` in the [SAS](https://www.sas.com/en_us/home.html) binary database storage format `sas7bdat`. Combined, the 12 datasets have got more than 40 million rows (40.790.529) and 28 columns.

To make things simpler, we will be working with just the month of April. The related dataset has more than three million records (3.096.313).

In [2]:
immigration_fname = 'data/i94_apr16_sub.sas7bdat'
immigration = pd.read_sas(immigration_fname, 'sas7bdat', encoding="ISO-8859-1")
#immigration = pd.read_csv('data/201604.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [90]:
immigration.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,airline,admnum,fltno,visatype,i94port_desc,i94cit_desc,i94res_desc,i94mode_desc,i94addr_desc,i94visa_desc
0,0,6,2016,4,692,692,XXX,20573,,,...,,1897628000.0,,B2,NOT REPORTED/UNKNOWN,ECUADOR,ECUADOR,,,Pleasure
1,1,7,2016,4,254,276,ATL,20551,1.0,AL,...,,3736796000.0,296.0,F1,"ATLANTA, GA",,SOUTH KOREA,Air,ALABAMA,Student
2,2,15,2016,4,101,101,WAS,20545,1.0,MI,...,OS,666643200.0,93.0,B2,WASHINGTON DC,ALBANIA,ALBANIA,Air,MICHIGAN,Pleasure
3,3,16,2016,4,101,101,NYC,20545,1.0,MA,...,AA,92468460000.0,199.0,B2,"NEW YORK, NY",ALBANIA,ALBANIA,Air,MASSACHUSETTS,Pleasure
4,4,17,2016,4,101,101,NYC,20545,1.0,MA,...,AA,92468460000.0,199.0,B2,"NEW YORK, NY",ALBANIA,ALBANIA,Air,MASSACHUSETTS,Pleasure


In [9]:
immigration = convert_integer(immigration, ['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', \
                              'arrdate', 'i94mode', 'i94bir', 'i94visa', 'count', 'biryear', 'dtadfile', 'depdate'])

In [75]:
port = dict(zip(pd.read_csv("lookup/I94PORT.csv").to_dict("list")["ID"], pd.read_csv("lookup/I94PORT.csv").to_dict("list")["Port"]))
immigration["i94port_desc"] = immigration["i94port"].map(port, na_action='ignore')

In [78]:
countries = dict(zip(pd.read_csv("lookup/I94CIT_I94RES.csv").to_dict("list")["Code"], pd.read_csv("lookup/I94CIT_I94RES.csv").to_dict("list")["I94CTRY"]))
immigration["i94cit_desc"] = immigration["i94cit"].map(countries, na_action='ignore')
immigration["i94res_desc"] = immigration["i94res"].map(countries, na_action='ignore')

In [81]:
modes = dict(zip(pd.read_csv("lookup/I94MODE.csv").to_dict("list")["ID"], pd.read_csv("lookup/I94MODE.csv").to_dict("list")["Mode"]))
immigration["i94mode_desc"] = immigration["i94mode"].map(modes, na_action='ignore')

In [85]:
addrs = dict(zip(pd.read_csv("lookup/I94ADDR.csv").to_dict("list")["Code"], pd.read_csv("lookup/I94ADDR.csv").to_dict("list")["State"]))
immigration["i94addr_desc"] = immigration["i94addr"].map(addrs, na_action='ignore')

In [88]:
visas = dict(zip(pd.read_csv("lookup/I94VISA.csv").to_dict("list")["ID"], pd.read_csv("lookup/I94VISA.csv").to_dict("list")["Type"]))
immigration["i94visa_desc"] = immigration["i94visa"].map(visas, na_action='ignore')

__Data Dictionary__: Here, we describe the various fields of the dataset:

| Column Name | Description |
| :--- | :--- |
| CICID* | ID that uniquely identify one record in the dataset |
| I94YR | 4 digit year |
| I94MON | Numeric month |
| I94CIT | 3 digit code of source city for immigration |
| I94RES | 3 digit code of source country for immigration  |
| I94PORT | Port addmitted through |
| ARRDATE | Arrival date in the USA |
| I94MODE | Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported) |
| I94ADDR | State of arrival |
| DEPDATE | Departure date |
| I94BIR | Age of Respondent in Years |
| I94VISA | Visa codes collapsed into three categories: (1 = Business; 2 = Pleasure; 3 = Student) |
| COUNT | Used for summary statistics |
| DTADFILE | Character Date Field |
| VISAPOST | Department of State where where Visa was issued |
| OCCUP | Occupation that will be performed in U.S. |
| ENTDEPA | Arrival Flag. Whether admitted or paroled into the US |
| ENTDEPD | Departure Flag. Whether departed, lost visa, or deceased |
| ENTDEPU | Update Flag. Update of visa, either apprehended, overstayed, or updated to PR |
| MATFLAG | Match flag |
| BIRYEAR | 4 digit year of birth |
| DTADDTO | Character date field to when admitted in the US |
| GENDER | Gender |
| INSNUM | INS number |
| AIRLINE | Airline used to arrive in U.S. |
| ADMNUM | Admission number, should be unique and not nullable |
| FLTNO | Flight number of Airline used to arrive in U.S. |
| VISATYPE | Class of admission legally admitting the non-immigrant to temporarily stay in U.S. |

***

__Global Temperature Data__

There are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.

The Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory, has repackaged the data from a newer compilation put it all together. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.

In the original dataset from [Kaggle](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data), several files are available but in this capstone project we will be using only the `GlobalLandTemperaturesByCity`.

In [None]:
temperature_fname = 'GlobalLandTemperaturesByCity.csv'
world_temperature = pd.read_csv(temperature_fname)

In [None]:
world_temperature.head()

__Data Dictionary__

| Column Name | Description |
| :--- | :--- |
| dt | Date in format YYYY-MM-DD |
| AverageTemperature | Average temperature of the city in a given date |
| City | City Name |
| Country | Country Name |
| Latitude | Latitude |
| Longitude | Longitude |

The dataset provides a long period of the world's temperature (from year 1743 to 2013). However, since the immigration dataset only has data of the US National Tourism Office in the year of 2016, the vast majority of the data here is useless. We are only keeping the American cities' latitude and longitude fields to form a dimension table for cities. It would be interesting if we could cross the two tables in order to analyse how the waves of immigration to the US relate to the changes in the temperature. But this is just unfeasible due to the different dates.

In [None]:
us_cities = world_temperature[world_temperature.Country == "United States"].groupby(["Country", "City"])["Latitude", "Longitude"].agg('first').reset_index()

In [None]:
us_cities = world_temperature[world_temperature.Country == "United States"]

In [None]:
us_cities = us_cities.groupby(["City"]).agg({"AverageTemperature": "mean", "Latitude": "first", "Longitude": "first"}).reset_index()

In [None]:
us_cities.head()

__Airports Data__

The airport codes may refer to either [IATA](https://en.wikipedia.org/wiki/IATA_airport_code) airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the [ICAO](https://en.wikipedia.org/wiki/ICAO_airport_code) airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code (from wikipedia).

Airport codes from around the world. Downloaded from public domain source http://ourairports.com/data/ who compiled this data from multiple different sources.

`airport-codes.csv` contains the list of all airport codes, the attributes are identified in datapackage description. Some of the columns contain attributes identifying airport locations, other codes (IATA, local if exist) that are relevant to identification of an airport.
Original source url is http://ourairports.com/data/airports.csv (stored in archive/data.csv).

In [None]:
airport = pd.read_csv("airport-codes_csv.csv")

In [None]:
airport.head()

In [None]:
us_airports = airport[airport.iso_country == "US"]

In [None]:
us_airports.head()

__U.S. City Demographic Data__

This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau's 2015 American Community Survey.

This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.

In [None]:
us_cities_demographics = pd.read_csv("us-cities-demographics.csv", sep=";")

In [None]:
us_cities_demographics.head()

In [None]:
us_cities_demographics[us_cities_demographics.State == "California"]["Total Population"]

In [None]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [None]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.