# Project Title
### US Demographic and Immigration Data

#### Project Summary

The goal of this project is to construct a pipeline for building a data lake in S3 with Spark for demographic and immigration data in the US. Three data sources are used:

* I94 Immigration Data
* USA Airport Codes
* USA City Demographic Data



# Imports and setup

In [2]:
import os
import configparser
import pandas
from pyspark.sql import SparkSession
from datetime import datetime

In [3]:
import pyspark.sql.functions as f
from pyspark.sql.functions import udf
from pyspark.sql import types as t

In [5]:
config = configparser.ConfigParser()

config.read_file(open('config.cfg'))

os.environ['AWS_ACCESS_KEY_ID'] = config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = config['AWS']['AWS_SECRET_ACCESS_KEY']

bucket = config['S3']['BUCKET']
immigration_path = bucket + config['S3']['IMMIGRATION_DATA']
immigration_labels = bucket + config['S3']['IMMIGRATION_LABELS']
airport_path = bucket + config['S3']['AIRPORT_CODES_DATA']
city_path = bucket + config['S3']['CITY_DATA']
output_path = bucket + config['S3']['OUTPUT']


print(f'{immigration_path}, {immigration_labels}')
print(f'{airport_path}')
print(f'{city_path}')
print(f'{output_path}')

s3a://udacity-data/capstone/immigration/sas_data/, s3a://udacity-data/capstone/immigration/I94_SAS_LABELS_Descriptions.SAS
s3a://udacity-data/capstone/airport/airport-codes_csv.csv
s3a://udacity-data/capstone/demographic/us-cities-demographic.csv
s3a://udacity-data/capstone/output


In [6]:
os.environ['PYSPARK_PYTHON']='/usr/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON']='/usr/bin/python3'

In [9]:
spark = SparkSession.builder.config('spark.jars.packages', 
                                    'org.apache.hadoop:hadoop-aws:2.7.0').getOrCreate()
spark

In [10]:
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key",
                                     os.environ['AWS_ACCESS_KEY_ID'])
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 
                                     os.environ['AWS_SECRET_ACCESS_KEY'])
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")

# Load data from buckets

In [12]:
immigration_data = spark.read.parquet(immigration_path)

In [17]:
immigration_data.limit(2).show(truncate=False, vertical=True)

-RECORD 0------------------
 cicid    | 5748517.0      
 i94yr    | 2016.0         
 i94mon   | 4.0            
 i94cit   | 245.0          
 i94res   | 438.0          
 i94port  | LOS            
 arrdate  | 20574.0        
 i94mode  | 1.0            
 i94addr  | CA             
 depdate  | 20582.0        
 i94bir   | 40.0           
 i94visa  | 1.0            
 count    | 1.0            
 dtadfile | 20160430       
 visapost | SYD            
 occup    | null           
 entdepa  | G              
 entdepd  | O              
 entdepu  | null           
 matflag  | M              
 biryear  | 1976.0         
 dtaddto  | 10292016       
 gender   | F              
 insnum   | null           
 airline  | QF             
 admnum   | 9.495387003E10 
 fltno    | 00011          
 visatype | B1             
-RECORD 1------------------
 cicid    | 5748518.0      
 i94yr    | 2016.0         
 i94mon   | 4.0            
 i94cit   | 245.0          
 i94res   | 438.0          
 i94port  | LOS     

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.