###### Data Engineering Capstone Project

# US Student Immigration
> The purpose of this project is to study the foreign students. The goal is to offer Data teams Analysts a selection of data concerning immigration to the United States.

#### Project Summary

The project follows the follow steps:
* [Step 1: Scope the Project and Gather Data](#Step-1:-Scope-the-Project-and-Gather-Data)
* [Step 2: Explore and Assess the Data](#Step-2:-Explore-and-Assess-the-Data)
    * [I94 Description Labels](#I94-Descrition-Labels)
    * [Immigration data](#Immigration-data)
    * [Global Land Temperature Data](#Global-Land-Temperature-Data)
    * [Global Airports Data](#Global-Airports-Data)
    * [Airports Data](#Airports-Data)
    
* [Step 3: Define the Data Model](#Step-3:-Define-the-Data-Model)
* [Step 4: Run ETL to Model the Data](#Step-4:-Run-ETL-to-Model-the-Data)
* [Step 5: Complete Project Write Up](#Step-5:-Complete-Project-Write-Up)

In [1]:
import os
import re
import sys
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()

pd.set_option("display.max.columns", None)
#pd.set_option("display.precision", 2)

# Step 1: Scope the Project and Gather Data

Data warehouse allow us to collect, transform and manage data from varied sources. Then, Data Team Business connect to it and analyse data. 
Apache Spark has been used to gather data
Amazon S3 buckets store the data in parquet files for the Data teams.
The main dataset includes data on immigration to the United State.
The questions about foreign students and their choice to come to US may be useful to propose services.   
How many students arrived in US in April?    
Which Airline bring the most student in April?    
What are the top city to arrive in the USA?   
Where are from?   
what are the student profils (age, country born, country indicators)? 

#### Data Source

[Datactionnary](2_data_dictionnary.ipynb) provides informations about dataset and tables used. [This notebook](1_capstone_notebook_exploration_with_python.ipynb) performs a first exploration with Python and explain the datasets, which variables I kept. 

Data |File |Data Source|Dataframe Name
-|-|-|-|
I94 Immigration | immigration_data_sample.csv| [US National Tourism and Trade Office](https://travel.trade.gov/research/programs/i94/description.asp)| df_immigration
I94 Description Labels  Description|I94_SAS_Labels_Descriptions.SAS |US National Tourism and Trade Office|
Global Land Temperature|GlobalLandTemperaturesByCity.csv| [Berkeley Earth](http://berkeleyearth.org/)|df_temperature
Global Airports|airports-extended.csv| [OpenFlights.org and user contributions](https://www.kaggle.com/open-flights/airports-train-stations-and-ferry-terminals)|df_global_airports
Airports codes |airport-codes_csv.csv| provide by Udacity|df_airport_id
Iso country | wikipedia-iso-country-codes.csv|[Kaggle](https://www.kaggle.com/juanumusic/countries-iso-codes)|df_iso_country
US Cities Demographic| us-cities-demographics.csv|provide by Udacity|df_demograph
Indicators developpment| WDIData.csv| [Kaggle](https://www.kaggle.com/xavier14/wdidata)|df_indicator_dev
Education-statistics| EdStatsData.csv|provide by Kaggle [World Bank](https://www.kaggle.com/kostya23/worldbankedstatsunarchived)|df_Educ_data

# Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data