###### Data Engineering Capstone Project

# US Student Immigration Part 3
> The purpose of this project is to study the foreign students. The goal is to offer Data teams Analysts a selection of data concerning immigration to the United States.

#### Project Summary

The project follows the follow steps:
* [Step 1: Scope the Project and Gather Data](#Step-1:-Scope-the-Project-and-Gather-Data)

* [Step 2: Explore and Assess the Data](#Step-2:-Explore-Assess-the-Data) 

* [Step 3: Define the Data Model](#Step-3:-Define-the-Data-Model)
* [Step 4: Run ETL to Model the Data](#Step-4:-Run-ETL-to-Model-the-Data)
* [Step 5: Complete Project Write Up](#Step-5:-Complete-Project-Write-Up)

In [1]:
import os
import io
import re
import sys
import datetime
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession, SQLContext
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql.functions import *
from pyspark.sql.types import FloatType, StringType, DecimalType

import datetime as dt
pd.set_option('display.max_colwidth', 200)
pd.set_option("display.precision", 2)

In [2]:
'pyspark.sql.types' in sys.modules

True

# Step 1: Scope the Project and Gather Data

Data warehouse allow us to collect, transform and manage data from varied sources. Then, Data Team Business connect to it and analyse data. 
Apache Spark has been used to gather data
Amazon S3 buckets store the data in parquet files for the Data teams.
The main dataset includes data on immigration to the United State.
The questions about foreign students and their choice to come to US may be useful to propose services.   
How many students arrived in US in April?    
Which Airline bring the most student in April?    
What are the top city to arrive in the USA?   
Where are from?   
what are the student profils (age, country born, country indicators)? 

The main dataset includes data on immigration to the United State, and other datasets. In this work book, the data is transforming and cleasning.  


### Describe and Gather Data

[Data dictionnary](2_data_dictionnary.ipynb) provide informations abou dataset and tables used.

#### Data Source

Data |File |Data Source
-|-|-|
I94 Immigration | immigration_data_sample.csv| [US National Tourism and Trade Office](https://travel.trade.gov/research/programs/i94/description.asp)
I94 Description Labels  Description|I94_SAS_Labels_Descriptions.SAS |US National Tourism and Trade Office
Global Land Temperature|GlobalLandTemperaturesByCity.csv| [Berkeley Earth](http://berkeleyearth.org/)
Global Airports|airports-extended.csv| [OpenFlights.org and user contributions](https://www.kaggle.com/open-flights/airports-train-stations-and-ferry-terminals)
Airports codes |airport-codes_csv.csv| provide by Udacity
Iso country | wikipedia-iso-country-codes.csv|[Kaggle](https://www.kaggle.com/juanumusic/countries-iso-codes)
US Cities Demographic| us-cities-demographics.csv|provide by Udacity
Indicators developpment| WDIData.csv| [Kaggle](https://www.kaggle.com/xavier14/wdidata)
Education-statistics| EdStatsData.csv|provide by Kaggle [World Bank](https://www.kaggle.com/kostya23/worldbankedstatsunarchived) # Edit: not used


#### I94 Immigration data  Description: 
Each line of immigration_data_sample.csv correspond to a record of I-94 Form from the U.S. immigration officers. It's provide information about Arrival/Departure to foreign visitors. Some explanation about the [Visitor Arrivals Program (I-94 Form)](https://travel.trade.gov/research/programs/i94/description.asp).  

Dataset information: There is a file per month for 2016, storage format is sas7bdat. These records are described according to 28 variables.   
A small description is provided [here](2_data_dictionnary.ipynb)  
I keep this variables for this project( _df_immigration_ ):
    
Column Name | Description | Example | Type
-|-|-|-|
**cicid**|     ID uniq per record in the dataset | 4.08e+06 | float64
**i94yr**|     4 digit year  | 2016.0 | float64
**i94mon**|    Numeric month |  4.0 | float64      
**i94cit**|     3 digit code of source city for immigration (Born country) | 209.0 | float64
**i94res**|    3 digit code of source country for immigration |209.0 | float64
**i94port**|   Port addmitted through | HHW | object
**arrdate**|   Arrival date in the USA | 20566.0 | float64
**i94mode**|   Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported) | 1.0 | float
**i94addr**|   State of arrival | HI | object
**i94bir**|    Age in years | 61.0 | float
**i94visa**|   Visa Code - 1 = Business / 2 = Pleasure / 3 = Student |2.0 | float
**dtadfile**|  Date Field in I94 files |20160422| int 64
**gender**|    Gender|M| object
**visatype**|  Class of admission legally admitting the non-immigrant to temporarily stay in U.S.|WT|object
**airline**|Airline used to arrive in U.S.|MU|Object



df_immigration   
Additional files of this dataset are provide to give more desciption about this dataset


#### I94 Description Labels  Description
The I94_SAS_Labels_Description.SAS file is provide to add explanations  about code used in _data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat._ 
I parse this file, save the result in 5 .csv files. 
    * i94visa Data
    * i94country and i94residence Data
    * i94port Data
    * i94mode Data
    * i94addr
A small description is provided [here](2_data_dictionnary.ipynb)

####  Global Land Temperature Data  Description
The Berkeley Earth Surface Temperature Study provide climate information. Each line correspond to a record of temperature per day from city around the world.     
Dataset information: the GlobalLandTemperaturesByCity.csv has 7 variables. A small description is provided [here](2_data_dictionnary.ipynb). I keep this variables for this project ( _df_temperature_ ):

Column Name | Description | Example | Type
-|-|-|-|
**dt**|Date format YYYY-MM-DD| 1743-11-01| object
**AverageTemperature**|Average Temperature for the city to th date dt|6.07|float64
**City**| City name| Århus| object
**Country**| Country name | Denmark | object

#### Global Airports Data
This is a database of airports, train stations, and ferry terminals around the world. Some of the data come from public sources and some of it comes from OpenFlights.org user contributions.      
Dataset information: A small description is provided [here](2_data_dictionnary.ipynb). I give name and keep this variables ( _df_global_airports_ ):

Column Name | Description | Example | Type
-|-|-|-|
**airport_ID**|Id in the table|1| Int
**airport_name**|Name of airport|Nadzab Airport|Object
**airport_city**|Main city served by airport|Nadzab|Object
**airport_country**|Country or territory where airport is located|Papua New Guinea|Object
**airport_iata**|3-letter IATA code|LAE|Object


#### Airports Data Description
The airport code refers to the IATA airport code, 3 letters code unique for all airports in the world. It's a code used in passenger reservation, ticket and baggage-handling too.     
Dataset information: The airport-codes_csv.csv provides informations about aiports and have 12 variables. A small description is provided [here](2_data_dictionnary.ipynb). I keep this variables for this project ( _df_airport_code_ ):

Column Name | Description | Example | Type
-|-|-|-|
**ident**| Unique identifier Airport code| 00AK| object 
**type**| Type of airport | small_airport |object
**name**| Name of the airport | Lowell Field | object
**iso_country**| ISO code of airport country |US| object
**iso_region**| ISO code of the region airport | US-KS|object
**municipality**| City name where the airport is located | Anchor Point|object
**iata_code**| IATA code of the airport| | object

#### Iso country
This is a database about the different code useful to identify country.        
Datasset information: A small description is provided [here](2_data_dictionnary.ipynb). This table gives us informations about Country codes used to identify each country and contains 4 variables. I keep this variables for this project ( _df_iso_country_ ):

Column Name | Description | Example | Type
-|-|-|-|
**Country_name**|Country Name in English|Wallis and Futuna|Object
**Alpha2_code**|code 2 letter code for the country|WF|Object
**Alpha3_code**|code 3 letter code for the country|WLF|Object
**Numeric_code**|ISO 3166-2 code|876|Int

#### US cities Demographics
This dataset contains information about the demographics of all US cities and come from the US Census Bureau.     
Dataset information: A small description is provided [here](2_data_dictionnary.ipynb). 
This dataset contains 12 variables and provides simple informations about us state population. 
I keep this variables for this project ( _df_demograph_ ):

Column Name | Description | Example | Type
-|-|-|-|
**City**|Name of the city|Silver Spring|Object
**State**|US state of the city|Maryland|Object
**Median Age**|The median of the age of the population|33.8|Float64
**Male Population**|Number of the male population|40601.0|Float64
**Female Population**|Number of the female population|41862.0|Float64
**Total Population**|Number of the total population|82463 	|Float64
**Foreign-born**|Number of residents of the city that were not born in the city|30908.0|Float64
**State Code**|Code of the state of the city|MD|Object|
**Race**|Race class|Hispanic or Latino|Object
**Count**|Number of individual of each race|25924|Int64

#### World Development Indicators
The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.   
Dataset information: This dataset contains 64 variables with economics context , most of which are variables per year(1960 to 2018).
A small description is provided [here](2_data_dictionnary.ipynb).
I keep this variables for this project ( _df_indicator_dev_ ):

Column Name | Description | Example | Type
-|-|-|-|
**Country Name**|Name of the country|Arab World|Object|
**Country Code**|3 letters code of country|ARB|Object
**Indicator Name**|indicators of economic development|2005 PPP conversion factor, GDP (LCU per inter...|Object
**Indicator Code**|letters indicator code|PA.NUS.PPP.05|Object
**1960 ...2018**|one column per year since 1960|2018|Float64

#### Education statistics Data
* Edit: Not used  
The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.    
Dataset information: This dataset contains 64 variables witheducation context , most of which are variables per year(1970 to 2100).
A small description is provided [here](2_data_dictionnary.ipynb).
I keep this variables for this project ( _df_Educ_data_ ):

Column Name | Description | Example | Type
-|-|-|-|
**Country Name**|Name of the country|Arab World|Object|
**Country Code**|3 letters code of country|ARB|Object
**Indicator Name**|indicators of education development|Adjusted net enrolment rate, lower secondary, ...|Object
**Indicator Code**|letters indicator code|UIS.NERA.2|Object
**1970 ...2100**|one column per year since 1970|2018|Float64

# Step 2: Explore and Assess the Data
## Explore the Data 

#### Data Source

[Datactionnary](2_data_dictionnary.ipynb) provides informations about dataset and tables used. [This notebook](1_Exploration_python.ipynb) performs a first exploration with Python and explain the datasets, which variables I kept. 
Edit: decide not used Education-statistics, i found indicators from Indicators developpment

Dataset |File |Data Source|Dataframe Name
-|-|-|-|
I94 Immigration | immigration_data_sample.csv| [US National Tourism and Trade Office](https://travel.trade.gov/research/programs/i94/description.asp)| df_immigration
I94 Description Labels  Description|I94_SAS_Labels_Descriptions.SAS |US National Tourism and Trade Office|
Global Land Temperature|GlobalLandTemperaturesByCity.csv| [Berkeley Earth](http://berkeleyearth.org/)|df_temperature
Global Airports|airports-extended.csv| [OpenFlights.org and user contributions](https://www.kaggle.com/open-flights/airports-train-stations-and-ferry-terminals)|df_global_airports
Airports codes |airport-codes_csv.csv| provide by Udacity|df_airport_code
Iso country | wikipedia-iso-country-codes.csv|[Wikipedia](https://gist.github.com/radcliff/f09c0f88344a7fcef373)|df_iso_country
US Cities Demographic| us-cities-demographics.csv|provide by Udacity|df_demograph
Indicators developpment| WDIData.csv| [World Bank](https://www.kaggle.com/xavier14/wdidata)|df_indicator_dev
Education-statistics| EdStatsData.csv|provide by Kaggle [World Bank](https://www.kaggle.com/kostya23/worldbankedstatsunarchived)|df_Educ_data

##### I94 Immigration Data
* Source: https://travel.trade.gov/research/reports/historical/2016.html
    * data 'data/18-83510-I94-Data-2016', provide one file per month
        * These records are described according to 28 variables and 3M  rows per file
        *  It's provide information about Arrival/Departure to foreign visitors        
    * I94_SAS_Labels_Description.SAS for variable descriptions
    
##### Global Land Temperature Data
* Source: http://berkeleyearth.org/
    * data 'GlobalLandTemperaturesByCity.csv' provide climate information
        * Each line correspond to a record of temperature per day from city around the world.
        * The GlobalLandTemperaturesByCity.csv has 7 variables and 8599213 rows.
        
##### Global Airports Data
* Source: https://www.kaggle.com/open-flights/airports-train-stations-and-ferry-terminals
    * data 'airports-extended.csv'. Some of the data come from public sources and some of it comes from OpenFlights.org user contributions.
        * It's provide informatioms about of airports, train stations, and ferry terminals around the world.
        * There are 4 variables in 'airports-extended.csv'and 10668 rows
        
##### Airports Data Description Data
* Source: https://datahub.io/core/airport-codes#data
    * airport-codes_csv.csv. The airport code refers to the IATA airport code, 3 letters code unique for all airports in the world
        * The airport-codes_csv.csv provides informations about aiports.
        * There are 55075 rows and 12 columns in airport-codes_csv.csv.
        
##### Iso country Data
* Source: https://gist.github.com/radcliff/f09c0f88344a7fcef373
    * data 'wikipedia-iso-country-codes.csv'. This is a database about the different code useful to identify country.
        * This table gives us informations about Country codes used to identify each country
        * There are 4 variables and 247 rows.
        
##### US cities Demographics Data
* Source: https://data.census.gov/cedsci/. 
    * data 'us-cities-demographics.csv'. This dataset contains information about the demographics of all US cities and come from the US Census Bureau.
        * Provides simple informations about US State population
        * Contains 12 variables and 2892 rows
        
##### World Development Indicators Data
* Source: https://www.kaggle.com/xavier14/wdidata
    * data 'WDIData.csv'. The primary World Bank collection of development indicators, compiled from officially-recognized international sources. 
        * It presents the most current and accurate global development data available, and includes national, regional and global estimates.
        * Contains 64 variables, most of which are variables per year(1960 to 2018), with economics context and 422137 rows.
               
##### i94addr Data
* Source: I94_SAS_Labels_Description.SAS
    * US States code defined in I94_SAS_Labels_Description.SAS
        * data 'i94addr.csv' provides State Id and State name  
        
##### i94city_i94res Data
* Source: I94_SAS_Labels_Description.SAS
    * data 'i94cit_i94res.csv' defined Code Country by 3 digits
        * data 'i94cit_i94res.csv' provides Country Id and Country name
        
##### i94mode Data
* Source: I94_SAS_Labels_Description.SAS
    * data 'i94mode.csv' defined arrival US
        * data 'i94mode.csv' provides code Mode and name Code.
        
##### i94port Data
* Source: I94_SAS_Labels_Description.SAS
    * data 'i94port.csv'
        * data 'i94port.csv' provides Port Id, Port city and State Id.
        
##### i94visa Data
* Source: I94_SAS_Labels_Description.SAS
    * data 'i94visa.csv'
        * data 'i94visa.csv' povides code Visa ans Visa

### SETUP SPARK AND ENVIRONMENT

In [3]:
output_parquet = '../../output/'
path = '../../data/'

In [4]:
#!pwd

In [5]:
#!ls ../../data/

In [6]:
%whos DataFrame


No variables match your requested type.


In [7]:
#TODO update with the latest version
def create_spark_session():
    spark = SparkSession.builder \
                    .appName("Us_student_immigation") \
                    .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12") \
                    .enableHiveSupport() \
                    .getOrCreate()
    return(spark)
spark = create_spark_session()
%autosave 60

Autosaving every 60 seconds


## 1_LOAD FILES

### I94 Immigration Data
#### Exploration

* Path = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
* There are 3096313 rows and 29 columns in *i94_apr16_sub.sas7bdat*.
* Name of the dataFrame: df_immigration
* As we see in [data exploration file](./0_dataset_information.ipynb), some variables are either not present or not very present (visapost, occup, entdepu, insnum)
* Variables droped: depdate, count, occup, entdepa, entdepd, entdepu, matflag, biryear, insnum, dtadfile, visapost, fltno, admnum, insnum, dtaddto. 	
* Variables used:

Column Name | Description |
-|-|
**cicid**|     ID uniq per record in the dataset 
**i94yr**|     4 digit year  
**i94mon**|    Numeric month 
**i94cit**|     3 digit code of source city for immigration (Born country) 
**i94res**|    3 digit code of source country for immigration
**i94port**|   Port addmitted through 
**i94mode**|   Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported) 
**i94addr**|   State of arrival 
**i94bir**|    Age in years 
**i94visa**|   Visa Code - 1 = Business / 2 = Pleasure / 3 = Student
**gender**|    Gender
**airline**|   Airline used to arrive in U.S.
**admnum**|    Admission number, should be unique and not nullable 
**visatype**|  Class of admission legally admitting the non-immigrant to temporarily stay in U.S.

#### Read I94 data

In [8]:
def load_immigration(path, file):
    df = spark.read \
        .format('com.github.saurfang.sas.spark') \
        .option('header', 'true') \
        .load(path+file)
    nb_rows = df.count()
    print(f'*****         Loading {nb_rows} rows')
    print(f'*****         Display the Schema')
    df.printSchema()
    print(f'*****         Display few rows')
    df.show(3, truncate = False)
    return df, nb_rows

In [9]:
file = '18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
# refaire avec S3 et tous les fichiers (get_path_sas_folder parquet file)
immigration, rows_immig = load_immigration(path, file)


*****         Loading 3096313 rows
*****         Display the Schema
root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-

### Global Land Temperature Data
#### Exploration
* Path = '../../data/GlobalLandTemperaturesByCity.csv
* There are 8599212 rows and 7 columns in *GlobalLandTemperaturesByCity.csv*.
* Name of the dataFrame: df_temperature

* As we see in [data exploration file](./0_dataset_information.ipynb), the first date is in 1743, and we find a row per day per town. So we will make aggregation for this data set and drop 'AverageTemperature' , 'Latitude' and 'Longitude' columns
* Variables used:

Column Name | Description 
-|-|
**dt**|Date format YYYY-MM-DD| 
**AverageTemperature**|Average Temperature for the city to th date dt|
**City**| City name| 
**Country**| Country name |

#### Read GlobalLandTemperaturesByCity

In [10]:
def load_temperature(path, file):
    df = spark.read \
        .format("csv") \
        .option('header', 'true') \
        .option('inferSchema', 'true') \
        .load(path+file)
    nb_rows = df.count()
    print(f'*****         Loading {nb_rows} rows')
    print(f'*****         Display the Schema')
    df.printSchema()
    print(f'*****         Display few rows')
    df.show(3, truncate = False)
    return df, nb_rows

In [11]:
file = 'GlobalLandTemperaturesByCity.csv'
temperature, rows_temp = load_temperature(path, file)

*****         Loading 8599212 rows
*****         Display the Schema
root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)

*****         Display few rows
+----------+------------------+-----------------------------+-----+-------+--------+---------+
|dt        |AverageTemperature|AverageTemperatureUncertainty|City |Country|Latitude|Longitude|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
|1743-11-01|6.068             |1.7369999999999999           |Århus|Denmark|57.05N  |10.33E   |
|1743-12-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-01-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
+----------+------------

### Airports Code Data
#### Exploration
* Path = '../../data/airport-codes_csv.csv'
* There are 55075 rows and 12 column in *airport-codes_csv.csv*
* Name of the DataFrame : df_airport_code
* Some variables left more 50% of data (continent, iata_code and local_code) so I kept:

Column Name | Description 
-|-|
**ident**| Unique identifier Airport code|
**type**| Type of airport | 
**name**| Name of the airport | 
**continent**| Continent | | 
**iso_country**| ISO code of airport country |
**iso_region**| ISO code of the region airport | 
**municipality**| City name where the airport is located | 
**iata_code**| IATA code of the airport|

#### Read Airports Code

In [12]:
def load_airport_code(path, file):
    df = spark.read \
        .format("csv") \
        .option('header', 'true') \
        .option('inferSchema', 'true') \
        .load(path+file)
    nb_rows = df.count()
    print(f'*****         Loading {nb_rows} rows')
    print(f'*****         Display the Schema')
    df.printSchema()
    print(f'*****         Display few rows')
    df.show(3, truncate = False)
    return df, nb_rows

In [13]:
file = 'airport-codes_csv.csv'    
airport_code, rows_code = load_airport_code(path, file)

*****         Loading 55075 rows
*****         Display the Schema
root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

*****         Display few rows
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+----------------------------------+
|ident|type         |name                |elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|coordinates                       |
+-----+-------------+--------------------+------------+---------+--------

### Global Airports Data
#### Exploration
* Path = '../../data/airports-extended.csv'
* There are 10668 rows and 13 columns in *airports-extended.csv*
* Name of the dataframe : df_global_airports
* No missing value, and I kept:

Column Name | Description | Example | Type
-|-|-|-|
**airport_name**|Name of airport|Nadzab Airport|Object
**airport_city**|Main city served by airport|Nadzab|Object
**airport_country**|Country or territory where airport is located|Papua New Guinea|Object
**airport_iata**|3-letter IATA code|LAE|Object

#### Read airports-extended

In [14]:
global_airports_schema = T.StructType([
    T.StructField('airport_ID', T.IntegerType(), False),
    T.StructField('name', T.StringType(), False),
    T.StructField('city', T.StringType(), False),
    T.StructField('country', T.StringType(), False),
    T.StructField('iata', T.StringType(), False),
    T.StructField('icao', T.StringType(), False),
    T.StructField('latitude', T.StringType(), False),
    T.StructField('longitude', T.StringType(), False),
    T.StructField('altitude', T.IntegerType(), False),
    T.StructField('timezone', T.StringType(), False),
    T.StructField('dst', T.StringType(), False),
    T.StructField('tz_timezone', T.StringType(), False),
    T.StructField('type', T.StringType(), False),
    T.StructField('data_source', T.StringType(), False)
])

In [15]:
def load_global_airports(path, file):
    df = spark.read \
        .format("csv") \
        .option('header', 'True') \
        .option('inferSchema', 'true') \
        .schema(global_airports_schema) \
        .load(path+file)
    nb_rows = df.count()
    print(f'*****         Loading {nb_rows} rows')
    print(f'*****         Display the Schema')
    df.printSchema()
    print(f'*****         Display few rows')
    df.show(3, truncate = False)
    return df, nb_rows

In [16]:
file = 'airports-extended.csv'
global_airports, rows_global = load_global_airports(path, file)

*****         Loading 10667 rows
*****         Display the Schema
root
 |-- airport_ID: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- iata: string (nullable = true)
 |-- icao: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- altitude: integer (nullable = true)
 |-- timezone: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- tz_timezone: string (nullable = true)
 |-- type: string (nullable = true)
 |-- data_source: string (nullable = true)

*****         Display few rows
+----------+----------------------------+-----------+----------------+----+----+------------------+------------------+--------+--------+---+--------------------+-------+-----------+
|airport_ID|name                        |city       |country         |iata|icao|latitude          |longitude         |altitude|timezone|dst|tz_timezone         |type   |d

### Iso Country Data
#### Exploration
* Path = '../../data/wikipedia-iso-country-codes.csv
* There are 246 rows and 5 columns in *wikipedia-iso-country-codes.csv*
* Name of the dataframe: df_iso_country
* I remove 'ISO 3166-2' column, only one missing value. I choose to replace manually. 

Column Name | Description 
-|-|
**Country_name**|Country Name in English|
**Alpha2_code**|code 2 letter code for the country|
**Alpha3_code**|code 3 letter code for the country|
**Numeric_code**|ISO 3166-2 code|

#### Read wikipedia-iso-country-codes

In [17]:
iso_country_schema = T.StructType([
    T.StructField('English short name lower case', T.StringType(), False),
    T.StructField('Alpha-2 code', T.StringType(), False),
    T.StructField('Alpha-3 code', T.StringType(), False),
    T.StructField('Numeric code', T.StringType(), False),
    T.StructField('ISO_3166-2', T.StringType(), True),    
]) 

In [18]:
def load_iso_country(path, file):
    df = spark.read \
            .format("csv") \
            .option('header', 'true') \
            .option('inferSchema', 'true') \
            .schema(iso_country_schema) \
            .load(path+file)
    df = df.withColumnRenamed("English short name lower case", "Country")\
           .withColumnRenamed("Alpha-2 code", "Alpha_2")\
           .withColumnRenamed("Alpha-3 code", "Alpha_3")\
           .withColumnRenamed("Numeric code", "Num_code")
    
    nb_rows = df.count()
    print(f'*****         Loading {nb_rows} rows')
    print(f'*****         Display the Schema')
    df.printSchema()
    print(f'*****         Display few rows')
    df.show(3, truncate = False)
    return df, nb_rows

In [19]:
file = 'wikipedia-iso-country-codes.csv'
iso_country, rows_iso = load_iso_country(path, file)

*****         Loading 246 rows
*****         Display the Schema
root
 |-- Country: string (nullable = true)
 |-- Alpha_2: string (nullable = true)
 |-- Alpha_3: string (nullable = true)
 |-- Num_code: string (nullable = true)
 |-- ISO_3166-2: string (nullable = true)

*****         Display few rows
+--------+-------+-------+--------+-------------+
|Country |Alpha_2|Alpha_3|Num_code|ISO_3166-2   |
+--------+-------+-------+--------+-------------+
|Zimbabwe|ZW     |ZWE    |716     |ISO 3166-2:ZW|
|Zambia  |ZM     |ZMB    |894     |ISO 3166-2:ZM|
|Yemen   |YE     |YEM    |887     |ISO 3166-2:YE|
+--------+-------+-------+--------+-------------+
only showing top 3 rows



### US cities Demographics
#### Exploration
* Path = '../../data/us-cities-demographics.csv
* There are 2891 rows and 12 columns in us-cities-demographics.csv
* Dataframe name : df_demograph
* Missing less than 1% in some variables so I drop 'Number of Veterans', 'Average Household Size' and kept: 

Column Name | Description | 
-|-|
**City**|Name of the city|
**State**|US state of the city|
**Median Age**|The median of the age of the population|
**Male Population**|Number of the male population|
**Female Population**|Number of the female population|
**Total Population**|Number of the total population|
**Foreign-born**|Number of residents of the city that were not born in the city|
**State Code**|Code of the state of the city|
**Race**|Race class|
**Count**|Number of individual of each race|

#### Read us-cities-demographics

In [20]:
demograph_schema = T.StructType([
    T.StructField('City', T.StringType(), False),
    T.StructField('State', T.StringType(), False),
    T.StructField('Median_Age', T.FloatType(), False),
    T.StructField('Male_Population', T.IntegerType(), False),
    T.StructField('Female_Population', T.IntegerType(), False),
    T.StructField('Total_Population', T.IntegerType(), False),
    T.StructField('Number_of_Veterans', T.IntegerType(), False),
    T.StructField('Foreign-born', T.IntegerType(), False),
    T.StructField('Average_Household_Size', T.FloatType(), False),
    T.StructField('State_Code', T.StringType(), False),
    T.StructField('Race', T.StringType(), False),
    T.StructField('Count', T.IntegerType(), False)
]) 

In [21]:
def load_demograph(path, file):
    df = spark.read \
        .format("csv") \
        .option('header', 'true') \
        .option('delimiter', ';') \
        .option('inferSchema', 'true') \
        .schema(demograph_schema) \
        .load(path+file)
    nb_rows = df.count()
    print(f'*****         Loading {nb_rows} rows')
    print(f'*****         Display the Schema')
    df.printSchema()
    print(f'*****         Display few rows')
    df.show(3, truncate = False)
    return df, nb_rows

In [22]:
file = 'us-cities-demographics.csv'
demograph, rows_demo = load_demograph(path, file)

*****         Loading 2891 rows
*****         Display the Schema
root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median_Age: float (nullable = true)
 |-- Male_Population: integer (nullable = true)
 |-- Female_Population: integer (nullable = true)
 |-- Total_Population: integer (nullable = true)
 |-- Number_of_Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average_Household_Size: float (nullable = true)
 |-- State_Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: integer (nullable = true)

*****         Display few rows
+-------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+------------------+-----+
|City         |State        |Median_Age|Male_Population|Female_Population|Total_Population|Number_of_Veterans|Foreign-born|Average_Household_Size|State_Code|Race              |Count|
+-----

### WDIData.csv


#### Exploration 
* Path = '../../data/WDIData.csv
* There are 422136 rows and 64 columns in *WDIData.csv*
* Dataframe name : df_indicator_dev
* This dataset contains 64 variables with economics context , most of which are variables per year(1960 to 2018). Data is missing a lot, between 40% and 91%. I just need the year 2015 to explain the Economic context in the country and make aggregation per country. I kept:

Column Name | Description | 
-|-|
**Country Name**|Name of the country|
**Country Code**|3 letters code of country|
**Indicator Name**|indicators of economic development|conversion factor, GDP (LCU per inter...|
**Indicator Code**|letters indicator code|
**2016**|one column per year since 1960|

#### Read WDIData

In [23]:
def load_indicator_dev(path, file):
    df = spark.read \
        .format("csv") \
        .option('header', 'true') \
        .option('inferSchema', 'true') \
        .load(path+file) \
        .select("Country Name","Country Code", "Indicator Name", "Indicator Code", "2015" ) \
        .toDF("Country_Name","Country_Code", "Indicator_Name", "Indicator_Code", "2015")
    nb_rows = df.count()
    print(f'*****         Loading {nb_rows} rows')
    print(f'*****         Display the Schema')
    df.printSchema()
    print(f'*****         Display few rows')
    df.show(3, truncate = False)
    return df, nb_rows


In [24]:
file = 'WDIData.csv'
indicator_dev, rows_dev = load_indicator_dev(path, file)

*****         Loading 422136 rows
*****         Display the Schema
root
 |-- Country_Name: string (nullable = true)
 |-- Country_Code: string (nullable = true)
 |-- Indicator_Name: string (nullable = true)
 |-- Indicator_Code: string (nullable = true)
 |-- 2015: double (nullable = true)

*****         Display few rows
+------------+------------+-------------------------------------------------------------------------+-----------------+----------------+
|Country_Name|Country_Code|Indicator_Name                                                           |Indicator_Code   |2015            |
+------------+------------+-------------------------------------------------------------------------+-----------------+----------------+
|Arab World  |ARB         |2005 PPP conversion factor, GDP (LCU per international $)                |PA.NUS.PPP.05    |null            |
|Arab World  |ARB         |2005 PPP conversion factor, private consumption (LCU per international $)|PA.NUS.PRVT.PP.05|null         

### I94 Description Labels  Description

Here some data extract from '../../data/I94_SAS_Labels_Description.SAS'.
To explain code in I94-immigration, I create 5 files and read here. The file was cleaned and parsed with the scrip parse_file.py. see below:

In [25]:
def parse_file(input_data, file, key):
    """
    fonction to parse file and create parquet file
    """
    output_parquet = '../../data/'
    path_file = input_data + file
    
    #file_parse = 'I94_SAS_Labels_Descriptions.SAS'
    with open(path_file, 'r') as f:
        file = f.read()
    sas_dict={}
    key_name = ''

    for line in file.split("\n"):
        line = re.sub(r"\s+", " ", line)
        if '/* I94' in line :         
            line = line.strip('/* ')
            key_name = line.split('-')[0].replace("&", "_").replace(" ", "").strip(" ").lower() 
            sas_dict[key_name] = []
        elif '=' in line and key_name != '' :
            #line_trans = re.sub("([A-Z]*?),(\s*?[A-Z]{2}\s)","\\1=\\2", line)
            #print(line_trans)
            sas_dict[key_name].append([item.strip(' ').strip(" ';") for item in line.split('=')])
        
    if key == "i94port":
        #pattern = r'[^()]*\s*\([^()]*\)'
        columns = ["Port_id", "Port_city", "State_id"]
        swap = sas_dict[key]          
        sas_dict[key] = []
        for x in swap:           
            if "," in x[1]:
                mylist=[]
                a = x[1].rsplit(",", 1)
                b = a[0]
                c = a[1].strip()
                mylist.extend([x[0], b, c])
                sas_dict[key].append(item for item in mylist)
    if key == "i94cit_i94res":
        columns = ["Country_id", "Country"]
        swap = sas_dict[key]
        for x in swap:
            #x[0] = int(x[0])
            if "mexico" in x[1]:
                x[1] = "mexico"        
    if key == "i94mode":
        columns = ["Mode_id", "Mode"]
        #swap = sas_dict[key]
        #for x in swap:
        #    x[0] = int(x[0])
    if key == "i94addr":
        columns = ["State_id", "State"]
        
    if key == "i94visa":
        columns = ["Code_visa", "Visa"]
        #swap = sas_dict[key]
        #for x in swap:
        #    x[0] = int(x[0])
            
    df = ""                  
    if key in sas_dict.keys():
        if len(sas_dict[key]) > 0:
            df = pd.DataFrame(sas_dict[key], columns = columns)
            df.sort_values(df.columns[0], inplace=True)
        #with io.open(f"../../data/{key}.csv", "w") as f:
        #    df.to_csv(f, index=False)
        df.to_parquet(f'{output_parquet}{key}.parquet')
    return(len(sas_dict[key]))

In [26]:
### Create Parquet Files 
# Parse I94_SAS_Labels_Description.SAS and save in parquet format in '../../data/'
#file = 'I94_SAS_Labels_Descriptions.SAS'
#!python parse_file.py $path $file
#### Read Parquet files create from 'I94_SAS_Labels_Descriptions.SAS'

path = '../../data/'
input_data =  '../../data/'
file = 'I94_SAS_Labels_Descriptions.SAS'
print(path+file)
print("...Begin to create I94 Labels files")
file = 'I94_SAS_Labels_Descriptions.SAS'

## make i94port.parquet
key = "i94port"
nb = parse_file(input_data, file, key)
print(f'There are {nb} rows in {key}.parquet')
## make i94visa.csv
key = "i94visa"
nb = parse_file(input_data, file, key)
print(f'There are {nb} rows in {key}.parquet')
## make i94addr.csv
key = "i94addr"
nb = parse_file(input_data, file, key)
print(f'There are {nb} rows in {key}.parquet')
# make i94cit_i94res.csv
key = "i94cit_i94res"
nb = parse_file(input_data, file, key)
print(f'There are {nb} rows in {key}.parquet')
# make i94mode.csv
key = "i94mode"
nb = parse_file(input_data, file, key)
print(f'There are {nb} rows in {key}.parquet')


../../data/I94_SAS_Labels_Descriptions.SAS
...Begin to create I94 Labels files
There are 583 rows in i94port.parquet
There are 3 rows in i94visa.parquet
There are 55 rows in i94addr.parquet
There are 289 rows in i94cit_i94res.parquet
There are 4 rows in i94mode.parquet


In [27]:
i94_mode = pd.read_parquet(path+'i94mode.parquet')
print(f'***** Dataframe i94_mode *****')
print("There are {} rows.".format(len(i94_mode)))
print(' ')

i94_ctry = pd.read_parquet(path+'i94cit_i94res.parquet')
print(f'***** Dataframe i94_ctry *****')
print("There are {} rows.".format(len(i94_ctry)))
print(' ')

i94_addr = pd.read_parquet(path+'i94addr.parquet')
print(f'***** Dataframe i94_addr *****')
print("There are {} rows.".format(len(i94_addr)))
print(' ')

i94_visa = pd.read_parquet(path+'i94visa.parquet')
print(f'***** Dataframe i94_visa *****')
print("There are {} rows.".format(len(i94_visa)))
print(' ')

i94_port = pd.read_parquet(path+'i94port.parquet')
print(f'***** Dataframe i94_port *****')
print("There are {} rows.".format(len(i94_port)))
print(' ')

***** Dataframe i94_mode *****
There are 4 rows.
 
***** Dataframe i94_ctry *****
There are 289 rows.
 
***** Dataframe i94_addr *****
There are 55 rows.
 
***** Dataframe i94_visa *****
There are 3 rows.
 
***** Dataframe i94_port *****
There are 583 rows.
 


In [28]:
%who_ls DataFrame

['airport_code',
 'demograph',
 'global_airports',
 'i94_addr',
 'i94_ctry',
 'i94_mode',
 'i94_port',
 'i94_visa',
 'immigration',
 'indicator_dev',
 'iso_country',
 'temperature']

## 2_EXPLORATION

## I94 Immigration Data
* i94addr, missing 152592 values (code State US, 2 letters)
    * fill by Port_id from the dataframe 'i94port' 
    * join on 'df_immigration.i94port == port_state_dic.Port_id', with no missing values
    * nul value replace by State_id
* int_col = ['cicid', 'i94yr', 'i94mon','i94cit', 'i94res', 'i94mode', 'i94bir', 'i94visa']
    * fill null by default value from dictionnary and cast the int_col in Integer
* str_cols = ['i94addr', 'i94port', 'gender', 'airline', 'visatype']
    * fill null by default value from dictionnary
* date_col = ['arrdate'(double sas format),'dtadfile'(string YYYYMMDD)]
    * 'arrdate' in SAS date format, a value represents the number of days between January 1, 1960, and a other date.
    * cast the date and fill the null value

### Global Land Temperature Data

* As we see in [data exploration file](./0_dataset_information.ipynb), the first date is in 1743, and we find a row per day per town. 
    * Make aggregation 
* drop "dt", "AverageTemperatureUncertainty" , "Latitude" and "Longitude" columns

### Airports Code Data
* MIssing value in Iata_code, Municipality in the whole table
    * I drop '["elevation_ft","continent", "gps_code", "coordinates"]'
    * I keep : ident, airport_type, airport_name, country_iso, city_name, iata_code, state_id
    The missing value in iata_code left with the drop. 
* I extract the State_id from the split of the local_code and rename columns.

### Global Airports Data
* There are some missing values.
* Data clean: I drop ["icao", "latitude", "longitude", "altitude", , "timezone", "dst", "tz_timezone", "data_source"] and keep only the airport in 'type'

### Iso Country Data
* No missing values
* I drop 'ISO_3166-2' and rename columns

### US cities Demographics
* missing values
* dataclean
* df_demograph

### World Development Indicators Data
* df_indicator_dev

# Step 3 Extract_Transform_Load

### Conceptual Data Model

On the basis of a star schema, this allows to quickly find the elements linked to each other.It consists of a large fact table and a circle of other tables that contain the descriptive elements of the fact, called "dimensions".
Table fact contaiins observable data (the facts) that we have on a subject and that we want to study, according axes of analysis (the dimensions).  
The immigration dataset is the center of this project and allow us to explore foreign visitors. It will the fact table. Dimension tables give us information about a piece of this visitors, country, airport, indicator economics, and us demography. 

The etl will load the diffrent files from different format with spark. Then process the cleasning. The second stage build tdimension table and fact table. Then the files are load in parquet file then make a check.