# Step 3: Define the Data Model

### The data model is organized as a star schema
- The goal is to **trade storage space and consistency checks for speed**.
- The key is for queries to perform **as few joins as possible**, which are an expensive but unavoidable operation.
- For this reason, dimension schemas will contain all columns relevant to the dimension at hand, even if this **information is repeated** in other dimensions, rather than normalizing.

### Design principles that I followed
- *Pure star schema*, as opposed to a snowflake schema. There will be a lot of repition, for the reason stated above, given that, as it is, storage is not an issue.
- I will use a dimension's natural key, if any, as its *primary key* if the key is derived from a single data source. This principle also applies to compound primariy keys (for instance, in the ```route_dim``` table).
- All dimension columns whose values are codes or abbreviations will have a corresponding *descriptive column* (with the ```_desc``` suffix).
- *No measures in dimension tables*. So, I used demographics and temperature tables only to derive static population, ethnical and weather categories.
- Include both numerical and descriptive columns for every cathegorical variable, with the number reflecting the desired order if it is an ordinal cathegorical variable. The reason is that the whole point of a star schema is to make analysis as easy as possible, and this way the schema can cater to both business users (who prefer descriptive values) and data scientists (who might need numerical variables for some visualizations and features).
    - The column naming convention I follow is ```<column name>``` for the description columns and ```<column name>_id``` for their numerical codes. The reason is to have an every day name aimed at business users, whereas I would expect data scientists to deal with the less natural labels.
- Do not use *null* values as foreign key values in the fact tables, nor in dimension columns (see [Null Attributes in Dimensions](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/null-dimension-attribute/)). It is OK to use them in fact table measure values (see [Nulls in Fact Tables](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-table-null/))
    - any immigration row that is a nk in a dimension will leftanti with that dim table, even if it already exitst in said dim table
    - it is very important in cleaning to ensure that values in columns that are in any dimension's natural key are **never** null, either fill them in or replace them with a null marker or drop them, but never leave them as null
- For columns that represent *ordinal categorical variables*, their ids' ordering should reflect it.
- Do not use monotonically_increasing_int to generate SKs
    - they will not be unique if leftanti joining immigration with a dim, ids from one month will be reused in the next, maybe because the id is based on the immigration partition, which can only be on the left of the join, which will often be the same, with new dim ids generated again starting from 0 but offset from an immigration partition
    - use md5 instead, low collision risk

## Model description

### Abstract data model

I will start by presenting what I consider as the core entities that are relevant to my understanding of this model's domain:

![Conceptual Data Model](images/de-capstone-abstract-data-model.png)

**NOTE**: I preferred to draw a more abstract and simpler (fewer symbols) conceptual data model diagram instead of the more common E-R diagram, because in my opinion the latter includes detail that in my opinion is unnecessary when one is already set on a star model. The intention is to highlight the main atoms and how they they are related. The multiplicities represent business rules and will be useful later on when designing data validation.

### Core entities description

I will describe only those entities whose precise meaning might not be obvious.

#### Route
A particular airline and flight number combination that flies to a given immigration port city.
- it is possible for an airplane to call at multiple US airports, but under the same flight number (see [Can two flights of the same airlines have the same number?](https://www.quora.com/Can-two-flights-of-the-same-airlines-have-the-same-number)).

#### Flight
An airplane on a particular route that arrived at a given immigration port in a given day.

#### Foreign Visitor
A non-US citizen that arrived in a flight for non-immigration purposes, such as pleasure, business or studies.
- it is possible that there actually were other passengers in a flight, US citizens, but we don't have any data about these and are out of scope.

### Star schema

![Star schema](images/de-capstone-star-schema.png)

#### Tables

##### flight_fact table

The only fact table in this model, it contains aggregate measures of a flight's foreign visitors.

The main measure, ```num_visitors```, is the count of each of the resulting foreign visitor categories after grouping by different visitor categories.

Each ```time_id``` and ```route_id``` combination refers to an individual flight, which may have multiple foreign ```visitor_id```s, one for each passenger category combination in that particular flight.

##### foreign_visitor_dim table
Dimension table that represents combinations of foreign visitor categories.

##### time_dim table

Dimension table that represents time with day granularity.

##### route_dim table

Dimension table that represents a route, with many columns that describe the destination.

## ETL

TODO: high level architecture diagram as in https://medium.com/deutsche-telekom-gurgaon/etl-data-pipeline-and-data-reporting-using-airflow-spark-livy-and-athena-for-oneapp-6d081a419adc
- explain the different Airflow tasks in a list as in the link above

![Data Flow diagram](images/data_flow.png)

Goal:
- a datalake with given star schema as parquet in S3
    - vs Redshift, less expensive, need to pay only for cheap object storage
    - pay for processing only on demand
- to do all all ETL in Airflow and PySpark ...
- ... accross multiple Airlfow tasks for visibility ...
- ... on a single Spark Cluster ...
- ... while passing intermediate in RAM Spark DataFrames between Airflow tasks ...

Solution:

- Divide the pipeline is multiple *[SparkSubmitOperator](https://airflow.apache.org/docs/apache-airflow/1.10.12/_api/airflow/contrib/operators/spark_submit_operator/index.html)* tasks
    - this will allow for fine grained visibility in the Airflow UI
    - the drawback is that the data can't flow through main memory
    - ETL code must be pyspark 2.4, otherwise incompatible with Airflow v???
        - in a way, using Spark SQL might be superior to Data Frames, the former being more portable between Spark versions (i.e ```trim```)
    - [Passing dependencies to python spark jobs](https://stackoverflow.com/questions/38066318/whats-the-difference-between-archives-files-py-files-in-pyspark-job-argum)

NOT Use **[Apache Livy](https://livy.apache.org/)**.
- as in [ETL Data Pipeline and Data Reporting using Airflow, Spark, Livy and Athena for OneApp](https://medium.com/deutsche-telekom-gurgaon/etl-data-pipeline-and-data-reporting-using-airflow-spark-livy-and-athena-for-oneapp-6d081a419adc)

EMR Cluster
- use Boto3 library inside Airflow task to manage it

#### Spark jobs design
```
main
 |__ plugins
         |___ etl_spark_jobs
                         |___ etl_*.py
```
- Most Airflow tasks will end up submitting a job to a Spark cluster.
- Executed by ```ETLSparkOperator```.
- To be testable, they must not execute on import.
    - They all have a trampoline function at the end, named with a ```_trampoline``` suffix.
    - The trampoline function is executed by a ```__name__ == '__main__'``` when the job is submitted to Spark.
- Must be **idempotent**, and this requirement is, by default, and will be assumed without any need to explicitly state it, part of every job's specification, and, hence, must be unit tested for every job.
- Jobs will be specified just like any function, with input types and output types
    - some of the inputs will be specified as Spark DataFrames, and their schema is part of their type, in an abstract way; the same applies to their outputs
    - one advantage of this approach is that general unit testing techniques can be applied, by first writing an specification as a comment in the job's source code, and then systematically synthesising the unit test from the specification (see [MIT Reading 6.005 — Software Construction, 6: Specification](http://web.mit.edu/6.005/www/sp16/classes/06-specifications/))
    - so, preconditions, postconditions and effects will be explicitly stated as part of each job's specification
    - when a precondition is either expensive or non-trivial to check for the client, an exception will be thrown by the job; otherwise the job's behaviour is undetermined.
    

### Data sources

Storage
```
eda    laptop   datalake/eda (if a 100 data scientists, they would need accounts)
dev    laptop   datalake/dev
test   laptop   datalake/airflow
prod   S3       s3:...
```

```
SparkETL('conf_name') <--- use Airflow vars and conns

3 different environments + common (use python parseconf, files stored in plugins):
- common.conf (implicit, no need to pass, could override with airflow vars)
- dev.conf
- airflow.conf
- cloud.conf
```


```
        root 

-eda    -raw    - curated           - staging    - check+load -->  - production
        sas     state_ref           immigration                    *_dim
        ref     port_ref            *_dim                          *_fact
                country_ref         incr_fact
                demographics_ref    
                airport_ref
                port_to_airport_ref
                temperature_ref
                
staging *_ref will be just in memory
- when loading to prod, first load dims, then facts
```

### Extract

The source data is extracted as is into the following staging tables:

- **immigration_staging**(Unnamed: 0, cicid, i94yr, i94mon, i94cit, i94res, i94port, arrdate, i94mode, i94addr, depdate, i94bir, i94visa, count, dtadfile, visapost, occup, entdepa, entdepd, entdepu, matflag, biryear, dtaddto, gender, insnum, airline, admnum, fltno, visatype)


- From immigration data dictionary:
    - **i94port_staging**(port_id, port_desc)
    - **i94country_staging**(country_id, country_desc)
    - **i94mode_staging**(mode_id, mode_desc)
    - **i94state_staging**(state_id, state_desc)
    - **i94visa_staging**(visa_id, visa_desc)


- **airports_staging**(ident, type, name, elevation_ft, continent, iso_country, iso_region, municipality, gps_code, iata_code, local_code, coordinates)


- **demographics_staging**(City, State, Median Age, Male Population, Female Population, Total Population, Number of Veterans, Foreign-born, Average Household Size, State Code, Race, Count)


- **temperatures_staging**(dt, AverageTemperature, AverageTemperatureUncertainty, State, Country)

### Transform

#### Cleaning

The goal is to create views from which the fact and dimension tables can be easily derived, with all data issues solved.

The way I understand cleaning data is quite broad: it consists of normalizing, dealing with missing and out of range values, formatting, eliminating duplicates, enriching, verifying and standardizing. The outcome is rectangular tidy data (columns represent variables and rows represent observations), with tables that have only the required columns with labels suitable for the domain at hand, and that can be easily and correctly joined with other clean tables for further processing.

standardizing: all tables should refer to the same entities by the same set of values
verifying: assert assumptions (like uniqueness)

- **immigration**(arrival_date, airline, flight_number, port, citizenship, residence, age, age_group, gender, visa, address_state, stay, stay_group)
- **state**(state_id, name, type_id, type)
- **port**(port_id, state_id, name)
- **country**(country_id, country)
- **airport**(airport_id, name, city, state_id, type, coordinates)
- **demographics**(state_id, city, asian, black, latino, native, white, ethnicity_id, ethnicity, population, size_id, size)
- **temperature**(state_id, climate_id, climate)

Principles:
- Trim every raw string; annoying bugs can creep in otherwise.

#### Generate association tables:

The goal is to create tables that allow to easily join clean tables, starting from clean tables.

- port_to_airport(port_id, airport_id)

#### Insert required rows into dimension tables

For each dimension:
1. Figure out which rows are missing, by performing an anti join between each dimension's natural keys and immigration.
2. For each dimension join those missing natural keys with reference tables to fill in all columns.

#### Process fact table

With the immigration table:
1. Join with dimension tables using their natural keys to get corresponding surrogate keys.
2. Group by all dimension natural keys to calculate aggregates.

### Load

Append.

# Scratch area

Do the ETL in a Spark cluster in EMR
- use the cluster's HDFS for storing intermediate results, so that it is efficient to do the different steps as different Airflow tasks.
    - but maybe can just store intermediate results in S3, hopefully data size will be much smaller after the first step, do it this way in the first pass for simplicity
    - remember to clean up at the end of the pipeline
    - let Spark take care of partitioning

### Extract

#### Sources
- immigration monthly SAS
- airports
- state temperatures
- demographics
- i94 data dictionary
- airline.dat

### Transform

##### immigration monthly SAS into flight_fact

want ```flight_fact(arrdate, (airline,fltno,port), citizenship, residence, address, passenger_num)```

    - filter:
        - keep only arrivals by Air
    - drop all except for:
        - arrdate, airline, fltno, i94cit, i94res, i94port, i94addr
    - decode:
        - i94port, i94addr, airline, fltno
    - rename and cast:
        - arrdate as int
        - concat(airline,fltno,port) as string   # in a first pass, later maybe md5 % maxint
        - i94cit as citizenship int
        - i94res as residence int
        - i94addr as address int
        
##### airports

want airports_ref(state, city, coordinates)

    - filter:
        - don't filter by iso_country, this way might keep Guam, etc
    - transform:
        - iso_region into state
        - municipality into city
    - drop all except for:
        - state, city, coordinates
    - group by:
        - state, city and just choose any one coordinate

        
##### state temperatures

want temperatures_ref(state, temperature)

Questions: 
- how to handle missing data (i.e. 2016)?
    - try to find more recent data source or update this one
    - predict with ML
    - just impute latest for same month
- seems that some ports are not in US states, try to join those too

    - filter:
        - as above, keep all countries around just in case
        - keep only latest year
    - rename:
        - AverageTemperature to temperature decimal(3,1)
        - State to state string
    - drop:
        - keep state, temperature

##### demographics

Questions:
- how to handle missing years?
    - in a first pass, treat it as static (with SCD 0 or 1)
- in a first pass, keep just total population

##### i94 data dictionary
    - create country data frame
    
##### airline.dat
    - create airline data frame

#### Dimensions

- country_dim

#### Facts
- from immigration data frame:
    - i94cit, i94res, i94

### Load

- In the cloud
    - Prefer a Data Lake instead of Data Warehouse
    - The reason is that it is yet unclear how frequently this data will be analysed, so provisioning, say, a Redshift cluster could be overkill, no need to pay to have dedicated computation resources and database managed storage.
    - For this reason, blob storage would be preferable
    - This way analysts can use any tool they want (Pandas, Databricks, spark cluster or standalone, Athena, viz tools, etc)
    - If there is high demand, might consider Redshift later on
    - There is no need to stick with Spark, given that data volume is not that high
    
Save everything in **parquet**, because it is structured and distributed

- use fastparquet, pyarrow gives error
    - index=False
    - book recommeds gzip commpression, but start by using default snappy
    - can go into more detail with regards to partitioning for big data case in step 5

- flight_fact:
    - flight_fact.parquet
    - partition_cols=[year, month] #, md5(airline,fltno,port)]
    - in a first pass partition just on (year, month)
- country_dim:
    - country.parquet

## Eating my dog food

Perform some analysis using my schema
- opportunity to show off my pandas and plotting

[Nulls in Fact Tables](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-table-null/)

> Null-valued measurements behave gracefully in fact tables. The aggregate functions (SUM, COUNT, MIN, MAX, and AVG) all do the “right thing” with null facts. However, nulls must be avoided in the fact table’s foreign keys because these nulls would automatically cause a referential integrity violation. Rather than a null foreign key, the associated dimension table must have a default row (and surrogate key) representing the unknown or not applicable condition.

[Null Attributes in Dimensions](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/null-dimension-attribute/)
> Null-valued dimension attributes result when a given dimension row has not been fully populated, or when there are attributes that are not applicable to all the dimension’s rows. In both cases, we recommend substituting a descriptive string, such as Unknown or Not Applicable in place of the null value. Nulls in dimension attributes should be avoided because different databases handle grouping and constraining on nulls inconsistently.

### SparkSubmitOperator notes

https://medium.com/codex/executing-spark-jobs-with-apache-airflow-3596717bbbe3

(https://airflow.apache.org/docs/apache-airflow-providers-apache-spark/stable/index.html)
- [Install Airflow Spark provider package and PySpark]
```
conda install -c conda-forge pyspark
conda install -c conda-forge apache-airflow-providers-apache-spark FAILED
pip install apache-airflow-providers-apache-spark
```
- you must register the master Spark connection in the Airflow administrative panel.
- conda installs pyspark 2.4; latest, 3.2.1, breaks Airflow
- as such, it requires java8

### Livy notes

- [I have tried **Livy** but since it **doesn't support spark version 3**](https://stackoverflow.com/questions/71207654/how-can-we-connect-to-remote-spark-cluster-via-jupyterhub)

https://livy.apache.org/get-started/
- It is strongly recommended to configure Spark to submit applications in YARN cluster mode. That makes sure that user sessions have their resources properly accounted for in the YARN cluster, and that the host running the Livy server doesn’t become overloaded when multiple user sessions are running.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-livy.html
- Can configure an EMR cluster to include a Livy server.

https://airflow.apache.org/docs/apache-airflow-providers-apache-livy/stable/_api/airflow/providers/apache/livy/operators/livy/index.html
- Airflow Livy provider

- DAG example: 
https://github.com/panovvv/airflow-livy-operators/blob/master/airflow_home/dags/03_batch_example.py
- Spark job for example above:
https://github.com/panovvv/airflow-livy-operators/blob/master/batches/join_2_files.py
- [Is it possible to configure Apache Livy to run with Spark Standalone?](https://stackoverflow.com/questions/40876586/is-it-possible-to-configure-apache-livy-to-run-with-spark-standalone)

## Validation

https://www.precisely.com/blog/data-quality/data-validation-vs-data-verification

Data validation:
- Purpose:	Check whether data falls within the acceptable range of values
- Usually performed:	When data is created or updated	
- Example:	Checking whether user-entered ZIP code can be found

Data verification:
- Check data to ensure it’s accurate and consistent
- When data is migrated or merged
- Checking that all ZIP codes in dataset are in ZIP+4 format

Goals: to ensure that data meets requirements

- check not empty
- check count range
- check columns not nulls/nan
- check no dups
- check column ranges
- check integrity (references to other tables)


Implementation:
- start by creating a class in a python file (ETLValidation) that is called from notebooks
- custom plugins in Airflow parametrized with Spark SQL templates
- custom Airflow hooks to create parametrized Spark SQL views and execute queries
- [raise ValueError](https://stackoverflow.com/questions/54593320/validationerror-or-typeerror-valueerror-exceptions) when validation fails

TODO:
- draw class diagram of ETLCheck operators, and a diagram of how check jobs are submitted to Spark

## Unit tests

[Airflow Best Practices](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html)
- You can write unit tests for both your tasks and your DAG.
- tasks must be **idempotent**
    - should this be a data test or a unit test? I think a unit test, given that it must be true of any particular input.


## TODO

- deal with missing data in every dataset
- do something interesting with the data
- deal with nulls in dimension columns
- dimension table partitioning
- split port to airport matching in 2 steps, start with exact matching
- save intermediate tables, like matching scores, for analysis
- user persist where appropriate
- validation
- unit tests
- airflow @daily?
- watch ds100 viz
- consider applying freq. itemsets: passenger in some flight to some state
- integrate airlines.dat
- find opportunities for avoiding code duplication
- error handling for bad data
- deal with warnings in ETL notebooks when saving
- airflow: pass and configure args (like dirs) properly
- add logging; reduce Spark info
- clean i94 address field, dict says it is often invalid
- port name can be matched to airport or city (or just do both at once)
- viz with nice maps
- integrate monthly temperature
- explore using routedat "is_direct" with residence
- explore matching flight_number to routedat
- must I always use explicit schemas for reading and writing?
- airflow unit (integration?) tests
