# Step 3: Defien Data model

- The system will be used to perform **ad-hoc flight analytics**.
- The data model will be organized as a **star schema**.
- The goal is to **trade storage space and consistency checks for speed**.
- The key is for queries to perform **as few joins as possible**, which are an expensive but unavoidable operation.
- For this reason, dimension schemas will contain all columns relevant to the dimension at hand, even if this **information is repeated** in other dimensions, rather than normalizing.
- For [**regularly changing dimensions**](https://web.stanford.edu/class/cs345/slides/Lecture5.ppt) (RCDs), such as ```passenger_dim``` (because of US address state temperature) an option would be to create a mini dimension for the state temperatures, but, given that this is an optimization to prevent dimension row explosion, I won't do it for the time being. Same might be applied to state temperature in ```flight_dim```. Avoiding too many optimizations is wise given that this database is meant for early stage ad-hoc analysis.

![Star schema](de-capstone-star-schema.png)

[Nulls in Fact Tables](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-table-null/)

> Null-valued measurements behave gracefully in fact tables. The aggregate functions (SUM, COUNT, MIN, MAX, and AVG) all do the “right thing” with null facts. However, nulls must be avoided in the fact table’s foreign keys because these nulls would automatically cause a referential integrity violation. Rather than a null foreign key, the associated dimension table must have a default row (and surrogate key) representing the unknown or not applicable condition.

[Null Attributes in Dimensions](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/null-dimension-attribute/)
> Null-valued dimension attributes result when a given dimension row has not been fully populated, or when there are attributes that are not applicable to all the dimension’s rows. In both cases, we recommend substituting a descriptive string, such as Unknown or Not Applicable in place of the null value. Nulls in dimension attributes should be avoided because different databases handle grouping and constraining on nulls inconsistently.

**flight_fact**(arrdate, flight, passenger, passnum)
- passnum is the measure
- decided to create separate passenger_dim rather than having different passenger attributes in the fact table

**time_dim**((year,month), year, month, day, monthnm)

- candidate columns: is weekend, is holiday in US, etc
- how about "is holiday at src country"?

**flight_dim**((airline,fltno,dstport), airline, fltno, dstport, dstairportnum, airlinenm, (more airline info), dstcity, dststate, dststatenm, dstcoord, dstpop, dststatetemp)

    - with the given data sources, it is not always possible to identify the exact airport
    - it might be possible for cities that have only one international airport
    - but I won't attempt this for the time being
    - another possibility would be to have a junk dimension with all airports for a given port; its pk would be dstport; don't do this for the time being
    - however, a temporary compromise would be add a #international airports at the city column (dstairportnum)

**passenger_dim**((citizenship, residence, usaddr, year, month), citizenship, residence, usaddr, year, month, addrtemp, ... more passenger categories ie age, gender, visa, etc)
- represents passenger category combinations rather than individuals

## ETL

Do the ETL in a Spark cluster in EMR
- use the cluster's HDFS for storing intermediate results, so that it is efficient to do the different steps as different Airflow tasks.
    - but maybe can just store intermediate results in S3, hopefully data size will be much smaller after the first step, do it this way in the first pass for simplicity
    - remember to clean up at the end of the pipeline
    - let Spark take care of partitioning

### Extract

#### Sources
- immigration monthly SAS
- airports
- state temperatures
- demographics
- i94 data dictionary
- airline.dat

### Transform

##### immigration monthly SAS into flight_fact

want ```flight_fact(arrdate, (airline,fltno,port), citizenship, residence, address, passenger_num)```

    - filter:
        - keep only arrivals by Air
    - drop all except for:
        - arrdate, airline, fltno, i94cit, i94res, i94port, i94addr
    - decode:
        - i94port, i94addr, airline, fltno
    - rename and cast:
        - arrdate as int
        - concat(airline,fltno,port) as string   # in a first pass, later maybe md5 % maxint
        - i94cit as citizenship int
        - i94res as residence int
        - i94addr as address int
        
##### airports

want airports_ref(state, city, coordinates)

    - filter:
        - don't filter by iso_country, this way might keep Guam, etc
    - transform:
        - iso_region into state
        - municipality into city
    - drop all except for:
        - state, city, coordinates
    - group by:
        - state, city and just choose any one coordinate

        
##### state temperatures

want temperatures_ref(state, temperature)

Questions: 
- how to handle missing data (i.e. 2016)?
    - try to find more recent data source or update this one
    - predict with ML
    - just impute latest for same month
- seems that some ports are not in US states, try to join those too

    - filter:
        - as above, keep all countries around just in case
        - keep only latest year
    - rename:
        - AverageTemperature to temperature decimal(3,1)
        - State to state string
    - drop:
        - keep state, temperature

##### demographics

Questions:
- how to handle missing years?
    - in a first pass, treat it as static (with SCD 0 or 1)
- in a first pass, keep just total population

##### i94 data dictionary
    - create country data frame
    
##### airline.dat
    - create airline data frame

#### Dimensions

- country_dim

#### Facts
- from immigration data frame:
    - i94cit, i94res, i94

### Load

- In the cloud
    - Prefer a Data Lake instead of Data Warehouse
    - The reason is that it is yet unclear how frequently this data will be analysed, so provisioning, say, a Redshift cluster could be overkill, no need to pay to have dedicated computation resources and database managed storage.
    - For this reason, blob storage would be preferable
    - This way analysts can use any tool they want (Pandas, Databricks, spark cluster or standalone, Athena, viz tools, etc)
    - If there is high demand, might consider Redshift later on
    - There is no need to stick with Spark, given that data volume is not that high
    
Save everything in **parquet**, because it is structured and distributed

- use fastparquet, pyarrow gives error
    - index=False
    - book recommeds gzip commpression, but start by using default snappy
    - can go into more detail with regards to partitioning for big data case in step 5

- flight_fact:
    - flight_fact.parquet
    - partition_cols=[year, month] #, md5(airline,fltno,port)]
    - in a first pass partition just on (year, month)
- country_dim:
    - country.parquet

## Eating my dog food

Perform some analysis using my schema
- opportunity to show off my pandas and plotting