### Step 4: Run Pipelines to Model the Data

#### Outline
Pipeline:
- Only use counties and states that are in the Covid and health data sets
- Use spark, mostly rely on dataframes but sometimes use temporary database views for convenience
- End result should be outputting parquet files

State facts:
- Start Covid-19 data, use the state abbreviation as a link to states, timestamp to time
- Calculate aggregate Covid-19 cases and deaths (both total and delta) for each day based on county facts table, group by state
- Partition by state, then month

#### 4.2 Create the data model
Build the data pipelines to create the data model. I'd break this into two steps:
* Create the parquet files for the database based on the input data
* Load the parquet files into the database

##### Setup
I'm going to need Spark for this because I'll want to make use of some of its functionality, such as the ability to create temporary SQL views of my dataframes.

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g. unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness

Author's notes:  
Integrity constraints seem a bit stupid since I'm creating the schema for the parquet files myself, but I guess I still need to do them?

##### Date dimension table
Checks:
* Unique key for datetime
* Columns are integers
* Should have 366 entries (2020 was a leap year)
* No days outside 2020

##### State dimension table
Checks:
* Unique key for state abbreviation
* State key and name string, population integer, rest decimal
* 50 entries
* No null entries apart from in air_pollution and household_overcrowding
* No negative numbers in numerics

##### County dimension table
Checks:
* Unique key for FIPS
* State foreign key exists in state dimension table
* County FIPS and name string, population integer, rest decimal
* No null entries apart from in uninsured, physicians, unemployment, air_pollution, household_overcrowding, residential_segregation and rural
* No negative numbers in numerics

##### State facts table
Checks:
* Unique composite keys (date + state)
* Foreign keys exist
* Values are integer and non-negative

##### County facts table
Checks:
* Unique composite keys (date + county)
* Foreign keys exist
* Covid values are integer, weather values decimal
* Numerics all non-negative