#### 5. Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g. unique key, data type, etc.)
 * Source/Count checks to ensure completeness

In [1]:
from setup import create_spark_session

spark = create_spark_session()

In [16]:
import pandas as pd

from pyspark.sql.functions import isnan, when, count, col

from etl import *

##### Time dimension table
Checks:
* Check count
* Unique key for timestamp
* Columns are integers
* No null entries

But first, let's load our data:

In [10]:
time_df = load_time_dimension_table(spark)

Data existence check:

In [8]:
time_df.show()

+----------+---+----+----+-------+-----+
| timestamp|day|week|year|weekday|month|
+----------+---+----+----+-------+-----+
|1593561600|  1|  27|2020|      3|    7|
|1593648000|  2|  27|2020|      4|    7|
|1593734400|  3|  27|2020|      5|    7|
|1593820800|  4|  27|2020|      6|    7|
|1593907200|  5|  27|2020|      7|    7|
|1593993600|  6|  28|2020|      1|    7|
|1594080000|  7|  28|2020|      2|    7|
|1594166400|  8|  28|2020|      3|    7|
|1594252800|  9|  28|2020|      4|    7|
|1594339200| 10|  28|2020|      5|    7|
|1594425600| 11|  28|2020|      6|    7|
|1594512000| 12|  28|2020|      7|    7|
|1594598400| 13|  29|2020|      1|    7|
|1594684800| 14|  29|2020|      2|    7|
|1594771200| 15|  29|2020|      3|    7|
|1594857600| 16|  29|2020|      4|    7|
|1594944000| 17|  29|2020|      5|    7|
|1595030400| 18|  29|2020|      6|    7|
|1595116800| 19|  29|2020|      7|    7|
|1595203200| 20|  30|2020|      1|    7|
+----------+---+----+----+-------+-----+
only showing top

Duplicate key check (same amount after dropDuplicates as before indicates no duplicates!):

In [4]:
time_df.count()

403

In [5]:
time_df.dropDuplicates(['timestamp']).count()

403

Data type check:

In [11]:
time_df.printSchema()

root
 |-- timestamp: long (nullable = true)
 |-- day: integer (nullable = true)
 |-- week: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- weekday: integer (nullable = true)
 |-- month: integer (nullable = true)



Null checks:

In [17]:
time_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in time_df.columns]).show()

+---------+---+----+----+-------+-----+
|timestamp|day|week|year|weekday|month|
+---------+---+----+----+-------+-----+
|        0|  0|   0|   0|      0|    0|
+---------+---+----+----+-------+-----+



##### County dimension table
Checks:
* Check count
* Unique key for FIPS
* Data types: County FIPS and name string, population integer, rest decimal
* No null entries apart from in uninsured, physicians, unemployment, air_pollution, household_overcrowding, residential_segregation and rural
* No negative numbers in numerics
* State foreign key exists in state dimension table

Load data:

In [9]:
county_dim_df = load_county_dimension_table(spark)

Existence check:

In [12]:
county_dim_df.limit(5).show()

+-----+-----------+-----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+--------+------------------+-----+
| fips|county_name|   latitude|   longitude|population| poor_health|     smokers|obesity|physical_inactivity|excessive_drinking|   uninsured| physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|       rural|    area|population_density|state|
+-----+-----------+-----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+--------+------------------+-----+
|48001|   Anderson|31.815347

Unique key check:

In [13]:
county_dim_df.count()

3140

In [14]:
county_dim_df.dropDuplicates(['fips']).count()

3140

Data type check:

In [15]:
county_dim_df.printSchema()

root
 |-- fips: integer (nullable = true)
 |-- county_name: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- population: integer (nullable = true)
 |-- poor_health: double (nullable = true)
 |-- smokers: double (nullable = true)
 |-- obesity: double (nullable = true)
 |-- physical_inactivity: double (nullable = true)
 |-- excessive_drinking: double (nullable = true)
 |-- uninsured: double (nullable = true)
 |-- physicians: double (nullable = true)
 |-- unemployment: double (nullable = true)
 |-- air_pollution: double (nullable = true)
 |-- housing_problems: double (nullable = true)
 |-- household_overcrowding: double (nullable = true)
 |-- food_insecurity: double (nullable = true)
 |-- residential_segregation: double (nullable = true)
 |-- over_sixtyfives: double (nullable = true)
 |-- rural: double (nullable = true)
 |-- area: double (nullable = true)
 |-- population_density: double (nullable = true)
 |-- state: string (nul

Null checks:

In [18]:
county_dim_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in county_dim_df.columns]).show()

+----+-----------+--------+---------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+-----+
|fips|county_name|latitude|longitude|population|poor_health|smokers|obesity|physical_inactivity|excessive_drinking|uninsured|physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|rural|area|population_density|state|
+----+-----------+--------+---------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+-----+
|   0|          0|       0|        0|         0|          0|      0|      0|                  0|                

The null values are in the columns we expected them to be in. This is by design.

Check for negative values in numeric columns:

In [20]:
county_dim_df.select([count(when(col(c).cast("float").isNotNull() & (col(c) < 0), c)).alias(c) for c in county_dim_df.columns]).show()

+----+-----------+--------+---------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+-----+
|fips|county_name|latitude|longitude|population|poor_health|smokers|obesity|physical_inactivity|excessive_drinking|uninsured|physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|rural|area|population_density|state|
+----+-----------+--------+---------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+-----+
|   0|          0|       0|     3140|         0|          0|      0|      0|                  0|                

We do find one numeric column that has negative values, and that's longitude. This is unsurprising, but amusing that this would fail this check.

##### State dimension table
Checks:
* Existence and count
* Unique key for state name
* Date types: State key and name string, population integer, rest decimal
* No null entries apart from in air_pollution and household_overcrowding
* No negative numbers in numerics

In [24]:
state_dim_df = load_state_dimension_table(spark)

Existence:

In [25]:
state_dim_df.limit(5).show()

+----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+------------------+------------------+
|     state|abbreviation|population| poor_health|     smokers|obesity|physical_inactivity|excessive_drinking|   uninsured| physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|       rural|              area|population_density|
+----------+------------+----------+------------+------------+-------+-------------------+------------------+------------+-----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+------------+------------------+------------------+
|   Alabama|          AL|   4887871|0.2202870285|0.2092735311|  0.355|   

Duplicates:

In [26]:
state_dim_df.count()

51

In [27]:
state_dim_df.dropDuplicates(['state']).count()

51

Data types:

In [28]:
state_dim_df.printSchema()

root
 |-- state: string (nullable = true)
 |-- abbreviation: string (nullable = true)
 |-- population: integer (nullable = true)
 |-- poor_health: double (nullable = true)
 |-- smokers: double (nullable = true)
 |-- obesity: double (nullable = true)
 |-- physical_inactivity: double (nullable = true)
 |-- excessive_drinking: double (nullable = true)
 |-- uninsured: double (nullable = true)
 |-- physicians: double (nullable = true)
 |-- unemployment: double (nullable = true)
 |-- air_pollution: double (nullable = true)
 |-- housing_problems: double (nullable = true)
 |-- household_overcrowding: double (nullable = true)
 |-- food_insecurity: double (nullable = true)
 |-- residential_segregation: double (nullable = true)
 |-- over_sixtyfives: double (nullable = true)
 |-- rural: double (nullable = true)
 |-- area: double (nullable = true)
 |-- population_density: double (nullable = true)



Null entries:

In [29]:
state_dim_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in state_dim_df.columns]).show()

+-----+------------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+
|state|abbreviation|population|poor_health|smokers|obesity|physical_inactivity|excessive_drinking|uninsured|physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|rural|area|population_density|
+-----+------------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+
|    0|           0|         0|          0|      0|      0|                  0|                 0|        0|         0|           0|            2|               0|                  

Only null entries are in the expected columns.

Non-negative numbers in numeric columns:

In [30]:
state_dim_df.select([count(when(col(c).cast("float").isNotNull() & (col(c) < 0), c)).alias(c) for c in state_dim_df.columns]).show()

+-----+------------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+
|state|abbreviation|population|poor_health|smokers|obesity|physical_inactivity|excessive_drinking|uninsured|physicians|unemployment|air_pollution|housing_problems|household_overcrowding|food_insecurity|residential_segregation|over_sixtyfives|rural|area|population_density|
+-----+------------+----------+-----------+-------+-------+-------------------+------------------+---------+----------+------------+-------------+----------------+----------------------+---------------+-----------------------+---------------+-----+----+------------------+
|    0|           0|         0|          0|      0|      0|                  0|                 0|        0|         0|           0|            0|               0|                  

Foreign key existence between county and state dimension tables:

In [32]:
combined_df = county_dim_df.join(state_dim_df, on=["state"], how="inner").select(["county_name", "state", "abbreviation"])
combined_df.limit(5).show()

+-----------+-----+------------+
|county_name|state|abbreviation|
+-----------+-----+------------+
|   Anderson|Texas|          TX|
|    Andrews|Texas|          TX|
|   Angelina|Texas|          TX|
|    Aransas|Texas|          TX|
|     Archer|Texas|          TX|
+-----------+-----+------------+



In [33]:
combined_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in ['abbreviation']]).show()

+------------+
|abbreviation|
+------------+
|           0|
+------------+



In [34]:
combined_df.count()

3140

Join successful, no missing rows and no null values in the joined column!

##### County facts table
Checks:
* Existence and count
* Unique composite keys (date + county)
* Covid values are integer, weather values decimal
* No null values
* Numerics all non-negative
* Foreign keys exist

In [35]:
county_facts_df = load_county_facts_table(spark)

Existence:

In [36]:
county_facts_df.limit(5).show()

+----+----------+----------------+----------------+-----------------+-----------------+--------+--------+-----------+----+----------+
|fips|     state|covid_case_total|covid_case_delta|covid_death_total|covid_death_delta|min_temp|max_temp|cloud_cover|wind| timestamp|
+----+----------+----------------+----------------+-----------------+-----------------+--------+--------+-----------+----+----------+
|1055|   Alabama|            6005|             100|               64|                0|    1.18|   15.19|       17.0|2.93|1606176000|
|1097|   Alabama|           19446|             140|              360|                2|    3.75|   20.04|       17.0|3.29|1606176000|
|5111|  Arkansas|            1602|              12|               39|                0|    2.77|   13.97|       20.0|1.41|1606176000|
|6079|California|            5885|              74|               35|                0|    6.09|   15.35|       12.0|2.19|1606176000|
|8059|  Colorado|           18369|             368|           

Duplicates:

In [37]:
county_facts_df.count()

1266226

In [38]:
county_facts_df.dropDuplicates(['fips', 'timestamp']).count()

1266226

Data types:

In [39]:
county_facts_df.printSchema()

root
 |-- fips: long (nullable = true)
 |-- state: string (nullable = true)
 |-- covid_case_total: long (nullable = true)
 |-- covid_case_delta: long (nullable = true)
 |-- covid_death_total: long (nullable = true)
 |-- covid_death_delta: long (nullable = true)
 |-- min_temp: double (nullable = true)
 |-- max_temp: double (nullable = true)
 |-- cloud_cover: double (nullable = true)
 |-- wind: double (nullable = true)
 |-- timestamp: integer (nullable = true)



Null entries:

In [40]:
county_facts_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in county_facts_df.columns]).show()

+----+-----+----------------+----------------+-----------------+-----------------+--------+--------+-----------+------+---------+
|fips|state|covid_case_total|covid_case_delta|covid_death_total|covid_death_delta|min_temp|max_temp|cloud_cover|  wind|timestamp|
+----+-----+----------------+----------------+-----------------+-----------------+--------+--------+-----------+------+---------+
|   0|    0|               0|               0|                0|                0|  204568|  204568|     204568|204568|        0|
+----+-----+----------------+----------------+-----------------+-----------------+--------+--------+-----------+------+---------+



As expected, null values where we don't have weather data, but that's to be expected and needs to be accounted for in future queries.

Non-negative numerics:

In [41]:
county_facts_df.select([count(when(col(c).cast("float").isNotNull() & (col(c) < 0), c)).alias(c) for c in county_facts_df.columns]).show()

+----+-----+----------------+----------------+-----------------+-----------------+--------+--------+-----------+----+---------+
|fips|state|covid_case_total|covid_case_delta|covid_death_total|covid_death_delta|min_temp|max_temp|cloud_cover|wind|timestamp|
+----+-----+----------------+----------------+-----------------+-----------------+--------+--------+-----------+----+---------+
|   0|    0|               0|           15355|                0|             2759|  180475|   29290|         39|  39|        0|
+----+-----+----------------+----------------+-----------------+-----------------+--------+--------+-----------+----+---------+



It's unsurprising that some weather data would be negative numbers.

We see some negative numbers in the delta columns which is to be expected; we know from previous steps that some of the data isn't quite reliable.

##### State facts table
Checks:
* Existence and count
* Unique composite keys (date + state)
* Values are integer and non-negative
* Foreign keys exist

In [42]:
state_facts_df = load_state_facts_table(spark)

Existence:

In [43]:
state_facts_df.limit(5).show()

+--------------------+----------------+----------------+-----------------+-----------------+----------+
|               state|covid_case_total|covid_case_delta|covid_death_total|covid_death_delta| timestamp|
+--------------------+----------------+----------------+-----------------+-----------------+----------+
|      North Carolina|               0|               0|                0|                0|1580342400|
|District of Columbia|               0|               0|                0|                0|1580342400|
|          New Mexico|               0|               0|                0|                0|1583712000|
|               Maine|               0|               0|                0|                0|1583712000|
|District of Columbia|               0|               0|                0|                0|1583712000|
+--------------------+----------------+----------------+-----------------+-----------------+----------+



Duplicates:

In [44]:
state_facts_df.count()

20553

In [46]:
state_facts_df.dropDuplicates(['state', 'timestamp']).count()

20553

Data types:

In [47]:
state_facts_df.printSchema()

root
 |-- state: string (nullable = true)
 |-- covid_case_total: long (nullable = true)
 |-- covid_case_delta: long (nullable = true)
 |-- covid_death_total: long (nullable = true)
 |-- covid_death_delta: long (nullable = true)
 |-- timestamp: integer (nullable = true)



Null entries:

In [48]:
state_facts_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in state_facts_df.columns]).show()

+-----+----------------+----------------+-----------------+-----------------+---------+
|state|covid_case_total|covid_case_delta|covid_death_total|covid_death_delta|timestamp|
+-----+----------------+----------------+-----------------+-----------------+---------+
|    0|               0|               0|                0|                0|        0|
+-----+----------------+----------------+-----------------+-----------------+---------+



Non-negative numerics:

In [49]:
state_facts_df.select([count(when(col(c).cast("float").isNotNull() & (col(c) < 0), c)).alias(c) for c in state_facts_df.columns]).show()

+-----+----------------+----------------+-----------------+-----------------+---------+
|state|covid_case_total|covid_case_delta|covid_death_total|covid_death_delta|timestamp|
+-----+----------------+----------------+-----------------+-----------------+---------+
|    0|               0|              31|                0|               75|        0|
+-----+----------------+----------------+-----------------+-----------------+---------+



We see some negative numbers in the delta columns which is to be expected; we know from previous steps that some of the data isn't quite reliable.