#### Step 6: Example queries

What use is a data engineering project without actually asking some questions about our data? Below are some example queries (run against the local parquet files since I don't want to pay for S3 read costs).

##### Setup, imports and database loads

In [1]:
from setup import create_spark_session

spark = create_spark_session()

In [2]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import SQLContext

from sql.exampleQueries import *

In [3]:
from etl import load_all_tables

time_dim_df, county_dim_df, state_dim_df, county_facts_df, state_facts_df = load_all_tables(spark)

time_dim_df.createOrReplaceTempView("dim_time")
county_dim_df.createOrReplaceTempView("dim_county")
state_dim_df.createOrReplaceTempView("dim_state")
county_facts_df.createOrReplaceTempView("fact_county")
state_facts_df.createOrReplaceTempView("fact_state")

sqlContext = SQLContext(spark)

Started loading database
Started loading time dimension table
Finished loading time dimension table
Started loading county dimension table
Finished loading county dimension table
Started loading state dimension table
Finished loading state dimension table
Started loading county facts table
Finished loading county facts table
Started loading state facts table
Finished loading state facts table
Finished loading database


##### Queries
Let's start simply. Which counties have the highest case rates?

In [5]:
sqlContext.sql("""
    SELECT fc.fips, dc.county_name, max(fc.covid_case_total) as covid_case_total
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips, dc.county_name
        ORDER BY covid_case_total DESC
        LIMIT 10
""").show()

+-----+--------------+----------------+
| fips|   county_name|covid_case_total|
+-----+--------------+----------------+
| 6037|   Los Angeles|         1190894|
| 4013|      Maricopa|          509683|
|17031|          Cook|          473944|
|12086|    Miami-Dade|          409216|
|48201|        Harris|          348848|
| 6065|     Riverside|          289450|
| 6071|San Bernardino|          286291|
|48113|        Dallas|          280404|
| 6059|        Orange|          261022|
| 6073|     San Diego|          259641|
+-----+--------------+----------------+



Which states have the most deaths in total?

In [6]:
sqlContext.sql("""
    SELECT fs.state, max(fs.covid_case_total) as covid_case_total
        FROM fact_state fs
        GROUP BY fs.state
        ORDER BY covid_case_total DESC
        LIMIT 10
""").show()

+--------------+----------------+
|         state|covid_case_total|
+--------------+----------------+
|    California|         3563578|
|         Texas|         2649363|
|       Florida|         1900218|
|      New York|         1635820|
|      Illinois|         1185363|
|       Georgia|          971325|
|          Ohio|          966154|
|  Pennsylvania|          931531|
|North Carolina|          858547|
|       Arizona|          815705|
+--------------+----------------+



Both of these are a bit misleading, though. Counties with higher population numbers are bound to have higher rates, so let's run these queries again but normalise them to the county/state population.

In [13]:
sqlContext.sql("""
    WITH county_norm AS (
        SELECT fc.fips, fc.covid_case_total / dc.population as norm_case_total
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips, fc.covid_case_total, dc.population
    )
    SELECT fc.fips, dc.county_name, dc.state, max(cn.norm_case_total) as norm_case_total
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        LEFT JOIN county_norm cn
        ON fc.fips == cn.fips
        GROUP BY fc.fips, dc.county_name, dc.state
        ORDER BY norm_case_total DESC
        LIMIT 10
""").show()

+-----+-------------+-------------------+
| fips|  county_name|    norm_case_total|
+-----+-------------+-------------------+
| 8025|      Crowley|0.33817002389894163|
|13053|Chattahoochee| 0.2878135529764133|
| 8011|         Bent| 0.2512750765045903|
|46041|        Dewey|0.23899051490514905|
|19021|  Buena Vista|0.23784844520479018|
| 5079|      Lincoln| 0.2358215646715983|
|47095|         Lake| 0.2264201862096883|
|20137|       Norton|0.22117863720073666|
|47169|    Trousdale| 0.2167635306937886|
|46009|    Bon Homme| 0.2154727793696275|
+-----+-------------+-------------------+



In [17]:
sqlContext.sql("""
    SELECT fs.state, fs.covid_death_total, ds.population
    FROM fact_state fs
    LEFT JOIN dim_state ds
    ON fs.state == ds.state
    GROUP BY fs.state, fs.covid_death_total, ds.population
    LIMIT 10
""").show()

+--------------------+-----------------+----------+
|               state|covid_death_total|population|
+--------------------+-----------------+----------+
|District of Columbia|              578|      null|
|       Massachusetts|             8288|      null|
|             Indiana|              116|      null|
|        North Dakota|             1352|      null|
|        South Dakota|               65|      null|
|        Rhode Island|             2220|      null|
|        South Dakota|             1110|      null|
|        Pennsylvania|             9005|      null|
|        Pennsylvania|             4417|      null|
|        North Dakota|              246|      null|
+--------------------+-----------------+----------+



In [16]:
sqlContext.sql("""
    SELECT fs.state, fs.covid_death_total / ds.population as norm_death_total
    FROM fact_state fs
    LEFT JOIN dim_state ds
    ON fs.state == ds.state
    GROUP BY fs.state, fs.covid_death_total, ds.population
    LIMIT 10
""").show()

+--------------------+----------------+
|               state|norm_death_total|
+--------------------+----------------+
|District of Columbia|            null|
|       Massachusetts|            null|
|             Indiana|            null|
|        North Dakota|            null|
|        South Dakota|            null|
|        Rhode Island|            null|
|        South Dakota|            null|
|        Pennsylvania|            null|
|        Pennsylvania|            null|
|        North Dakota|            null|
+--------------------+----------------+



In [15]:
sqlContext.sql("""
    WITH state_norm AS (
        SELECT fs.state, fs.covid_death_total / ds.population as norm_death_total
        FROM fact_state fs
        LEFT JOIN dim_state ds
        ON fs.state == ds.state
        GROUP BY fs.state, fs.covid_death_total, ds.population
    )
    SELECT fs.state, max(sn.norm_death_total) as norm_death_total
        FROM fact_state fs
        LEFT JOIN state_norm sn
        ON fs.state == sn.state
        GROUP BY fs.state
        ORDER BY norm_death_total DESC
        LIMIT 10
""").show()

+------------+----------------+
|       state|norm_death_total|
+------------+----------------+
|        Utah|            null|
|      Hawaii|            null|
|   Minnesota|            null|
|        Ohio|            null|
|    Arkansas|            null|
|      Oregon|            null|
|       Texas|            null|
|North Dakota|            null|
|Pennsylvania|            null|
| Connecticut|            null|
+------------+----------------+



Now for something a bit more tricky: Which counties have the largest temperature swings during the day?

In [None]:
sqlContext.sql("""
    SELECT fc.fips, dc.county_name, fc.max_temp - fc.min_temp as temp_change
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips, dc.county_name
        ORDER BY covid_case_total DESC
        LIMIT 10
""").show()

I'd like to know how strongly Covid case rates correlate to population density. My guess is that high population density results in higher infection rates. To check this, we'll need to get the total case count for each county and normalise it for the total population, then compare the various percentiles of population density in the data set. We could split the whole dataset into 10% buckets of population density and calculate the average normalised case rate.

I'm also curious about how strongly a county's average temperature correlates with case rates. I'd assume that warmer counties fare worse.

A harder question to answer is how much weather on any given day affects covid case rates a few days down the line.  
Since the reporting won't be extremely accurate, we should limit ourselves to evaluating strings of consistently good/bad weather for a few days, and then look at the new case increase a week afterwards.

First, we need to classify what makes a day "good" or "bad". I'd define a "good" day as having low chance of precipitation and low cloud cover, whereas a "bad" day is high on both. We could include temperature here as well, but we'd need to look at a rolling average to see how any given day compares; this presents a problem since rolling averages don't play well with the idea of having a period of good/bad days (they're subsequent days, so the last day in the sequence would need e.g. a higher temperature than the preceding days even though they were classed as "good" already).

Next, we need to indentify sufficiently long sequences of days with similar weather.  
Then we can determine the cases a week from each day, and track the delta.

Finally, we can average the delta for good/bad weather days and see if we can find any difference.