#### Step 6: Example queries

What use is a data engineering project without actually asking some questions about our data? Below are some example queries (run against the local parquet files since I don't want to pay for S3 read costs).

##### Setup, imports and database loads

In [1]:
from setup import create_spark_session

spark = create_spark_session()

In [2]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import SQLContext

from sql.exampleQueries import *

In [3]:
from etl import load_all_tables

time_dim_df, county_dim_df, state_dim_df, county_facts_df, state_facts_df = load_all_tables(spark)

time_dim_df.createOrReplaceTempView("dim_time")
county_dim_df.createOrReplaceTempView("dim_county")
state_dim_df.createOrReplaceTempView("dim_state")
county_facts_df.createOrReplaceTempView("fact_county")
state_facts_df.createOrReplaceTempView("fact_state")

sqlContext = SQLContext(spark)

Started loading database
Started loading time dimension table
Finished loading time dimension table
Started loading county dimension table
Finished loading county dimension table
Started loading state dimension table
Finished loading state dimension table
Started loading county facts table
Finished loading county facts table
Started loading state facts table
Finished loading state facts table
Finished loading database


##### Queries
Let's start simply. Which counties have the highest case rates?

In [5]:
sqlContext.sql("""
    SELECT fc.fips, fc.timestamp, dc.county_name, max(fc.covid_case_total) as covid_case_total
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips, fc.timestamp, dc.county_name
        ORDER BY covid_case_total DESC
        LIMIT 10
""").show()

+-----+-------------+-----------+----------------+
| fips|    timestamp|county_name|covid_case_total|
+-----+-------------+-----------+----------------+
|13183|      Georgia|       Long|      1614384000|
|29201|     Missouri|      Scott|      1614384000|
|39129|         Ohio|   Pickaway|      1614384000|
|48189|        Texas|       Hale|      1614384000|
|40037|     Oklahoma|      Creek|      1614384000|
|54069|West Virginia|       Ohio|      1614384000|
|42041| Pennsylvania| Cumberland|      1614384000|
|41039|       Oregon|       Lane|      1614384000|
|48155|        Texas|      Foard|      1614384000|
|17119|     Illinois|    Madison|      1614384000|
+-----+-------------+-----------+----------------+



In [6]:
county_facts_df.agg({'covid_case_total': 'max'}).show()

+---------------------+
|max(covid_case_total)|
+---------------------+
|           1614384000|
+---------------------+



Which states have the most deaths in total?

In [None]:
sqlContext.sql("""
    SELECT fs.state, fs.timestamp, max(fs.covid_case_total) as covid_case_total
        FROM fact_state fs
        GROUP BY fs.state, fs.timestamp
        ORDER BY covid_case_total DESC
        LIMIT 10
""").show()

Now for something a bit more tricky: Which counties have the largest temperature swings during the day?

I'd like to know how strongly Covid case rates correlate to population density. My guess is that high population density results in higher infection rates. To calculate the correlation, we'll need to get the total case count for each county and normalise it for the total population, then make a model to fit those two features.

I'm also curious about how strongly a county's average temperature correlates with case rates. I'd assume that warmer counties fare worse.

A harder question to answer is how much weather on any given day affects covid case rates a few days down the line.  
Since the reporting won't be extremely accurate, we should limit ourselves to evaluating strings of consistently good/bad weather for a few days, and then look at the new case increase a week afterwards.

First, we need to classify what makes a day "good" or "bad". I'd define a "good" day as having low chance of precipitation and low cloud cover, whereas a "bad" day is high on both. We could include temperature here as well, but we'd need to look at a rolling average to see how any given day compares; this presents a problem since rolling averages don't play well with the idea of having a period of good/bad days (they're subsequent days, so the last day in the sequence would need e.g. a higher temperature than the preceding days even though they were classed as "good" already).

Next, we need to indentify sufficiently long sequences of days with similar weather.  
Then we can determine the cases a week from each day, and track the delta.

Finally, we can average the delta for good/bad weather days and see if we can find any difference.