#### Step 6: Example queries

What use is a data engineering project without actually asking some questions about our data? Below are some example queries (run against the local parquet files since I don't want to pay for S3 read costs).

##### Setup, imports and database loads

In [1]:
from setup import create_spark_session

spark = create_spark_session()

In [2]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import SQLContext

from sql.exampleQueries import *

In [3]:
from etl import load_all_tables

time_dim_df, county_dim_df, state_dim_df, county_facts_df, state_facts_df = load_all_tables(spark)

time_dim_df.createOrReplaceTempView("dim_time")
county_dim_df.createOrReplaceTempView("dim_county")
state_dim_df.createOrReplaceTempView("dim_state")
county_facts_df.createOrReplaceTempView("fact_county")
state_facts_df.createOrReplaceTempView("fact_state")

sqlContext = SQLContext(spark)

Started loading database
Started loading time dimension table
Finished loading time dimension table
Started loading county dimension table
Finished loading county dimension table
Started loading state dimension table
Finished loading state dimension table
Started loading county facts table
Finished loading county facts table
Started loading state facts table
Finished loading state facts table
Finished loading database


##### Highest case rate by county
Let's start simply. Which counties have the highest case rates?

In [None]:
sqlContext.sql("""
    SELECT fc.fips, dc.county_name, max(fc.covid_case_total) as covid_case_total
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips, dc.county_name
        ORDER BY covid_case_total DESC
        LIMIT 10
""").show()

##### Highest deaths by state
Which states have the most deaths in total?

In [None]:
sqlContext.sql("""
    SELECT fs.state, max(fs.covid_case_total) as covid_case_total
        FROM fact_state fs
        GROUP BY fs.state
        ORDER BY covid_case_total DESC
        LIMIT 10
""").show()

Both of these are a bit misleading, though. Counties with higher population numbers are bound to have higher rates, so let's run these queries again but normalise them to the county/state population.

##### Highest cases by counties and highest deaths by state, relative to population

In [None]:
sqlContext.sql("""
    WITH county_norm AS (
        SELECT fc.fips, fc.covid_case_total / dc.population as norm_case_total
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips, fc.covid_case_total, dc.population
    )
    SELECT fc.fips, dc.county_name, dc.state, max(cn.norm_case_total) as norm_case_total
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        LEFT JOIN county_norm cn
        ON fc.fips == cn.fips
        GROUP BY fc.fips, dc.county_name, dc.state
        ORDER BY norm_case_total DESC
        LIMIT 10
""").show()

In [4]:
sqlContext.sql("""
    WITH state_norm AS (
        SELECT fs.state, fs.covid_death_total / ds.population as norm_death_total
        FROM fact_state fs
        LEFT JOIN dim_state ds
        ON fs.state == ds.state
        GROUP BY fs.state, fs.covid_death_total, ds.population
    )
    SELECT fs.state, max(sn.norm_death_total) as norm_death_total
        FROM fact_state fs
        LEFT JOIN state_norm sn
        ON fs.state == sn.state
        GROUP BY fs.state
        ORDER BY norm_death_total DESC
        LIMIT 10
""").show()

+-------------+--------------------+
|        state|    norm_death_total|
+-------------+--------------------+
|   New Jersey| 0.00260851409661762|
|     New York|0.002414721897611473|
|Massachusetts|0.002326376900875365|
| Rhode Island|0.002315298657448348|
|  Mississippi|0.002233026288033...|
|      Arizona|0.002226406601775...|
| South Dakota|0.002137752412905...|
|  Connecticut|0.002130622378532552|
|    Louisiana|0.002057305849941...|
|      Alabama|0.002031559343526046|
+-------------+--------------------+



##### Largest temperature swings during a day
Now for something a bit more tricky: Which counties have the largest temperature swings during the day?

In [6]:
sqlContext.sql("""
    SELECT fc.fips, dc.county_name, dc.state, fc.max_temp - fc.min_temp as temp_change
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips, dc.county_name, dc.state, fc.max_temp, fc.min_temp
        ORDER BY temp_change DESC
        LIMIT 10
""").show()

+-----+-----------+----------+------------------+
| fips|county_name|     state|       temp_change|
+-----+-----------+----------+------------------+
| 8031|     Denver|  Colorado|             32.97|
| 8014| Broomfield|  Colorado|32.940000000000005|
| 8001|      Adams|  Colorado|             32.75|
| 8005|   Arapahoe|  Colorado|              32.4|
| 8087|     Morgan|  Colorado|             32.25|
|40025|   Cimarron|  Oklahoma|             32.19|
|48431|   Sterling|     Texas|             32.04|
| 8039|     Elbert|  Colorado|31.980000000000004|
|56021|    Laramie|   Wyoming|             31.98|
|35059|      Union|New Mexico|31.799999999999997|
+-----+-----------+----------+------------------+



Looks like Colorado is seeing quite a few temperature changes! It might be interesting to group this by state, or to narrow the query down a bit since it's now looking only at the most extreme temperature, not at the average deviation every day.

##### Correlation between Covid-19 case rate and population density
I'd like to know how strongly Covid case rates correlate to population density. My guess is that high population density results in higher infection rates. To check this, we'll need to get the total case count for each county and normalise it for the total population, then compare the various percentiles of population density in the data set. We could split the whole dataset into 10% buckets of population density and calculate the average normalised case rate.

In [24]:
sqlContext.sql("""
    WITH percentiles AS (
        SELECT dc.fips, PERCENT_RANK() OVER(
            ORDER BY dc.population_density ASC
        ) AS percent_rank
        FROM dim_county dc
    ),
    percent_buckets AS (
        SELECT p.fips, ROUND(p.percent_rank, 1) AS bucket
        FROM percentiles p
    ),
    max_cases AS (
        SELECT fc.fips, max(fc.covid_case_total / dc.population_density) as normalised_covid_cases
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips
    )
    SELECT pb.bucket, avg(mc.normalised_covid_cases) as average_normalised_cases, count(*) as count
        FROM  percent_buckets pb
        LEFT JOIN max_cases mc
        ON pb.fips == mc.fips
        GROUP BY pb.bucket
        ORDER BY pb.bucket DESC
        LIMIT 10
""").show()

+------+------------------------+-----+
|bucket|average_normalised_cases|count|
+------+------------------------+-----+
|   1.0|      34.962723593940574|  157|
|   0.9|       57.32998591205464|  314|
|   0.8|       53.69791338446903|  314|
|   0.7|       68.21512735366217|  314|
|   0.6|      55.558876271476386|  314|
|   0.5|       64.02359784609429|  314|
|   0.4|       67.70472271546836|  314|
|   0.3|       76.53304705111098|  314|
|   0.2|      118.82131288623027|  314|
|   0.1|      169.38397129450536|  314|
+------+------------------------+-----+



This didn't turn out how I expected, which is exciting! It looks like the top 10% most densely populated counties actually have a much lower case rate considering the overall population than the next 10%, then it stabilises throughout the middle and the bottom 10% are much, much worse off than the average.

I'm not 100% confident that my query is correct, but this pretty much turned out exactly opposite what one would expect. I'd love to dig in more to evaluate factors that explain this result, maybe in a future project.

##### Correlation between Covid-19 case rate and max temperature
I'm also curious about how strongly a county's average temperature correlates with case rates. I'd assume that warmer counties fare worse (since people are more likely to be out and about).

In [22]:
sqlContext.sql("""
    WITH hottest_days AS (
        SELECT fc.fips, max(fc.max_temp) as max_temp
        FROM fact_county fc
        GROUP BY fc.fips
    ),
    percentiles AS (
        SELECT hd.fips, PERCENT_RANK() OVER(
            ORDER BY hd.max_temp ASC
        ) AS percent_rank
        FROM hottest_days hd
        GROUP BY hd.fips, hd.max_temp
    ),
    percent_buckets AS (
        SELECT p.fips, ROUND(p.percent_rank, 1) AS bucket
        FROM percentiles p
    ),
    max_cases AS (
        SELECT fc.fips, max(fc.covid_case_total / dc.population_density) as normalised_covid_cases
        FROM fact_county fc
        LEFT JOIN dim_county dc
        ON fc.fips == dc.fips
        GROUP BY fc.fips
    )
    SELECT pb.bucket, avg(mc.normalised_covid_cases) as average_normalised_cases, count(*) as count
        FROM percent_buckets pb
        LEFT JOIN max_cases mc
        ON pb.fips == mc.fips
        GROUP BY pb.bucket
        ORDER BY pb.bucket DESC
        LIMIT 10
""").show()

+------+------------------------+-----+
|bucket|average_normalised_cases|count|
+------+------------------------+-----+
|   1.0|      184.71254309892313|  154|
|   0.9|       147.8762124137458|  301|
|   0.8|      128.33340261998075|  309|
|   0.7|       66.35012975534268|  333|
|   0.6|       70.22352478849133|   85|
|   0.5|      63.806747476272875|  546|
|   0.4|       73.35582869607516|  112|
|   0.3|       55.01663784026408|  516|
|   0.2|      59.749469896702976|   81|
|   0.1|        57.4032677262522|  536|
+------+------------------------+-----+



Another interesting result. It appears that yes, the hottest counties fare worse, but so do the coldest. It appears that a more temperate climate is less affected by Covid-19 since the case rate decreased in the mid percentiles.

##### Correlation between deaths-per-infection ratio and percentage of elderly people in population
Elderly people are supposed to be more likely to die from Covid-19. We don't have patient data, but we do have information about how many over-65s live in a county. Let's look at the death rate relative to the case rate by percentiles of elderly people in each county, to see if the data bears this statement out as well.

In [26]:
sqlContext.sql("""
    WITH percentiles AS (
        SELECT dc.fips, PERCENT_RANK() OVER(
            ORDER BY dc.over_sixtyfives ASC
        ) AS percent_rank
        FROM dim_county dc
    ),
    percent_buckets AS (
        SELECT p.fips, ROUND(p.percent_rank, 1) AS bucket
        FROM percentiles p
    ),
    max_timestamps AS (
        SELECT fc.fips, max(fc.timestamp) as max_timestamp
        FROM fact_county fc
        GROUP BY fc.fips
    ),
    ratio AS (
        SELECT fc.fips, max(fc.covid_death_total / fc.covid_case_total) as death_per_case_ratio
        FROM fact_county fc
        LEFT JOIN max_timestamps mt
        ON fc.fips == mt.fips
        GROUP BY fc.fips
    )
    SELECT pb.bucket, avg(r.death_per_case_ratio) as death_per_case_ratio, count(*) as count
        FROM percent_buckets pb
        LEFT JOIN ratio r
        ON pb.fips == r.fips
        GROUP BY pb.bucket
        ORDER BY pb.bucket DESC
        LIMIT 10
""").show()

+------+--------------------+-----+
|bucket|death_per_case_ratio|count|
+------+--------------------+-----+
|   1.0| 0.10234139384339469|  157|
|   0.9| 0.07611350244709403|  314|
|   0.8| 0.08860998654504763|  314|
|   0.7| 0.08230655909068994|  314|
|   0.6|   0.105019219292181|  314|
|   0.5| 0.12325206203839445|  314|
|   0.4| 0.09558496924563675|  314|
|   0.3| 0.10577683141649179|  314|
|   0.2| 0.09502856552517643|  314|
|   0.1| 0.10371444871632118|  314|
+------+--------------------+-----+



Every result is interesting! This time, it looks like there isn't that much variation between the different percentiles, although it appears that the median percentile has the highest ratio of deaths per cases? If we were properly analysing this dataset to fit a model, I'd discard this feature since there doesn't seem to be a strong correlation.

##### Future investigations
I can think of many other interesting investigations we can start with this dataset:
* Does relatively good/bad weather have any impact on case counts?
* If we knew when stay-at-home orders went into effect in a given county, could we detect its effect in the data?
* Which health and socioeconomic factors play a part in infection rates? Which in death rates?
* What do the worst-hit counties in all states have in common?

...and many more.