### Covid Cases By State

https://www.cdc.gov/covid-data-tracker/#cases

How can we find states that have similar numbers of COVID cases (total or per capita)? 
For example, let's write a SQL query to find all states that have similar infection numbers. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/US_MAP_DATA.csv')

In [3]:
df

Unnamed: 0,abbr,fips,jurisdiction,Total Cases,Total Death,Death_100k,CasesInLast7Days,RatePer100000
0,AK,2,Alaska,2729,22,3.0,688,370
1,AL,1,Alabama,82530,1493,31.0,12104,1688
2,AR,5,Arkansas,40181,428,14.0,5516,1333
3,AS,60,American Samoa,0,0,,0,0
4,AZ,4,Arizona,165934,3408,48.0,17251,2314
5,CA,6,California,466550,8518,22.0,65781,1179
6,CO,8,Colorado,45314,1807,32.0,4255,796
7,CT,9,Connecticut,49077,4423,124.0,981,1374
8,DC,11,District of Columbia,11945,583,83.0,518,1700
9,DE,10,Delaware,14602,581,60.0,810,1510


In [4]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

You may notice that "Total Cases" has a space in it. Spaces aren't allowed in SQL when referring to table names. A quick fix here is to put back ticks aroudn it, `Total Cases`. Note that these are ticks, not quotes! (you'll generally find them in the upper left corner of your keyboard). 

In [5]:
pysqldf("""
SELECT 
    abbr, `Total Cases`
FROM 
    df
""")

Unnamed: 0,abbr,Total Cases
0,AK,2729
1,AL,82530
2,AR,40181
3,AS,0
4,AZ,165934
5,CA,466550
6,CO,45314
7,CT,49077
8,DC,11945
9,DE,14602


If you'd rather not deal with this repeatedly, you can rename the column in SQL or pandas - we'll use pandas here.

In [6]:
df.rename(columns={'Total Cases': 'Total_Cases'}, inplace=True)

Let's start by finding states that have total cases within 10000 of each other. 

In [7]:
pysqldf("""
SELECT
    a.abbr as abbr_a, 
    b.abbr as abbr_b, 
    a.Total_Cases as total_a,
    b.Total_Cases as total_b
FROM   
    df a
JOIN 
    df b 
ON
    ABS(total_a - total_b) < 10000
AND
    abbr_a <> abbr_b
AND
    abbr_a > abbr_b
""")

Unnamed: 0,abbr_a,abbr_b,total_a,total_b
0,AS,AK,0,2729
1,CO,AR,45314,40181
2,CT,AR,49077,40181
3,CT,CO,49077,45314
4,DC,AK,11945,2729
5,DE,DC,14602,11945
6,FSM,AK,0,2729
7,FSM,AS,0,0
8,GA,AZ,175052,165934
9,GU,AK,351,2729


Which states have a total case count within 1,000 of each other?

In [8]:
pysqldf("""
SELECT 
    *
FROM 
    df a
JOIN
    df b
WHERE 
    a.abbr = 'CA'
""")

Unnamed: 0,abbr,fips,jurisdiction,Total_Cases,Total Death,Death_100k,CasesInLast7Days,RatePer100000,abbr.1,fips.1,jurisdiction.1,Total_Cases.1,Total Death.1,Death_100k.1,CasesInLast7Days.1,RatePer100000.1
0,CA,6,California,466550,8518,22.0,65781,1179,AK,2,Alaska,2729,22,3.0,688,370
1,CA,6,California,466550,8518,22.0,65781,1179,AL,1,Alabama,82530,1493,31.0,12104,1688
2,CA,6,California,466550,8518,22.0,65781,1179,AR,5,Arkansas,40181,428,14.0,5516,1333
3,CA,6,California,466550,8518,22.0,65781,1179,AS,60,American Samoa,0,0,,0,0
4,CA,6,California,466550,8518,22.0,65781,1179,AZ,4,Arizona,165934,3408,48.0,17251,2314
5,CA,6,California,466550,8518,22.0,65781,1179,CA,6,California,466550,8518,22.0,65781,1179
6,CA,6,California,466550,8518,22.0,65781,1179,CO,8,Colorado,45314,1807,32.0,4255,796
7,CA,6,California,466550,8518,22.0,65781,1179,CT,9,Connecticut,49077,4423,124.0,981,1374
8,CA,6,California,466550,8518,22.0,65781,1179,DC,11,District of Columbia,11945,583,83.0,518,1700
9,CA,6,California,466550,8518,22.0,65781,1179,DE,10,Delaware,14602,581,60.0,810,1510


#### Exercise

Modify the SQL above to find which states don't have a total case count within 1,000 of any other state.

#### Using a Cross Product

This question came up during the most recent SQL workshop - what happens when you don't specify a JOIN condition. 
Are there times when you'd want the entire cross-product of two tables?

Yes, there are times when you want to take the entire cross product of two tables, to get every possible pairing of rows from two (or more) different tables. Keep in mind that this can produce extremely large result sets quickly! If you're dealing with large tables (or multiple tables), you may want to reconsider your query and try to find a way reduce the result set. 

We can either use a JOIN with no ON condition or use a CROSS JOIN. I prefer to use the CROSS JOIN syntax, as it makes it clear in the query that I really do intend to take the full cross product, and haven't just forgotten to include an ON condition in my SQL query. Forgetting to include an ON clause in a JOIN is a fairly common SQL bug (actually, when my queries are taking an unexpectedly long time to run, this is the first thing I check). 

To illustrate, let's write a SQL query to find each state and state with the closest number of cases.

In [32]:
# first, get it all for california
pysqldf("""
SELECT
    a.abbr AS abbr_a,
    b.abbr AS abbr_b,
    MIN(ABS(a.total_cases - b.total_cases)) AS min_diff
FROM
    df a
CROSS JOIN
    df b
WHERE
    abbr_a = 'CA'
AND 
    abbr_b != 'CA'
GROUP BY
    abbr_a
""")

Unnamed: 0,abbr_a,abbr_b,min_diff
0,CA,FL,29683


In [27]:
# for all states
pysqldf("""
SELECT
    a.abbr AS abbr_a,
    b.abbr AS abbr_b,
    MIN(ABS(a.total_cases - b.total_cases)) AS min_diff
FROM
    df a
CROSS JOIN
    df b
WHERE 
    abbr_a <> abbr_b
GROUP BY
    abbr_a
""")

Unnamed: 0,abbr_a,abbr_b,min_diff
0,AK,WY,140
1,AL,SC,1579
2,AR,UT,844
3,AS,FSM,0
4,AZ,IL,9039
5,CA,FL,29683
6,CO,NV,163
7,CT,CO,3763
8,DC,DE,2657
9,DE,PR,1459


### Exercise - Non-symmetric pairs

Some of these pairs repeat - for example, AK's closest match is WY at 140, and WY's closest match is AK. 

Which pairs are only work in one direction? For example, Texas's closest match is Florida, but Florida's closest match is California.
