Documentation: https://spark.apache.org/docs/latest/api/python/index.html

### Explore the COVID data

In [0]:
# Check out pre-loaded dataset
display(dbutils.fs.ls('dbfs:/databricks-datasets/COVID/covid-19-data/'))

path,name,size
dbfs:/databricks-datasets/COVID/covid-19-data/.git/,.git/,0
dbfs:/databricks-datasets/COVID/covid-19-data/.github/,.github/,0
dbfs:/databricks-datasets/COVID/covid-19-data/.gitignore,.gitignore,10
dbfs:/databricks-datasets/COVID/covid-19-data/LICENSE,LICENSE,1289
dbfs:/databricks-datasets/COVID/covid-19-data/NEW-YORK-DEATHS-METHODOLOGY.md,NEW-YORK-DEATHS-METHODOLOGY.md,2771
dbfs:/databricks-datasets/COVID/covid-19-data/NYT-readme.md,NYT-readme.md,1748
dbfs:/databricks-datasets/COVID/covid-19-data/PROBABLE-CASES-NOTE.md,PROBABLE-CASES-NOTE.md,3162
dbfs:/databricks-datasets/COVID/covid-19-data/README.md,README.md,22959
dbfs:/databricks-datasets/COVID/covid-19-data/colleges/,colleges/,0
dbfs:/databricks-datasets/COVID/covid-19-data/excess-deaths/,excess-deaths/,0


In [0]:
spark.read.text('dbfs:/databricks-datasets/COVID/covid-19-data/README.md').display()

value
# Coronavirus (Covid-19) Data in the United States
"**NEW:** As the [us-counties.csv](us-counties.csv) file has grown too large to open in Excel, we're providing a new [us-counties-recent.csv](us-counties-recent.csv) file that contains only the most recent 30 days of data for each county. It is otherwise identical. Both files will continue to be updated."
"**Change:** As of Feb. 10, 2021, we are changing how we report data for a few low-population Alaska geographies to better align with how the state reports data. Data for Bristol Bay Borough and Lake and Peninsula Borough are combined in a new area called ""Bristol Bay plus Lake and Peninsula"", and data for Yakutat City and Borough and Hoonah-Angoon Census Area are combined as ""Yakutat plus Hoonah-Angoon"". Many cases now assigned to those new geographies were previously reported as Unknown. The entire timeseries will be revised to use these new geographies."
**NEW:** We are publishing the data behind our [survey of mask usage](https://www.nytimes.com/interactive/2020/07/17/upshot/coronavirus-face-mask-map.html) in the United States in order to provide researchers a way to understand the role of mask wearing in the course of the pandemic. See the data and documentation in the [mask-use/](mask-use/) directory.
**NEW:** We are publishing the data behind our [excess deaths tracker](https://www.nytimes.com/interactive/2020/04/21/world/coronavirus-missing-deaths.html) in order to provide researchers and the public with a better record of the true toll of the pandemic. This data is compiled from official national and municipal data for 24 countries. See the data and documentation in the [excess-deaths/](excess-deaths/) directory.
---
[ [U.S. Data](us.csv) ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv)) | [U.S. State-Level Data](us-states.csv) ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv)) | [U.S. County-Level Data](us-counties.csv) ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv)) ]
"The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak."
"Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak."
"We have used this data to power our [maps](https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html) and [reporting](https://www.nytimes.com/coronavirus) tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak."


Open `us-states.csv` and explore the schema

In [0]:
states = (spark.read.format('csv')
            .option("header", "true")
            .option("InferSchema", "true")
            .load('dbfs:/databricks-datasets/COVID/covid-19-data/us-states.csv'))
states.display()

date,state,fips,cases,deaths
2020-01-21,Washington,53,1,0
2020-01-22,Washington,53,1,0
2020-01-23,Washington,53,1,0
2020-01-24,Illinois,17,1,0
2020-01-24,Washington,53,1,0
2020-01-25,California,6,1,0
2020-01-25,Illinois,17,1,0
2020-01-25,Washington,53,1,0
2020-01-26,Arizona,4,1,0
2020-01-26,California,6,2,0


In [0]:
states.printSchema()

Explore the `us-counties.csv` and answer the following questions:
1. What's the time span of the data (firsta and last date)?
2. Agregate the table by state:
  - Which state has the most confirmed cases and confirmed deaths?
  - Make a plot.

In [0]:
counties = (spark.read.format('csv')
            .option("header", "true")
            .option("InferSchema", "true")
            .load('dbfs:/databricks-datasets/COVID/covid-19-data/live/us-counties.csv'))
counties.display()

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-12,Autauga,Alabama,1001.0,6409,95,5523.0,85.0,886.0,10.0
2021-03-12,Baldwin,Alabama,1003.0,20072,294,14228.0,220.0,5844.0,74.0
2021-03-12,Barbour,Alabama,1005.0,2175,52,1217.0,35.0,958.0,17.0
2021-03-12,Bibb,Alabama,1007.0,2475,58,2009.0,34.0,466.0,24.0
2021-03-12,Blount,Alabama,1009.0,6282,129,4835.0,109.0,1447.0,20.0
2021-03-12,Bullock,Alabama,1011.0,1183,39,1057.0,29.0,126.0,10.0
2021-03-12,Butler,Alabama,1013.0,2037,66,1858.0,60.0,179.0,6.0
2021-03-12,Calhoun,Alabama,1015.0,14034,299,10551.0,240.0,3483.0,59.0
2021-03-12,Chambers,Alabama,1017.0,3439,112,1711.0,72.0,1728.0,40.0
2021-03-12,Cherokee,Alabama,1019.0,1787,42,1152.0,32.0,635.0,10.0


In [0]:
counties.printSchema()

In [0]:
display(counties.describe())

summary,date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
count,3246,3246,3246,3216.0,3246.0,3168.0,2386.0,1743.0,2055.0,1163.0
mean,,,,31489.679415422885,9048.8280961183,167.93339646464648,8426.308885163453,210.66150315547907,1116.05401459854,34.621668099742045
stddev,,,,16357.537531280195,35617.30549705999,799.772491913278,36042.96051818512,961.2880907957092,3842.007239647724,181.3321984872807
min,2021-03-12,Abbeville,Alabama,1001.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2021-03-12,Ziebach,Wyoming,78030.0,1208672.0,30116.0,1208024.0,25085.0,115045.0,5031.0


In [0]:
len(counties.columns)

In [0]:
# Convert `date` from string to date type
# import the whole module pyspark.sql.functions as F and than call individual functions (F.function)
import pyspark.sql.functions as F

counties = counties.withColumn('date', (F.to_date(F.col('date'), 'yyyy-MM-dd')))
#counties.display()

In [0]:
# Convert `date` from string to date type
# import individual functions from pyspark.sql.functions as you go
from pyspark.sql.functions import to_date, col

counties = counties.withColumn('date', (to_date(col('date'), 'yyyy-MM-dd')))
counties.display()

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-12,Autauga,Alabama,1001.0,6409,95,5523.0,85.0,886.0,10.0
2021-03-12,Baldwin,Alabama,1003.0,20072,294,14228.0,220.0,5844.0,74.0
2021-03-12,Barbour,Alabama,1005.0,2175,52,1217.0,35.0,958.0,17.0
2021-03-12,Bibb,Alabama,1007.0,2475,58,2009.0,34.0,466.0,24.0
2021-03-12,Blount,Alabama,1009.0,6282,129,4835.0,109.0,1447.0,20.0
2021-03-12,Bullock,Alabama,1011.0,1183,39,1057.0,29.0,126.0,10.0
2021-03-12,Butler,Alabama,1013.0,2037,66,1858.0,60.0,179.0,6.0
2021-03-12,Calhoun,Alabama,1015.0,14034,299,10551.0,240.0,3483.0,59.0
2021-03-12,Chambers,Alabama,1017.0,3439,112,1711.0,72.0,1728.0,40.0
2021-03-12,Cherokee,Alabama,1019.0,1787,42,1152.0,32.0,635.0,10.0


In [0]:
from pyspark.sql.functions import to_date, col

counties = counties.withColumn('date', (F.to_date(F.col('date'), 'yyyy-MM-dd')))
#counties.display()

In [0]:
# First date
from pyspark.sql.functions import min

min_date = counties.select(min(col("date")))
min_date.display()

min(date)
2021-03-12


In [0]:
min_date = counties.select(min("date"))
min_date.display()

min(date)
2021-03-12


In [0]:
# Last date
from pyspark.sql.functions import max

max_date = counties.select(max("date"))
max_date.display()

max(date)
2021-03-12


In [0]:
# Aggregate confirmed cases and confirmed deaths per state
from pyspark.sql.functions import sum

df = (counties
       .groupby('state')
       .agg(sum('confirmed_cases'), sum('confirmed_deaths'))
       #.agg(sum('confirmed_cases').alias('confirmed_cases_total'), sum('confirmed_deaths').alias('confirmed_deaths_total'))
      # .withColumnRenamed('old_column_name', 'new_column_name')
      
      )
df.display()

state,sum(confirmed_cases),sum(confirmed_deaths)
Utah,85210.0,655.0
Hawaii,,
Minnesota,469579.0,
Ohio,,
Northern Mariana Islands,146.0,2.0
Arkansas,256864.0,4352.0
Oregon,10738.0,138.0
Texas,2370468.0,29319.0
North Dakota,95186.0,
Pennsylvania,834207.0,8012.0


In [0]:
# Which state has the max confirmed cases ?
#df.orderBy('confirmed_cases_total', ascending=False).select('state').first()
df.orderBy('confirmed_cases_total', ascending=False).first()[0]
# display(df.orderBy('confirmed_cases_total', ascending=False))

In [0]:
# Which country has the max confirmed deaths ?
df.orderBy('confirmed_deaths_total', ascending=False).select('state').first()

In [0]:
# Do we have the data for all the states?
from pyspark.sql.functions import col

(df
  .select("state")
  .filter(col("confirmed_deaths_total").isNull())
  #.filter(col("state").isNotNull())
  #.distinct()
  # .count()
  .display()
)

state
Hawaii
Minnesota
Ohio
North Dakota
Vermont
Virginia
Wyoming
Iowa
Massachusetts
Mississippi


In [0]:
# How many counties is in each state ?
(counties
  .select("county", "state")
  .where(col("county").isNotNull())
  .groupBy("state")
  .count()
  .orderBy("count", ascending=False)
  .display()
)

state,count
Texas,254
Georgia,160
Virginia,133
Kentucky,120
Missouri,117
Kansas,106
Illinois,103
North Carolina,100
Iowa,100
Tennessee,96


Get familiar with the mask use study by reading the README.md

In [0]:
spark.read.text('dbfs:/databricks-datasets/COVID/covid-19-data/mask-use/README.md').display()

value
# Mask-Wearing Survey Data
The New York Times is releasing estimates of [mask usage](https://www.nytimes.com/interactive/2020/07/17/upshot/coronavirus-face-mask-map.html) by county in the United States.
"This data comes from a large number of interviews conducted online by the global data and survey firm Dynata at the request of The New York Times. The firm asked a question about mask use to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. (Several states have imposed new mask requirements since the completion of these interviews.)"
"Specifically, each participant was asked: _How often do you wear a mask in public when you expect to be within six feet of another person?_"
"This survey was conducted a single time, and at this point we have no plans to update the data or conduct the survey again."
## Data
Data on the estimated prevalence of mask-wearing in counties in the United States can be found in the **[mask-use-by-county.csv](mask-use-by-county.csv)** file. ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/mask-use/mask-use-by-county.csv))
```
"COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS"
"01001,0.053,0.074,0.134,0.295,0.444"


In [0]:
masks = (spark.read.format('csv')
            .option("header", "true")
            .option("InferSchema", "true")
            .load('dbfs:/databricks-datasets/COVID/covid-19-data/mask-use/mask-use-by-county.csv'))
masks.display()

COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
1001,0.053,0.074,0.134,0.295,0.444
1003,0.083,0.059,0.098,0.323,0.436
1005,0.067,0.121,0.12,0.201,0.491
1007,0.02,0.034,0.096,0.278,0.572
1009,0.053,0.114,0.18,0.194,0.459
1011,0.031,0.04,0.144,0.286,0.5
1013,0.102,0.053,0.257,0.137,0.451
1015,0.152,0.108,0.13,0.167,0.442
1017,0.117,0.037,0.15,0.136,0.56
1019,0.135,0.027,0.161,0.158,0.52


In [0]:
# Make two groups of frequency of wearing masks: almost_never (NEVER+RARELY) and almost_always (FREQUENTLY+ALWAYS): masks_groups
masks_groups = (masks
                 .withColumn('almost_never', masks.NEVER + masks.RARELY)
                 .withColumn('almost_always', masks.FREQUENTLY + masks.ALWAYS)
                 .drop('NEVER', 'RARELY', 'SOMETIMES', 'FREQUENTLY', 'ALWAYS')
                )
masks_groups.display()

COUNTYFP,almost_never,almost_always
1001,0.127,0.739
1003,0.142,0.759
1005,0.188,0.692
1007,0.054,0.85
1009,0.167,0.653
1011,0.071,0.786
1013,0.155,0.5880000000000001
1015,0.26,0.609
1017,0.154,0.6960000000000001
1019,0.162,0.678


Questions:
1. Join the tables `masks_groups` and `counties`.
2. Do you find a correlation between wearing a mask and number of cases/deaths?
3. Plot

In [0]:
# Join masks_groups and counties
mask_use = (counties
            .join(masks_groups, counties.fips == masks_groups.COUNTYFP)
            #.drop('COUNTYFP')
           )
mask_use.display()

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths,COUNTYFP,almost_never,almost_always
2021-03-12,Autauga,Alabama,1001,6409,95,5523.0,85.0,886.0,10.0,1001,0.127,0.739
2021-03-12,Baldwin,Alabama,1003,20072,294,14228.0,220.0,5844.0,74.0,1003,0.142,0.759
2021-03-12,Barbour,Alabama,1005,2175,52,1217.0,35.0,958.0,17.0,1005,0.188,0.692
2021-03-12,Bibb,Alabama,1007,2475,58,2009.0,34.0,466.0,24.0,1007,0.054,0.85
2021-03-12,Blount,Alabama,1009,6282,129,4835.0,109.0,1447.0,20.0,1009,0.167,0.653
2021-03-12,Bullock,Alabama,1011,1183,39,1057.0,29.0,126.0,10.0,1011,0.071,0.786
2021-03-12,Butler,Alabama,1013,2037,66,1858.0,60.0,179.0,6.0,1013,0.155,0.5880000000000001
2021-03-12,Calhoun,Alabama,1015,14034,299,10551.0,240.0,3483.0,59.0,1015,0.26,0.609
2021-03-12,Chambers,Alabama,1017,3439,112,1711.0,72.0,1728.0,40.0,1017,0.154,0.6960000000000001
2021-03-12,Cherokee,Alabama,1019,1787,42,1152.0,32.0,635.0,10.0,1019,0.162,0.678


In [0]:
# What happened during the join? 
# It's a good practice to verify
print('counties:', counties.count(), ', masks_groups:', masks_groups.count(), ', mask_use:', mask_use.count())

In [0]:
# Keep data for only one state
masks_arkansas = (mask_use
                  .filter(mask_use.state == "Arkansas")
                 )
masks_arkansas.display()

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths,COUNTYFP,almost_never,almost_always
2021-03-12,Arkansas,Arkansas,5001,2006,33,1320,25,686,8,5001,0.092,0.773
2021-03-12,Ashley,Arkansas,5003,1861,32,1475,24,386,8,5003,0.126,0.712
2021-03-12,Baxter,Arkansas,5005,2968,98,2085,66,883,32,5005,0.254,0.638
2021-03-12,Benton,Arkansas,5007,27695,391,20974,288,6721,103,5007,0.115,0.803
2021-03-12,Boone,Arkansas,5009,3686,77,2779,68,907,9,5009,0.274,0.605
2021-03-12,Bradley,Arkansas,5011,1329,31,979,16,350,15,5011,0.196,0.629
2021-03-12,Calhoun,Arkansas,5013,404,3,304,2,100,1,5013,0.277,0.615
2021-03-12,Carroll,Arkansas,5015,2731,43,2317,39,414,4,5015,0.179,0.706
2021-03-12,Chicot,Arkansas,5017,1600,38,1458,36,142,2,5017,0.178,0.628
2021-03-12,Clark,Arkansas,5019,2061,40,1689,33,372,7,5019,0.1709999999999999,0.749


In [0]:
# How would you visualize it? 
masks_arkansas_select = (masks_arkansas
                         .select('county', 'confirmed_cases', 'confirmed_deaths', 'almost_never', 'almost_always')
                        )
masks_arkansas_select.display()

county,confirmed_cases,confirmed_deaths,almost_never,almost_always
Arkansas,1320,25,0.092,0.773
Ashley,1475,24,0.126,0.712
Baxter,2085,66,0.254,0.638
Benton,20974,288,0.115,0.803
Boone,2779,68,0.274,0.605
Bradley,979,16,0.196,0.629
Calhoun,304,2,0.277,0.615
Carroll,2317,39,0.179,0.706
Chicot,1458,36,0.178,0.628
Clark,1689,33,0.1709999999999999,0.749


In [0]:
masks_arkansas_select = (masks_arkansas
                         .select('county', 'confirmed_cases', 'confirmed_deaths', 'almost_never', 'almost_always')
                        )
masks_arkansas_select.display()

county,confirmed_cases,confirmed_deaths,almost_never,almost_always
Arkansas,1320,25,0.092,0.773
Ashley,1475,24,0.126,0.712
Baxter,2085,66,0.254,0.638
Benton,20974,288,0.115,0.803
Boone,2779,68,0.274,0.605
Bradley,979,16,0.196,0.629
Calhoun,304,2,0.277,0.615
Carroll,2317,39,0.179,0.706
Chicot,1458,36,0.178,0.628
Clark,1689,33,0.1709999999999999,0.749


In [0]:
# Save as a Parquet file
mask_use.write.parquet("output/mask_use1.parquet")

In [0]:
states.rdd.getNumPartitions()

In [0]:
# Check where it is and how it looks like
display(dbutils.fs.ls('dbfs:/output/mask_use1.parquet'))

In [0]:
# An example of partitioned dataset
display(dbutils.fs.ls('dbfs:/databricks-datasets/amazon/data20K'))

In [0]:
amazon = (spark.read.format('parquet')
            .option("header", "true")
            .option("InferSchema", "true")
            .load('dbfs:/databricks-datasets/amazon/data20K'))
amazon.display()

In [0]:
counties.rdd.getNumPartitions()

Re-do at least one excercise in SQL. (First you need to register dataframes as tables.)

In [0]:
# Create a temporary sql table
counties.createOrReplaceTempView("counties")
mask_use.createOrReplaceTempView("mask_use")
masks_groups.createOrReplaceTempView("masks_groups")

In [0]:
%sql

-- Verify that the table was created
SELECT *
FROM mask_use

In [0]:
%sql

-- Select data for only one state
SELECT *
FROM mask_use
WHERE state = "Arkansas"

In [0]:
%sql

-- Join mask_use and masks_groups 

SELECT county, state, confirmed_cases, confirmed_deaths, almost_never, almost_always
FROM counties
INNER JOIN masks_groups ON counties.fips=masks_groups.COUNTYFP