Documentation: https://spark.apache.org/docs/latest/api/python/index.html

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html

### Explore the COVID data

In [0]:
# Check out pre-loaded dataset
#display(dbutils.fs.ls('dbfs:/databricks-datasets/COVID/covid-19-data/'))
display(dbutils.fs.ls('dbfs:/databricks-datasets/COVID/covid-19-data/'))

path,name,size
dbfs:/databricks-datasets/COVID/covid-19-data/.git/,.git/,0
dbfs:/databricks-datasets/COVID/covid-19-data/.github/,.github/,0
dbfs:/databricks-datasets/COVID/covid-19-data/.gitignore,.gitignore,10
dbfs:/databricks-datasets/COVID/covid-19-data/LICENSE,LICENSE,1289
dbfs:/databricks-datasets/COVID/covid-19-data/NEW-YORK-DEATHS-METHODOLOGY.md,NEW-YORK-DEATHS-METHODOLOGY.md,2771
dbfs:/databricks-datasets/COVID/covid-19-data/NYT-readme.md,NYT-readme.md,1748
dbfs:/databricks-datasets/COVID/covid-19-data/PROBABLE-CASES-NOTE.md,PROBABLE-CASES-NOTE.md,3162
dbfs:/databricks-datasets/COVID/covid-19-data/README.md,README.md,19391
dbfs:/databricks-datasets/COVID/covid-19-data/excess-deaths/,excess-deaths/,0
dbfs:/databricks-datasets/COVID/covid-19-data/live/,live/,0


In [0]:
spark.read.text('dbfs:/databricks-datasets/COVID/covid-19-data/README.md').display()

value
# Coronavirus (Covid-19) Data in the United States
**NEW:** We are publishing the data behind our [survey of mask usage](https://www.nytimes.com/interactive/2020/07/17/upshot/coronavirus-face-mask-map.html) in the United States in order to provide researchers a way to understand the role of mask wearing in the course of the pandemic. See the data and documentation in the [mask-use/](mask-use/) directory.
**NEW:** We are publishing the data behind our [excess deaths tracker](https://www.nytimes.com/interactive/2020/04/21/world/coronavirus-missing-deaths.html) in order to provide researchers and the public with a better record of the true toll of the pandemic. This data is compiled from official national and municipal data for 24 countries. See the data and documentation in the [excess-deaths/](excess-deaths/) directory.
---
[ [U.S. Data](us.csv) ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv)) | [U.S. State-Level Data](us-states.csv) ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv)) | [U.S. County-Level Data](us-counties.csv) ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv)) ]
"The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak."
"Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak."
"We have used this data to power our [maps](https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html) and [reporting](https://www.nytimes.com/coronavirus) tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak."
"The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository."
## Live and Historical Data


Open `us-states.csv` and explore the schema

In [0]:
states = (spark.read.format('csv')
            .option("header", "true")
            .option("InferSchema", "true")
            .load('dbfs:/databricks-datasets/COVID/covid-19-data/us-states.csv'))
states.display()

date,state,fips,cases,deaths
2020-01-21,Washington,53,1,0
2020-01-22,Washington,53,1,0
2020-01-23,Washington,53,1,0
2020-01-24,Illinois,17,1,0
2020-01-24,Washington,53,1,0
2020-01-25,California,6,1,0
2020-01-25,Illinois,17,1,0
2020-01-25,Washington,53,1,0
2020-01-26,Arizona,4,1,0
2020-01-26,California,6,2,0


In [0]:
states.printSchema()

Explore the `us-counties.csv` and answer the following questions:
1. What's the time span of the data (firsta and last date)?
2. Agregate the table by state:
  - Which state has the most confirmed cases and confirmed deaths?
  - Make a plot.

In [0]:
counties = (spark.read.format('csv')
            .option("header", "true")
            .option("InferSchema", "true")
            .load('dbfs:/databricks-datasets/COVID/covid-19-data/live/us-counties.csv'))
counties.display()

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2020-09-28,Autauga,Alabama,1001.0,1785,25,1601.0,24.0,184.0,1.0
2020-09-28,Baldwin,Alabama,1003.0,5588,50,5086.0,46.0,502.0,4.0
2020-09-28,Barbour,Alabama,1005.0,886,7,668.0,7.0,218.0,0.0
2020-09-28,Bibb,Alabama,1007.0,657,10,623.0,6.0,34.0,4.0
2020-09-28,Blount,Alabama,1009.0,1618,15,1256.0,15.0,362.0,0.0
2020-09-28,Bullock,Alabama,1011.0,607,14,582.0,13.0,25.0,1.0
2020-09-28,Butler,Alabama,1013.0,914,39,878.0,38.0,36.0,1.0
2020-09-28,Calhoun,Alabama,1015.0,3548,44,3183.0,36.0,365.0,8.0
2020-09-28,Chambers,Alabama,1017.0,1172,42,898.0,40.0,274.0,2.0
2020-09-28,Cherokee,Alabama,1019.0,614,13,464.0,12.0,150.0,1.0


In [0]:
counties.printSchema()

In [0]:
# Convert `date` from string to date type

In [0]:
# First day in the dataset

In [0]:
# Last day in the dataset 

In [0]:
# Aggregate confirmed cases and confirmed deaths per state

In [0]:
# Which country has the max confirmed cases?

In [0]:
# Which country has the max confirmed deaths?

In [0]:
# Do we have the data for all the countries?

In [0]:
# How many counties is in each state?

Get familiar with the mask use study by reading the README.md

In [0]:
spark.read.text('dbfs:/databricks-datasets/COVID/covid-19-data/mask-use/README.md').display()

In [0]:
# create dataframe masks by reading dbfs:/databricks-datasets/COVID/covid-19-data/mask-use/mask-use-by-county.csv

In [0]:
# Make two groups of frequency of wearing masks: almost_never (NEVER+RARELY) and almost_always (FREQUENTLY+ALWAYS): masks_groups


Questions:
1. Join the tables `masks` and `df`.
2. Do you find a correlation between wearing a mask and number of cases/deaths?
3. Plot

In [0]:
# Join masks_groups and counties: mask_use


In [0]:
# What happened during the join? 
# It's a good practice to verify (count lines for counties, mask_groups and mask_use)

In [0]:
# Keep data for only one state


In [0]:
# How would you visualize it? 


In [0]:
# Save mask_use as a Parquet file


Re-do at least one excercise in SQL. (First you need to register the data as a table.)

In [0]:
%sql
