# Introduction to Data Analysis for Aspiring Data Scientists
## Data Analysis with `pandas`

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Motivate why to use `pandas`
 - Introduce `pandas` and its history
 - Import the COVID-19 dataset
   * `pd.read_csv()`
 - Summarize the data
   * `head`, `tail`, `shape`
   * `sum`, `min`, `count`, `mean`, `std`
   * `describe`
 - Slice and munge data
   * Slicing, `loc`, `iloc`
   * `value_counts`
   * `drop`
   * `sort_values`
   * Filtering
 - Group data and perform aggregate functions
   * `groupby`
 - Work with missing data and duplicates
   * `isnull`
   * `unique`, `drop_duplicates`
   * `fillna`
 - Visualization
   * Histograms
   * Scatterplots
   * Lineplots
 
 Check out [this cheetsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for help.  Also see [the `pandas` docs.](https://pandas.pydata.org/docs/)

### Motivate why to use `pandas`

Let's start big picture...<br><br>

* Humans are tool using animals animals
* Computers are one of the most powerful tools we've created
* If you write code, you can unlock the full power of these tools

Ok, cool. But why `pandas`?<br><br>

* More and more, data is leading decision making
* Excel is great but what if...
  - You want to automate your analysis so it re-runs on new data each day?
  - You want to build a code base to share with your colleaguges
  - You want more robust analyses to feed a business decision
  - You want to do machine learning
* One of the core libraries used by data analysts and data scientists in Python

Enter `pandas`...

### Introduce `pandas` and its history

`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Highlights:

- Built in 2008, open sourced in 2009
- A fast and efficient **DataFrame object** for data manipulation with integrated indexing;
- Tools for **reading and writing data** between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent **data alignment** and integrated handling of **missing data**: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible **reshaping and pivoting** of data sets;
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets;
- Columns can be inserted and deleted from data structures for **size mutability**;
- Aggregating or transforming data with a powerful **group by** engine allowing split-apply-combine operations on data sets;
- High performance **merging and joining** of data sets;
- Hierarchical axis **indexing** provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- **Time series**-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly **optimized** for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of **academic and commercial domains**, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.


[Check out the book](https://www.amazon.com/gp/product/1491957662/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=quantpytho-20&creative=9325&linkCode=as2&creativeASIN=1491957662&linkId=ea8de4253cce96046e8ab0383ac71b33)

### Import the COVID-19 dataset

Use `%sh ls` to search the folder structure

In [7]:
%sh ls /dbfs/databricks-datasets/COVID/

In [8]:
%sh ls /dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports

Use `%sh head` to see the first few lines of CSV file

In [10]:
%sh head /dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv

Import `pandas`.  Alias it as `pd`

In [12]:
import pandas as pd

Read the csv file.  This creates a `DataFrame`

In [14]:
pd.read_csv("/dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv")

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-11 22:45:33,34.223334,-82.461707,9,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-11 22:45:33,30.295065,-92.414197,98,4,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-11 22:45:33,37.767072,-75.632346,15,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-11 22:45:33,43.452658,-116.241552,513,6,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-11 22:45:33,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"
5,21001.0,Adair,Kentucky,US,2020-04-11 22:45:33,37.104598,-85.281297,6,0,0,0,"Adair, Kentucky, US"
6,29001.0,Adair,Missouri,US,2020-04-11 22:45:33,40.190586,-92.600782,11,0,0,0,"Adair, Missouri, US"
7,40001.0,Adair,Oklahoma,US,2020-04-11 22:45:33,35.884942,-94.658593,27,2,0,0,"Adair, Oklahoma, US"
8,8001.0,Adams,Colorado,US,2020-04-11 22:45:33,39.874321,-104.336258,543,23,0,0,"Adams, Colorado, US"
9,16003.0,Adams,Idaho,US,2020-04-11 22:45:33,44.893336,-116.454525,1,0,0,0,"Adams, Idaho, US"


Now let's combine the lines of code and save the `DataFrame` to a variable so we can reuse it

In [16]:
import pandas as pd

df = pd.read_csv("/dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv")

df

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-11 22:45:33,34.223334,-82.461707,9,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-11 22:45:33,30.295065,-92.414197,98,4,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-11 22:45:33,37.767072,-75.632346,15,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-11 22:45:33,43.452658,-116.241552,513,6,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-11 22:45:33,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"
5,21001.0,Adair,Kentucky,US,2020-04-11 22:45:33,37.104598,-85.281297,6,0,0,0,"Adair, Kentucky, US"
6,29001.0,Adair,Missouri,US,2020-04-11 22:45:33,40.190586,-92.600782,11,0,0,0,"Adair, Missouri, US"
7,40001.0,Adair,Oklahoma,US,2020-04-11 22:45:33,35.884942,-94.658593,27,2,0,0,"Adair, Oklahoma, US"
8,8001.0,Adams,Colorado,US,2020-04-11 22:45:33,39.874321,-104.336258,543,23,0,0,"Adams, Colorado, US"
9,16003.0,Adams,Idaho,US,2020-04-11 22:45:33,44.893336,-116.454525,1,0,0,0,"Adams, Idaho, US"


### Summarize the data

First, let's talk tab completion

In [19]:
# df. # Uncomment this and press 'tab' with your cursor after the "."

Need help?

In [21]:
help(df.head())

Take a peak at the first and last few rows of the data

In [23]:
df.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-11 22:45:33,34.223334,-82.461707,9,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-11 22:45:33,30.295065,-92.414197,98,4,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-11 22:45:33,37.767072,-75.632346,15,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-11 22:45:33,43.452658,-116.241552,513,6,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-11 22:45:33,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"


In [24]:
df.tail(2)

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
2964,,,,Zimbabwe,2020-04-11 22:45:13,-19.015438,29.154857,14,3,0,11,Zimbabwe
2965,,unassigned,Utah,US,2020-04-11 22:45:33,,,0,5,0,0,"unassigned, Utah, US"


How many records are in the dataset?

In [26]:
df.shape

Summarize the data

In [28]:
# df.sum()
# df.min()
# df.max()
# df.count()
df.mean()
# df.std()

These summary stats are aggregated for you...

In [30]:
df.describe()

Unnamed: 0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active
count,2686.0,2907.0,2907.0,2966.0,2966.0,2966.0,2966.0
mean,31172.744602,36.581616,-80.630198,597.273769,36.582266,135.573163,259.664531
std,17195.765629,9.996576,40.637602,6258.790749,593.397411,2303.045843,3663.293511
min,66.0,-51.7963,-159.856183,0.0,0.0,0.0,0.0
25%,18085.5,33.730899,-95.619116,4.0,0.0,0.0,0.0
50%,29144.0,37.829989,-87.356027,13.0,0.0,0.0,0.0
75%,46066.5,41.474923,-80.789848,64.0,2.0,0.0,0.0
max,99999.0,71.7069,178.065,163027.0,19468.0,64264.0,100269.0


### Slice and munge data

Grab just the confirmed cases

In [33]:
df['Confirmed']

Grab the country and confirmed cases.

In [35]:
df.columns

In [36]:
df[['Country_Region', 'Confirmed']]

Unnamed: 0,Country_Region,Confirmed
0,US,9
1,US,98
2,US,15
3,US,513
4,US,1
5,US,6
6,US,11
7,US,27
8,US,543
9,US,1


Create a new column `Date`

In [38]:
import datetime

df["Date"] = datetime.date(2020, 4, 11)

In [39]:
df["Date"].head()

Slice the DataFrame to get the first 10 rows

In [41]:
df.loc[:10, ['Country_Region', 'Confirmed']]
# df.loc[0:10, ['Country_Region', 'Confirmed']] # Same thing

Unnamed: 0,Country_Region,Confirmed
0,US,9
1,US,98
2,US,15
3,US,513
4,US,1
5,US,6
6,US,11
7,US,27
8,US,543
9,US,1


Return just the first column from the first row

In [43]:
df.iloc[0, 0]

How many regions to we have per country?

In [45]:
df["Country_Region"].value_counts()

What's FIPS?

In [47]:
df = df.drop("FIPS", axis=1)

Sort by confirmed cases

In [49]:
df.sort_values("Confirmed", ascending=False)

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Date
2937,,,Spain,2020-04-11 22:45:13,40.463667,-3.749220,163027,16606,59109,87312,Spain,2020-04-11
2865,,,Italy,2020-04-11 22:45:13,41.871940,12.567380,152271,19468,32534,100269,Italy,2020-04-11
2842,,,France,2020-04-11 22:45:13,46.227600,2.213700,129654,13832,26391,89431,",,France",2020-04-11
2846,,,Germany,2020-04-11 22:45:13,51.165691,10.451526,124908,2736,57400,64772,Germany,2020-04-11
1734,New York City,New York,US,2020-04-11 22:45:33,40.767273,-73.971526,98308,6367,0,0,"New York City, New York, US",2020-04-11
2955,,,United Kingdom,2020-04-11 22:45:13,55.378100,-3.436000,78991,9875,344,68772,United Kingdom,2020-04-11
2861,,,Iran,2020-04-11 22:45:13,32.427908,53.688046,70029,4357,41947,23725,Iran,2020-04-11
2731,,Hubei,China,2020-04-11 01:38:00,30.975600,112.270700,67803,3219,64264,320,"Hubei, China",2020-04-11
2951,,,Turkey,2020-04-11 22:45:13,38.963700,35.243300,52167,1101,2965,48101,Turkey,2020-04-11
2799,,,Belgium,2020-04-11 22:45:13,50.833300,4.469936,28018,3346,5986,18686,Belgium,2020-04-11


Let's just look at what's going on in the US

In [51]:
df[df["Country_Region"] == "US"]

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Date
0,Abbeville,South Carolina,US,2020-04-11 22:45:33,34.223334,-82.461707,9,0,0,0,"Abbeville, South Carolina, US",2020-04-11
1,Acadia,Louisiana,US,2020-04-11 22:45:33,30.295065,-92.414197,98,4,0,0,"Acadia, Louisiana, US",2020-04-11
2,Accomack,Virginia,US,2020-04-11 22:45:33,37.767072,-75.632346,15,0,0,0,"Accomack, Virginia, US",2020-04-11
3,Ada,Idaho,US,2020-04-11 22:45:33,43.452658,-116.241552,513,6,0,0,"Ada, Idaho, US",2020-04-11
4,Adair,Iowa,US,2020-04-11 22:45:33,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US",2020-04-11
5,Adair,Kentucky,US,2020-04-11 22:45:33,37.104598,-85.281297,6,0,0,0,"Adair, Kentucky, US",2020-04-11
6,Adair,Missouri,US,2020-04-11 22:45:33,40.190586,-92.600782,11,0,0,0,"Adair, Missouri, US",2020-04-11
7,Adair,Oklahoma,US,2020-04-11 22:45:33,35.884942,-94.658593,27,2,0,0,"Adair, Oklahoma, US",2020-04-11
8,Adams,Colorado,US,2020-04-11 22:45:33,39.874321,-104.336258,543,23,0,0,"Adams, Colorado, US",2020-04-11
9,Adams,Idaho,US,2020-04-11 22:45:33,44.893336,-116.454525,1,0,0,0,"Adams, Idaho, US",2020-04-11


Now let's look at what's going on in my county

In [53]:
df[(df["Country_Region"] == "US") & (df["Province_State"] == "California") & (df["Admin2"] == "San Francisco")]

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Date
2108,San Francisco,California,US,2020-04-11 22:45:33,37.752151,-122.438567,857,13,0,0,"San Francisco, California, US",2020-04-11


### Group data and perform aggregate functions

What country has the greatest number of confirmed cases?

In [56]:
df.groupby("Country_Region")

Group and sum the data. **Note that an aggregate function return a scalar (single) value.**

In [58]:
df.groupby("Country_Region")["Confirmed"].sum().sort_values(ascending=False)

Which US states have the most cases?

In [60]:
df[df['Country_Region'] == "US"].groupby("Province_State")["Confirmed"].sum().sort_values(ascending=False)

### Work with missing data and duplicates

Do we have null values?

In [63]:
df.isnull().tail()

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Date
2961,True,True,False,False,False,False,False,False,False,False,False,False
2962,True,True,False,False,False,False,False,False,False,False,False,False
2963,True,True,False,False,False,False,False,False,False,False,False,False
2964,True,True,False,False,False,False,False,False,False,False,False,False
2965,False,False,False,False,True,True,False,False,False,False,False,False


In [64]:
df.isnull().sum()

How many unique countries?

In [66]:
df['Country_Region'].unique().shape

Another way to do the same thing.

In [68]:
df['Country_Region'].drop_duplicates()

In [69]:
df.fillna("NO DATA AVAILABLE").tail(3)

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Date
2963,NO DATA AVAILABLE,NO DATA AVAILABLE,Zambia,2020-04-11 22:45:13,-13.1339,27.8493,40,2,28,10,Zambia,2020-04-11
2964,NO DATA AVAILABLE,NO DATA AVAILABLE,Zimbabwe,2020-04-11 22:45:13,-19.0154,29.1549,14,3,0,11,Zimbabwe,2020-04-11
2965,unassigned,Utah,US,2020-04-11 22:45:33,NO DATA AVAILABLE,NO DATA AVAILABLE,0,5,0,0,"unassigned, Utah, US",2020-04-11


### Visualization
   * Histograms
   * Scatterplots
   * Lineplots

In [71]:
import matplotlib.pyplot as plt

%matplotlib inline

In [72]:
us_subset_df = df[df["Country_Region"] == "US"]

What is the _distribution_ of deaths by US states and territories?

In [74]:
us_subset_df.groupby("Province_State")["Deaths"].sum().hist()

In [75]:
us_subset_df.groupby("Province_State")["Deaths"].sum().hist(bins=30)

How do confirmed cases compare to deaths?

In [77]:
us_subset_df.plot.scatter(x="Confirmed", y="Deaths")

In [78]:
us_subset_df[us_subset_df["Deaths"] < 1000].plot.scatter(x="Confirmed", y="Deaths")

Import the data for all available days

In [80]:
import glob

path = "/dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports"
all_files = glob.glob(path + "/*.csv")

dfs = []

for filename in all_files:
  temp_df = pd.read_csv(filename)
  temp_df.columns = [c.replace("/", "_") for c in temp_df.columns]
  temp_df.columns = [c.replace(" ", "_") for c in temp_df.columns]
  
  month, day, year = filename.split("/")[-1].replace(".csv", "").split("-")
  d = datetime.date(int(year), int(month), int(day))
  temp_df["Date"] = d

  dfs.append(temp_df)
  
all_days_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
all_days_df = all_days_df.drop(["Latitude", "Longitude", "Lat", "Long_", "FIPS", "Combined_Key", "Last_Update"], axis=1)

In [81]:
all_days_df.head()

Unnamed: 0,Province_State,Country_Region,Confirmed,Deaths,Recovered,Date,Admin2,Active
0,Anhui,Mainland China,1.0,,,2020-01-22,,
1,Beijing,Mainland China,14.0,,,2020-01-22,,
2,Chongqing,Mainland China,6.0,,,2020-01-22,,
3,Fujian,Mainland China,1.0,,,2020-01-22,,
4,Gansu,Mainland China,,,,2020-01-22,,


How has the disease spread over time?

In [83]:
all_days_df.groupby("Date")["Confirmed"].sum().plot(title="Confirmed Cases over Time", rot=45)

Break this down by types of cases

In [85]:
all_days_df.groupby("Date")["Confirmed", "Deaths", "Recovered"].sum().plot(title="Confirmed, Deaths, Recovered over Time", rot=45)

What is the growth in my county?

In [87]:
(all_days_df[(all_days_df["Country_Region"] == "US") & (all_days_df["Province_State"] == "California") & (all_days_df["Admin2"] == "San Francisco")]
  .groupby("Date")["Confirmed", "Deaths", "Recovered"]
  .sum()
  .plot(title="Confirmed, Deaths, Recovered over Time", rot=45))

Wrap this up in a function and run it yourself!

In [89]:
def plotMyCountry(Country_Region, Province_State, Admin2):
  (all_days_df[(all_days_df["Country_Region"] == Country_Region) & (all_days_df["Province_State"] == Province_State) & (all_days_df["Admin2"] == Admin2)]
    .groupby("Date")["Confirmed", "Deaths", "Recovered"]
    .sum()
    .plot(title="Confirmed, Deaths, Recovered over Time", rot=45))
  
plotMyCountry("US", "New York", "New York City")