<h1 style="text-align:center">CS 212 - Introduction to Programming for Analysts - Spring 2021</h1>
<h1 style="text-align:center">Pandas Case Study - COVID-19 Data Visualization - 35 Points</h1>

# Objectives
Upon completion of this programming exercise, students will:
- Import datasets
- Manipulate data into a 'tidy' format
- Create useful data visualizations

# Description
In this case study, you will provide timely, useful feedback to global leaders regarding the spread of COVID-19. Every country's leadership is trying to decide national policy on quarantine, social distancing, wearing face masks, and potential national shutdown. Utilizing daily updated timeseries data from Johns Hopkins Center for Systems Science and Engineering's GitHub [site](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series), you will create useful visualizations for the number of confirmed COVID-19 cases and deaths similar to this [study](https://91-divoc.com/pages/covid-visualization/).

# 1. Setup

In [84]:
# Import packages for data manipulation and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Downloading Data
After looking at the [study](https://91-divoc.com/pages/covid-visualization/) above, I was able to find out that the data comes from Johns Hopkins Center for Systems Science and Engineering (CSSE) Geograhpic Information System (GIS) and Data GitHub page. Here are two URL links to the raw data hosted on Johns Hopkins GitHub repository (repo). You will use these URLs to read in the data using `pandas.read_csv`.
1. `https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv`
2. `https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv`

### Load and Inspect Data
**Q1.1 (2 Points)** [Read](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) the two COVID-19 global .csv files using the URLs above to DataFrames named `cases` and `deaths`, respectively. Additionally, read the `population.csv` file to a DataFrame named `population`. Remember, the file must be in the same directory as this Jupyter Notebook or you must specify the entire file path. Inspect the first five rows of the `cases`.

In [85]:
# read both URLs


Next, explore the `cases`, `deaths` and `population` DataFrames to understand the types of data and some summary statistics for the columns.

In [86]:
# explore cases (.info() and .describe())


In [87]:
# explore deaths


In [88]:
# explore population


### This data is UNTIDY! 
It appears that there is a lot of missing data in the first column of the `cases` and `deaths` DataFrames and that the data is in wide form. My initial thought process went to the Reshaping lesson and that we should take all of the date columns and melt them into one column called 'date'.

# 2. Manipulating our Data Into Tidy Data

Now that we have the data loaded, my Logic Design for manipulating the data to contain exactly what we want for visualization is:
1. Change the name of the Country/Region column to 'country'
2. Remove the province/state column and lat and long columns and assign to new data frames `country_cases` and `country_deaths`
3. Group by country
4. Join the 'population' column to `country_cases` and `country_deaths`
5. Melt the data to have the columns country, population, date, cases
6. Convert the new 'date' column type to `datetime`

**Q2.1 (2 Points)** Step 1 of my Data Manipulation Logic Design - Change the name of the Country/Region column to 'country' and reassign to the DataFrames `cases` and `deaths`.

In [89]:
# change column name


**Q2.2 (2 Points)** Step 2 of my Data Manipulation Logic Design - Remove the province state column and lat and long columns and assign to new DataFrames `country_cases` and `country_deaths`. Sometimes we don't want to permanently delete data, so we are assigning the result to new DataFrames. However, in case of a mistake, you can always quickly rerun all cells above this by clicking Cell, then 'Run All Above'.

In [90]:
# remove province/state, lat and long columns


I always like to double check that my group by is working correctly, so I am going to output the sum the number of cases in Australia on 3/28/21 and then check that same data point after grouping by country. This number should be 29276.

In [91]:
# output sum of Australia on 3/28/2021 to baseline the sum aggregation by country (should be 29276)
# country_cases['3/28/21'][country_cases.country == 'Australia'].sum()

**Q2.3 (2 Points)** Step 3 of my Data Manipulation Logic Design - Group by country (sum all of the cases/deaths/population in each country's provinces) and assign to the DataFrames `country_cases`, `country_deaths`, and `country_population`. The first 5 lines of your groupby result should look similar to this: <img src="groupby_country_cases.PNG">

In [92]:
# groupby country for cases, deaths and population


Like I mentioned before, I like to double check that my groupby worked correctly. First, I am going to make sure that the length of our new DataFrame `country_cases` is the same as the number of unique countries in our original DataFrame `cases`.

In [93]:
# Check to see if the number of countries in our groupby object is the same as the number of unique countries in our raw data
# len(country_cases) == cases.country.nunique()

Second, I want to check to see if there are any countries in the `country_population` dataset that are **not** in `country_cases`. Run the next cell, your output should be a blank DataFrame. Uncomment (ctrl + /) the second line of code and rerun the cell, the result should also be a blank DataFrame.

In [94]:
# countries from population.csv not in cases
# country_population[~country_population.index.isin(country_cases.index)]

# countries in cases not in population.csv
# country_cases[~country_cases.index.isin(country_population.index)]

Lastly, I want to make sure that the number of cases in Australia on 3/28/21 in our new DataFrame `country_cases` matches the sum of all the provinces in Australia we calcualted above (29276).

In [95]:
# output Australia on 3/28/20210 from grouped country cases
# country_cases['3/28/21'].loc['Australia']

**Q2.4 (2 Points)** Step 4 of my Data Manipulation Logic Design - Join the `population` column from the to `country_population` DataFrame to the two DataFrames grouped by country (`country_cases` and `country_deaths`).

In [96]:
# join population series to both cases and deaths DataFrames


We should verify that the new `population` column in our `country_cases` DataFrame matches the `population` column in `population.csv`.

In [97]:
# country_cases['population'].head()

**Q2.5 (2 Points)** Step 5 of my Data Manipulation Logic Design - Melt `country_cases` with its reset index and leave the columns `country` and `population`, make a new column called `date` for the values previously the column labels, and make a label `cases` for the values previously contained in the multiple `date` columns. Assign this new, melted DataFrame to `cases_tidy`. Repeat this for `country_deaths` (new column label should be `deaths`, **not** `cases`) and assign this new, melted DataFrame to `deaths_tidy`. It is important that we reset the index of `country_cases` and `country_deaths` before melting the data because we want to preserrve the country information.  

In [98]:
# melt to make tidy data


Verify melt worked properly by checking the shape of the two new DataFrames. The `cases_tidy` and `deaths_tidy` DataFrames should contain 4 columns and common sense says the number of rows = (# of days) * (# of countries). Uncomment (ctrl + /) the second line and verify the results are the same. Your `cases_tidy` Dataframe should look similar to the DataFrame below. <img src="melt_cases.PNG">

In [99]:
# cases_tidy
# deaths_tidy

**Q2.6 (2 Points)** Step 6 of my Data Manipulation Logic Design - Change the type of object for the new `date` column in both the `cases_tidy` and `deaths_tidy` DataFrames to be a datetime object. HINT: Google is your friend!

In [100]:
# check the data types in each column (.dtypes)


In [101]:
# change date column datatype from object to datetime


In [102]:
# verify the data type was changed


## Filtering and Sorting our Tidy Data

**Q2.7 (6 Points)** In order for our plot to not be crowded, we should filter the top 10 countries. We will be filtering the top 10 multiple times, so write a function that filters the top 10 countries' rows by the highest number of confirmed cases (or deaths). Use your `filter_top_10` function to filter the `cases_tidy` DataFrame by the column `cases` with no restriction on the population size and assign the resulting DataFrame to `top_10_cases`. Repeat this to filter the top 10 countries in `deaths_tidy` by the column `deaths` and assign the resulting DataFrame to `top_10_deaths`. 

In [103]:
def filter_top_10(df, col='cases', pop_threshold=0):
    '''
    filters a dataframe to contain only the top 10 countries that meet a population threshold for a given column (attribute)
    
    Parameters
    ----------
    df: DataFrame, the DataFrame to filter
    col: string, the column used in determining the top 10
    pop_threshold: integer, a minimum threshold each observation must satisfy to qualify to be in the top 10
    
    Returns
    -------
    DataFrame: DataFrame of the top 10 countries
    '''
    # find the names of the top 10 countries
        # filter to countries that meet population threshold
        # groupby country and take max of col column
        # sort in descending order
        # filter first 10

    # filter the DataFrame based on the top 10 names
    
    # return the top 10 DataFrame


In [104]:
# use your function for cases and deaths



Now that we have the top ten countries, verify the top 10 cases and its shape. You can do the same with `top_10_deaths`. Once again we expect the number of rows = (10 countries) * (# of dates). Your `top_10_cases` should look something similar to (the date should be different and the countries might be a little different): <img src="top_10_cases_tail.PNG">

In [105]:
# verify top_10 data is ready to plot
# top_10_cases.tail(10)

# 3. Visualizing COVID-19 Top 10 Data
Create line plots for the top 10 countries confirmed cases and deaths. You have multiple options here to plot data. You can use matplotlib directly, or you can chose to use one of the matplotlib wrappers we have used in the past ([Seaborn](https://seaborn.pydata.org/generated/seaborn.lineplot.html), [Pandas Plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)). Remember to use [good data visualization techniques](https://www.gooddata.com/blog/7-tips-good-data-visualizations)!

### Confirmed COVID-19 Cases

**Q3.1 (1 Point)** Plot the top 10 countries confirmed COVID-19 cases.

In [106]:
# line plot of cases

# plt.show()

### Confirmed COVID-19 Related Deaths

**Q3.2 (1 Point)** Plot the top 10 countries confirmed COVID-19 deaths.

In [107]:
# line plot of deaths

# plt.show()

# 4. Visualizing COVID-19 Data by Percentage of Population
Although these two plots speak a lot about how COVID-19 is spreading, it maybe be interesting to look at the cases and deaths by the percentage of population to tell a different part of the story.

### Making Percentage of Population
**Q4.1 (2 Point)** In the `cases_tidy` DataFrame, make a new column called `cases_pct` for the percentage of cases by population. Repeat this by adding `deaths_pct` column to the `deaths_tidy` DataFrame.

In [108]:
# new column for % of population (cases and deaths)


**Q4.2 (2 Point)** Filter the top 10 countries' rows by the highest percentage (by population) of confirmed cases in our DataFrame and assign it to `top_10_cases_pct`. Repeat this for the top 10 countries with the highest percentage (by population) of deaths and assign it to `top_10_deaths_pct`.

In [109]:
# use your function for cases_pct and deaths_pct



In [110]:
# verify top_10 table
top_10_cases_pct.tail(10)

It appears that very small countries (and a cruise ship!!) are taking over the top 10 list due to their extremely small population size. 

**Q4.2 (2 Point)** Filter our `cases_tidy` and `deaths_tidy` data again, but this time require more than 100,000 people.

In [111]:
# use your function for cases_pct and deaths_pct, require population > 100000


### Confirmed COVID-19 Cases, Percentage of Population
**Q4.3 (2 Point)** Plot the top 10 countries confirmed COVID-19 deaths by percentage of population. Change the values on the Y-axis to make more sense (0%, 1%, 2%, etc.) Use [`plt.yticks()`](https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.yticks.html) to help format the Y-axis.

In [112]:
# line plot of cases_pct

# plt.show()

### Confirmed COVID-19 Related Deaths
**Q4.4 (1 Point)** Plot the top 10 countries confirmed COVID-19 deaths by percentage of population. Change the values on the Y-axis to make more sense (0.00%, 0.05%, 0.1%, etc.) Use [`plt.yticks()`](https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.yticks.html) to help format the Y-axis.

In [113]:
# line plot of deaths_pct

# plt.show()

# 5. Visualizing COVID-19 Data Log-Scale
We don't notice this too much with the graphs based on proportion of population, but in our first graphs containing the number of people it can be hard to compare the magnitude of difference between the top 10 countries because the United States has **such** a large number. We can barely see the bottom 7 countries! Although that linear scale shows the real human impact -- a growth twice the size is twice the number of real people infected -- a logarithmic scale might be better to illustrate the differences of the magnitude of growth between countries, but less of the human impact. 

When there are large differences in values on a chart, the graph might be uninterpretable and a log scale can help use see all of the countries we are interested in.

**Q5.1 (2 Point)** Create a new column in the `top_10_cases` DataFrame called `log_cases` and assign to it the natural log of the number of cases in `top_10_cases`. Similarly, repeat this for the `top_10_deaths` DataFrame and call that new column `log_deaths` which contains the natural log of the number of deaths in `top_10_deaths`. 

In [114]:
# create new column for log cases and log deaths


**Q5.2 (1 Point)** Plot the top 10 countries confirmed COVID-19 cases on a logarithmic scale. 

In [115]:
# line plot of log_cases

# plt.show()

**Q5.3 (1 Point)** Plot the top 10 countries confirmed COVID-19 deaths on a logarithmic scale. 

In [116]:
# line plot of log_deaths

# plt.show()