# Introduction

This program draws a "heat map" of COVID cases based on the Covid-19 dataset. The dataset includes information about
case counts over time. <br>
It relies on a pre-made country map JSON to make the Folium overlay, and a "better_names" csv which I wrote so that the
code wouldn't get too cluttered.

To Do: <br>
Figure out why France displays as the wrong color.

# Init

Imports standard libraries:

In [9]:
import numpy as np
import pandas as pd
import matplotlib as mp
import folium as fol
import pycountry as pc
from urllib.request import urlopen
from json import load

Loads data files on GitHub (for Colab)

In [10]:
COVID_DATA_URL = "https://raw.githubusercontent.com/WBArno/PDA_Project/master/Dat/covid_19_data.csv"
BETTER_NAMES_URL = "https://raw.githubusercontent.com/WBArno/PDA_Project/master/Dat/better_names.csv"
COUNTRIES_URL = "https://raw.githubusercontent.com/WBArno/PDA_Project/master/Dat/countries.json"
MERGED_COUNTRIES_URL = "https://raw.githubusercontent.com/WBArno/PDA_Project/master/Dat/merged_countries.json"

# df = pd.read_csv(COVID_DATA_URL)
# bn = pd.read_csv(BETTER_NAMES_URL)
# ct = load(urlopen(COUNTRIES_URL))
# mc = load(urlopen(MERGED_COUNTRIES_URL))

Loads Data Files (for execution locally/ GitHub).

In [11]:
df = pd.read_csv("../Dat/covid_19_data.csv")
bn = pd.read_csv("../Dat/better_names.csv")
ct = load(open("../Dat/countries.json"))
mc = load(open("../Dat/merged_countries.json"))

# Project Tasks

## Stage 1

### 1.3 - Loading a Dataframe

Loading was taken care of in "Init", so this section is commented out.

In [12]:
# import pandas as pd
# CSV_DATA = "https://raw.githubusercontent.com/WBArno/PDA_Project/master/Dat/covid_19_data.csv"
# df = pd.read_csv(CSV_DATA)

### 1.4 - Manipulation with Workflows

#### 1.4a - Datatypes

Displays the datatypes in the dataset.

In [13]:
print("Data Types:\n", df.dtypes)

Data Types:
 SNo                  int64
ObservationDate     object
Province/State      object
Country/Region      object
Last Update         object
Confirmed          float64
Deaths             float64
Recovered          float64
dtype: object


#### 1.4b - Top

Displays the top of the dataframe with .head()

In [14]:
print("Top of Set:\n", df.head())

Top of Set:
    SNo ObservationDate Province/State  Country/Region      Last Update  \
0    1      01/22/2020          Anhui  Mainland China  1/22/2020 17:00   
1    2      01/22/2020        Beijing  Mainland China  1/22/2020 17:00   
2    3      01/22/2020      Chongqing  Mainland China  1/22/2020 17:00   
3    4      01/22/2020         Fujian  Mainland China  1/22/2020 17:00   
4    5      01/22/2020          Gansu  Mainland China  1/22/2020 17:00   

   Confirmed  Deaths  Recovered  
0        1.0     0.0        0.0  
1       14.0     0.0        0.0  
2        6.0     0.0        0.0  
3        1.0     0.0        0.0  
4        0.0     0.0        0.0  


#### 1.4c - Summary

Creates an automatically-generated statistical summary of the dataframe.

In [15]:
print("Summary:\n", df.describe())

Summary:
                  SNo     Confirmed         Deaths     Recovered
count  236017.000000  2.360170e+05  236017.000000  2.360170e+05
mean   118009.000000  5.715800e+04    1487.719368  3.393027e+04
std     68132.383579  1.834751e+05    4770.414639  1.474800e+05
min         1.000000 -3.028440e+05    -178.000000 -8.544050e+05
25%     59005.000000  7.270000e+02       9.000000  1.000000e+01
50%    118009.000000  6.695000e+03     127.000000  1.224000e+03
75%    177013.000000  3.349900e+04     880.000000  1.263900e+04
max    236017.000000  3.664050e+06  108208.000000  6.399531e+06


## Stage 2

### 2.4 - Adding/Editing Columns

Combines "Province/State" and "Country/Region" into a new column ("Location") in the form "Province, Country".

In [16]:
p2 = df # Creates a new dataframe for this section
p2["Location"] = p2["Province/State"] + ", " + p2["Country/Region"]
p2

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Location
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0,"Anhui, Mainland China"
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0,"Beijing, Mainland China"
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0,"Chongqing, Mainland China"
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0,"Fujian, Mainland China"
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0,"Gansu, Mainland China"
...,...,...,...,...,...,...,...,...,...
236012,236013,02/27/2021,Zaporizhia Oblast,Ukraine,2021-02-28 05:22:20,69504.0,1132.0,65049.0,"Zaporizhia Oblast, Ukraine"
236013,236014,02/27/2021,Zeeland,Netherlands,2021-02-28 05:22:20,16480.0,178.0,0.0,"Zeeland, Netherlands"
236014,236015,02/27/2021,Zhejiang,Mainland China,2021-02-28 05:22:20,1321.0,1.0,1314.0,"Zhejiang, Mainland China"
236015,236016,02/27/2021,Zhytomyr Oblast,Ukraine,2021-02-28 05:22:20,50582.0,834.0,44309.0,"Zhytomyr Oblast, Ukraine"


### 2.7a - Creating a Subset

Uses .groupby().agg() to narrow down the table into a more useful form. <br>
Dataset is filtered before other steps to speed up the process.

In [17]:
# as_index = False so that the next stage is possible.
p2 = p2.groupby(["SNo", "ObservationDate", "Location", "Confirmed", "Recovered", "Deaths"],
                as_index=False).agg({"Recovered":"sum"})
p2

Unnamed: 0,ObservationDate,Location,Confirmed,Deaths,Recovered
0,01/01/2021,"Abruzzo, Italy",35723.0,1218.0,23132.0
1,01/01/2021,"Acre, Brazil",41689.0,796.0,33670.0
2,01/01/2021,"Adygea Republic, Russia",11103.0,92.0,9085.0
3,01/01/2021,"Aguascalientes, Mexico",17021.0,1371.0,0.0
4,01/01/2021,"Aichi, Japan",16764.0,213.0,13596.0
...,...,...,...,...,...
173966,12/31/2020,"Zaporizhia Oblast, Ukraine",54088.0,594.0,20530.0
173967,12/31/2020,"Zeeland, Netherlands",10462.0,127.0,0.0
173968,12/31/2020,"Zhejiang, Mainland China",1306.0,1.0,1293.0
173969,12/31/2020,"Zhytomyr Oblast, Ukraine",39202.0,655.0,31998.0


### 2.6 - set_index()

Sets the index to "SNo", which is the original index number for the dataset.

In [18]:
p2.set_index(["SNo"], inplace=True)
p2

KeyError: "None of ['SNo'] are in the columns"

### 2.5 - A Filtering Operation

Filters the dataset so that only entries with a "Recovered" value greater than the mean will be included. <br>
This reduces the dataset by a factor of ten.

In [None]:
p2 = p2[p2.Recovered > p2.Recovered.mean()]
p2

### 2.7b - to_csv()

Saves the filtered dataframe to "test_output.csv" in the working folder.

In [None]:
p2.to_csv("test_output.csv")


## Stage 3

### Part 1 - NaN

#### 1.1 - isna()

In [None]:
p3 = df
p3.isna().sum()

There are 62,045 missing values in the table; all located under the Province/State column.

#### 1.3 - dropna()

Drops all of the rows with null values. <br>
Displays a count of all remaining NaN values to show

In [None]:
trunc = p3.dropna()
trunc.isna().sum().sum() # Total count of NaN values

#### 1.2 - fillna()

Instead of dropping the null values, this replaces them with "Undefined". <br>
Prints a row with a null value before/after the modification to show the change.

In [None]:
print(p3["Province/State"][35])
p3.fillna("Undefined", inplace=True)
print(p3["Province/State"][35])
p3.isna().sum().sum() # Total count of NaN values.

### Part 2 - Plotting

#### 2.1 - Histograms

Creates a set of histograms by sorted data:<br>
1) Confirmed Cases:

In [None]:
# Using the filtered plot from Stage 2 to reduce migraines (caused by the number of data points) slightly.
p2_state_hist = p2["Confirmed"].hist(bins="auto", log=True)

2) Recovered Cases:

In [None]:
p2_chaos_hist = p2["Recovered"].hist(bins="auto", log=True)

3) Deaths:

In [None]:
p2_balanced_hist = p2["Deaths"].hist(bins="auto", log=True)

#### 2.2 - Plots Within Plots

All three plots, but together now!

In [None]:
p2_grouped_hist = p2.hist(column=["Confirmed", "Recovered", "Deaths"],
                          bins="auto", layout=(3, 1), figsize=(10, 10), log=True)

#### 2.3 - Single-Line Graph

This is an attempt at graphing recovered cases over time. <br>
*It turns out that 20,000 data points don't make for a good line graph.*

In [None]:
p2["ObservationDate"] = pd.to_datetime(p2["ObservationDate"])
p2.sort_values(by="ObservationDate", ascending=True, inplace=True)
p2.plot(x="ObservationDate", y="Recovered", figsize=(10,5))

#### 2.4 - Multi-Line Graph

Now with less data points! <br>
Creates a function to reduce redundant code in this section somewhat.


In [None]:
def aggravate(tmp_series, target):
    return tmp_series.aggregate({target:"last"}).groupby(["Country/Region"], as_index=False,
                                                         dropna=False).aggregate({target:"sum"})[target]

Creates a temporary *unsorted* dataframe for use with the "aggravate" function.

In [None]:
tmp_ser = df.groupby(["Country/Region", "Province/State"], as_index=False, dropna=False)#.aggregate({"Confirmed":"last"})

Creates a new dataframe (lg) which is filtered to both make the data more meaningful and reduce the number of datapoints.


In [None]:
lg = pd.DataFrame()
lg["Country"] = (tmp_ser.aggregate({"Confirmed":"last"}).groupby(["Country/Region"], as_index=False,dropna=False).
                 aggregate({"Confirmed":"sum"}))["Country/Region"]
lg["Confirmed"] = aggravate(tmp_ser, "Confirmed")
lg["Recovered"] = aggravate(tmp_ser, "Recovered")
lg["Deaths"] = aggravate(tmp_ser, "Deaths")

Creates a line graph out of the newly-filtered function using the number of confirmed cases as the x-axis.

In [None]:
lg.sort_values(by="Confirmed", ascending=True, inplace=True)
lg.plot('Confirmed', legend=True, logx=True, logy=True)
mp.pyplot.annotate('|---?---|', (10**7, 10**7))
mp.pyplot.annotate('|----------?----------|', (0.5, 100))

#### 2.4 - Multi-Line Graph

Now with beautiful labels!

In [None]:
lg_bar = lg
lg_bar.set_index("Country")
lg_bar.plot.bar("Country", "Confirmed", logy=True)

mp.pyplot.savefig("modern_art.png")

# -- Run --


Function which changes the poorly-named-countries into ones that PyCountry can recognize.

In [None]:
def sanitize_csv(original, new):
    if new == "nil" or new is None:
        df["Country"] = df["Country"].str.replace(original, "", regex=True)
    else:
        df["Country"] = df["Country"].str.replace(original, new, regex=True)

Prepares the table for use by dropping unneeded columns and renaming an annoying one.

In [None]:
df.drop(["SNo", "ObservationDate", "Recovered", "Last Update", "Deaths"], axis=1, inplace=True)
df.rename(columns = {"Country/Region": "Country"}, inplace=True)

"Sanitizes" the country names so that PyCountry will recognize them, then collapses them all together.

In [None]:
for row in bn.itertuples(): sanitize_csv(row[1], row[2])

Groups by and finds the maximum value for each state (the entries are cumulative, so a single group would result in an
absurd amount of cases.)

In [None]:
df = df.groupby(["Country", "Province/State"], as_index=False, dropna=False).aggregate({"Confirmed":"last"})

Groups the table again by country, finding the sum of all of the states.

In [None]:
df = df.groupby(["Country"], as_index=False, dropna=False).aggregate({"Confirmed":"sum"})

Uses PyCountry to find the three-letter acronym for each country for use with Folium.

In [None]:
for row in df["Country"]: df["Country"] = df["Country"].replace(row, pc.countries.search_fuzzy(row)[0].alpha_3)


Takes the log of all values in order to make a more meaningful map. <br>
Without this step, only three countries would be colored anything other than yellow.

In [None]:
for row in df["Confirmed"]: df["Confirmed"] = df["Confirmed"].replace(row, np.log(row))
df.dropna(inplace=True) # Necessary to remove broken entries created by the above process.

Creates the Folium map.

In [None]:
outbreak_map = fol.Map(location=[0, 0], zoom_start=0)

fol.Choropleth(
    name = "COVID Cases",
    geo_data = ct, # Polygonal data to draw the country map.
    data = df, # COVID case data
    columns = ["Country", "Confirmed"], # Column to match with the key, count-based column.
    key_on = "feature.id", # Establishes the key of the country JSON.
    fill_color = "YlOrRd", # Color scheme
    fill_opacity = 0.75,
    line_opacity = 0.25,
    nan_fill_opacity = 0,
    legend_name = "Confirmed Cases",
    highlight = True,
).add_to(outbreak_map)

Creates an overlay for the above folium map which displays the confirmed count.

In [None]:
OverlayData = fol.features.GeoJson(
    mc,
    style_function = lambda x: {'fillColor': '#ffffff', 'color':'#000000', 'fillOpacity': 0.2, 'weight': 0.2},
    highlight_function = lambda x: {'fillColor': '#000000', 'color':'#000000', 'fillOpacity': 0.50, 'weight': 0.1},
    control = False,
    tooltip = fol.features.GeoJsonTooltip(
        fields = ["name", "Confirmed"],
        aliases = ["Country: ", "Confirmed Cases: "],
        style = "background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"
    )
)

outbreak_map.add_child(OverlayData)
outbreak_map.keep_in_front(OverlayData)
fol.LayerControl().add_to(outbreak_map)

In [None]:
# Displays the map
outbreak_map