
# <center> Data Preparation </center>
<HR>

<div style='text-align: justify'>
In this notebook, we will manipulate raw data of covid19 as per the requirement of our case study. We want to perform the statistical test, ANOVA on our formulated hypothesis such as difference in number of covid19 cases across the different regions are significant.
So, we will aggregate the data into four different regions based on the continents and consider the 20 countries from each regions for our case study. Data has been collected on 9th May 2020 from Johns Hopkins University.
</div>

Data Source: [Covid19 data Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19)

[]()

Follow [notebook case_study](case_study.ipynb) for statistical test

In [1]:
import pandas as pd
import pycountry_convert as pc

**Load the data into the memory**

In [2]:
df = pd.read_csv('data/covid_data_05-09-2020.csv')

In [3]:
df.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2021-03-02 05:23:30,33.93911,67.709953,55733,2444,49344,3945.0,Afghanistan,143.168187,4.385194
1,,,,Albania,2021-03-02 05:23:30,41.1533,20.1683,107931,1816,70413,35702.0,Albania,3750.469108,1.682556
2,,,,Algeria,2021-03-02 05:23:30,28.0339,1.6596,113255,2987,78234,32034.0,Algeria,258.272078,2.637411
3,,,,Andorra,2021-03-02 05:23:30,42.5063,1.5218,10889,110,10475,304.0,Andorra,14093.056364,1.010194
4,,,,Angola,2021-03-02 05:23:30,-11.2027,17.8739,20854,508,19400,946.0,Angola,63.451074,2.435984


**Aggregate the data by Country_Region**

Aggregate the data by Country_Region and select required columns for our study such as confirmed covid cases, death etc.

In [4]:
df = df.groupby('Country_Region')[['Confirmed', 
                                   'Deaths', 'Recovered', 'Active']].sum()
df = df.reset_index()
# Rename column Country_Region to Country
df = df.rename(columns={'Country_Region': 'Country'})

In [5]:
df.head()

Unnamed: 0,Country,Confirmed,Deaths,Recovered,Active
0,Afghanistan,55733,2444,49344,3945.0
1,Albania,107931,1816,70413,35702.0
2,Algeria,113255,2987,78234,32034.0
3,Andorra,10889,110,10475,304.0
4,Angola,20854,508,19400,946.0


**Missing Values in data**

In [6]:
df.isnull().sum()

Country      0
Confirmed    0
Deaths       0
Recovered    0
Active       0
dtype: int64

**Categories the countries into continents**

In [7]:
def country_to_continent(country_name:str) -> str:
    """find continent name of the country"""
    try:
        country_alpha2 = pc.country_name_to_country_alpha2(country_name)
        continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
        continent_name = pc.convert_continent_code_to_continent_name(continent_code)
    except:
        #if country_name is invalid 
        continent_name = pd.NA
    return continent_name

In [8]:
# Change 'US' to 'United States' in Country field Otherwise country name is considered invalid
df.loc[df.Country=='US', 'Country'] = 'United States'
# find continent name for the countries
df['Continent'] = df.Country.map(country_to_continent)

**Categories the continents into the following four regions for our case study:**
    
    AFRO: Africa
    
    AMER: North America and South America
    
    EURO: Europe
    
    OCEA: Asia and Oceania

In [9]:
df['Region'] = df.Continent.map({'Africa': 'AFRO', 
                                 'North America': 'AMR',
                                 'South America': 'AMR',
                                 'Europe': "EURO",
                                 'Asia': 'OCEA',
                                 'Oceania': 'OCEA'})

**Consider 20 top most affected countries from each region for the study**

In [10]:
df = df.groupby(['Region']) \
    .apply(lambda x: x.sort_values(['Confirmed'], ascending=False)) \
    .reset_index(drop=True) \
    .groupby(['Region']) \
    .head(20)

In [11]:
df.tail()

Unnamed: 0,Country,Confirmed,Deaths,Recovered,Active,Continent,Region
144,Nepal,274216,2777,270471,968.0,Asia,OCEA
145,Georgia,270918,3520,265523,1875.0,Asia,OCEA
146,Kazakhstan,263396,3168,240302,19926.0,Asia,OCEA
147,Azerbaijan,234662,3223,228839,2600.0,Asia,OCEA
148,Kuwait,192031,1085,180155,10791.0,Asia,OCEA


**Save the preprocessed data**

In [12]:
df.to_csv('data/covid19_preprocessed_data.csv', index=False)