# Pandas and Covid-19

This notebook is an example of data analysis and manipulation with Pandas.

Enjoy it!

In [0]:
import numpy as np
import pandas as pd

## The Data

To get some data I and going to download it from Data Repository by Johns Hopkins CSSE

https://github.com/CSSEGISandData/COVID-19

I first remove the folder where I am goint to store the data so I can re-execute this sentences without any problems ...

In [0]:
!rm -rf ./COVID-19

The dataset is avaible in GitHub so I use the `git` command to get it

In [3]:
!git clone https://github.com/CSSEGISandData/COVID-19.git

Cloning into 'COVID-19'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 20973 (delta 10), reused 16 (delta 9), pack-reused 20952[K
Receiving objects: 100% (20973/20973), 87.54 MiB | 32.88 MiB/s, done.
Resolving deltas: 100% (11130/11130), done.


## Exporing the data


In [4]:
!ls -lt ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports | head

total 8920
-rw-r--r-- 1 root root 315929 Apr 19 05:59 04-18-2020.csv
-rw-r--r-- 1 root root      0 Apr 19 05:59 README.md
-rw-r--r-- 1 root root 314848 Apr 19 05:59 04-17-2020.csv
-rw-r--r-- 1 root root 314226 Apr 19 05:59 04-16-2020.csv
-rw-r--r-- 1 root root 309742 Apr 19 05:59 04-13-2020.csv
-rw-r--r-- 1 root root 311068 Apr 19 05:59 04-14-2020.csv
-rw-r--r-- 1 root root 312551 Apr 19 05:59 04-15-2020.csv
-rw-r--r-- 1 root root 305548 Apr 19 05:59 04-12-2020.csv
-rw-r--r-- 1 root root 303921 Apr 19 05:59 04-11-2020.csv


Yes!!!  
We have data files ...

Perfect. Let's explore the first dataset generated ...

In [0]:
first = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv")

In [0]:
first.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
4,Gansu,Mainland China,1/22/2020 17:00,,,


And one of the last ones ...

In [0]:
last = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-18-2020.csv")

In [7]:
last.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-18 22:32:47,34.223334,-82.461707,15,0,0,15,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-18 22:32:47,30.295065,-92.414197,110,7,0,103,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-18 22:32:47,37.767072,-75.632346,33,0,0,33,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-18 22:32:47,43.452658,-116.241552,593,9,0,584,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-18 22:32:47,41.330756,-94.471059,1,0,0,1,"Adair, Iowa, US"


Can I concatenate both datasets?

In [8]:
pd.concat((first, last), axis = 0)

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Active,Combined_Key
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,,,,,,,,,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,,,,,,,,,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,
4,Gansu,Mainland China,1/22/2020 17:00,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3048,,,,418.0,2.0,69.0,,,,West Bank and Gaza,2020-04-18 22:32:28,31.952200,35.233200,347.0,West Bank and Gaza
3049,,,,6.0,0.0,0.0,,,,Western Sahara,2020-04-18 22:32:28,24.215500,-12.885800,6.0,Western Sahara
3050,,,,1.0,0.0,0.0,,,,Yemen,2020-04-18 22:32:28,15.552727,48.516388,1.0,Yemen
3051,,,,57.0,2.0,33.0,,,,Zambia,2020-04-18 22:32:28,-13.133897,27.849332,22.0,Zambia


Ups!!! The column names don't match :-(

#Loading the data into Pandas  and cleaning it

In [0]:
import glob
import os

files = glob.glob("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv")
files.sort(key=os.path.getmtime)

We are going to:
- Create a blank Dataset to store all the data
- Load every dataset unifying the column names so we can concatenate it without any problem.
- Remove extra blank spaces from the country field
- Enrich the information with the date of the data in the correct type

In [0]:
data = pd.DataFrame()
for file in files:  
  df = pd.read_csv(file).rename(columns = {'Province/State' : 'State', 
                        "Country/Region" : 'Country',
                        'Province_State' : 'State', 
                        "Country_Region" : 'Country',
                        'Last Update' : 'Last_Update',
                        'Confirmed' : 'ConfirmedAcum',
                        'Deaths' : 'DeathsAcum',
                        'Recovered' : 'RecoveredAcum'})
  df = df.assign(Date = pd.to_datetime(file[-14:-4], format = '%m-%d-%Y'),
                 Country = df.Country.str.strip())
  data = pd.concat((data, df), axis = 0)


I noticed that the country names were a little messy.   
Let's fix it ...

In [0]:
data['Country'] = data.Country.replace({'Bahamas, The' : 'Bahamas',
                         'Congo (Brazzaville)' : 'Congo',
                         'Congo (Kinshasa)' : 'Congo',
                         "Cote d'Ivoire" : "Cote d'Ivoire",
                         "Curacao" : "Curaçao",
                         'Czech Republic' : 'Czech Republic (Czechia)',
                         'Czechia' : 'Czech Republic (Czechia)',
                         'Faroe Islands' : 'Faeroe Islands',
                         'Macau' : 'Macao',
                         'Mainland China' : 'China',
                         'Palestine' : 'State of Palestine',
                         'Reunion' : 'Réunion',
                         'Saint Kitts and Nevis' : 'Saint Kitts & Nevis',
                         'Sao Tome and Principe' : 'Sao Tome & Principe',
                         'US' : 'United States',
                         'Gambia, The' : 'Gambia',
                         'Hong Kong SAR' : 'Hong Kong',
                         'Korea, South' : 'South Korea',
                         'Macao SAR' : 'Macao',
                         'Taiwan*' : 'Taiwan',
                         'Viet Nam' : 'Vietnam',
                         'West Bank and Gaza' : 'State of Palestine'
                         })

I'm going to fill in the null values ​​of the 'State' and 'Admin2' fields so that I can later group the data correctly

In [0]:
data = data.fillna({'State' : 'NA', 'Admin2' : 'NA'})

Finally I am going to be left alone with the columns that interest me

In [0]:
data = data[['Date', 'Country', 'State', 'Admin2', 'ConfirmedAcum', 'DeathsAcum', 'RecoveredAcum']]

Let's verify the structure of the dateset ...

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92516 entries, 0 to 3052
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           92516 non-null  datetime64[ns]
 1   Country        92516 non-null  object        
 2   State          92516 non-null  object        
 3   Admin2         92516 non-null  object        
 4   ConfirmedAcum  92497 non-null  float64       
 5   DeathsAcum     92075 non-null  float64       
 6   RecoveredAcum  92128 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 5.6+ MB


Wait a set, I think that can be interesting have a column the the active cases. Let's create it ...

In [0]:
data['ActiveAcum'] = data.ConfirmedAcum  - data.DeathsAcum - data.RecoveredAcum

In [16]:
data.query("Country == 'Spain'").sort_values('Date', ascending = False).head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
3025,2020-04-18,Spain,,,191726.0,20043.0,74797.0,96886.0
3017,2020-04-17,Spain,,,190839.0,20002.0,74797.0,96040.0
3013,2020-04-16,Spain,,,184948.0,19315.0,74797.0,90836.0
2998,2020-04-15,Spain,,,177644.0,18708.0,70853.0,88083.0
2985,2020-04-14,Spain,,,172541.0,18056.0,67504.0,86981.0


Perfect :-)

Now, I am going to group and summarize the data because I want to be sure that there is only one row per Date, Country, State and Admin2

In [0]:
data = data.groupby(["Date", "Country", "State", "Admin2"]).agg("sum").reset_index()

In [18]:
data.head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
0,2020-01-22,China,Anhui,,1.0,0.0,0.0,0.0
1,2020-01-22,China,Beijing,,14.0,0.0,0.0,0.0
2,2020-01-22,China,Chongqing,,6.0,0.0,0.0,0.0
3,2020-01-22,China,Fujian,,1.0,0.0,0.0,0.0
4,2020-01-22,China,Gansu,,0.0,0.0,0.0,0.0


# Daily Cases
I am going to enrich the data by creating new columns with the daily cases.  

First I create new columns with the cases from the previous day

In [0]:
data = data.sort_values(['State', 'Country', 'Date']).\
            assign(ConfirmedPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ConfirmedAcum"],
                   DeathsPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["DeathsAcum"],
                   RecoveredPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["RecoveredAcum"],
                   ActivePrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ActiveAcum"],
            ).\
            fillna({ 'ConfirmedPrevious' : 0, 'DeathsPrevious' : 0, 'RecoveredPrevious' : 0 })

In [20]:
data.head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,ConfirmedPrevious,DeathsPrevious,RecoveredPrevious,ActivePrevious
2599,2020-02-28,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,0.0,0.0,0.0,
2713,2020-02-29,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2834,2020-03-01,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2961,2020-03-02,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
3103,2020-03-03,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0


After that I am going to assign the new fields subtracting the previous acum cases to the actual acum cases

In [0]:
data = data.assign(Confirmed = data.ConfirmedAcum -  data.ConfirmedPrevious,
            Deaths = data.DeathsAcum - data.DeathsPrevious,
            Recovered = data.RecoveredAcum - data.RecoveredPrevious,
            Active = data.ActiveAcum - data.ActivePrevious
            )

I no longer need the fields I used to make the calculation so I can drop them

In [0]:
data = data.drop(['ConfirmedPrevious', 'DeathsPrevious', 'RecoveredPrevious', 'ActivePrevious'], axis = 1)

Does the data look good?

In [23]:
data.query("Country == 'Spain'")

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
545,2020-02-01,Spain,,,1.0,0.0,0.0,1.0,1.0,0.0,0.0,
612,2020-02-02,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
679,2020-02-03,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
749,2020-02-04,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
819,2020-02-05,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
77526,2020-04-14,Spain,,,172541.0,18056.0,67504.0,86981.0,2442.0,300.0,2777.0,-635.0
80538,2020-04-15,Spain,,,177644.0,18708.0,70853.0,88083.0,5103.0,652.0,3349.0,1102.0
83563,2020-04-16,Spain,,,184948.0,19315.0,74797.0,90836.0,7304.0,607.0,3944.0,2753.0
86603,2020-04-17,Spain,,,190839.0,20002.0,74797.0,96040.0,5891.0,687.0,0.0,5204.0


## Data By Country




So far, we have data by 3 geographical levels: Country, State and a lower level called Admin2

The problem is that not all countries have this level of information, so I will create a new dataset only with the country level data

In [24]:
data_by_country = data.groupby(["Date", "Country"]).agg("sum").reset_index()
data_by_country = data_by_country.sort_values(['Country', 'Date'])
data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8744 entries, 848 to 3045
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           8744 non-null   datetime64[ns]
 1   Country        8744 non-null   object        
 2   ConfirmedAcum  8744 non-null   float64       
 3   DeathsAcum     8744 non-null   float64       
 4   RecoveredAcum  8744 non-null   float64       
 5   ActiveAcum     8744 non-null   float64       
 6   Confirmed      8744 non-null   float64       
 7   Deaths         8744 non-null   float64       
 8   Recovered      8744 non-null   float64       
 9   Active         8744 non-null   float64       
dtypes: datetime64[ns](1), float64(8), object(1)
memory usage: 751.4+ KB


In [25]:
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
848,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
886,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
928,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
977,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1030,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [26]:
data_by_country[data_by_country.Country == 'United States']

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
7,2020-01-22,United States,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
22,2020-01-23,United States,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33,2020-01-24,United States,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
47,2020-01-25,United States,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
62,2020-01-26,United States,5.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
7999,2020-04-14,United States,607670.0,25831.0,47763.0,534076.0,27062.0,2303.0,4281.0,19813.0
8183,2020-04-15,United States,636350.0,28325.0,52096.0,555929.0,28671.0,2494.0,4333.0,21844.0
8367,2020-04-16,United States,667801.0,32916.0,54703.0,580182.0,31449.0,4591.0,2607.0,24251.0
8551,2020-04-17,United States,699706.0,36773.0,58545.0,604388.0,31976.0,3857.0,3842.0,24277.0


## Cases per million inhabitants



We are going to enrich the information with the number of cases per million inhabitants, so we need population data by country.

A small internet search leads me to a page that has population data for 2020:

https://www.worldometers.info/world-population/population-by-country/

It seems that this information is protected to be downloaded automatically so I have no choice but to do it manually and upload the data to a GitHub Repository:

https://github.com/dvillaj/world-population/



I load the data to the Pandas, clean it up and just maintain the field of the population 


In [0]:
population = pd.read_excel("https://github.com/dvillaj/world-population/blob/master/data/world-popultation-2020.xlsx?raw=true", sheet_name="Data")

In [28]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            235 non-null    object 
 1   Population (2020)  235 non-null    int64  
 2   Yearly Change      235 non-null    float64
 3   Net Change         235 non-null    int64  
 4   Density (P/Km²)    235 non-null    float64
 5   Land Area (Km²)    235 non-null    int64  
 6   Migrants (net)     201 non-null    float64
 7   Fertility Rate     201 non-null    float64
 8   Average Age        201 non-null    float64
 9   Urban Pop %        222 non-null    float64
 10  World Share        235 non-null    float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.3+ KB


In [0]:
population = population.rename(columns = {
    'Population (2020)' : 'Population',
    'Yearly Change' : 'Yearly_Change',
    'Net Change' : 'Net_Change',
    'Density (P/Km²)' : 'Density',
    'Land Area (Km²)' : 'Land_Area',
    'Migrants (net)' : 'igrants',
    'Fertility Rate' : 'Fertility',
    'Average Age' : 'Mean_Age',
    'Urban Pop %' : 'Urban_Pop',
    'World Share' : 'World_Share'
})

In [0]:
population = population[['Country', 'Population']]

In [43]:
population.head()

Unnamed: 0,Country,Population
0,Afghanistan,38928346
1,Albania,2877797
2,Algeria,43851044
3,American Samoa,55191
4,Andorra,77265


Now I join the population Dataset with the country data to have the population in this dataset

In [0]:
data_by_country = data_by_country.merge(population, how = 'left', on = 'Country')

In [45]:
data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8744 entries, 0 to 8743
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               8744 non-null   datetime64[ns]
 1   Country            8744 non-null   object        
 2   ConfirmedAcum      8744 non-null   float64       
 3   DeathsAcum         8744 non-null   float64       
 4   RecoveredAcum      8744 non-null   float64       
 5   ActiveAcum         8744 non-null   float64       
 6   Confirmed          8744 non-null   float64       
 7   Deaths             8744 non-null   float64       
 8   Recovered          8744 non-null   float64       
 9   Active             8744 non-null   float64       
 10  Population (2020)  8427 non-null   float64       
 11  Yearly Change      8427 non-null   float64       
 12  Net Change         8427 non-null   float64       
 13  Density (P/Km²)    8427 non-null   float64       
 14  Land Are

In [46]:
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fertility Rate,Average Age,Urban Pop %,World Share,Population
0,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0
1,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0
2,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0
3,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0
4,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0


And finally I calculate the number of cases per million inhabitants

In [47]:
data_by_country = data_by_country.assign(ConfirmedAcum_Millon = data_by_country.ConfirmedAcum / data_by_country.Population * 1000000)
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fertility Rate,Average Age,Urban Pop %,World Share,Population,ConfirmedAcum_Millon
0,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0,0.025688
1,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0,0.025688
2,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0,0.025688
3,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0,0.025688
4,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.5,38928346.0,0.025688


## Rankins

I'm going to create a dataset of last day's cases. 

The goal is to get a set of rankings that tell me the countries with the most cases

So I need a variable that contains the last date of the dataset

In [48]:
last_day = list(data_by_country.Date.sort_values(ascending = False))[0]
last_day

Timestamp('2020-04-18 00:00:00')

Now I can filter the data by this date

In [49]:
last_day_data = data_by_country[data_by_country.Date == last_day]
last_day_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 184 entries, 54 to 8736
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  184 non-null    datetime64[ns]
 1   Country               184 non-null    object        
 2   ConfirmedAcum         184 non-null    float64       
 3   DeathsAcum            184 non-null    float64       
 4   RecoveredAcum         184 non-null    float64       
 5   ActiveAcum            184 non-null    float64       
 6   Confirmed             184 non-null    float64       
 7   Deaths                184 non-null    float64       
 8   Recovered             184 non-null    float64       
 9   Active                184 non-null    float64       
 10  Population (2020)     178 non-null    float64       
 11  Yearly Change         178 non-null    float64       
 12  Net Change            178 non-null    float64       
 13  Density (P/Km²)   

In [50]:
last_day_data

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fertility Rate,Average Age,Urban Pop %,World Share,Population,ConfirmedAcum_Millon
54,2020-04-18,Afghanistan,933.0,30.0,112.0,791.0,27.0,0.0,13.0,14.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.50,38928346.0,23.967111
95,2020-04-18,Albania,548.0,26.0,302.0,220.0,9.0,0.0,19.0,-10.0,2877797.0,-0.11,-3120.0,105.0,27400.0,-14000.0,1.6,36.0,63.0,0.04,2877797.0,190.423438
149,2020-04-18,Algeria,2534.0,367.0,894.0,1273.0,116.0,3.0,48.0,65.0,43851044.0,1.85,797990.0,18.0,2381740.0,-10000.0,3.1,29.0,73.0,0.56,43851044.0,57.786538
197,2020-04-18,Andorra,704.0,35.0,205.0,464.0,8.0,0.0,14.0,-6.0,77265.0,0.16,123.0,164.0,470.0,,,,88.0,0.00,77265.0,9111.499385
227,2020-04-18,Angola,24.0,2.0,6.0,16.0,5.0,0.0,1.0,4.0,32866272.0,3.27,1040977.0,26.0,1246700.0,6413.0,5.6,17.0,67.0,0.42,32866272.0,0.730232
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8651,2020-04-18,Vietnam,268.0,0.0,201.0,67.0,0.0,0.0,3.0,-3.0,97338579.0,0.91,876473.0,314.0,310070.0,-80000.0,2.1,32.0,38.0,1.25,97338579.0,2.753276
8665,2020-04-18,Western Sahara,6.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,597339.0,2.55,14876.0,2.0,266000.0,5582.0,2.4,28.0,87.0,0.01,597339.0,10.044548
8674,2020-04-18,Yemen,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,29825964.0,2.28,664042.0,56.0,527970.0,-30000.0,3.8,20.0,38.0,0.38,29825964.0,0.033528
8706,2020-04-18,Zambia,57.0,2.0,33.0,22.0,5.0,0.0,3.0,2.0,18383955.0,2.93,522925.0,25.0,743390.0,-8000.0,4.7,18.0,45.0,0.24,18383955.0,3.100530


Let's assign new columns with the most interesting rankins ...

In [51]:
last_day_data = last_day_data.assign(
    Rank_ConfirmedAcum = last_day_data.ConfirmedAcum.rank(),
    Rank_Confirmed = last_day_data.Confirmed.rank(),
    Rank_ActiveAcum = last_day_data.ActiveAcum.rank(),
    Rank_Active = last_day_data.Active.rank(),
    Rank_ConfirmedAcum_Millon = last_day_data.ConfirmedAcum_Millon.rank()
)
last_day_data

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fertility Rate,Average Age,Urban Pop %,World Share,Population,ConfirmedAcum_Millon,Rank_ConfirmedAcum,Rank_Confirmed,Rank_ActiveAcum,Rank_Active,Rank_ConfirmedAcum_Millon
54,2020-04-18,Afghanistan,933.0,30.0,112.0,791.0,27.0,0.0,13.0,14.0,38928346.0,2.33,886592.0,60.0,652860.0,-62920.0,4.6,18.0,25.0,0.50,38928346.0,23.967111,106.0,108.5,112.0,117.0,56.0
95,2020-04-18,Albania,548.0,26.0,302.0,220.0,9.0,0.0,19.0,-10.0,2877797.0,-0.11,-3120.0,105.0,27400.0,-14000.0,1.6,36.0,63.0,0.04,2877797.0,190.423438,92.0,82.5,82.0,20.0,109.0
149,2020-04-18,Algeria,2534.0,367.0,894.0,1273.0,116.0,3.0,48.0,65.0,43851044.0,1.85,797990.0,18.0,2381740.0,-10000.0,3.1,29.0,73.0,0.56,43851044.0,57.786538,130.0,142.0,124.0,143.5,79.0
197,2020-04-18,Andorra,704.0,35.0,205.0,464.0,8.0,0.0,14.0,-6.0,77265.0,0.16,123.0,164.0,470.0,,,,88.0,0.00,77265.0,9111.499385,97.0,80.0,96.0,24.5,176.0
227,2020-04-18,Angola,24.0,2.0,6.0,16.0,5.0,0.0,1.0,4.0,32866272.0,3.27,1040977.0,26.0,1246700.0,6413.0,5.6,17.0,67.0,0.42,32866272.0,0.730232,30.0,70.5,26.0,103.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8651,2020-04-18,Vietnam,268.0,0.0,201.0,67.0,0.0,0.0,3.0,-3.0,97338579.0,0.91,876473.0,314.0,310070.0,-80000.0,2.1,32.0,38.0,1.25,97338579.0,2.753276,74.0,26.0,57.0,32.0,21.0
8665,2020-04-18,Western Sahara,6.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,597339.0,2.55,14876.0,2.0,266000.0,5582.0,2.4,28.0,87.0,0.01,597339.0,10.044548,6.0,26.0,11.5,66.0,37.0
8674,2020-04-18,Yemen,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,29825964.0,2.28,664042.0,56.0,527970.0,-30000.0,3.8,20.0,38.0,0.38,29825964.0,0.033528,1.0,26.0,1.5,66.0,1.0
8706,2020-04-18,Zambia,57.0,2.0,33.0,22.0,5.0,0.0,3.0,2.0,18383955.0,2.93,522925.0,25.0,743390.0,-8000.0,4.7,18.0,45.0,0.24,18383955.0,3.100530,46.0,70.5,35.0,97.0,23.0


Which countries have the most confirmed cases?

In [52]:
last_day_data.sort_values('Rank_ConfirmedAcum', ascending = False)[['Country', 'ConfirmedAcum']].reset_index(drop = True).head(10)

Unnamed: 0,Country,ConfirmedAcum
0,United States,732197.0
1,Spain,191726.0
2,Italy,175925.0
3,France,149149.0
4,Germany,143342.0
5,United Kingdom,115314.0
6,China,83787.0
7,Turkey,82329.0
8,Iran,80868.0
9,Belgium,37183.0


Which countries have more confirmed cases on the last day?

In [53]:
last_day_data.sort_values('Rank_Confirmed', ascending = False)[['Country', 'Confirmed']].reset_index(drop = True).head(10)

Unnamed: 0,Country,Confirmed
0,United States,32491.0
1,United Kingdom,5545.0
2,Russia,4785.0
3,Turkey,3783.0
4,Italy,3491.0
5,Brazil,2976.0
6,Germany,1945.0
7,Canada,1542.0
8,Iran,1374.0
9,India,1370.0


Which countries have the most active cases?

In [54]:
last_day_data.sort_values('Rank_ActiveAcum', ascending = False)[['Country', 'ActiveAcum']].reset_index(drop = True).head(10)

Unnamed: 0,Country,ActiveAcum
0,United States,628693.0
1,Italy,107771.0
2,United Kingdom,99402.0
3,Spain,96886.0
4,France,93217.0
5,Turkey,69986.0
6,Germany,53483.0
7,Russia,33423.0
8,Netherlands,27836.0
9,Belgium,23382.0


Which countries had the most active cases on the last day?

In [55]:
last_day_data.sort_values('Rank_Active', ascending = False)[['Country', 'Active']].reset_index(drop = True).head(10)

Unnamed: 0,Country,Active
0,United States,24305.0
1,United Kingdom,4634.0
2,Russia,4278.0
3,Brazil,2763.0
4,Turkey,1840.0
5,Canada,1078.0
6,Netherlands,1003.0
7,India,913.0
8,Singapore,910.0
9,Saudi Arabia,847.0


Which countries have the most active cases per million inhabitants?

In [56]:
last_day_data.sort_values('Rank_ConfirmedAcum_Millon', ascending = False)[['Country', 'ConfirmedAcum_Millon']].reset_index(drop = True).head(20)

Unnamed: 0,Country,ConfirmedAcum_Millon
0,San Marino,13409.566473
1,Holy See,9987.515605
2,Andorra,9111.499385
3,Luxembourg,5650.358319
4,Iceland,5157.614955
5,Spain,4100.671807
6,Belgium,3208.301081
7,Switzerland,3166.400566
8,Ireland,2988.788903
9,Italy,2909.68718


## Saving the clean dataset

Finally I will create an Excel file with the information per country once it is clean and in perfect condition to apply some machine learning algorithms ...

In [0]:
data_by_country.to_excel("All_data.xlsx", index = False)