# Pandas and Covid-19

This notebook is an example of data analysis and manipulation with Pandas.

Enjoy it!

In [0]:
import numpy as np
import pandas as pd

## The Data

To get some data I and going to download it from Data Repository by Johns Hopkins CSSE

https://github.com/CSSEGISandData/COVID-19

I first remove the folder where I am goint to store the data so I can re-execute this sentences without any problems ...

In [0]:
!rm -rf ./COVID-19

The dataset is avaible in GitHub so I use the `git` command to get it

In [72]:
!git clone https://github.com/CSSEGISandData/COVID-19.git

Cloning into 'COVID-19'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 20736 (delta 5), reused 10 (delta 5), pack-reused 20722[K
Receiving objects: 100% (20736/20736), 84.99 MiB | 12.33 MiB/s, done.
Resolving deltas: 100% (10977/10977), done.


## Exporing the data


In [73]:
ls -lt ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports | head

total 8608
-rw-r--r-- 1 root root 314848 Apr 18 05:40 04-17-2020.csv
-rw-r--r-- 1 root root      0 Apr 18 05:40 README.md
-rw-r--r-- 1 root root 312551 Apr 18 05:40 04-15-2020.csv
-rw-r--r-- 1 root root 314226 Apr 18 05:40 04-16-2020.csv
-rw-r--r-- 1 root root 305548 Apr 18 05:40 04-12-2020.csv
-rw-r--r-- 1 root root 309742 Apr 18 05:40 04-13-2020.csv
-rw-r--r-- 1 root root 311068 Apr 18 05:40 04-14-2020.csv
-rw-r--r-- 1 root root 303921 Apr 18 05:40 04-11-2020.csv
-rw-r--r-- 1 root root 301216 Apr 18 05:40 04-10-2020.csv


Yes!!!  
We have data files ...

Perfect. Let's explore the first dataset generated ...

In [0]:
first = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv")

In [125]:
first.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
4,Gansu,Mainland China,1/22/2020 17:00,,,


And one of the last ones ...

In [0]:
last = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-17-2020.csv")

In [127]:
last.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-17 23:30:52,34.223334,-82.461707,10,0,0,10,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-17 23:30:52,30.295065,-92.414197,110,6,0,104,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-17 23:30:52,37.767072,-75.632346,28,0,0,28,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-17 23:30:52,43.452658,-116.241552,576,9,0,567,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-17 23:30:52,41.330756,-94.471059,1,0,0,1,"Adair, Iowa, US"


Can I concatenate both datasets?

In [128]:
pd.concat((first, last), axis = 0)

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Active,Combined_Key
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,,,,,,,,,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,,,,,,,,,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,
4,Gansu,Mainland China,1/22/2020 17:00,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3040,,,,402.0,2.0,69.0,,,,West Bank and Gaza,2020-04-17 23:30:32,31.952200,35.233200,331.0,West Bank and Gaza
3041,,,,6.0,0.0,0.0,,,,Western Sahara,2020-04-17 23:30:32,24.215500,-12.885800,6.0,Western Sahara
3042,,,,1.0,0.0,0.0,,,,Yemen,2020-04-17 23:30:32,15.552727,48.516388,1.0,Yemen
3043,,,,52.0,2.0,30.0,,,,Zambia,2020-04-17 23:30:32,-13.133897,27.849332,20.0,Zambia


Ups!!! The column names don't match :-(

#Loading the data into Pandas  and cleaning it

In [0]:
import glob
import os

files = glob.glob("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv")
files.sort(key=os.path.getmtime)

We are going to:
- Create a blank Dataset to store all the data
- Load every dataset unifying the column names so we can concatenate it without any problem.
- Remove extra blank spaces from the country field
- Enrich the information with the date of the data in the correct type

In [0]:
data = pd.DataFrame()
for file in files:  
  df = pd.read_csv(file).rename(columns = {'Province/State' : 'State', 
                        "Country/Region" : 'Country',
                        'Province_State' : 'State', 
                        "Country_Region" : 'Country',
                        'Last Update' : 'Last_Update',
                        'Confirmed' : 'ConfirmedAcum',
                        'Deaths' : 'DeathsAcum',
                        'Recovered' : 'RecoveredAcum'})
  df = df.assign(Date = pd.to_datetime(file[-14:-4], format = '%m-%d-%Y'),
                 Country = df.Country.str.strip())
  data = pd.concat((data, df), axis = 0)


I noticed that the country names were a little messy.   
Let's fix it ...

In [0]:
data['Country'] = data.Country.replace({'Bahamas, The' : 'Bahamas',
                         'Congo (Brazzaville)' : 'Congo',
                         'Congo (Kinshasa)' : 'Congo',
                         "Cote d'Ivoire" : "Cote d'Ivoire",
                         "Curacao" : "Curaçao",
                         'Czech Republic' : 'Czech Republic (Czechia)',
                         'Czechia' : 'Czech Republic (Czechia)',
                         'Faroe Islands' : 'Faeroe Islands',
                         'Macau' : 'Macao',
                         'Mainland China' : 'China',
                         'Palestine' : 'State of Palestine',
                         'Reunion' : 'Réunion',
                         'Saint Kitts and Nevis' : 'Saint Kitts & Nevis',
                         'Sao Tome and Principe' : 'Sao Tome & Principe',
                         'US' : 'United States',
                         'Gambia, The' : 'Gambia',
                         'Hong Kong SAR' : 'Hong Kong',
                         'Korea, South' : 'South Korea',
                         'Macao SAR' : 'Macao',
                         'Taiwan*' : 'Taiwan',
                         'Viet Nam' : 'Vietnam',
                         'West Bank and Gaza' : 'State of Palestine'
                         })

I'm going to fill in the null values ​​of the 'State' and 'Admin2' fields so that I can later group the data correctly

In [0]:
data = data.fillna({'State' : 'NA', 'Admin2' : 'NA'})

Finally I am going to be left alone with the columns that interest me

In [0]:
data = data[['Date', 'Country', 'State', 'Admin2', 'ConfirmedAcum', 'DeathsAcum', 'RecoveredAcum']]

Let's verify the structure of the dateset ...

In [134]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89463 entries, 0 to 3044
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           89463 non-null  datetime64[ns]
 1   Country        89463 non-null  object        
 2   State          89463 non-null  object        
 3   Admin2         89463 non-null  object        
 4   ConfirmedAcum  89444 non-null  float64       
 5   DeathsAcum     89022 non-null  float64       
 6   RecoveredAcum  89075 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 5.5+ MB


Wait a set, I think that can be interesting have a column the the active cases. Let's create it ...

In [0]:
data['ActiveAcum'] = data.ConfirmedAcum  - data.DeathsAcum - data.RecoveredAcum

In [136]:
data.query("Country == 'Spain'").sort_values('Date', ascending = False).head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
3017,2020-04-17,Spain,,,190839.0,20002.0,74797.0,96040.0
3013,2020-04-16,Spain,,,184948.0,19315.0,74797.0,90836.0
2998,2020-04-15,Spain,,,177644.0,18708.0,70853.0,88083.0
2985,2020-04-14,Spain,,,172541.0,18056.0,67504.0,86981.0
2973,2020-04-13,Spain,,,170099.0,17756.0,64727.0,87616.0


Perfect :-)

Now, I am going to group and summarize the data because I want to be sure that there is only one row per Date, Country, State and Admin2

In [0]:
data = data.groupby(["Date", "Country", "State", "Admin2"]).agg("sum").reset_index()

In [138]:
data.head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
0,2020-01-22,China,Anhui,,1.0,0.0,0.0,0.0
1,2020-01-22,China,Beijing,,14.0,0.0,0.0,0.0
2,2020-01-22,China,Chongqing,,6.0,0.0,0.0,0.0
3,2020-01-22,China,Fujian,,1.0,0.0,0.0,0.0
4,2020-01-22,China,Gansu,,0.0,0.0,0.0,0.0


# Daily Cases
I am going to enrich the data by creating new columns with the daily cases.  

First I create new columns with the cases from the previous day

In [0]:
data = data.sort_values(['State', 'Country', 'Date']).\
            assign(ConfirmedPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ConfirmedAcum"],
                   DeathsPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["DeathsAcum"],
                   RecoveredPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["RecoveredAcum"],
                   ActivePrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ActiveAcum"],
            ).\
            fillna({ 'ConfirmedPrevious' : 0, 'DeathsPrevious' : 0, 'RecoveredPrevious' : 0 })

In [142]:
data.head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,ConfirmedPrevious,DeathsPrevious,RecoveredPrevious,ActivePrevious
2599,2020-02-28,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,0.0,0.0,0.0,
2713,2020-02-29,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2834,2020-03-01,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2961,2020-03-02,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
3103,2020-03-03,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0


After that I am going to assign the new fields subtracting the previous acum cases to the actual acum cases

In [0]:
data = data.assign(Confirmed = data.ConfirmedAcum -  data.ConfirmedPrevious,
            Deaths = data.DeathsAcum - data.DeathsPrevious,
            Recovered = data.RecoveredAcum - data.RecoveredPrevious,
            Active = data.ActiveAcum - data.ActivePrevious
            )

I no longer need the fields I used to make the calculation so I can drop them

In [0]:
data = data.drop(['ConfirmedPrevious', 'DeathsPrevious', 'RecoveredPrevious', 'ActivePrevious'], axis = 1)

Does the data look good?

In [145]:
data.query("Country == 'Spain'")

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
545,2020-02-01,Spain,,,1.0,0.0,0.0,1.0,1.0,0.0,0.0,
612,2020-02-02,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
679,2020-02-03,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
749,2020-02-04,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
819,2020-02-05,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
74526,2020-04-13,Spain,,,170099.0,17756.0,64727.0,87616.0,3268.0,547.0,2336.0,385.0
77526,2020-04-14,Spain,,,172541.0,18056.0,67504.0,86981.0,2442.0,300.0,2777.0,-635.0
80538,2020-04-15,Spain,,,177644.0,18708.0,70853.0,88083.0,5103.0,652.0,3349.0,1102.0
83563,2020-04-16,Spain,,,184948.0,19315.0,74797.0,90836.0,7304.0,607.0,3944.0,2753.0


## Data By Country




So far, we have data by 3 geographical levels: Country, State and a lower level called Admin2

The problem is that not all countries have this level of information, so I will create a new dataset only with the country level data

In [216]:
data_by_country = data.groupby(["Date", "Country"]).agg("sum").reset_index()
data_by_country = data_by_country.sort_values(['Country', 'Date'])
data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8560 entries, 848 to 3045
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           8560 non-null   datetime64[ns]
 1   Country        8560 non-null   object        
 2   ConfirmedAcum  8560 non-null   float64       
 3   DeathsAcum     8560 non-null   float64       
 4   RecoveredAcum  8560 non-null   float64       
 5   ActiveAcum     8560 non-null   float64       
 6   Confirmed      8560 non-null   float64       
 7   Deaths         8560 non-null   float64       
 8   Recovered      8560 non-null   float64       
 9   Active         8560 non-null   float64       
dtypes: datetime64[ns](1), float64(8), object(1)
memory usage: 735.6+ KB


In [217]:
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
848,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
886,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
928,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
977,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1030,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [218]:
data_by_country[data_by_country.Country == 'United States']

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
7,2020-01-22,United States,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
22,2020-01-23,United States,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33,2020-01-24,United States,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
47,2020-01-25,United States,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
62,2020-01-26,United States,5.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
7815,2020-04-13,United States,580619.0,23528.0,43482.0,513609.0,25310.0,1514.0,10494.0,12919.0
7999,2020-04-14,United States,607670.0,25831.0,47763.0,534076.0,27062.0,2303.0,4281.0,19813.0
8183,2020-04-15,United States,636350.0,28325.0,52096.0,555929.0,28671.0,2494.0,4333.0,21844.0
8367,2020-04-16,United States,667801.0,32916.0,54703.0,580182.0,31449.0,4591.0,2607.0,24251.0


## Cases per million inhabitants



We are going to enrich the information with the number of cases per million inhabitants, so we need population data by country.

A small internet search leads me to a page that has population data for 2020:

https://www.worldometers.info/world-population/population-by-country/

It seems that this information is protected to be downloaded automatically so I have no choice but to do it manually and upload the data to a GitHub Repository:

https://github.com/dvillaj/world-population/



I load the data to the Pandas, clean it up and just maintain the field of the population 


In [0]:
population = pd.read_excel("https://github.com/dvillaj/world-population/blob/master/data/world-popultation-2020.xlsx?raw=true", sheet_name="Data")

In [235]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            235 non-null    object 
 1   Population (2020)  235 non-null    int64  
 2   Yearly Change      235 non-null    float64
 3   Net Change         235 non-null    int64  
 4   Density (P/Km²)    235 non-null    float64
 5   Land Area (Km²)    235 non-null    int64  
 6   Migrants (net)     201 non-null    float64
 7   Fertility Rate     201 non-null    float64
 8   Average Age        201 non-null    float64
 9   Urban Pop %        222 non-null    float64
 10  World Share        235 non-null    float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.3+ KB


In [0]:
population = population.rename(columns = {
    'Population (2020)' : 'Population',
    'Yearly Change' : 'Yearly_Change',
    'Net Change' : 'Net_Change',
    'Density (P/Km²)' : 'Density',
    'Land Area (Km²)' : 'Land_Area',
    'Migrants (net)' : 'Migrants',
    'Fertility Rate' : 'Fertility',
    'Average Age' : 'Mean_Age',
    'Urban Pop %' : 'Urban_Pop',
    'World Share' : 'World_Share'
})

In [0]:
population = population[['Country', 'Population']]

In [223]:
population.head()

Unnamed: 0,Country,Population
0,Afghanistan,38928346
1,Albania,2877797
2,Algeria,43851044
3,American Samoa,55191
4,Andorra,77265


Now I join the population Dataset with the country data to have the population in this dataset

In [0]:
data_by_country = data_by_country.merge(population, how = 'left', on = 'Country')

In [225]:
data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8560 entries, 0 to 8559
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           8560 non-null   datetime64[ns]
 1   Country        8560 non-null   object        
 2   ConfirmedAcum  8560 non-null   float64       
 3   DeathsAcum     8560 non-null   float64       
 4   RecoveredAcum  8560 non-null   float64       
 5   ActiveAcum     8560 non-null   float64       
 6   Confirmed      8560 non-null   float64       
 7   Deaths         8560 non-null   float64       
 8   Recovered      8560 non-null   float64       
 9   Active         8560 non-null   float64       
 10  Population     8249 non-null   float64       
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 802.5+ KB


In [226]:
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population
0,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,38928346.0
1,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0
2,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0
3,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0
4,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0


And finally I calculate the number of cases per million inhabitants

In [237]:
data_by_country = data_by_country.assign(ConfirmedAcum_Millon = data_by_country.ConfirmedAcum / data_by_country.Population * 1000000)
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population,ConfirmedAcum_Millon
0,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,38928346.0,0.025688
1,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688
2,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688
3,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688
4,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688


## Rankins

I'm going to create a dataset of last day's cases. 

The goal is to get a set of rankings that tell me the countries with the most cases

So I need a variable that contains the last date of the dataset

In [239]:
last_day = list(data_by_country.Date.sort_values(ascending = False))[0]
last_day

Timestamp('2020-04-17 00:00:00')

Now I can filter the data by this date

In [229]:
last_day_data = data_by_country[data_by_country.Date == last_day]
last_day_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 184 entries, 53 to 8552
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  184 non-null    datetime64[ns]
 1   Country               184 non-null    object        
 2   ConfirmedAcum         184 non-null    float64       
 3   DeathsAcum            184 non-null    float64       
 4   RecoveredAcum         184 non-null    float64       
 5   ActiveAcum            184 non-null    float64       
 6   Confirmed             184 non-null    float64       
 7   Deaths                184 non-null    float64       
 8   Recovered             184 non-null    float64       
 9   Active                184 non-null    float64       
 10  Population            178 non-null    float64       
 11  ConfirmedAcum_Millon  178 non-null    float64       
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 18.7+ KB


In [230]:
last_day_data

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population,ConfirmedAcum_Millon
53,2020-04-17,Afghanistan,906.0,30.0,99.0,777.0,66.0,0.0,45.0,21.0,38928346.0,23.273529
93,2020-04-17,Albania,539.0,26.0,283.0,230.0,21.0,0.0,6.0,15.0,2877797.0,187.296046
146,2020-04-17,Algeria,2418.0,364.0,846.0,1208.0,150.0,16.0,63.0,71.0,43851044.0,55.141219
193,2020-04-17,Andorra,696.0,35.0,191.0,470.0,23.0,2.0,22.0,-1.0,77265.0,9007.959619
222,2020-04-17,Angola,19.0,2.0,5.0,12.0,0.0,0.0,0.0,0.0,32866272.0,0.578100
...,...,...,...,...,...,...,...,...,...,...,...,...
8471,2020-04-17,Vietnam,268.0,0.0,198.0,70.0,0.0,0.0,21.0,-21.0,97338579.0,2.753276
8484,2020-04-17,Western Sahara,6.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,597339.0,10.044548
8492,2020-04-17,Yemen,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,29825964.0,0.033528
8523,2020-04-17,Zambia,52.0,2.0,30.0,20.0,4.0,0.0,0.0,4.0,18383955.0,2.828553


Let's assign new columns with the most interesting rankins ...

In [231]:
last_day_data = last_day_data.assign(
    Rank_ConfirmedAcum = last_day_data.ConfirmedAcum.rank(),
    Rank_Confirmed = last_day_data.Confirmed.rank(),
    Rank_ActiveAcum = last_day_data.ActiveAcum.rank(),
    Rank_Active = last_day_data.Active.rank(),
    Rank_ConfirmedAcum_Millon = last_day_data.ConfirmedAcum_Millon.rank()
)
last_day_data

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population,ConfirmedAcum_Millon,Rank_ConfirmedAcum,Rank_Confirmed,Rank_ActiveAcum,Rank_Active,Rank_ConfirmedAcum_Millon
53,2020-04-17,Afghanistan,906.0,30.0,99.0,777.0,66.0,0.0,45.0,21.0,38928346.0,23.273529,106.0,127.0,109.0,122.0,57.0
93,2020-04-17,Albania,539.0,26.0,283.0,230.0,21.0,0.0,6.0,15.0,2877797.0,187.296046,92.0,90.5,82.0,117.0,109.0
146,2020-04-17,Algeria,2418.0,364.0,846.0,1208.0,150.0,16.0,63.0,71.0,43851044.0,55.141219,130.0,140.0,124.0,143.0,79.0
193,2020-04-17,Andorra,696.0,35.0,191.0,470.0,23.0,2.0,22.0,-1.0,77265.0,9007.959619,100.0,95.0,95.0,39.0,176.0
222,2020-04-17,Angola,19.0,2.0,5.0,12.0,0.0,0.0,0.0,0.0,32866272.0,0.578100,28.5,26.5,22.5,66.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8471,2020-04-17,Vietnam,268.0,0.0,198.0,70.0,0.0,0.0,21.0,-21.0,97338579.0,2.753276,74.0,26.5,57.0,21.5,21.0
8484,2020-04-17,Western Sahara,6.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,597339.0,10.044548,6.0,26.5,11.5,66.0,40.0
8492,2020-04-17,Yemen,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,29825964.0,0.033528,1.0,26.5,1.0,66.0,1.0
8523,2020-04-17,Zambia,52.0,2.0,30.0,20.0,4.0,0.0,0.0,4.0,18383955.0,2.828553,45.0,64.5,34.0,98.5,22.0


Which countries have the most confirmed cases?

In [197]:
last_day_data.sort_values('Rank_ConfirmedAcum', ascending = False)[['Country', 'ConfirmedAcum']].reset_index(drop = True).head(10)

Unnamed: 0,Country,ConfirmedAcum
0,United States,699706.0
1,Spain,190839.0
2,Italy,172434.0
3,France,149130.0
4,Germany,141397.0
5,United Kingdom,109769.0
6,China,83760.0
7,Iran,79494.0
8,Turkey,78546.0
9,Belgium,36138.0


Which countries have more confirmed cases on the last day?

In [198]:
last_day_data.sort_values('Rank_Confirmed', ascending = False)[['Country', 'Confirmed']].reset_index(drop = True).head(10)

Unnamed: 0,Country,Confirmed
0,United States,31976.0
1,Spain,5891.0
2,United Kingdom,5624.0
3,Turkey,4353.0
4,Russia,4070.0
5,Germany,3699.0
6,Italy,3493.0
7,Brazil,3257.0
8,France,2039.0
9,Canada,2005.0


Which countries have the most active cases?

In [200]:
last_day_data.sort_values('Rank_ActiveAcum', ascending = False)[['Country', 'ActiveAcum']].reset_index(drop = True).head(10)

Unnamed: 0,Country,ActiveAcum
0,United States,604388.0
1,Italy,106962.0
2,Spain,96040.0
3,France,95421.0
4,United Kingdom,94768.0
5,Turkey,68146.0
6,Germany,53931.0
7,Russia,29145.0
8,Netherlands,26833.0
9,Belgium,23014.0


Which countries had the most active cases on the last day?

In [201]:
last_day_data.sort_values('Rank_Active', ascending = False)[['Country', 'Active']].reset_index(drop = True).head(10)

Unnamed: 0,Country,Active
0,United States,24277.0
1,Spain,5204.0
2,United Kingdom,4757.0
3,Russia,3743.0
4,Brazil,3040.0
5,Turkey,2685.0
6,Japan,1115.0
7,Netherlands,1088.0
8,Canada,1061.0
9,Saudi Arabia,699.0


Which countries have the most active cases per million inhabitants?

In [233]:
last_day_data.sort_values('Rank_ConfirmedAcum_Millon', ascending = False)[['Country', 'ConfirmedAcum_Millon']].reset_index(drop = True).head(20)

Unnamed: 0,Country,ConfirmedAcum_Millon
0,San Marino,12820.13498
1,Holy See,9987.515605
2,Andorra,9007.959619
3,Luxembourg,5559.300806
4,Iceland,5140.032176
5,Spain,4081.700484
6,Switzerland,3128.732832
7,Belgium,3118.134214
8,Italy,2851.948269
9,Ireland,2831.228409


## Saving the clean dataset

Finally I will create an Excel file with the information per country once it is clean and in perfect condition to apply some machine learning algorithms ...

In [0]:
data_by_country.to_excel("All_data.xlsx", index = False)