# Pandas and Covid-19

This notebook is an example of data analysis and manipulation with Pandas.

Enjoy it!

In [0]:
import numpy as np
import pandas as pd

## The Data

To get some data I and going to download it from Data Repository by Johns Hopkins CSSE

https://github.com/CSSEGISandData/COVID-19

I first remove the folder where I am goint to store the data so I can re-execute this sentences without any problems ...

In [0]:
!rm -rf ./COVID-19

The dataset is avaible in GitHub so I use the `git` command to get it

In [3]:
!git clone https://github.com/CSSEGISandData/COVID-19.git

Cloning into 'COVID-19'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects:  14% (1/7)[Kremote: Counting objects:  28% (2/7)[Kremote: Counting objects:  42% (3/7)[Kremote: Counting objects:  57% (4/7)[Kremote: Counting objects:  71% (5/7)[Kremote: Counting objects:  85% (6/7)[Kremote: Counting objects: 100% (7/7)[Kremote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 21226 (delta 0), reused 2 (delta 0), pack-reused 21219[K
Receiving objects: 100% (21226/21226), 89.05 MiB | 32.87 MiB/s, done.
Resolving deltas: 100% (11301/11301), done.


## Exporing the data


In [4]:
!ls -lt ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports | head

total 9232
-rw-r--r-- 1 root root 317954 Apr 20 06:22 04-19-2020.csv
-rw-r--r-- 1 root root      0 Apr 20 06:22 README.md
-rw-r--r-- 1 root root 315926 Apr 20 06:22 04-18-2020.csv
-rw-r--r-- 1 root root 314848 Apr 20 06:22 04-17-2020.csv
-rw-r--r-- 1 root root 314226 Apr 20 06:22 04-16-2020.csv
-rw-r--r-- 1 root root 309742 Apr 20 06:22 04-13-2020.csv
-rw-r--r-- 1 root root 311068 Apr 20 06:22 04-14-2020.csv
-rw-r--r-- 1 root root 312551 Apr 20 06:22 04-15-2020.csv
-rw-r--r-- 1 root root 305548 Apr 20 06:22 04-12-2020.csv


Yes!!!  
We have data files ...

Perfect. Let's explore the first dataset generated ...

In [0]:
first = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv")

In [6]:
first.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
4,Gansu,Mainland China,1/22/2020 17:00,,,


And one of the last ones ...

In [0]:
last = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-18-2020.csv")

In [8]:
last.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-18 22:32:47,34.223334,-82.461707,15,0,0,15,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-18 22:32:47,30.295065,-92.414197,110,7,0,103,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-18 22:32:47,37.767072,-75.632346,33,0,0,33,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-18 22:32:47,43.452658,-116.241552,593,9,0,584,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-18 22:32:47,41.330756,-94.471059,1,0,0,1,"Adair, Iowa, US"


Can I concatenate both datasets?

In [9]:
last.query("Country_Region == 'Spain'")

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
3025,,,,Spain,2020-04-18 22:32:28,40.463667,-3.74922,191726,20043,74797,96886,Spain


In [10]:
pd.concat((first, last), axis = 0)

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Active,Combined_Key
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,,,,,,,,,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,,,,,,,,,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,
4,Gansu,Mainland China,1/22/2020 17:00,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3048,,,,418.0,2.0,69.0,,,,West Bank and Gaza,2020-04-18 22:32:28,31.952200,35.233200,347.0,West Bank and Gaza
3049,,,,6.0,0.0,0.0,,,,Western Sahara,2020-04-18 22:32:28,24.215500,-12.885800,6.0,Western Sahara
3050,,,,1.0,0.0,0.0,,,,Yemen,2020-04-18 22:32:28,15.552727,48.516388,1.0,Yemen
3051,,,,57.0,2.0,33.0,,,,Zambia,2020-04-18 22:32:28,-13.133897,27.849332,22.0,Zambia


Ups!!! The column names don't match :-(

#Loading the data into Pandas  and cleaning it

In [0]:
import glob
import os

files = glob.glob("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv")
files.sort(key=os.path.getmtime)

We are going to:
- Create a blank Dataset to store all the data
- Load every dataset unifying the column names so we can concatenate it without any problem.
- Remove extra blank spaces from the country field
- Enrich the information with the date of the data in the correct type

In [0]:
data = pd.DataFrame()
for file in files:  
  df = pd.read_csv(file).rename(columns = {'Province/State' : 'State', 
                        "Country/Region" : 'Country',
                        'Province_State' : 'State', 
                        "Country_Region" : 'Country',
                        'Last Update' : 'Last_Update',
                        'Confirmed' : 'ConfirmedAcum',
                        'Deaths' : 'DeathsAcum',
                        'Recovered' : 'RecoveredAcum'})
  df = df.assign(Date = pd.to_datetime(file[-14:-4], format = '%m-%d-%Y'),
                 Country = df.Country.str.strip())
  data = pd.concat((data, df), axis = 0)


I noticed that the country names were a little messy.   
Let's fix it ...

In [0]:
data['Country'] = data.Country.replace({'Bahamas, The' : 'Bahamas',
                         'Congo (Brazzaville)' : 'Congo',
                         'Congo (Kinshasa)' : 'Congo',
                         "Cote d'Ivoire" : "Cote d'Ivoire",
                         "Curacao" : "Curaçao",
                         'Czech Republic' : 'Czech Republic (Czechia)',
                         'Czechia' : 'Czech Republic (Czechia)',
                         'Faroe Islands' : 'Faeroe Islands',
                         'Macau' : 'Macao',
                         'Mainland China' : 'China',
                         'Palestine' : 'State of Palestine',
                         'Reunion' : 'Réunion',
                         'Saint Kitts and Nevis' : 'Saint Kitts & Nevis',
                         'Sao Tome and Principe' : 'Sao Tome & Principe',
                         'US' : 'United States',
                         'Gambia, The' : 'Gambia',
                         'Hong Kong SAR' : 'Hong Kong',
                         'Korea, South' : 'South Korea',
                         'Macao SAR' : 'Macao',
                         'Taiwan*' : 'Taiwan',
                         'Viet Nam' : 'Vietnam',
                         'West Bank and Gaza' : 'State of Palestine'
                         })

I'm going to fill in the null values ​​of the 'State' and 'Admin2' fields so that I can later group the data correctly

In [0]:
data = data.fillna({'State' : 'NA', 'Admin2' : 'NA'})

Finally I am going to be left alone with the columns that interest me

In [0]:
data = data[['Date', 'Country', 'State', 'Admin2', 'ConfirmedAcum', 'DeathsAcum', 'RecoveredAcum']]

Let's verify the structure of the dateset ...

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95588 entries, 0 to 3071
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           95588 non-null  datetime64[ns]
 1   Country        95588 non-null  object        
 2   State          95588 non-null  object        
 3   Admin2         95588 non-null  object        
 4   ConfirmedAcum  95569 non-null  float64       
 5   DeathsAcum     95147 non-null  float64       
 6   RecoveredAcum  95200 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 5.8+ MB


Wait a set, I think that can be interesting have a column the the active cases. Let's create it ...

In [0]:
data['ActiveAcum'] = data.ConfirmedAcum  - data.DeathsAcum - data.RecoveredAcum

In [18]:
data.query("Country == 'Spain'").sort_values('Date', ascending = False).head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
3044,2020-04-19,Spain,,,198674.0,20453.0,77357.0,100864.0
3025,2020-04-18,Spain,,,191726.0,20043.0,74797.0,96886.0
3017,2020-04-17,Spain,,,190839.0,20002.0,74797.0,96040.0
3013,2020-04-16,Spain,,,184948.0,19315.0,74797.0,90836.0
2998,2020-04-15,Spain,,,177644.0,18708.0,70853.0,88083.0


Perfect :-)

Now, I am going to group and summarize the data because I want to be sure that there is only one row per Date, Country, State and Admin2

In [0]:
data = data.groupby(["Date", "Country", "State", "Admin2"]).agg("sum").reset_index()

In [20]:
data.head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
0,2020-01-22,China,Anhui,,1.0,0.0,0.0,0.0
1,2020-01-22,China,Beijing,,14.0,0.0,0.0,0.0
2,2020-01-22,China,Chongqing,,6.0,0.0,0.0,0.0
3,2020-01-22,China,Fujian,,1.0,0.0,0.0,0.0
4,2020-01-22,China,Gansu,,0.0,0.0,0.0,0.0


# Daily Cases
I am going to enrich the data by creating new columns with the daily cases.  

First I create new columns with the cases from the previous day

In [0]:
data = data.sort_values(['State', 'Country', 'Date']).\
            assign(ConfirmedPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ConfirmedAcum"],
                   DeathsPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["DeathsAcum"],
                   RecoveredPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["RecoveredAcum"],
                   ActivePrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ActiveAcum"],
            ).\
            fillna({ 'ConfirmedPrevious' : 0, 'DeathsPrevious' : 0, 'RecoveredPrevious' : 0 })

In [22]:
data.head()

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,ConfirmedPrevious,DeathsPrevious,RecoveredPrevious,ActivePrevious
2599,2020-02-28,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,0.0,0.0,0.0,
2713,2020-02-29,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2834,2020-03-01,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2961,2020-03-02,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
3103,2020-03-03,Canada,"Montreal, QC",,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0


After that I am going to assign the new fields subtracting the previous acum cases to the actual acum cases

In [0]:
data = data.assign(Confirmed = data.ConfirmedAcum -  data.ConfirmedPrevious,
            Deaths = data.DeathsAcum - data.DeathsPrevious,
            Recovered = data.RecoveredAcum - data.RecoveredPrevious,
            Active = data.ActiveAcum - data.ActivePrevious
            )

I no longer need the fields I used to make the calculation so I can drop them

In [0]:
data = data.drop(['ConfirmedPrevious', 'DeathsPrevious', 'RecoveredPrevious', 'ActivePrevious'], axis = 1)

Does the data look good?

In [25]:
data.query("Country == 'Spain'")

Unnamed: 0,Date,Country,State,Admin2,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
545,2020-02-01,Spain,,,1.0,0.0,0.0,1.0,1.0,0.0,0.0,
612,2020-02-02,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
679,2020-02-03,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
749,2020-02-04,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
819,2020-02-05,Spain,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
80538,2020-04-15,Spain,,,177644.0,18708.0,70853.0,88083.0,5103.0,652.0,3349.0,1102.0
83563,2020-04-16,Spain,,,184948.0,19315.0,74797.0,90836.0,7304.0,607.0,3944.0,2753.0
86603,2020-04-17,Spain,,,190839.0,20002.0,74797.0,96040.0,5891.0,687.0,0.0,5204.0
89647,2020-04-18,Spain,,,191726.0,20043.0,74797.0,96886.0,887.0,41.0,0.0,846.0


## Data By Country




So far, we have data by 3 geographical levels: Country, State and a lower level called Admin2

The problem is that not all countries have this level of information, so I will create a new dataset only with the country level data

In [26]:
data_by_country = data.groupby(["Date", "Country"]).agg("sum").reset_index()
data_by_country = data_by_country.sort_values(['Country', 'Date'])
data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8928 entries, 848 to 3045
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           8928 non-null   datetime64[ns]
 1   Country        8928 non-null   object        
 2   ConfirmedAcum  8928 non-null   float64       
 3   DeathsAcum     8928 non-null   float64       
 4   RecoveredAcum  8928 non-null   float64       
 5   ActiveAcum     8928 non-null   float64       
 6   Confirmed      8928 non-null   float64       
 7   Deaths         8928 non-null   float64       
 8   Recovered      8928 non-null   float64       
 9   Active         8928 non-null   float64       
dtypes: datetime64[ns](1), float64(8), object(1)
memory usage: 767.2+ KB


In [27]:
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
848,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
886,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
928,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
977,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1030,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [28]:
data_by_country[data_by_country.Country == 'United States']

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
7,2020-01-22,United States,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
22,2020-01-23,United States,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33,2020-01-24,United States,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
47,2020-01-25,United States,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
62,2020-01-26,United States,5.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
8183,2020-04-15,United States,636350.0,28325.0,52096.0,555929.0,28671.0,2494.0,4333.0,21844.0
8367,2020-04-16,United States,667801.0,32916.0,54703.0,580182.0,31449.0,4591.0,2607.0,24251.0
8551,2020-04-17,United States,699706.0,36773.0,58545.0,604388.0,31976.0,3857.0,3842.0,24277.0
8735,2020-04-18,United States,732197.0,38664.0,64840.0,628693.0,32491.0,1891.0,6295.0,24305.0


## Cases per million inhabitants



We are going to enrich the information with the number of cases per million inhabitants, so we need population data by country.

A small internet search leads me to a page that has population data for 2020:

https://www.worldometers.info/world-population/population-by-country/

It seems that this information is protected to be downloaded automatically so I have no choice but to do it manually and upload the data to a GitHub Repository:

https://github.com/dvillaj/world-population/



I load the data to the Pandas, clean it up and just maintain the field of the population 


In [0]:
population = pd.read_excel("https://github.com/dvillaj/world-population/blob/master/data/world-popultation-2020.xlsx?raw=true", sheet_name="Data")

In [30]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            235 non-null    object 
 1   Population (2020)  235 non-null    int64  
 2   Yearly Change      235 non-null    float64
 3   Net Change         235 non-null    int64  
 4   Density (P/Km²)    235 non-null    float64
 5   Land Area (Km²)    235 non-null    int64  
 6   Migrants (net)     201 non-null    float64
 7   Fertility Rate     201 non-null    float64
 8   Average Age        201 non-null    float64
 9   Urban Pop %        222 non-null    float64
 10  World Share        235 non-null    float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.3+ KB


In [0]:
population = population.rename(columns = {
    'Population (2020)' : 'Population',
    'Yearly Change' : 'Yearly_Change',
    'Net Change' : 'Net_Change',
    'Density (P/Km²)' : 'Density',
    'Land Area (Km²)' : 'Land_Area',
    'Migrants (net)' : 'igrants',
    'Fertility Rate' : 'Fertility',
    'Average Age' : 'Mean_Age',
    'Urban Pop %' : 'Urban_Pop',
    'World Share' : 'World_Share'
})

In [0]:
population = population[['Country', 'Population']]

In [33]:
population.head()

Unnamed: 0,Country,Population
0,Afghanistan,38928346
1,Albania,2877797
2,Algeria,43851044
3,American Samoa,55191
4,Andorra,77265


Now I join the population Dataset with the country data to have the population in this dataset

In [0]:
data_by_country = data_by_country.merge(population, how = 'left', on = 'Country')

In [35]:
data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8928 entries, 0 to 8927
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           8928 non-null   datetime64[ns]
 1   Country        8928 non-null   object        
 2   ConfirmedAcum  8928 non-null   float64       
 3   DeathsAcum     8928 non-null   float64       
 4   RecoveredAcum  8928 non-null   float64       
 5   ActiveAcum     8928 non-null   float64       
 6   Confirmed      8928 non-null   float64       
 7   Deaths         8928 non-null   float64       
 8   Recovered      8928 non-null   float64       
 9   Active         8928 non-null   float64       
 10  Population     8605 non-null   float64       
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 837.0+ KB


In [36]:
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population
0,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,38928346.0
1,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0
2,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0
3,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0
4,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0


And finally I calculate the number of cases per million inhabitants

In [37]:
data_by_country = data_by_country.assign(ConfirmedAcum_Millon = data_by_country.ConfirmedAcum / data_by_country.Population * 1000000)
data_by_country.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population,ConfirmedAcum_Millon
0,2020-02-24,Afghanistan,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,38928346.0,0.025688
1,2020-02-25,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688
2,2020-02-26,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688
3,2020-02-27,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688
4,2020-02-28,Afghanistan,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38928346.0,0.025688


## Rankins

I'm going to create a dataset of last day's cases. 

The goal is to get a set of rankings that tell me the countries with the most cases

So I need a variable that contains the last date of the dataset

In [38]:
last_day = list(data_by_country.Date.sort_values(ascending = False))[0]
last_day

Timestamp('2020-04-19 00:00:00')

Now I can filter the data by this date

In [39]:
last_day_data = data_by_country[data_by_country.Date == last_day]
last_day_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 184 entries, 55 to 8920
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  184 non-null    datetime64[ns]
 1   Country               184 non-null    object        
 2   ConfirmedAcum         184 non-null    float64       
 3   DeathsAcum            184 non-null    float64       
 4   RecoveredAcum         184 non-null    float64       
 5   ActiveAcum            184 non-null    float64       
 6   Confirmed             184 non-null    float64       
 7   Deaths                184 non-null    float64       
 8   Recovered             184 non-null    float64       
 9   Active                184 non-null    float64       
 10  Population            178 non-null    float64       
 11  ConfirmedAcum_Millon  178 non-null    float64       
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 18.7+ KB


In [40]:
last_day_data

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population,ConfirmedAcum_Millon
55,2020-04-19,Afghanistan,996.0,33.0,131.0,832.0,63.0,3.0,19.0,41.0,38928346.0,25.585469
97,2020-04-19,Albania,562.0,26.0,314.0,222.0,14.0,0.0,12.0,2.0,2877797.0,195.288271
152,2020-04-19,Algeria,2629.0,375.0,1047.0,1207.0,95.0,8.0,153.0,-66.0,43851044.0,59.952963
201,2020-04-19,Andorra,713.0,36.0,235.0,442.0,9.0,1.0,30.0,-22.0,77265.0,9227.981622
232,2020-04-19,Angola,24.0,2.0,6.0,16.0,0.0,0.0,0.0,0.0,32866272.0,0.730232
...,...,...,...,...,...,...,...,...,...,...,...,...
8831,2020-04-19,Vietnam,268.0,0.0,202.0,66.0,0.0,0.0,1.0,-1.0,97338579.0,2.753276
8846,2020-04-19,Western Sahara,6.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,597339.0,10.044548
8856,2020-04-19,Yemen,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,29825964.0,0.033528
8889,2020-04-19,Zambia,61.0,3.0,33.0,25.0,4.0,1.0,0.0,3.0,18383955.0,3.318111


Let's assign new columns with the most interesting rankins ...

In [41]:
last_day_data = last_day_data.assign(
    Rank_ConfirmedAcum = last_day_data.ConfirmedAcum.rank(),
    Rank_Confirmed = last_day_data.Confirmed.rank(),
    Rank_ActiveAcum = last_day_data.ActiveAcum.rank(),
    Rank_Active = last_day_data.Active.rank(),
    Rank_ConfirmedAcum_Millon = last_day_data.ConfirmedAcum_Millon.rank()
)
last_day_data

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population,ConfirmedAcum_Millon,Rank_ConfirmedAcum,Rank_Confirmed,Rank_ActiveAcum,Rank_Active,Rank_ConfirmedAcum_Millon
55,2020-04-19,Afghanistan,996.0,33.0,131.0,832.0,63.0,3.0,19.0,41.0,38928346.0,25.585469,105.0,125.5,113.0,134.0,56.0
97,2020-04-19,Albania,562.0,26.0,314.0,222.0,14.0,0.0,12.0,2.0,2877797.0,195.288271,90.0,92.0,80.0,89.0,108.0
152,2020-04-19,Algeria,2629.0,375.0,1047.0,1207.0,95.0,8.0,153.0,-66.0,43851044.0,59.952963,130.0,136.0,123.0,12.0,80.0
201,2020-04-19,Andorra,713.0,36.0,235.0,442.0,9.0,1.0,30.0,-22.0,77265.0,9227.981622,98.0,82.5,93.0,18.0,176.0
232,2020-04-19,Angola,24.0,2.0,6.0,16.0,0.0,0.0,0.0,0.0,32866272.0,0.730232,30.0,24.0,25.5,61.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8831,2020-04-19,Vietnam,268.0,0.0,202.0,66.0,0.0,0.0,1.0,-1.0,97338579.0,2.753276,72.0,24.0,56.0,38.5,19.0
8846,2020-04-19,Western Sahara,6.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,597339.0,10.044548,6.0,24.0,11.0,61.0,36.0
8856,2020-04-19,Yemen,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,29825964.0,0.033528,1.0,24.0,2.0,61.0,1.0
8889,2020-04-19,Zambia,61.0,3.0,33.0,25.0,4.0,1.0,0.0,3.0,18383955.0,3.318111,46.5,69.0,36.5,92.5,23.0


Which countries have the most confirmed cases?

In [42]:
last_day_data.sort_values('Rank_ConfirmedAcum', ascending = False)[['Country', 'ConfirmedAcum']].reset_index(drop = True).head(10)

Unnamed: 0,Country,ConfirmedAcum
0,United States,759086.0
1,Spain,198674.0
2,Italy,178972.0
3,France,154097.0
4,Germany,145184.0
5,United Kingdom,121172.0
6,Turkey,86306.0
7,China,83805.0
8,Iran,82211.0
9,Russia,42853.0


Which countries have more confirmed cases on the last day?

In [43]:
last_day_data.sort_values('Rank_Confirmed', ascending = False)[['Country', 'Confirmed']].reset_index(drop = True).head(10)

Unnamed: 0,Country,Confirmed
0,United States,27159.0
1,Spain,6948.0
2,Russia,6060.0
3,United Kingdom,5858.0
4,France,4948.0
5,Turkey,3977.0
6,Italy,3047.0
7,Brazil,1996.0
8,India,1893.0
9,Germany,1842.0


Which countries have the most active cases?

In [44]:
last_day_data.sort_values('Rank_ActiveAcum', ascending = False)[['Country', 'ActiveAcum']].reset_index(drop = True).head(10)

Unnamed: 0,Country,ActiveAcum
0,United States,648088.0
1,Italy,108257.0
2,United Kingdom,104641.0
3,Spain,100864.0
4,France,97170.0
5,Turkey,72313.0
6,Germany,52598.0
7,Russia,39201.0
8,Netherlands,28819.0
9,Belgium,24056.0


Which countries had the most active cases on the last day?

In [51]:
last_day_data.sort_values('Rank_Active', ascending = False)[['Country', 'Active']].reset_index(drop = True).head(10)

Unnamed: 0,Country,Active
0,United States,19397.0
1,Russia,5778.0
2,United Kingdom,5239.0
3,Spain,3978.0
4,France,3953.0
5,Turkey,2327.0
6,India,1464.0
7,Peru,1029.0
8,Saudi Arabia,1014.0
9,Netherlands,983.0


Which countries have the most active cases per million inhabitants?

In [52]:
last_day_data.sort_values('Rank_ConfirmedAcum_Millon', ascending = False)[['Country', 'ConfirmedAcum_Millon']].reset_index(drop = True).head(20)

Unnamed: 0,Country,ConfirmedAcum_Millon
0,San Marino,13586.395921
1,Holy See,9987.515605
2,Andorra,9227.981622
3,Luxembourg,5671.125822
4,Iceland,5189.850048
5,Spain,4249.27694
6,Belgium,3321.592083
7,Switzerland,3205.223752
8,Ireland,3088.631221
9,Italy,2960.082615


## Saving the clean dataset

Finally I will create an Excel file with the information per country once it is clean and in perfect condition to apply some machine learning algorithms ...

In [0]:
data_by_country.to_excel("All_data.xlsx", index = False)

## Convert to PDF

In [54]:
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc

Reading package lists... Done
Building dependency tree       
Reading state information... Done
pandoc is already the newest version (1.19.2.4~dfsg-1build4).
pandoc set to manually installed.
The following additional packages will be installed:
  fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono fonts-texgyre
  javascript-common libcupsfilters1 libcupsimage2 libgs9 libgs9-common
  libijs-0.35 libjbig2dec0 libjs-jquery libkpathsea6 libpotrace0 libptexenc1
  libruby2.5 libsynctex1 libtexlua52 libtexluajit2 libzzip-0-13 lmodern
  poppler-data preview-latex-style rake ruby ruby-did-you-mean ruby-minitest
  ruby-net-telnet ruby-power-assert ruby-test-unit ruby2.5
  rubygems-integration t1utils tex-common tex-gyre texlive-base
  texlive-binaries texlive-fonts-recommended texlive-latex-base
  texlive-latex-recommended texlive-pictures texlive-plain-generic tipa
Suggested packages:
  fonts-noto apache2 | lighttpd | httpd poppler-utils ghostscript
  fonts-japanese-mincho | fonts-ipa

In [55]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [68]:
ls -la "./drive/My Drive/Colab Notebooks/"

total 1442
-rw------- 1 root root  80931 Apr  8 05:33  Covib19.ipynb
-rw------- 1 root root 133795 Apr 20 07:14  Covid-19.ipynb
-rw------- 1 root root  68277 Apr 14 04:04 'Module 1.1 - Vector and Matrices.ipynb'
-rw------- 1 root root 251701 Apr 14 04:05 'Module 1.2 - Main Probability Distributions.ipynb'
-rw------- 1 root root  56997 Apr 18 04:06  NumPy.ipynb
-rw------- 1 root root 555477 Apr 19 05:56  Pandas.ipynb
-rw------- 1 root root  73495 Apr 13 05:17 'Pivot Table.ipynb'
-rw------- 1 root root 252815 Apr 19 05:01 'Python for Data Analysis - Exercises.ipynb'
-rw------- 1 root root    333 Apr 13 07:18  Untitled0.ipynb


In [0]:
!cp "./drive/My Drive/Colab Notebooks/Covid-19.ipynb" .

In [72]:
!jupyter nbconvert --to PDF "Covid-19.ipynb"

[NbConvertApp] Converting notebook Covid-19.ipynb to PDF
[NbConvertApp] Writing 105522 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 74970 bytes to Covid-19.pdf
