# Pandas and Covid-19

This notebook is an example of data analysis and manipulation with **Pandas** and has beed created in [Google Colab](https://colab.research.google.com)

Enjoy it!

In [1]:
import numpy as np
import pandas as pd

## Data

To get some data I and going to download it from Data Repository by Johns Hopkins CSSE

https://github.com/CSSEGISandData/COVID-19

I like to remove the folder where I am going to save the data so I can re-execute these sentences without any problems ...

In [2]:
!rm -rf ./COVID-19

The dataset is avaible in GitHub so I use the `git` command to get it

In [3]:
!git clone https://github.com/CSSEGISandData/COVID-19.git

Cloning into 'COVID-19'...
remote: Enumerating objects: 622950, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 622950 (delta 23), reused 35 (delta 18), pack-reused 622900[K
Receiving objects: 100% (622950/622950), 7.13 GiB | 23.85 MiB/s, done.
Resolving deltas: 100% (542114/542114), done.
Checking out files: 100% (2322/2322), done.


More than **7.9 Gb** of data ...

## Exporing the data


### Exploring the files

The first step is explore the datafiles. I will use command line orders to save the filenames into a file that I will later load into pandas

In [7]:
!ls -lt ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports  | head

total 544852
-rw-r--r-- 1 root root 556039 Jan 13 05:21 12-31-2022.csv
-rw-r--r-- 1 root root      0 Jan 13 05:21 README.md
-rw-r--r-- 1 root root 559258 Jan 13 05:21 12-31-2021.csv
-rw-r--r-- 1 root root 570042 Jan 13 05:21 12-31-2020.csv
-rw-r--r-- 1 root root 556049 Jan 13 05:21 12-30-2022.csv
-rw-r--r-- 1 root root 559186 Jan 13 05:21 12-30-2021.csv
-rw-r--r-- 1 root root 570027 Jan 13 05:21 12-30-2020.csv
-rw-r--r-- 1 root root 556070 Jan 13 05:21 12-29-2022.csv
-rw-r--r-- 1 root root 559184 Jan 13 05:21 12-29-2021.csv


In [8]:
! ls -l ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports > files.txt

In [9]:
! cat files.txt | head

total 544852
-rw-r--r-- 1 root root 570035 Jan 13 05:21 01-01-2021.csv
-rw-r--r-- 1 root root 559258 Jan 13 05:21 01-01-2022.csv
-rw-r--r-- 1 root root 556053 Jan 13 05:21 01-01-2023.csv
-rw-r--r-- 1 root root 570265 Jan 13 05:21 01-02-2021.csv
-rw-r--r-- 1 root root 559080 Jan 13 05:21 01-02-2022.csv
-rw-r--r-- 1 root root 556050 Jan 13 05:21 01-02-2023.csv
-rw-r--r-- 1 root root 570438 Jan 13 05:21 01-03-2021.csv
-rw-r--r-- 1 root root 559288 Jan 13 05:21 01-03-2022.csv
-rw-r--r-- 1 root root 556004 Jan 13 05:21 01-03-2023.csv


I will load this file into a Pandas dataframe and clean it

In [10]:
df_data_files = pd.read_fwf("files.txt", header = None)

In [11]:
df_data_files.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,total 544852,,,,,,,
1,-rw-r--r-- 1,root,root,570035.0,Jan,13.0,05:21,01-01-2021.csv
2,-rw-r--r-- 1,root,root,559258.0,Jan,13.0,05:21,01-01-2022.csv
3,-rw-r--r-- 1,root,root,556053.0,Jan,13.0,05:21,01-01-2023.csv
4,-rw-r--r-- 1,root,root,570265.0,Jan,13.0,05:21,01-02-2021.csv
5,-rw-r--r-- 1,root,root,559080.0,Jan,13.0,05:21,01-02-2022.csv
6,-rw-r--r-- 1,root,root,556050.0,Jan,13.0,05:21,01-02-2023.csv
7,-rw-r--r-- 1,root,root,570438.0,Jan,13.0,05:21,01-03-2021.csv
8,-rw-r--r-- 1,root,root,559288.0,Jan,13.0,05:21,01-03-2022.csv
9,-rw-r--r-- 1,root,root,556004.0,Jan,13.0,05:21,01-03-2023.csv


In [12]:
df_files = (df_data_files
  # We only want to keep the 'filename' column
  .rename(columns= { 7: 'filename'}) 
  .filter(['filename'])
   # Rows are filtered
  .query("filename != 'README.md' and filename.notnull()", engine = 'python') 
   # New date field from the name of the file
  .assign(date = lambda dataset: pd.to_datetime(dataset.filename.str[0:10], format = "%m-%d-%Y"),
          month = lambda dataset: dataset.date.dt.month,
          year = lambda dataset: dataset.date.dt.year)
   )

In [13]:
df_files.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1087 entries, 1 to 1087
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   filename  1087 non-null   object        
 1   date      1087 non-null   datetime64[ns]
 2   month     1087 non-null   int64         
 3   year      1087 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 42.5+ KB


In [14]:
df_files

Unnamed: 0,filename,date,month,year
1,01-01-2021.csv,2021-01-01,1,2021
2,01-01-2022.csv,2022-01-01,1,2022
3,01-01-2023.csv,2023-01-01,1,2023
4,01-02-2021.csv,2021-01-02,1,2021
5,01-02-2022.csv,2022-01-02,1,2022
...,...,...,...,...
1083,12-30-2021.csv,2021-12-30,12,2021
1084,12-30-2022.csv,2022-12-30,12,2022
1085,12-31-2020.csv,2020-12-31,12,2020
1086,12-31-2021.csv,2021-12-31,12,2021


In [15]:
df_files.groupby("year").agg(total_files = ("filename", "count"))

Unnamed: 0_level_0,total_files
year,Unnamed: 1_level_1
2020,345
2021,365
2022,365
2023,12


Now I can know the date of the first and the last data file

In [16]:
df_files.nsmallest(5, 'date')

Unnamed: 0,filename,date,month,year
55,01-22-2020.csv,2020-01-22,1,2020
58,01-23-2020.csv,2020-01-23,1,2020
61,01-24-2020.csv,2020-01-24,1,2020
64,01-25-2020.csv,2020-01-25,1,2020
67,01-26-2020.csv,2020-01-26,1,2020


In [17]:
df_files.nlargest(5, 'date')

Unnamed: 0,filename,date,month,year
36,01-12-2023.csv,2023-01-12,1,2023
33,01-11-2023.csv,2023-01-11,1,2023
30,01-10-2023.csv,2023-01-10,1,2023
27,01-09-2023.csv,2023-01-09,1,2023
24,01-08-2023.csv,2023-01-08,1,2023


### Exploring the data

Perfect. Let's explore the first dataset generated ...

In [18]:
first = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv")

In [19]:
first.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
4,Gansu,Mainland China,1/22/2020 17:00,,,


In [20]:
first.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  64 non-null     object 
 1   Country/Region  76 non-null     object 
 2   Last Update     76 non-null     object 
 3   Confirmed       66 non-null     float64
 4   Deaths          39 non-null     float64
 5   Recovered       39 non-null     float64
dtypes: float64(3), object(3)
memory usage: 3.7+ KB


And the last one ...

In [22]:
last = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-12-2023.csv")

In [23]:
last.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2023-01-13 04:20:58,33.93911,67.709953,207900,7854,,,Afghanistan,534.058207,3.777778
1,,,,Albania,2023-01-13 04:20:58,41.1533,20.1683,334018,3596,,,Albania,11606.713462,1.076589
2,,,,Algeria,2023-01-13 04:20:58,28.0339,1.6596,271277,6881,,,Algeria,618.632948,2.536522
3,,,,Andorra,2023-01-13 04:20:58,42.5063,1.5218,47781,165,,,Andorra,61840.419336,0.345326
4,,,,Angola,2023-01-13 04:20:58,-11.2027,17.8739,105095,1930,,,Angola,319.765542,1.836434


In [24]:
last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4016 entries, 0 to 4015
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3268 non-null   float64
 1   Admin2               3272 non-null   object 
 2   Province_State       3837 non-null   object 
 3   Country_Region       4016 non-null   object 
 4   Last_Update          4016 non-null   object 
 5   Lat                  3925 non-null   float64
 6   Long_                3925 non-null   float64
 7   Confirmed            4016 non-null   int64  
 8   Deaths               4016 non-null   int64  
 9   Recovered            0 non-null      float64
 10  Active               0 non-null      float64
 11  Combined_Key         4016 non-null   object 
 12  Incident_Rate        3922 non-null   float64
 13  Case_Fatality_Ratio  3974 non-null   float64
dtypes: float64(7), int64(2), object(5)
memory usage: 439.4+ KB


Can I concatenate both datasets?

In [25]:
pd.concat((first, last))

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,,,,,,,,,,,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,,,,,,,,,,,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,,,,,,,,,,,,
4,Gansu,Mainland China,1/22/2020 17:00,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4011,,,,703228.0,5708.0,,,,,West Bank and Gaza,2023-01-13 04:20:58,31.952200,35.233200,,West Bank and Gaza,13784.956961,0.811686
4012,,,,535.0,0.0,,,,,Winter Olympics 2022,2023-01-13 04:20:58,39.904200,116.407400,,Winter Olympics 2022,,0.000000
4013,,,,11945.0,2159.0,,,,,Yemen,2023-01-13 04:20:58,15.552727,48.516388,,Yemen,40.048994,18.074508
4014,,,,336340.0,4034.0,,,,,Zambia,2023-01-13 04:20:58,-13.133897,27.849332,,Zambia,1829.530053,1.199382


Ups!!! The column names don't match :-(

## Loading the data into Pandas and cleaning it

In [26]:
import glob
import os

files = glob.glob("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv")
files.sort(key=os.path.getmtime)

We are going to:
- Create a blank Dataset to store all the data
- Load every dataset unifying the column names so we can concatenate it without any problem.
- Remove extra blank spaces from the country field
- Enrich the information with the date of the data in the correct type

In [27]:
df_all_data = pd.DataFrame()
for file in files:  
    df_file = (pd.read_csv(file)
                  # Columns are different in some files
                  .rename(columns = {'Province/State' : 'State', 
                        'Province_State' : 'State', 
                        "Country/Region" : 'Country',
                        "Country_Region" : 'Country',
                        'Last Update' : 'Last_Update',
                        'Confirmed' : 'ConfirmedAcum',
                        'Deaths' : 'DeathsAcum',
                        'Recovered' : 'RecoveredAcum'})
                  # A new field with the date of data is created
                  # Country field is cleaned
                  .assign(
                    Date = pd.to_datetime(file[-14:-4], format = '%m-%d-%Y'),
                    Country = lambda dataset: dataset.Country.str.strip()
                  )
                )
    df_all_data = pd.concat([df_all_data, df_file])


Let's create a new variable in case it is necessary to repeat the analysis

In [38]:
df_data = df_all_data.copy()

In [56]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103343 entries, 0 to 103342
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   Date           103343 non-null  datetime64[ns]
 1   Country        103343 non-null  object        
 2   ConfirmedAcum  103343 non-null  float64       
 3   DeathsAcum     103343 non-null  float64       
 4   RecoveredAcum  103343 non-null  float64       
 5   ActiveAcum     103343 non-null  float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 4.7+ MB


In [57]:
df_data.head(10)

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
0,2020-01-22,Antarctica,0.0,0.0,0.0,0.0
1,2020-01-22,China,547.0,17.0,28.0,399.0
2,2020-01-22,Hong Kong,0.0,0.0,0.0,0.0
3,2020-01-22,Japan,2.0,0.0,0.0,0.0
4,2020-01-22,Kiribati,0.0,0.0,0.0,0.0
5,2020-01-22,"Korea, North",0.0,0.0,0.0,0.0
6,2020-01-22,Macao,1.0,0.0,0.0,0.0
7,2020-01-22,Malaysia,0.0,0.0,0.0,0.0
8,2020-01-22,Nauru,0.0,0.0,0.0,0.0
9,2020-01-22,New Zealand,0.0,0.0,0.0,0.0


First, I am going to filter the columns that are interesting to me

In [58]:
df_data = df_data.filter(['Date', 'Country', 'ConfirmedAcum', 'DeathsAcum', 'RecoveredAcum'])

In [59]:
df_data.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum
0,2020-01-22,Antarctica,0.0,0.0,0.0
1,2020-01-22,China,547.0,17.0,28.0
2,2020-01-22,Hong Kong,0.0,0.0,0.0
3,2020-01-22,Japan,2.0,0.0,0.0
4,2020-01-22,Kiribati,0.0,0.0,0.0


I realized that there were countries that were called by different names. 

Let's fix it ...

In [60]:
df_data = df_data.assign(
    Country = lambda dataset: dataset.Country.replace({'Bahamas, The' : 'Bahamas',
                         'Congo (Brazzaville)' : 'Congo',
                         'Congo (Kinshasa)' : 'Congo',
                         "Cote d'Ivoire" : "Cote d'Ivoire",
                         "Curacao" : "Curaçao",
                         'Czech Republic' : 'Czech Republic (Czechia)',
                         'Czechia' : 'Czech Republic (Czechia)',
                         'Faroe Islands' : 'Faeroe Islands',
                         'Macau' : 'Macao',
                         'Mainland China' : 'China',
                         'Palestine' : 'State of Palestine',
                         'Reunion' : 'Réunion',
                         'Saint Kitts and Nevis' : 'Saint Kitts & Nevis',
                         'Sao Tome and Principe' : 'Sao Tome & Principe',
                         'US' : 'United States',
                         'Gambia, The' : 'Gambia',
                         'Hong Kong SAR' : 'Hong Kong',
                         'Korea, South' : 'South Korea',
                         'Macao SAR' : 'Macao',
                         'Taiwan*' : 'Taiwan',
                         'Viet Nam' : 'Vietnam',
                         'West Bank and Gaza' : 'State of Palestine',
                         'occupied Palestinian territory' : 'State of Palestine'
                         })
)    

At some point the data of the recovered persons was not published anymore, so I am going to keep only the rows that contain this data

In [61]:
df_data = df_data.query("Date <= '2021-08-04'")

Let's verify the structure of the dateset ...

In [62]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103343 entries, 0 to 103342
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   Date           103343 non-null  datetime64[ns]
 1   Country        103343 non-null  object        
 2   ConfirmedAcum  103343 non-null  float64       
 3   DeathsAcum     103343 non-null  float64       
 4   RecoveredAcum  103343 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 4.7+ MB


More than **1.9 MM** of rows!

In [63]:
(df_data
    .query("Country == 'Spain'") 
    .sort_values('Date', ascending = False) 
    .head()
)

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum
103309,2021-08-04,Spain,4545184.0,81844.0,150376.0
103109,2021-08-03,Spain,4523310.0,81773.0,150376.0
102909,2021-08-02,Spain,4502983.0,81643.0,150376.0
102709,2021-08-01,Spain,4447044.0,81486.0,150376.0
102509,2021-07-31,Spain,4447044.0,81486.0,150376.0


Wait a sec, I think that can be interesting have a column the the active cases. Let's create it ...

In [64]:
df_data = df_data.assign(
    ActiveAcum = lambda dataset: dataset.ConfirmedAcum  - dataset.DeathsAcum - dataset.RecoveredAcum
)

In [65]:
df_data.head()

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
0,2020-01-22,Antarctica,0.0,0.0,0.0,0.0
1,2020-01-22,China,547.0,17.0,28.0,502.0
2,2020-01-22,Hong Kong,0.0,0.0,0.0,0.0
3,2020-01-22,Japan,2.0,0.0,0.0,2.0
4,2020-01-22,Kiribati,0.0,0.0,0.0,0.0


In [66]:
(df_data
    .query("Country == 'Spain'")
    .sort_values('Date', ascending = False)
    .head()
)

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
103309,2021-08-04,Spain,4545184.0,81844.0,150376.0,4312964.0
103109,2021-08-03,Spain,4523310.0,81773.0,150376.0,4291161.0
102909,2021-08-02,Spain,4502983.0,81643.0,150376.0,4270964.0
102709,2021-08-01,Spain,4447044.0,81486.0,150376.0,4215182.0
102509,2021-07-31,Spain,4447044.0,81486.0,150376.0,4215182.0


Perfect :-)

Now, I am going to group and summarize the data because I want to be sure that there is only one row per Date and Country

In [67]:
df_data = (df_data
            .groupby(["Date", "Country"], as_index = False )
            .agg("sum")
            )

In [68]:
(df_data
    .query("Country == 'Spain'")
    .sort_values('Date', ascending = False)
    .head()
)

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum
103309,2021-08-04,Spain,4545184.0,81844.0,150376.0,4312964.0
103109,2021-08-03,Spain,4523310.0,81773.0,150376.0,4291161.0
102909,2021-08-02,Spain,4502983.0,81643.0,150376.0,4270964.0
102709,2021-08-01,Spain,4447044.0,81486.0,150376.0,4215182.0
102509,2021-07-31,Spain,4447044.0,81486.0,150376.0,4215182.0


##  Daily Cases
I am going to enrich the data by creating new columns with the daily cases.  

In [69]:
(df_data
      .query("Country == 'Spain'") 
      .sort_values(['Country', 'Date'], ascending = False) 
      .filter(['Date', 'Country','ConfirmedAcum']) 
      .head(5)
)

Unnamed: 0,Date,Country,ConfirmedAcum
103309,2021-08-04,Spain,4545184.0
103109,2021-08-03,Spain,4523310.0
102909,2021-08-02,Spain,4502983.0
102709,2021-08-01,Spain,4447044.0
102509,2021-07-31,Spain,4447044.0


To obtain the daily cases, the cases of one day must be subtracted from those of the previous day.


First I create new columns with the cases from the previous day

In [70]:
df_data = (df_data 
      # Sort the data to prepare it to the following step
      .sort_values(['Country', 'Date']) 
      # Create new columns with the previous value of each Country
      .assign(ConfirmedPrevious = lambda dataset: dataset.groupby(['Country']).shift(1)["ConfirmedAcum"],
              DeathsPrevious = lambda dataset: dataset.groupby(['Country']).shift(1)["DeathsAcum"],
              RecoveredPrevious = lambda dataset: dataset.groupby(['Country']).shift(1)["RecoveredAcum"],
              ActivePrevious = lambda dataset: dataset.groupby(['Country']).shift(1)["ActiveAcum"],
      ) 
      # Replace nulls by 0
      .fillna({ 'ConfirmedPrevious' : 0, 'DeathsPrevious' : 0, 'RecoveredPrevious' : 0 })
 )

In [71]:
(df_data 
      .query("Country == 'Spain'") 
      .sort_values(['Country', 'Date'], ascending = False) 
      .filter(['Date', 'Country','ConfirmedAcum', 'ConfirmedPrevious']) 
      .head(5)
)

Unnamed: 0,Date,Country,ConfirmedAcum,ConfirmedPrevious
103309,2021-08-04,Spain,4545184.0,4523310.0
103109,2021-08-03,Spain,4523310.0,4502983.0
102909,2021-08-02,Spain,4502983.0,4447044.0
102709,2021-08-01,Spain,4447044.0,4447044.0
102509,2021-07-31,Spain,4447044.0,4447044.0


After that I am going to assign the new fields subtracting the previous acum cases to the actual acum cases

In [72]:
df_data = df_data.assign(
      Confirmed = lambda dataset: dataset.ConfirmedAcum -  dataset.ConfirmedPrevious,
      Deaths = lambda dataset: dataset.DeathsAcum - dataset.DeathsPrevious,
      Recovered = lambda dataset: dataset.RecoveredAcum - dataset.RecoveredPrevious,
      Active = lambda dataset: dataset.ActiveAcum - dataset.ActivePrevious
    )

In [73]:
(df_data 
      .query("Country == 'Spain'") 
      .sort_values(['Country', 'Date'], ascending = False) 
      .filter(['Date', 'Country','ConfirmedAcum', 'ConfirmedPrevious', 'Confirmed']) 
      .head(5)
 )

Unnamed: 0,Date,Country,ConfirmedAcum,ConfirmedPrevious,Confirmed
103309,2021-08-04,Spain,4545184.0,4523310.0,21874.0
103109,2021-08-03,Spain,4523310.0,4502983.0,20327.0
102909,2021-08-02,Spain,4502983.0,4447044.0,55939.0
102709,2021-08-01,Spain,4447044.0,4447044.0,0.0
102509,2021-07-31,Spain,4447044.0,4447044.0,0.0


I no longer need the fields I used to make the calculation so I can drop them

In [74]:
df_data = df_data.drop(columns = ['ConfirmedPrevious', 'DeathsPrevious', 'RecoveredPrevious', 'ActivePrevious'])

In [76]:
(df_data
      .query("Country == 'Spain'") 
      .sort_values(['Country', 'Date'], ascending = False)
      .head(5)
 )

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active
103309,2021-08-04,Spain,4545184.0,81844.0,150376.0,4312964.0,21874.0,71.0,0.0,21803.0
103109,2021-08-03,Spain,4523310.0,81773.0,150376.0,4291161.0,20327.0,130.0,0.0,20197.0
102909,2021-08-02,Spain,4502983.0,81643.0,150376.0,4270964.0,55939.0,157.0,0.0,55782.0
102709,2021-08-01,Spain,4447044.0,81486.0,150376.0,4215182.0,0.0,0.0,0.0,0.0
102509,2021-07-31,Spain,4447044.0,81486.0,150376.0,4215182.0,0.0,0.0,0.0,0.0


The data looks good!

## Cases per million inhabitants



We are going to enrich the information with the number of cases per million inhabitants, so we need population data by country.

A small internet search leads me to a page that has population data for 2020:

https://www.worldometers.info/world-population/population-by-country/

It seems that this information is protected to be downloaded automatically so I have no choice but to do it manually and upload the data to a GitHub Repository:

https://github.com/dvillaj/world-population/



I load the data to the Pandas, clean it up and just maintain the field of the population 


In [77]:
df_population = pd.read_excel("https://github.com/dvillaj/world-population/blob/master/data/world-popultation-2020.xlsx?raw=true", 
                              sheet_name="Data")

In [78]:
df_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            235 non-null    object 
 1   Population (2020)  235 non-null    int64  
 2   Yearly Change      235 non-null    float64
 3   Net Change         235 non-null    int64  
 4   Density (P/Km²)    235 non-null    float64
 5   Land Area (Km²)    235 non-null    int64  
 6   Migrants (net)     201 non-null    float64
 7   Fertility Rate     201 non-null    float64
 8   Average Age        201 non-null    float64
 9   Urban Pop %        222 non-null    float64
 10  World Share        235 non-null    float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.3+ KB


In [79]:
df_population = (df_population
                    .rename(columns = {
                        'Population (2020)' : 'Population',
                        'Yearly Change' : 'Yearly_Change',
                        'Net Change' : 'Net_Change',
                        'Density (P/Km²)' : 'Density',
                        'Land Area (Km²)' : 'Land_Area',
                        'Migrants (net)' : 'igrants',
                        'Fertility Rate' : 'Fertility',
                        'Average Age' : 'Mean_Age',
                        'Urban Pop %' : 'Urban_Pop',
                        'World Share' : 'World_Share'
                    })
                )

In [82]:
df_population = df_population.filter(['Country', 'Population'])

In [83]:
df_population.head()

Unnamed: 0,Country,Population
0,Afghanistan,38928346
1,Albania,2877797
2,Algeria,43851044
3,American Samoa,55191
4,Andorra,77265


Now I join the population Dataset with the country data to have the population in this dataset

In [84]:
df_data = df_data.merge(df_population, how = 'left', on = 'Country')

In [101]:
(df_data
      .query("Country == 'Spain'") 
      .sort_values(['Country', 'Date'], ascending = False) 
      .head(5)
 )

Unnamed: 0,Date,Country,ConfirmedAcum,DeathsAcum,RecoveredAcum,ActiveAcum,Confirmed,Deaths,Recovered,Active,Population
86186,2021-08-04,Spain,4545184.0,81844.0,150376.0,4312964.0,21874.0,71.0,0.0,21803.0,46754778.0
86185,2021-08-03,Spain,4523310.0,81773.0,150376.0,4291161.0,20327.0,130.0,0.0,20197.0,46754778.0
86184,2021-08-02,Spain,4502983.0,81643.0,150376.0,4270964.0,55939.0,157.0,0.0,55782.0,46754778.0
86183,2021-08-01,Spain,4447044.0,81486.0,150376.0,4215182.0,0.0,0.0,0.0,0.0,46754778.0
86182,2021-07-31,Spain,4447044.0,81486.0,150376.0,4215182.0,0.0,0.0,0.0,0.0,46754778.0


And finally I calculate the number of cases per million inhabitants

In [102]:
df_data = (df_data 
      .assign(ConfirmedAcum_Millon = 
              lambda dataset: (dataset.ConfirmedAcum / dataset.Population * 1000000).round(0))
      )

In [103]:
(df_data 
      .query("Country == 'Spain'") 
      .sort_values(['Country', 'Date'], ascending = False) 
      .filter(['Date', 'Country', 'ConfirmedAcum', 'Population', 'ConfirmedAcum_Millon']) 
      .head(5)
 )

Unnamed: 0,Date,Country,ConfirmedAcum,Population,ConfirmedAcum_Millon
86186,2021-08-04,Spain,4545184.0,46754778.0,97213.0
86185,2021-08-03,Spain,4523310.0,46754778.0,96745.0
86184,2021-08-02,Spain,4502983.0,46754778.0,96311.0
86183,2021-08-01,Spain,4447044.0,46754778.0,95114.0
86182,2021-07-31,Spain,4447044.0,46754778.0,95114.0


## Last day cases

I'm going to create a dataset of last day's cases. 

The goal is to get a set of rankings that tell me the countries with the most cases

First let's find out the most recent day in our data

In [104]:
df_data.Date.sort_values(ascending = False).head(1)

103342   2021-08-04
Name: Date, dtype: datetime64[ns]

In [105]:
last_day_data = df_data.query("Date == '2021-08-04'")

In [106]:
(last_day_data 
      .query("Country == 'Spain'") 
      .sort_values(['Country', 'Date'], ascending = False) 
      .filter(['Date', 'Country', 'ConfirmedAcum', 'Population', 'ConfirmedAcum_Millon']) 
      .head(5)
 )

Unnamed: 0,Date,Country,ConfirmedAcum,Population,ConfirmedAcum_Millon
86186,2021-08-04,Spain,4545184.0,46754778.0,97213.0


In [107]:
last_day_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 527 to 103342
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  200 non-null    datetime64[ns]
 1   Country               200 non-null    object        
 2   ConfirmedAcum         200 non-null    float64       
 3   DeathsAcum            200 non-null    float64       
 4   RecoveredAcum         200 non-null    float64       
 5   ActiveAcum            200 non-null    float64       
 6   Confirmed             200 non-null    float64       
 7   Deaths                200 non-null    float64       
 8   Recovered             200 non-null    float64       
 9   Active                200 non-null    float64       
 10  Population            190 non-null    float64       
 11  ConfirmedAcum_Millon  190 non-null    float64       
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 20.3+ KB


Which countries have the most confirmed cases?

In [108]:
(last_day_data 
  .sort_values('ConfirmedAcum', ascending = False) 
  .filter(['Country', 'ConfirmedAcum']) 
  .reset_index(drop = True) 
  .head(10)
 )

Unnamed: 0,Country,ConfirmedAcum
0,United States,35458071.0
1,India,31812114.0
2,Brazil,20034407.0
3,Russia,6274006.0
4,France,6272466.0
5,United Kingdom,5980830.0
6,Turkey,5822487.0
7,Argentina,4975616.0
8,Colombia,4815063.0
9,Spain,4545184.0


Which countries have the most active cases?

In [97]:
(last_day_data 
  .sort_values('ActiveAcum', ascending = False) 
  .filter(['Country', 'ActiveAcum']) 
  .reset_index(drop = True) 
  .head(10)
 )

Unnamed: 0,Country,ActiveAcum
0,United States,34846777.0
1,United Kingdom,5798928.0
2,France,5745110.0
3,Spain,4312964.0
4,Netherlands,1858347.0
5,Brazil,1703235.0
6,Belgium,1107676.0
7,Sweden,1088172.0
8,Serbia,716389.0
9,Thailand,640009.0


Which countries have the most active cases per million inhabitants?

In [109]:
(last_day_data
  .sort_values('ConfirmedAcum_Millon', ascending = False) 
  .filter(['Country', 'ConfirmedAcum_Millon']) 
  .reset_index(drop = True) 
  .head(10)
)

Unnamed: 0,Country,ConfirmedAcum_Millon
0,Andorra,191510.0
1,Seychelles,188577.0
2,Montenegro,163454.0
3,Bahrain,158451.0
4,Czech Republic (Czechia),156334.0
5,San Marino,152604.0
6,Maldives,144105.0
7,Slovakia,142265.0
8,Slovenia,124883.0
9,Luxembourg,118445.0


What is the total number of cases?

In [110]:
(last_day_data
    .agg(
        TotalConfirmed = ("ConfirmedAcum", "sum"),
        TotalRecovered = ("RecoveredAcum", "sum"),
        TotalDeaths = ("DeathsAcum", "sum"),
    ) 
    .style.format("{:,.0f}")
)

Unnamed: 0,ConfirmedAcum,RecoveredAcum,DeathsAcum
TotalConfirmed,200757318.0,,
TotalRecovered,,130899061.0,
TotalDeaths,,,4282889.0


## Exporting to Excel the clean dataset

Finally I will create an Excel file with the information per country once it is clean and in perfect condition to apply some machine learning algorithms ...

In [112]:
df_data.to_excel("Covid19-Clean-Data.xlsx", index = False)