# Data Cleaning & Validation - Country Indicator Data

**Author:** Alan Meeson <alan.meeson@capgemini.com>

**Date:** 2023-02-06

This notebook captures assumptions about the data, and validation of those assumptions.
This can serve as a template for the Cleaning and Validation stage of the ETL process for the country data.

Key findings are:
- There is a large quantity of missing data in most columns
- Some columns are on varying frequencies of sampling,  eg: excess mortality can be monthly, weekly, or other, varying by location.
- The location column does not only contain countries, but also continents, world, and socio-economic categories.

The analysis here has not been as thorough as might be preferable,  however time is limited.

In [None]:
import os
import re
import datetime
import numpy as np
import pandas as pd
import pandera as pa
import matplotlib.pyplot as plt

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
data_dir = '../data'
country_filename = os.path.join(data_dir, 'raw', 'Data2_Large set of country indicators.xlsx')
cleaned_country_filename = os.path.join(data_dir, 'cleaned', 'country_indicators.parquet')

## Explore the Country Data
Key data available:
- Deaths & excess mortality
- Reproduction rate
- Hospitalisation
- Tests & positive rate
- Vaccinations
- stringency_index (of lockdowns?)
- population statistics (density, age, wealth, handwashing, etc.)


In [None]:
countries_df = pd.read_excel(
    country_filename, 
    parse_dates=['date']
)

### Initial Look and Datatypes

In [None]:
countries_df.head()

In [None]:
countries_df.dtypes

### Explore the columns

#### Continents & Locations

First, let's have a look at what different continents we have data for.

We get the expected set (we don't really expect to get covid variant data from the antarctic), with the addition of some which are nan.

In [None]:
countries_df.continent.unique()

So, what locations have `nan` as a continent?  
- It turns out that we have some locations which are not countries,  but rather are aggregations over the continents & globe.
- Also that we have some which which are varying income levels.

In [None]:
countries_df.loc[countries_df['continent'].isna(), 'location'].unique()

#### Test Units

This is the only other non-numeric column, so lets have a look.  It seems to be an indication of what units the columns about tests are given in.

Questions to consider:  
    - do the units make any difference here?
    - Is there a systematic pattern by location or time as to what units are used?

In [None]:
countries_df.tests_units.unique()

#### General description of columns

In [None]:
countries_df.describe()

#### Percentage of null values

In [None]:
100 * countries_df.apply(lambda x: sum(x.isna()), axis=0) / countries_df.shape[0]

## Clean and Transform

In [None]:
countries_df = pd.read_excel(
    country_filename, 
    dtype={
        'continent': str, # Continent the location is on. May be null if location is itsself a continent, or a socioeconomic class, or just the whole world
        'location': str, # Mostly countries, not all of which have evolution data; sometimes a continent, or a socioeconomic class, or just the whole world
        'date': str,
        'total_deaths_per_million': float,
        'new_deaths_per_million': float,
        'new_deaths_smoothed_per_million': float,
        'reproduction_rate': float, # the R value of the virus
        'hosp_patients': 'Int64', 
        'hosp_patients_per_million': float,
        'total_tests': 'Int64', 
        'new_tests': 'Int64', 
        'total_tests_per_thousand': float,
        'new_tests_per_thousand': float, 
        'new_tests_smoothed': 'Int64',
        'new_tests_smoothed_per_thousand': float, 
        'positive_rate': float, 
        'tests_per_case': float,
        'tests_units': str, 
        'total_vaccinations': 'Int64', 
        'people_vaccinated': 'Int64',
        'people_fully_vaccinated': 'Int64', 
        'total_boosters': 'Int64', 
        'new_vaccinations': 'Int64',
        'new_vaccinations_smoothed': 'Int64', 
        'total_vaccinations_per_hundred': float,
        'people_vaccinated_per_hundred': float, 
        'people_fully_vaccinated_per_hundred': float,
        'new_vaccinations_smoothed_per_million': 'Int64',
        'new_people_vaccinated_smoothed': 'Int64',
        'new_people_vaccinated_smoothed_per_hundred': float, 
        'stringency_index': float, # stringency of lockdown 
        'population_density': float, 
        'median_age': float, 
        'aged_65_older': float, 
        'aged_70_older': float,
        'gdp_per_capita': float, 
        'extreme_poverty': float, 
        'cardiovasc_death_rate': float,
        'diabetes_prevalence': float,  
        'female_smokers': float, 
        'male_smokers': float,
        'handwashing_facilities': float, 
        'hospital_beds_per_thousand': float,
        'life_expectancy': float, 
        'human_development_index': float, 
        'population': 'Int64',
        'excess_mortality_cumulative_absolute': float, 
        'excess_mortality_cumulative': float,
        'excess_mortality': float, 
        'excess_mortality_cumulative_per_million': float
    },
    parse_dates=['date']
)

In [None]:
schema = pa.DataFrameSchema({
    'continent': pa.Column(str, checks=pa.Check.str_matches(r'^[A-Za-z, ]+$'), nullable=True), # Continent the location is on. May be null if location is itsself a continent, or a socioeconomic class, or just the whole world
    'location': pa.Column(str, checks=pa.Check.str_matches(r"^[A-Za-z,\-\(\) ']+$")), # Mostly countries, not all of which have evolution data; sometimes a continent, or a socioeconomic class, or just the whole world
    'date': pa.Column('datetime64[ns]'),
    'total_deaths_per_million': pa.Column(float, checks=pa.Check.ge(0), nullable=True),
    'new_deaths_per_million': pa.Column(float, checks=pa.Check.ge(0), nullable=True),
    'new_deaths_smoothed_per_million': pa.Column(float, checks=pa.Check.ge(0), nullable=True),
    
    # We're assuming that COVID's Re is less severe than measles' R0 by setting reproduction_rate to max 20.
    # Something weird here is that some values for reproduction rate are less than 0, as far as  -0.08!?
    # This is due to limitations of the R estimation method around 0 I suspect
    'reproduction_rate': pa.Column(float, checks=[pa.Check.gt(-0.1), pa.Check.lt(20)], nullable=True), 
    'hosp_patients': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True),
    'hosp_patients_per_million': pa.Column(float, checks=pa.Check.ge(0), nullable=True),
    'total_tests': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True), 
    'new_tests': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True), 
    'total_tests_per_thousand': pa.Column(float, checks=pa.Check.ge(0), nullable=True),
    'new_tests_per_thousand': pa.Column(float, checks=pa.Check.ge(0), nullable=True), 
    'new_tests_smoothed': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True),
    'new_tests_smoothed_per_thousand': pa.Column(float, checks=pa.Check.ge(0), nullable=True), 
    'positive_rate': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(1)], nullable=True), 
    'tests_per_case': pa.Column(float, checks=pa.Check.ge(0), nullable=True), #???
    'tests_units': pa.Column(str, nullable=True), 
    'total_vaccinations': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True), 
    'people_vaccinated': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True),
    'people_fully_vaccinated': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True), 
    'total_boosters': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True), 
    'new_vaccinations': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True),
    'new_vaccinations_smoothed': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True), 
    'total_vaccinations_per_hundred': pa.Column(float, checks=pa.Check.ge(0), nullable=True),
    'people_vaccinated_per_hundred': pa.Column(float, checks=pa.Check.ge(0), nullable=True), 
    'people_fully_vaccinated_per_hundred': pa.Column(float, checks=pa.Check.ge(0), nullable=True),
    'new_vaccinations_smoothed_per_million': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True),
    'new_people_vaccinated_smoothed': pa.Column('Int64', checks=pa.Check.ge(0), nullable=True),
    'new_people_vaccinated_smoothed_per_hundred': pa.Column(float, checks=pa.Check.ge(0), nullable=True), 
    'stringency_index': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # stringency of lockdown 
    'population_density': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(21000)], nullable=True), # Population Density per sq Kilometer; Highest recorded is just shy of 21000.
    'median_age': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(120)], nullable=True), 
    'aged_65_older': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # Percentage of population aged 65 or over
    'aged_70_older': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # Percentage of population aged 65 or over
    'gdp_per_capita': pa.Column(float, checks=pa.Check.ge(0), nullable=True), # Gross Domestic Product per Capita; not sure which source, or even which currency. Should be USD, but numbers suggest GBP.
    'extreme_poverty': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # Percentage of population living in extreme poverty.  Seems to be by WorldBank Definition, but not sure.
    'cardiovasc_death_rate': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100000)], nullable=True), # Probably Cardiovascular deaths per 100,000 judging by comparison to know external stats
    'diabetes_prevalence': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # Prevalence of diabetes; not sure what scale, as UK is approx 4.28 here, and has about 4.8 million people with diabetes.
    'female_smokers': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # Based on world stats this is the percentage of women who smoke.  Numbers for UK don't match official stats though.
    'male_smokers': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # Based on world stats this is the percentage of men who smoke.  Numbers for UK don't match official stats though.
    'handwashing_facilities': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(100)], nullable=True), # Percentage of people with access to hand washing facilities
    'hospital_beds_per_thousand': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(1000)], nullable=True),
    'life_expectancy': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(150)], nullable=True), 
    'human_development_index': pa.Column(float, checks=[pa.Check.ge(0), pa.Check.le(1)], nullable=True), 
    'population': pa.Column('Int64', checks=[pa.Check.ge(0), pa.Check.le(pow(10,10))], nullable=True), # Lets upper bound human population at 10 billion. That should be good through till at least 2050.
    'excess_mortality_cumulative_absolute': pa.Column(float, nullable=True), 
    'excess_mortality_cumulative': pa.Column(float, nullable=True),
    'excess_mortality': pa.Column(float, nullable=True), 
    'excess_mortality_cumulative_per_million': pa.Column(float, nullable=True)
})

In [None]:
# This takes a while.
schema.validate(countries_df)

In [None]:
# If the output path does not yet exist, create it
if not os.path.exists(os.path.dirname(cleaned_country_filename)):
    os.makedirs(os.path.dirname(cleaned_country_filename))
    
countries_df.to_parquet(cleaned_country_filename)