# Data Quality - Variant Evolution Data Set

**Author:** Alan Meeson <alan.meeson@capgemini.com>

**Date:** 2023-02-06

**Last Updated:** 2023-02-13

This notebook explores the data quality fo the variant evolution data set.
captures assumptions about the data, and validation of those assumptions.
This can serve as a template for the Cleaning and Validation stage of the ETL process for the evolution data.

Key findings are:
- The rows for South Africa are corrupted.  The `location` entry is missing the closing ' " '
- There are some `perc_sequences` entries which are below 0 by -0.01.  Typically for variants 'other' and 'non_who'
- There are some `perc_sequences` entries which are off by 0.01 in addition to the ones noted above.
- There is some duplication in counts between the variants 'other' and 'non_who'; if we include both in the sum, the totals for a location/day don't add up correctly.


In [None]:
import os 
import sys
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
data_dir = '../data/cleaned'
evolution_filename = os.path.join(data_dir, 'covid_variants_evolution.parquet')

## Load and prepare data

In [None]:
evolution_df = pd.read_parquet(evolution_filename)
evolution_df.head()

## Explore data

### Date

#### How frequently do we get samples?

This plot groups the data by location and variant to allow us to see how long there is between samples being recorded.

Findings: This is not consistent; it varies.  Typically either weekly, fortnightly or monthly.

In [None]:
date_df = evolution_df[['location','variant','date']]
date_df = date_df.set_index(['location','variant'])
date_df = date_df.groupby(['location','variant']).apply(lambda group: group.date.sort_values().diff().unique())

In [None]:
delays = [x.astype(int) for diffs in date_df for x in diffs.astype('timedelta64[D]') if ~np.isnat(x)]

keys = list(set(delays))
keys.sort()

plt.figure(figsize=(6,2))
plt.hist(delays, bins=keys)
plt.xticks(range(0, max(keys), 7), rotation=90)
#plt.suptitle('Distribution of time between Variant Sequence Samples', fontsize=18)
plt.title('Distribution of time between Variant Sequence Samples')
#plt.title('Predominately weekly, fortnightly or monthly with some variations or longer delays', fontsize=10)
plt.ylabel('Count')
plt.xlabel('Time between samples')
plt.show()

### Global prevalence of variants over time

In [None]:
evolution_df[['date', 'num_sequences']].groupby('date').sum().plot()
plt.title('Prevalence of any variant over time globally')

In [None]:
plot_df = evolution_df[['variant', 'date', 'num_sequences']].groupby(['variant', 'date']).sum()
variants = evolution_df.variant.unique()

fig = plt.figure()
for variant in variants:
    plt.plot(plot_df.loc[variant])

plt.xticks(rotation=90)
fig.legend(variants, loc=7, bbox_to_anchor=(1.2, 0.5))
plt.title('Covid-19 variant prevalence over time, globally')
plt.ylabel('Num Sequences')
plt.xlabel('Date')
plt.show()

In [None]:
plot_df = evolution_df[['location', 'variant', 'date', 'num_sequences']].groupby(['location', 'variant', 'date']).sum()
locations = evolution_df.location.unique()

for location in locations[:3]:
    
    fig = plt.figure()
    
    variants = plot_df.loc[location].loc[plot_df.loc[location, 'num_sequences'] > 0].index.unique(level='variant')
    for variant in variants:
        plt.plot(plot_df.loc[location, variant])

    plt.xticks(rotation=90)
    fig.legend(variants, loc=7, bbox_to_anchor=(1.2, 0.5))
    plt.title('Covid-19 variant prevalence over time, %s' % location)
    plt.ylabel('Num Sequences')
    plt.xlabel('Date')
    plt.show()
