# Analysis - Questions 1 and 2, simple version

**Author:** Alan Meeson <alan.meeson@capgemini.com>

**Date:** 2023-02-11

This notebook covers a simple-ish analysis of covid variant evolution data to address question 2 from the instructions:
- Question 1: "For each continent, which are the five countries that are statistically hit earliest by new variants?"    
- Question 2: "For these countries which are the five countries on and the five off the respective continent, that serves as predictors for incoming variants?"

The approach here can be described at a high level as:
1. Apply the Shulze ranking approach to identify the five-ish countries on each contient which are hit earliets.
2. For each country *C* in sets of top 5 countries as defined above:
    a. calculate a new top 5 ranking over each country in the same continent, but filter the data to only consider variants where country *C* was not the winning rank
    b. calculate a new top 5 ranking over all countries *not* in the same contient, also filtering to exclude variants where country *C* would have the winning rank.
    
The general idea is to consider scenarios & variants where it is plausible for a country to have transmitted the variant to country *C* and only use those for the ranking.

This notebook requires the following data:
- country_continent_mapping.parquet, as produced by notebook 004-data-cleaning-evolution.ipynb
- covid_variants_evolution.parquet, as produced by notebook 005a-data-cleaning-country-continent-mapping.ipynb

This notebook produces the following data:
- location_variant_outbreak.parquet, a nice additional file - showing the per variant ranking of countries in order of how early they caught each variant.
- location_outbreak.parquet, the answer to question 1 - a ranking of the countries in terms of how early they tend to catch variants
- predictors.parquet, the answer to question 2 - the sets of predictors, on and off continent, for each of the top five earliest locations in each contient.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import early_hit_ranking as ehr

In [None]:
# Data generated by notebooks 004-data-cleaning-evolution and 005a-data-cleaning-country-continent-mapping
data_dir = '../data'
country_filename = os.path.join(data_dir, 'cleaned', 'country_indicators.parquet')
evolution_filename = os.path.join(data_dir, 'cleaned', 'covid_variants_evolution.parquet')
location_variant_outbreak_rank_filename = os.path.join(data_dir, 'analysis', 'location_variant_outbreak.parquet')
location_outbreak_rank_filename = os.path.join(data_dir, 'analysis', 'location_outbreak.parquet')
predictors_filename = os.path.join(data_dir, 'analysis', 'predictors.parquet')

## Load and prepare data

In [None]:
countries_df = pd.read_parquet(country_filename)
country_continent_mapping = countries_df[['continent', 'location']].drop_duplicates()
evolution_df = pd.read_parquet(evolution_filename)
combined_df = evolution_df.merge(country_continent_mapping, on='location', how='left')

## Calculate the three Key Tables

### Calculate Per country variant outbreak table

Including columns for:
- outbreak_date: the date of the first occurence of a variant in a given location
- variant_outbreak_rank_continent: the rank (eg: order) in which the countries in the contient had their first detected cases of each specific variant
- variant_outbreak_rank_global: the rank (eg: order) in which the countries globally had their first detected cases of each specific variant
- combined_outbreak_rank_continent: Schulze method combined continent ranks to give a single _how early does this country get hit by new variants_ score, ranked within the continent
- combined_outbreak_rank_global: Schulze method combined continent ranks to give a single _how early does this country get hit by new variants_ score, ranked globally

In [None]:
# calculate the first outbreak dates
variant_outbreaks_df = combined_df.loc[
    combined_df['num_sequences'] > 0, 
    ['continent', 'location','variant', 'date']
].groupby(
    ['continent', 'location','variant']
).min().rename(columns={'date': 'outbreak_date'})

In [None]:
# Create the LocationVariantOnset table
location_variant_outbreak_df = variant_outbreaks_df
location_variant_outbreak_df['outbreak_rank_continent'] = location_variant_outbreak_df['outbreak_date'].groupby(['continent', 'variant']).rank(method='min')
location_variant_outbreak_df['outbreak_rank_global'] = location_variant_outbreak_df['outbreak_date'].groupby(['variant']).rank(method='min')

In [None]:
location_variant_outbreak_df.head()

In [None]:
# If the output path does not yet exist, create it
if not os.path.exists(os.path.dirname(location_variant_outbreak_rank_filename)):
    os.makedirs(os.path.dirname(location_variant_outbreak_rank_filename))
    
location_variant_outbreak_df.to_parquet(location_variant_outbreak_rank_filename)

### Calculate the combined outbreak rankings

In [None]:
location_variant_outbreak_df = location_variant_outbreak_df.reset_index()
per_continent_schulze_outbreak_rank = location_variant_outbreak_df.groupby('continent').apply(
    lambda x: ehr.calculate_combined_rank(
        x.pivot(index=['variant'], columns=['location'], values='outbreak_rank_global').drop('non_who')
    )
)
location_outbreak_df = per_continent_schulze_outbreak_rank.reset_index('continent', name='schulze_onset_rank_continent')
location_outbreak_df['earliest_5_in_continent'] = location_outbreak_df['schulze_onset_rank_continent'] <= 5
location_outbreak_df.head()

In [None]:
# This bit would add the global ranking, but we don't want to confuse things, so we just reset the index ready to save
# global_outbreak_rank_matrix = location_variant_outbreak_df.pivot(index=['variant'], columns=['location'], values='outbreak_rank_global').drop('non_who')
# location_outbreak_df['schulze_onset_rank_global'] = ehr.calculate_combined_rank(global_outbreak_rank_matrix)
# location_outbreak_df['earliest_5_in_world'] = location_outbreak_df['schulze_onset_rank_global'] <= 5

location_outbreak_df = location_outbreak_df.reset_index()
location_outbreak_df.head()

In [None]:
if not os.path.exists(os.path.dirname(location_outbreak_rank_filename)):
    os.makedirs(os.path.dirname(location_outbreak_rank_filename))
    
location_outbreak_df.to_parquet(location_outbreak_rank_filename)

### Calculate the Predictor locations

Note: with thanks to Jalal Alizadeh for his assistance.

In [None]:
# First, identify the top five earliest outbreak locations within each continent
top_five_per_continent = location_outbreak_df.loc[location_outbreak_df['earliest_5_in_continent'], ['continent', 'location']]
top_five_per_continent.head()

In [None]:
def calculate_predictors(global_rank_matrix_df: pd.DataFrame, target_country: str, target_continent: str) -> pd.DataFrame:
    """Identifies the five On and the five Off-continent predictors for incomming variants.
    
    A predictor is defined in this context as a country which could have transmitted a variant to the target country.
    That is, it must have caught at least one variant before the target country.
    Where there are multiple possible predictors they are ranked by their schulze onset ranking as calculated in the
    set of onsets which are a) before the target country, and b) calculated excluding the target country (and it's continent for off continent predictors).
    
    Args:
        global_rank_matrix_df - each row is a location, each column a variant, values are global_outbreak_rank
            Note: rows must be indexed by contient and location
            Note: non_who variants must have been dropped
        target_country - the name of the country we are identifying predictors for
        target_continent - the name of the continent in which we find the target_country
        
    Returns:
        A pandas dataframe containing only the five on and five off continent predictor countries.
            predictor_location - the location of the predictor
            predictor_continent - the continent of the predictor
            predictor_rank - the rank of the predictor, either within the continent if `on_same_continent` is true, or off continent if it is false.
            on_same_continent - true if the predictor is on the same continent as the target_country.
            is_in_same_continent_top_5_predictors - true if it is one of the five predictors from the same continent
            is_in_off_continent_top_5_predictors - true if it is one of the five predictors from other continents
    """
    target_country_series = global_rank_matrix_df.loc[(target_continent, target_country), :]

    on_continent_matrix = global_rank_matrix_df.loc[target_continent,:].drop(target_country, axis=0)
    on_continent_matrix[on_continent_matrix > target_country_series] = np.nan
    on_continent_rank = ehr.calculate_combined_rank(on_continent_matrix.transpose()).reset_index(name='predictor_rank')
    on_continent_rank['continent'] = target_continent
    on_continent_rank['on_same_continent'] = True
    on_continent_rank['is_in_same_continent_top_5_predictors'] = on_continent_rank['predictor_rank'] <= 5
    on_continent_rank['is_in_off_continent_top_5_predictors'] = False

    off_continent_matrix = global_rank_matrix_df.drop(target_continent, axis=0)
    off_continent_matrix[off_continent_matrix > target_country_series] = np.nan
    off_continent_rank = ehr.calculate_combined_rank(off_continent_matrix.transpose()).reset_index(name='predictor_rank')
    off_continent_rank['on_same_continent'] = False
    off_continent_rank['is_in_same_continent_top_5_predictors'] = False
    off_continent_rank['is_in_off_continent_top_5_predictors'] = off_continent_rank['predictor_rank'] <= 5

    predictor_rank_df = pd.concat([on_continent_rank, off_continent_rank], axis=0)
    predictor_rank_df = predictor_rank_df.loc[predictor_rank_df['is_in_same_continent_top_5_predictors'] | predictor_rank_df['is_in_off_continent_top_5_predictors']]
    
    predictor_rank_df = predictor_rank_df.rename({
        'location': 'predictor_location',
        'continent': 'predictor_continent'
    }, axis=1 )
    
    return predictor_rank_df    

Note: this next step will take quite a while to run (about 3 minutes)

In [None]:
global_rank_matrix_df = location_variant_outbreak_df.pivot(index=['continent', 'location'], columns=['variant'], values='outbreak_rank_global').drop('non_who', axis=1)

results = []
for index, row in top_five_per_continent.iterrows():
    predictors_df = calculate_predictors(global_rank_matrix_df, row['location'], row['continent'])
    predictors_df['continent'] = row['continent']
    predictors_df['location'] = row['location']
    results.append(predictors_df)

results_df = pd.concat(results, axis=0)

In [None]:
results_df.head()

In [None]:
if not os.path.exists(os.path.dirname(predictors_filename)):
    os.makedirs(os.path.dirname(predictors_filename))
    
results_df.to_parquet(predictors_filename)