# Baseline contact rates

The notebook below walks through a few strategies for parsing contact matrices from the POLYMOD study (Mossong 2008) and POLYMOD projections (Prem et al. 2017) Most of the explanatory text is under the POLYMOD projections heading, as this is the contact matrix that is modified for use in the Austin Granular Model.

## Imports and helpers

In [1]:
import pandas as pd
import xarray as xr
from copy import deepcopy
from numpy.testing import assert_array_equal

In [2]:
def split_age_group(long_df, split_grp, newgrp1, newgrp2):
    
    # only works if the age columns are "age1" and "age2"
    assert 'age1' in long_df.columns
    assert 'age2' in long_df.columns

    # separate age groups
    split_grp_df = long_df[(long_df['age1'] == split_grp) | (long_df['age2'] == split_grp)]
    other_groups = long_df[(long_df['age1'] != split_grp) & (long_df['age2'] != split_grp)]
    
    # copy dataframes
    group_1 = deepcopy(split_grp_df)
    group_2 = deepcopy(split_grp_df)

    # reassign age group 1
    group_1['age1'] = [newgrp1 if i == split_grp else i for i in group_1['age1']]
    group_1['age2'] = [newgrp1 if i == split_grp else i for i in group_1['age2']]

    # reassign age group 2
    group_2['age1'] = [newgrp2 if i == split_grp else i for i in group_2['age1']]
    group_2['age2'] = [newgrp2 if i == split_grp else i for i in group_2['age2']]

    # add contacts between group 1 and group 2
    group_1_self = deepcopy(group_1[(group_1['age1'] == newgrp1) & (group_1['age2'] == newgrp1)])
    group_1_self['age2'] = newgrp2

    # add add contacts between group 2 and group 1
    group_2_self = deepcopy(group_2[(group_2['age1'] == newgrp2) & (group_2['age2'] == newgrp2)])
    group_2_self['age2'] = newgrp1
    
    # join everything
    regrouped = pd.concat([other_groups, group_1, group_2, group_1_self, group_2_self])
    
    # sanity checks
    age1_counts = regrouped['age1'].value_counts().unique()
    age2_counts = regrouped['age2'].value_counts().unique()
    assert len(age1_counts) == 1
    assert len(age2_counts) == 1
    assert_array_equal(age1_counts, age2_counts)
    
    return regrouped

## POLYMOD

Citation: Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R, et al. (2008) Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases. PLoS Med 5(3): e74. https://doi.org/10.1371/journal.pmed.0050074

Tables in `epimodels/notebooks/AustinGranularModel/BaselineContacts/Mosson2008PolymodSupplement.xlsx` are copied from Supporting Information Table S5, https://doi.org/10.1371/journal.pmed.0050074.st005 (this link downloads the tables in Microsoft Word `.doc` format).

Table represent "all contacts" (the Supporting Information also has data on "physical contacts" not copied). The row names indicate the "age of contact" and the column names indicate the "age of participant", where participants were the subjects who kept diaries as part of this study.

There is one sheet for each of the countries surveyed.

In [3]:
polymod = pd.read_excel('/Users/kpierce/epimodels/notebooks/AustinGranularModel/BaselineContacts/Mosson2008PolymodSupplement.xlsx',
                       sheet_name=None, header=0)

In [4]:
polymod['BelgiumAll']

Unnamed: 0.1,Unnamed: 0,00-04,05-09,10-14,15-19,20-24,25-29,30-34,35-39,40-44,45-49,50-54,55-59,60-64,65-69,70+
0,00-04,1.36,0.66,0.43,0.28,0.1,0.44,0.8,0.74,0.18,0.22,0.39,0.46,0.34,0.32,0.11
1,05-09,0.74,3.28,0.78,0.68,0.58,0.09,0.68,0.69,0.21,0.06,0.39,0.23,0.07,0.59,0.15
2,10-14,0.42,0.76,5.6,0.81,0.46,0.07,0.34,1.13,0.89,0.26,0.31,0.13,0.1,0.18,0.3
3,15-19,0.14,0.25,1.34,6.39,2.02,0.35,0.55,0.44,0.92,0.76,0.33,0.27,0.07,0.09,0.33
4,20-24,0.34,0.17,0.7,1.67,4.4,0.91,0.84,0.46,0.55,0.8,0.93,0.71,0.27,0.23,0.07
5,25-29,1.08,0.61,0.28,0.72,1.77,2.28,1.16,1.15,0.89,1.08,1.3,1.13,0.54,0.27,0.3
6,30-34,1.46,1.29,0.57,0.37,1.29,1.37,2.07,1.46,0.58,1.06,0.74,1.77,1.2,0.41,0.3
7,35-39,0.77,1.2,1.15,0.73,0.56,1.09,2.18,1.67,1.29,1.46,0.83,1.27,0.76,0.91,0.52
8,40-44,0.38,0.82,1.15,1.15,0.96,0.58,1.57,1.64,1.42,1.0,1.09,1.37,1.07,0.73,0.56
9,45-49,0.26,0.41,0.9,1.27,1.17,1.05,0.77,0.82,1.42,1.98,1.13,1.06,0.83,0.27,0.93


In [5]:
polymod_tables = []
for key, value in polymod.items():
    long = value.melt(id_vars='Unnamed: 0', var_name='age1', value_name='daily_per_capita_contacts')
    long = long.rename(columns={'Unnamed: 0': 'age2'})
    long['country'] = key
    polymod_tables.append(long)

In [6]:
polymod_long = pd.concat(polymod_tables)

In [7]:
polymod_long.head()

Unnamed: 0,age2,age1,daily_per_capita_contacts,country
0,00-04,00-04,1.36,BelgiumAll
1,05-09,00-04,0.74,BelgiumAll
2,10-14,00-04,0.42,BelgiumAll
3,15-19,00-04,0.14,BelgiumAll
4,20-24,00-04,0.34,BelgiumAll


In [8]:
polymod_long_regrouped = split_age_group(long_df=polymod_long, split_grp='15-19', newgrp1='15-17', newgrp2='18-19')


With the data groupings adjusted, we can convert the full dataset into an `xarray` and save as a `zarr` file.

In [9]:
polymod_long = polymod_long.set_index(['age1', 'age2', 'country'])
polymod_xr = polymod_long.to_xarray()
polymod_xr.to_zarr('/Users/kpierce/epimodels/notebooks/AustinGranularModel/BaselineContacts/polymod_mossong2008.zarr')

<xarray.backends.zarr.ZarrStore at 0x13a9a8f90>

In [10]:
polymod_long_regrouped = polymod_long_regrouped.set_index(['age1', 'age2', 'country'])
polymod_long_regrouped_xr = polymod_long_regrouped.to_xarray()
polymod_long_regrouped_xr.to_zarr('/Users/kpierce/epimodels/notebooks/AustinGranularModel/BaselineContacts/polymod_mossong2008_regrouped.zarr')


<xarray.backends.zarr.ZarrStore at 0x13a9a8f20>

We can also do some aggregations across dimensions of the `xarray` (demonstration only, not saved).

In [11]:
polymod_mean_xr = polymod_xr.mean(dim='country')

In [12]:
polymod_mean_xr

## POLYMOD projections

Citation: Prem K, Cook AR, Jit M (2017) Projecting social contact matrices in 152 countries using contact surveys and demographic data. PLoS Comput Biol 13(9): e1005697. https://doi.org/10.1371/journal.pcbi.1005697

Prem et al. (2017) extend the POLYMOD study with statistical demographic models to estimate contact matrices for 152 countries (using the same age grouping.

Tables in `epimodels/notebooks/AustinGranularModel/BaselineContacts/contact_matrices_152_countries` are copied from Supporting Information S1 Dataset, https://doi.org/10.1371/journal.pcbi.1005697.s002 (this link downloads the directory `.zip` format). Files with the suffix `_1` contain sheets for countries "Albania" through "Morocco" and those with the suffix `_2` contain countries "Mozambique" through "Zimbabwe" (alphabetically, in English).

In [15]:
polymod_usa = polymod = pd.read_excel(
    '/Users/kpierce/epimodels/notebooks/AustinGranularModel/BaselineContacts/contact_matrices_152_countries/MUestimates_all_locations_2.xlsx',
    sheet_name="United States of America", header=None)


  warn("Workbook contains no default style, apply openpyxl's default")


In [16]:
polymod_usa

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,2.598237,1.101286,0.499396,0.315998,0.411961,0.715457,1.057365,0.988814,0.497488,0.322391,0.336978,0.26598,0.173599,0.135957,0.073886,0.038853
1,0.989686,5.386372,1.224101,0.347044,0.193902,0.509275,0.892744,1.068942,0.846197,0.347779,0.241652,0.197511,0.170506,0.12128,0.051215,0.039004
2,0.304842,1.888934,8.284524,0.973481,0.347356,0.301111,0.537831,0.869836,1.047914,0.564866,0.319458,0.16629,0.104552,0.098895,0.063186,0.052481
3,0.173684,0.432369,3.06756,11.106139,1.599837,0.783971,0.636821,0.892186,1.12534,1.009137,0.533794,0.23687,0.09507,0.067864,0.033226,0.021371
4,0.262619,0.21439,0.329929,2.645996,4.257321,1.742612,1.146007,1.040654,0.915094,1.079188,0.738022,0.412893,0.121239,0.052891,0.04983,0.040059
5,0.579676,0.34356,0.209946,0.882349,2.024331,3.414656,1.729206,1.333241,1.128603,0.952613,0.905713,0.498146,0.162274,0.055466,0.02694,0.018639
6,0.685994,0.925715,0.70581,0.511232,0.993349,1.651464,2.724437,1.697023,1.31631,1.026176,0.798953,0.542883,0.215146,0.088143,0.040122,0.037413
7,0.673234,1.087263,0.907901,0.762107,0.712048,1.280754,1.57267,2.780799,1.83112,1.187005,0.862649,0.468816,0.253787,0.147221,0.074796,0.029849
8,0.344231,0.743115,1.012538,1.172505,0.867901,1.128282,1.445997,1.662298,2.523069,1.465606,1.041925,0.393184,0.213727,0.11748,0.074063,0.032296
9,0.369636,0.561619,0.752136,1.656516,0.894872,0.95449,1.148589,1.325174,1.421437,2.031986,1.079285,0.497942,0.183696,0.092167,0.074005,0.066027


The Prem et al. projections present 16 age groups (versus 15 in the POLYMOD study); figures in the manuscript indicate the last age group is discrete 75-79 years of age (versus an open-ended 75+ in the POLYMOD study). Though not explicitly stated, the structure of the figures corresponding to these data suggest that columns indicate the age group of the "participant" (in POLYMOD terms) and the rows indicate the age group of the "contact".

In [23]:
age_groups = [
    '00-04', '05-09', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
    '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79'
]
polymod_usa.columns = age_groups
polymod_usa['age2'] = age_groups

In [24]:
polymod_usa.head()

Unnamed: 0,00-04,05-09,10-14,15-19,20-24,25-29,30-34,35-39,40-44,45-49,50-54,55-59,60-64,65-69,70-74,75-79,age2
0,2.598237,1.101286,0.499396,0.315998,0.411961,0.715457,1.057365,0.988814,0.497488,0.322391,0.336978,0.26598,0.173599,0.135957,0.073886,0.038853,00-04
1,0.989686,5.386372,1.224101,0.347044,0.193902,0.509275,0.892744,1.068942,0.846197,0.347779,0.241652,0.197511,0.170506,0.12128,0.051215,0.039004,05-09
2,0.304842,1.888934,8.284524,0.973481,0.347356,0.301111,0.537831,0.869836,1.047914,0.564866,0.319458,0.16629,0.104552,0.098895,0.063186,0.052481,10-14
3,0.173684,0.432369,3.06756,11.106139,1.599837,0.783971,0.636821,0.892186,1.12534,1.009137,0.533794,0.23687,0.09507,0.067864,0.033226,0.021371,15-19
4,0.262619,0.21439,0.329929,2.645996,4.257321,1.742612,1.146007,1.040654,0.915094,1.079188,0.738022,0.412893,0.121239,0.052891,0.04983,0.040059,20-24


In [25]:
usa_long = polymod_usa.melt(id_vars='age2', var_name='age1', value_name='daily_per_capita_contacts')

In [26]:
usa_long.head()

Unnamed: 0,age2,age1,daily_per_capita_contacts
0,00-04,00-04,2.598237
1,05-09,00-04,0.989686
2,10-14,00-04,0.304842
3,15-19,00-04,0.173684
4,20-24,00-04,0.262619


The US Census Bureau age groups are slightly different for ages 15-24. US high school students are typically 18 years old or younger, so the polymod 15-19 year and 20-24 year baseline contacts need to be adjusted to match US Census Bureau ranges and school age range expectations.

We'll make the following new age groups:

- 0-4
- 5-10
- 10-14
- 15-17
- 18-49
- 50-64
- 65+

To accomplish this, we first need to split the 15-19 year age group into 15-17 and 18-19 year age groups.

- assume that 15-17 year per capita contacts (across all contact age groups) are the same as 15-19 year per capita contacts.
- assume that 18-19 year per capita contacts (across all contact age groups) are the same as 15-19 year per capita contacts.

In [27]:
usa_regrouped = split_age_group(long_df=usa_long, split_grp='15-19', newgrp1='15-17', newgrp2='18-19')


In [29]:
usa_regrouped.tail()

Unnamed: 0,age2,age1,daily_per_capita_contacts
211,18-19,65-69,0.067864
227,18-19,70-74,0.033226
243,18-19,75-79,0.021371
51,18-19,15-17,11.106139
51,15-17,18-19,11.106139


To re-aggregate, take the population-weighted average of the daily per-capita contacts. A single person in age group *i* will have $\Sigma$(x$_{j}$ $*$ N$_{j}$)/$\Sigma$(N$_{j}$) contacts for all age sub-groups *j*, where x$_{j}$ is the daily per capita contact rate for age sub-group *j* and N$_{j}$ is the population size of age sub-group *j*.

US total population data from National Population by Characteristics: 2010-2019; table download link https://www2.census.gov/programs-surveys/popest/tables/2010-2019/national/asrh/nc-est2019-agesex.xlsx

Citation: US Census Bureau, “National Population by Characteristics: 2010-2019,” Census.gov. https://www.census.gov/data/tables/time-series/demo/popest/2010s-national-detail.html (accessed Mar. 15, 2022).

In [55]:
us_pop_total = pd.read_excel(
    '/Users/kpierce/epimodels/notebooks/AustinGranularModel/BaselineContacts/nc-est2019-syasexn.xlsx',
    skiprows=[0, 1, 2], nrows=1
)
us_pop_total = us_pop_total.rename(columns={'Unnamed: 0': 'age'})

In [56]:
us_pop_total

Unnamed: 0,age,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Total\nPopulation,308745538,308758105,309321666,311556874,313830990,315993715,318301008,320635163,322941311,324985539,326687501,328239523


In [83]:
us_total_pop_2018 = us_pop_total[2018].values.item()

In [48]:
us_pop_age = pd.read_excel(
    '/Users/kpierce/epimodels/notebooks/AustinGranularModel/BaselineContacts/nc-est2019-syasexn.xlsx',
    skiprows=[0, 1, 2, 4], nrows=101
)
us_pop_age = us_pop_age.rename(columns={'Unnamed: 0': 'age'})

In [68]:
us_pop_age['age'] = [i.split('.')[1] for i in us_pop_age['age']]
us_pop_age['age'] = [i.split('+')[0] for i in us_pop_age['age']]

In [70]:
us_pop_age['age'] = us_pop_age['age'].astype(int)

In [71]:
us_pop_age.tail()

Unnamed: 0,age,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
96,96,95223,95288,97259,101295,105060,108285,120427,122315,136011,147449,151823,157463
97,97,68138,68168,68966,73267,76840,79369,82948,92078,94732,104068,113716,116969
98,98,45900,45938,47086,50654,54192,56508,59546,61585,69464,71571,77943,86150
99,99,32266,32289,32214,33604,36514,38797,41277,43276,45030,50969,53184,57124
100,100,53364,53412,54437,57513,61035,64898,70685,75449,81199,85663,93038,100322


The following works only because the dataframe is sorted in ascending order:

In [72]:
age_groups = []
for i in us_pop_age['age']:
    if i < 5:
        age_groups.append('00-04')
    elif i < 10:
        age_groups.append('05-09')
    elif i < 15:
        age_groups.append('10-14')
    elif i < 18:
        age_groups.append('15-17')
    elif i < 50:
        age_groups.append('18-49')
    elif i < 65:
        age_groups.append('50-64')
    else:
        age_groups.append('65+')
us_pop_age['age_group'] = age_groups

In [76]:
us_pop_age_grouped = us_pop_age.groupby('age_group')[2018].sum().reset_index()

In [85]:
us_pop_age_grouped['percent_of_total'] = us_pop_age_grouped[2018] / us_total_pop_2018

In [86]:
us_pop_age_grouped

Unnamed: 0,age_group,2018,percent_of_total
0,00-04,19762962,0.060495
1,05-09,20188285,0.061797
2,10-14,20868629,0.063879
3,15-17,12499269,0.038261
4,18-49,137915587,0.422164
5,50-64,63083430,0.1931
6,65+,52369339,0.160304


In [87]:
us_pop_age_grouped['percent_of_total'].sum()

1.0

Now we can take the population weighted average of the POLYMOD projections for the USA.

In [30]:
revised_groups = {
    '00-04': '00-04',
    '05-09': '05-09',
    '10-14': '10-14',
    '15-17': '15-17',
    '18-19': '18-49',
    '20-24': '18-49',
    '25-29': '18-49',
    '30-34': '18-49',
    '35-39': '18-49',
    '40-44': '18-49',
    '45-49': '18-49',
    '50-54': '50-64',
    '55-59': '50-64',
    '60-64': '50-64',
    '65-69': '65+',
    '70-74': '65+',
    '75-79': '65+'
}

In [89]:
usa_regrouped['age1_group'] = [revised_groups[i] for i in usa_regrouped['age1']]
usa_regrouped['age2_group'] = [revised_groups[i] for i in usa_regrouped['age2']]

In [91]:
usa_regrouped_weighted = pd.merge(
    usa_regrouped, us_pop_age_grouped, left_on='age2_group', right_on='age_group', how='left'
)

In [93]:
usa_regrouped_weighted['numerator'] = usa_regrouped_weighted['daily_per_capita_contacts'] * usa_regrouped_weighted['percent_of_total']


In [98]:
usa_weighted = usa_regrouped_weighted.groupby(
    ['age1_group', 'age2_group']
).sum(
    ['numerator', 'percent_of_total']
).reset_index()

In [100]:
usa_weighted['weighted_daily_per_capita_contacts'] = usa_weighted['numerator'] / usa_weighted['percent_of_total']

In [102]:
usa_weighted.head(20)

Unnamed: 0,age1_group,age2_group,daily_per_capita_contacts,2018,percent_of_total,numerator,weighted_daily_per_capita_contacts
0,00-04,00-04,2.598237,19762962,0.060495,0.15718,2.598237
1,00-04,05-09,0.989686,20188285,0.061797,0.06116,0.989686
2,00-04,10-14,0.304842,20868629,0.063879,0.019473,0.304842
3,00-04,15-17,0.173684,12499269,0.038261,0.006645,0.173684
4,00-04,18-49,3.089074,965409109,2.955146,1.304095,0.441296
5,00-04,50-64,1.291459,189250290,0.579301,0.249381,0.430486
6,00-04,65+,0.591866,157108017,0.480912,0.094879,0.197289
7,05-09,00-04,1.101286,19762962,0.060495,0.066622,1.101286
8,05-09,05-09,5.386372,20188285,0.061797,0.332861,5.386372
9,05-09,10-14,1.888934,20868629,0.063879,0.120664,1.888934


In [103]:
usa_weighted_xr = usa_weighted[['age1_group', 'age2_group', 'weighted_daily_per_capita_contacts']].set_index(['age1_group', 'age2_group']).to_xarray()

In [105]:
usa_weighted_xr

In [106]:
usa_weighted_xr.to_zarr('/Users/kpierce/epimodels/notebooks/AustinGranularModel/BaselineContacts/usa_baseline_contacts.zarr')


<xarray.backends.zarr.ZarrStore at 0x13c1fedd0>