# Data loading and parsing

In this notebook we find load the data from a specific simulation and parse it to be used in the subsequent analysis notebooks.

We need to know the simulation ID from the database, the ID of the Zenodo repository where the data has been deposited, and the time zone, start and end times of the simulaton for correctly parsing the data.

We use the [pickle module](https://docs.python.org/3/library/pickle.html) to serialize the objects we create in this notebook from the parsed data and save them in binary format to easily load them back into the analysis notebooks.

In [1]:
import os
from os import path
from datetime import datetime
import pickle
import pytz
import requests

import pandas as pd
import numpy as np

import networkx as nx

# Utility code is saved in separate python files to shorten the notebook
from utils.parsing import get_contact_list, get_infection_list, get_node_state, hour_rounder, create_contact_network, remove_nodes_with_less_edges
from utils.parsing import get_score_events, get_info_count

## Input files

The input files from the simulation should be stored in the data subfolder. Once the OO-WKU23 Zenodo repository is publicly avaialble, the cell after the one below can be used to download the files directly from Zenodo.  

In [2]:
data_folder = "./data"
output_folder = "./output"
if not path.exists(data_folder):
    os.makedirs(data_folder)
if not path.exists(output_folder):
    os.makedirs(output_folder)

# Simulation ID in the database
sim_id = 165

# ID of the data on Zenodo
zenodo_id = '10674401'

# Print warning messages to the console when parsing data
print_data_warnings = False

# Discard transmissions when the infected node was already infected before
discard_reinfections = True

# Default contact time for transmissions that are missing an associated contact event
def_contact_time = 10

# Time delta for plots in seconds
time_step_min = 30
time_delta_sec = 60 * time_step_min

# Time zone of the simulation, starting and ending time
sim_tz = "Asia/Shanghai"
time0 = 'Nov 20 2023 9:00AM'
time1 = 'Dec 4 2023 12:00PM'

# https://howchoo.com/g/ywi5m2vkodk/working-with-datetime-objects-and-timezones-in-python
# https://itnext.io/working-with-timezone-and-python-using-pytz-library-4931e61e5152
timezone = pytz.timezone(sim_tz)
obs_date0 = timezone.localize(datetime.strptime(time0, '%b %d %Y %I:%M%p'))
obs_date1 = timezone.localize(datetime.strptime(time1, '%b %d %Y %I:%M%p'))

In [3]:
# Won't work until the record is made public
download_files_from_zenodo = False

data_files = ['participants.csv', 'histories.csv', 'survey.csv', 'sequences.csv', 'mutations.csv']
zenodo_url = 'https://zenodo.org/record/' + zenodo_id + '/files/'

for fn in data_files:
    full_src_path = zenodo_url + fn
    dest_path = path.join(data_folder, fn)    
    if path.isfile(dest_path):
        print('Found data file', dest_path)
    elif download_files_from_zenodo:
        print('Downloading', full_src_path, 'to', dest_path, '...')        
        response = requests.get(full_src_path)
        with open(dest_path, 'wb') as f:
            f.write(response.content)    
        print(' Done.')
    else:
        print('WARNING: Data file', dest_path, 'is missing!')

Found data file ./data/participants.csv
Found data file ./data/histories.csv
Found data file ./data/sequences.csv
Found data file ./data/mutations.csv


In [4]:
# Load participants and histories

all_users = pd.read_csv(path.join(data_folder, "participants.csv"), low_memory=False) 
all_events = pd.read_csv(path.join(data_folder, "histories.csv"), low_memory=False)

users = all_users[all_users["sim_id"] == sim_id]
users['random_id'] = users['random_id'].astype(str).str.zfill(4)

# Save the users to a pickle file
with open(path.join(data_folder, 'users.pickle'), 'wb') as f:
    pickle.dump(users, f)

events = all_events[all_events["sim_id"] == sim_id]
events.fillna({'contact_length':0, 'peer_id':-1}, inplace=True)
events["event_start"] = events["time"] - events["contact_length"]/1000
events["event_start"] = events["event_start"].astype(int)

p2pToSim = pd.Series(users.sim_id.values, index=users.p2p_id).to_dict()
p2pToId = pd.Series(users.id.values, index=users.p2p_id).to_dict()
idTop2p = pd.Series(users.p2p_id.values, index=users.id).to_dict()
        
user_index = {}
index_user = {}
idx = 0
for kid in idTop2p:
    user_index[kid] = idx
    index_user[idx] = kid
    idx += 1

# Round min and max times to the hour
min_time = min(events['time'])
max_time = max(events['time'])
first_date = hour_rounder(datetime.fromtimestamp(min_time, tz=timezone))
last_date = hour_rounder(datetime.fromtimestamp(max_time, tz=timezone))
min_time = datetime.timestamp(first_date)
max_time = datetime.timestamp(last_date)

print("First event:", first_date)
print("Last event :", last_date)

if time0 and time1:
    print("Start time:", datetime.strptime(time0, '%b %d %Y %I:%M%p'))
    print("End time:", datetime.strptime(time1, '%b %d %Y %I:%M%p'))

print(first_date.tzinfo)

# These should return the same value
print(len(users))
print(len(idTop2p))    
print(len(p2pToId))
print(len(user_index))

First event: 2023-11-19 15:00:00+08:00
Last event : 2023-12-05 16:00:00+08:00
Start time: 2023-11-20 09:00:00
End time: 2023-12-04 12:00:00
Asia/Shanghai
794
794
794
794


At this point, we have parsed the source simulation data and we can use it to extract any information we need from it. For example, in the cell below, we get the final states of all nodes, the list of infections (all the (infectors, infectees) pairs) and the list of all contacts during the entire simulation:

In [5]:
# Get list of infections and contacts, needed to construct the network graph
state = get_node_state(user_index, events, None, print_data_warnings)
infections = get_infection_list(user_index, events, discard_reinfections, time_delta_sec)
contacts = get_contact_list(user_index, events, infections, def_contact_time, print_data_warnings)

With the contacts and state information, we can build the network graph using the [networkx package](https://networkx.org/). The first step is to construct the full network where we remove isolated nodes:

In [6]:
min_total_contact_time = 5  # at least this total time (in minutes) over the two weeks to be defined as in contact
min_total_contact_count = 1 # nodes must have at least this number of edges with other nodes to be kept

G = create_contact_network(user_index, contacts, state, "final_health_state", min_total_contact_time)

print(len(G.nodes()), len(G.edges()))

removed = remove_nodes_with_less_edges(G, min_total_contact_count)

isolates = list(nx.isolates(G))
G.remove_nodes_from(isolates)

mask = users.index.isin(removed + isolates)
remids = pd.DataFrame(users[mask]['random_id'].tolist(), columns=['User ID'])
remids.to_csv(path.join(data_folder, 'removed-nodes.csv'), index=False)

print('Removed', len(remids), 'nodes without enough connections')
print('There are', len(G.nodes()), 'remaining nodes with', len(G.edges()), 'edges between them')

794 1802
Removed 320 nodes without enough connections
There are 474 remaining nodes with 1802 edges between them


In [7]:
# Save the full network into a pickle file for later use
with open(path.join(data_folder, 'full-network.pickle'), 'wb') as f:
    pickle.dump(G, f)

We also save the directed graph containing all the transmission trees from the simulaton:

In [8]:
# Construct a new graph using only the transmission (infection) data
T = nx.DiGraph(infections)

with open(path.join(data_folder, 'transmission-tree.pickle'), 'wb') as f:
    pickle.dump(T, f)

## Getting the largest connected subgraph

We will conduct the network analyses on the largest connected subgraph in the network, we cand find it using the code in the following cell. We don't save the subgraph yet becasue we will add some properties to the nodes later on.

In [9]:
# If the Graph has more than one component, this will return False:
print("Network is connected", nx.is_connected(G))

components = nx.connected_components(G)

subgraphs = [G.subgraph(c) for c in components]
for sg in subgraphs:
    print(len(sg.nodes()), len(sg.edges()))

# Calculate the largest connected component subgraph:
G = sorted(subgraphs, key=lambda x: len(x))[-1]

degrees = [degree for node, degree in G.degree()]

Network is connected False
472 1801
2 1


## Animation of network spread on network

If we save the states of all nodes during the simulation at a give interval, we can then use those states to color the nodes in an animaton that is generated in the network properties notebook.

In [10]:
# Generate the state of all nodes in G for each frame of the animation

if obs_date0 and obs_date1:
    tmin = datetime.timestamp(obs_date0)
    tmax = datetime.timestamp(obs_date1)
else:
    tmin = min_time
    tmax = max_time
    
t = tmin
frame = 0
all_state = []
tstate = None
print('Calculating the state of each frame...')
while t <= tmax:
    t0 = t
    t += time_delta_sec
    td = datetime.fromtimestamp(t, tz=timezone)
    print(frame, end=' ')
    
    # We want to include contact and infection events that either started or ended between t0 and t
    condition = ((t0 < events['event_start']) & (events['event_start'] <= t)) | ((t0 < events['time']) & (events['time'] <= t))
    tevents = events[condition]
    tstate = get_node_state(user_index, tevents, tstate)

    fstate = [tstate[idx] for idx in list(G.nodes())]
    all_state.append(fstate)
    frame += 1
print('\nDone')

num_frames = len(all_state)
print(f'Calculated states for {num_frames} frames')

Calculating the state of each frame...
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267

In [11]:
# Save the network states to a file
with open(path.join(data_folder, 'all-network-states.pickle'), 'wb') as f:
    pickle.dump(all_state, f)

## Adding behavioral properties to the network

We collected behavioral data during the OO simulation at WKU:

* Responses to an initial survey on perceptions about quarantine (adapted from [this paper](https://bmcpublichealth.biomedcentral.com/articles/10.1186/1471-2458-9-470) published in 2009 following the H1N1 pandemic)
* Daily decisions on whether to "quarantine" (their avatar) or "wear" a (virtual) mask.

In [12]:
user_survey_qpref = pd.read_csv(path.join(data_folder, "survey-quarantine-perceptions.csv"))
user_survey_demo = pd.read_csv(path.join(data_folder, "survey-demographics.csv"), dtype = {'OOID': str})

In [13]:
# Remove entries with invalid/missing IDs
user_survey_qpref = user_survey_qpref[user_survey_qpref['user_id'].isin(users['random_id'])]
user_survey_demo = user_survey_demo[user_survey_demo['OOID'].isin(users['random_id'])]

In [14]:
question1 = "Public health officials should have the power to order people into quarantine during COVID-19 outbreaks"
question2 = "If someone is given a quarantine order by a public health official, they should follow it no matter what else is going on in their life at work or home"
question3 = "If I go into quarantine, my family, friends, and community will be protected from getting COVID-19"

qpref_vars = ['q1_response', 'q2_response', 'q3_response']
demo_vars = ['Gender', 'Major', 'Year', 'Mobile operating system']
action_vars = ['quarantine_yes', 'quarantine_no', 'quarantine_ratio', 'wear_mask', 'num_contacts']
attribs = demo_vars + action_vars + qpref_vars

In [15]:
qy_values = []
qn_values = []
qr_values = []
wm_values = []
q1_values = []
q2_values = []
q3_values = []
gender_values = []
major_values = []
year_values = []
mos_values = []

qy_values_all = []
qn_values_all = []
qr_values_all = []
wm_values_all = []
q1_values_all = []
q2_values_all = []
q3_values_all = []
gender_values_all = []
major_values_all = []
year_values_all = []
mos_values_all = []

qy_dict = {}
qn_dict = {}
qr_dict = {}
wm_dict = {}
q1_dict = {}
q2_dict = {}
q3_dict = {}
gender_dict = {}
major_dict = {}
year_dict = {}
mos_dict = {}

for idx in user_index.values():    
    uid = users['id'][idx]
    rid = users['random_id'][idx]
    
    user_events = events[events['user_id'] == uid]
    qy_ev = user_events[user_events['inf'] == 'quarantine']
    qn_ev = user_events[user_events['inf'] == 'noQuarantine']
    wm_ev = user_events[user_events['modifier'] == 'Wearing Mask']

    qy_num = len(qy_ev)
    qn_num = len(qn_ev)
    wm_num = len(wm_ev)

    if 0 < qy_num + qn_num:
        qr_val = qy_num / (qy_num + qn_num)
    else:    
        qr_val = np.nan
    
    q1_res = np.nan
    q2_res = np.nan
    q3_res = np.nan
    survey_responses = user_survey_qpref[user_survey_qpref['user_id'] == rid]
    if len(survey_responses) == 1:
        q1_res = survey_responses['question1'].values[0]
        q2_res = survey_responses['question2'].values[0]
        q3_res = survey_responses['question3'].values[0]

    gender_res = np.nan
    major_res = np.nan
    year_res = np.nan
    mos_res = np.nan
    demo_responses = user_survey_demo[user_survey_demo['OOID'] == rid]
    if len(demo_responses) == 1:
        gender_res = demo_responses['Gender'].values[0]
        major_res = demo_responses['Major'].values[0]
        year_res = demo_responses['Year'].values[0]
        mos_res = demo_responses['Mobile operating system'].values[0]
    
    qy_values_all.append(qy_num)
    qn_values_all.append(qn_num)
    qr_values_all.append(qr_val)
    wm_values_all.append(wm_num)
    q1_values_all.append(q1_res)
    q2_values_all.append(q2_res)
    q3_values_all.append(q3_res)
    gender_values_all.append(gender_res)
    major_values_all.append(major_res)
    year_values_all.append(year_res)
    mos_values_all.append(mos_res)

    if idx in G.nodes():
        qy_values.append(qy_num)
        qn_values.append(qn_num)
        qr_values.append(qr_val)
        wm_values.append(wm_num)
        q1_values.append(q1_res)
        q2_values.append(q2_res)
        q3_values.append(q3_res)
        gender_values.append(gender_res)
        major_values.append(major_res)
        year_values.append(year_res)
        mos_values.append(mos_res)

        qy_dict[idx] = qy_num
        qn_dict[idx] = qn_num
        qr_dict[idx] = qr_val
        wm_dict[idx] = wm_num
        q1_dict[idx] = q1_res
        q2_dict[idx] = q2_res
        q3_dict[idx] = q3_res
        gender_dict[idx] = gender_res
        major_dict[idx] = major_res
        year_dict[idx] = year_res
        mos_dict[idx] = mos_res

nc_dict = dict(G.degree())

user_prefs = pd.DataFrame({'quarantine_yes': qy_values, 
                           'quarantine_no': qn_values, 
                           'quarantine_ratio': qr_values, 
                           'wear_mask': wm_values, 
                           'q1_response': q1_values, 
                           'q2_response': q2_values, 
                           'q3_response': q3_values,
                           'gender': gender_values,
                           'major': major_values,
                           'year': year_values,
                           'mobile_os': mos_values,                           
                           'num_contacts': degrees})

user_prefs_all = pd.DataFrame({'quarantine_yes': qy_values_all, 
                               'quarantine_no': qn_values_all, 
                               'quarantine_ratio': qr_values_all, 
                               'wear_mask': wm_values_all,
                               'q1_response': q1_values_all, 
                               'q2_response': q2_values_all, 
                               'q3_response': q3_values_all,
                               'gender': gender_values_all,
                               'major': major_values_all,
                               'year': year_values_all,
                               'mobile_os': mos_values_all})

nx.set_node_attributes(G, qy_dict, 'quarantine_yes')
nx.set_node_attributes(G, qn_dict, 'quarantine_no')
nx.set_node_attributes(G, qr_dict, 'quarantine_ratio')
nx.set_node_attributes(G, wm_dict, 'wear_mask')
nx.set_node_attributes(G, q1_dict, 'q1_response')
nx.set_node_attributes(G, q2_dict, 'q2_response')
nx.set_node_attributes(G, q3_dict, 'q3_response')
nx.set_node_attributes(G, gender_dict, 'gender')
nx.set_node_attributes(G, major_dict, 'major')
nx.set_node_attributes(G, year_dict, 'year')
nx.set_node_attributes(G, mos_dict, 'mobile_os')
nx.set_node_attributes(G, nc_dict, 'num_contacts')

In [16]:
with open(path.join(data_folder, 'user_prefs.pickle'), 'wb') as f:
    pickle.dump(user_prefs, f)

with open(path.join(data_folder, 'user_prefs_all.pickle'), 'wb') as f:
    pickle.dump(user_prefs_all, f)

with open(path.join(data_folder, 'network-largest_conn_comp.pickle'), 'wb') as f:
    pickle.dump(G, f)

## Daily contact matrices and states

Finally, we generate and save the adjacency matrices from the contact network generated for each day of the simulation (only for those nodes present in the full network), as well as the daily states of each node. This information will be used in the analysis notebooks for tensor factorization and risk prediction.

Also, count the "score" events from each day to plot the daily behaviors such as quarantine, masking, etc.

In [17]:
tmin = datetime.timestamp(obs_date0)
tmax = datetime.timestamp(obs_date1)

# Time delta for plots in seconds, each frame is 1 day
daily_delta_sec = 60 * (60 * 24)

In [18]:
# Generate the state of all nodes in G for each frame of the animation

t = tmin
frame = 0

tstate = None
print('Calculating the network for each day of the sim...')
nodes0 = list(G.nodes()) # We only look at the nodes we already selected before (which have enough interactions over the entire period of the sim)
daily_adj = np.zeros((15, len(nodes0), len(nodes0)))
daily_inf = []
daily_states = []
while t <= tmax:
    t0 = t
    t += daily_delta_sec
    td = datetime.fromtimestamp(t, tz=timezone)
    print('Frame', frame+1, datetime.fromtimestamp(t0, tz=timezone).strftime('%Y-%m-%d %H:%M'), 'to', td.strftime('%Y-%m-%d %H:%M'))
    
    # We want to include contact and infection events that either started or ended between t0 and t
    condition = ((t0 < events['event_start']) & (events['event_start'] <= t)) | ((t0 < events['time']) & (events['time'] <= t))
    tevents = events[condition]
    tstate = get_node_state(user_index, tevents, tstate, False)
    tinf = get_infection_list(user_index, tevents, discard_reinfections, time_delta_sec)
    tcontacts = get_contact_list(user_index, tevents, tinf, def_contact_time, False)
    
    tg = nx.Graph()
    tg.add_nodes_from(nodes0)
    tedges = []
    tweights = []
    if 0 < len(tcontacts):
        for p in tcontacts:
            n0 = p[0]
            n1 = p[1]
            w = tcontacts[p]            
            if n0 in nodes0 and n1 in nodes0 and 0 < w:
                tedges += [(n0, n1)]
                tweights += [w]


    daily_states.append([tstate[idx] for idx in list(tg.nodes())])    
    daily_inf.append(tinf)
    
    tg.add_weighted_edges_from([(tedges[i][0], tedges[i][1], tweights[i]) for i in range(len(tedges))])
    adjm = nx.adjacency_matrix(tg).todense()
    daily_adj[frame, :, :] = adjm
    
    frame += 1
print('Done')

np.save(path.join(data_folder, 'daily-contact-matrices.npy'), daily_adj)
print(f'Saved {frame} adjacency matrices to a Pickle file.')

with open(path.join(data_folder, 'daily-transmissions.npy'), 'wb') as f:
    pickle.dump(daily_inf, f)
print(f'Saved {frame} transmissions lists to a Pickle file.')    

with open(path.join(data_folder, 'daily-node-states.pickle'), 'wb') as f:
    pickle.dump(daily_states, f)
print(f'Saved {frame} states to a Pickle file.')

Calculating the network for each day of the sim...
Frame 1 2023-11-20 09:00 to 2023-11-21 09:00
Frame 2 2023-11-21 09:00 to 2023-11-22 09:00
Frame 3 2023-11-22 09:00 to 2023-11-23 09:00
Frame 4 2023-11-23 09:00 to 2023-11-24 09:00
Frame 5 2023-11-24 09:00 to 2023-11-25 09:00
Frame 6 2023-11-25 09:00 to 2023-11-26 09:00
Frame 7 2023-11-26 09:00 to 2023-11-27 09:00
Frame 8 2023-11-27 09:00 to 2023-11-28 09:00
Frame 9 2023-11-28 09:00 to 2023-11-29 09:00
Frame 10 2023-11-29 09:00 to 2023-11-30 09:00
Frame 11 2023-11-30 09:00 to 2023-12-01 09:00
Frame 12 2023-12-01 09:00 to 2023-12-02 09:00
Frame 13 2023-12-02 09:00 to 2023-12-03 09:00
Frame 14 2023-12-03 09:00 to 2023-12-04 09:00
Frame 15 2023-12-04 09:00 to 2023-12-05 09:00
Done
Saved 15 adjacency matrices to a Pickle file.
Saved 15 transmissions lists to a Pickle file.
Saved 15 states to a Pickle file.


In [19]:
dates = []
masking = []
medication = []
quarantine_no = []
quarantine_yes = []

t = tmin
frame = 0
print('Calculating score-producing events for each day of the sim...')
while t <= tmax:
    t0 = t
    t += daily_delta_sec
    td = datetime.fromtimestamp(t, tz=timezone)
    
    # We want to include events that ocurred between t0 and t
    condition = (t0 < events['time']) & (events['time'] <= t)

    tevents = events[condition]    
    score_events = get_score_events(tevents)

    date = datetime.fromtimestamp(t0, tz=timezone).strftime('%Y-%m-%d')
    mask = get_info_count(score_events, "shopMask")
    med = get_info_count(score_events, "shopMedication")
    qno = get_info_count(score_events, "noQuarantine")
    qyes = get_info_count(score_events, "quarantine")

    print(date, mask, med, qno, qyes) 

    dates.append(date)
    masking.append(mask)
    medication.append(med)
    quarantine_no.append(qno)
    quarantine_yes.append(qyes)
    
    frame += 1
print('Done')

daily_behaviors = pd.DataFrame({'date': dates, 
                                'masking': masking, 
                                'medication': medication, 
                                'quarantine_no': quarantine_no, 
                                'quarantine_yes': quarantine_yes})

with open(path.join(data_folder, 'daily-behaviors.pickle'), 'wb') as f:
    pickle.dump(daily_behaviors, f)

Calculating score-producing events for each day of the sim...
2023-11-20 204 0 31 0
2023-11-21 100 0 225 70
2023-11-22 52 0 184 52
2023-11-23 42 0 205 60
2023-11-24 27 1 160 37
2023-11-25 22 3 108 15
2023-11-26 7 6 80 18
2023-11-27 23 3 200 45
2023-11-28 11 4 140 28
2023-11-29 11 10 144 24
2023-11-30 4 4 155 28
2023-12-01 1 6 116 15
2023-12-02 4 3 76 8
2023-12-03 0 0 57 7
2023-12-04 3 3 103 15
Done
