## Part I: Read Data and Pre-Process

Data source is url query into PH Department of Health Data (shared on FB by X) which returns a JSON. JSON is then converted to pandas DataFrame, Stackoverflow code reference found here: 

In [46]:
import numpy as np
import pandas as pd

#Read JSON from src url
src_url = 'https://services5.arcgis.com/mnYJ21GiFTR97WFg/arcgis/rest/services/PH_masterlist/FeatureServer/0/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=%2A&outSR=102100&cacheHint=true&fbclid=IwAR38QbvClnSnwMEw23qlVwTBcgl_fqsFQaYNM1XiCThVVus0sBjclLyp8F0'

from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
raw = json.loads(urlopen(src_url).read())
df = json_normalize(raw['features'])

#rename columns for easier navigation
df.rename(columns = lambda x: x.split('attributes.')[-1], inplace = True)

Column 'travel hx' contains a list of people interacted with and countries visited in paragraph format. Regex is used to obtain a list of contacts (of the form PHX) and store copies of exact contact code [PHX] and code only [X]

In [68]:
#include contact tracing to (sort of) map spread
import re
def get_contacts (s):
    try:
        return re.findall(r'\bPH\w+',s)
    except TypeError:
        return ""

df['contacts'] = df['travel_hx'].apply(lambda x: get_contacts(x))

def get_contacts_num (l):
    nums = []
    for i in l:
        nums.append(i.split('PH')[-1])
    return nums
        
df['contacts_num'] = df['contacts'].apply(lambda x: get_contacts_num(x))

## Part II: Connected Components Analysis

Question here is, given the contacts obtained in Part I, a.) how many inter-connected groups are there, b.) how large are they and c.) what percentage of transmissions do they cover?

In [119]:
#Create an undirected graph of all connections
contact_map = {}
def add_edge(s, d, curr_dict):
    try:
        curr_dict[s].append(d)
    except KeyError:
        curr_dict[s] = []
        curr_dict[s].append(d)

for row in df.itertuples():
    current = getattr(row, 'FID')
    contact_list = getattr(row, 'contacts_num')
    for c in pd.Series(contact_list).unique():
        add_edge(int(current), int(c), contact_map)
        add_edge(int(c), int(current), contact_map)
    
    if(len(contact_list)==0):
        contact_map[current] = []

for key in contact_map.keys():
    contact_map[key] = list(set(contact_map[key]))
        

Use Depth First Search (DFS) to a.) make a map of all connected cases and b.) count the size and number of connected components

In [144]:
cnt = len(df)
visited = [False for i in range(1,cnt+2)]
visited_pre = [False for i in range(1,cnt+2)]
parent = [-1 for i in range(1,cnt+2)]
def dfs (v, g):
    #dfs from node v in graph (dict) g
    visited[v] = True
    for node in g[v]:
        if(not visited[node]):
            parent[node] = v
            dfs(node, g)
            
cnt_cc = 0
comp_list = {}
for i in range(1, cnt+1):
    if(not visited[i]):
        cnt_cc+=1
        dfs(i, contact_map)
        comp_list[cnt_cc-1] = []
        
        for j in range(1, cnt+1):
            if(not visited_pre[j] and visited[j]):
                visited_pre[j] = visited[j]
                comp_list[cnt_cc-1].append(j)

Finding: The largest connected component is 8 people: [12, 34, 35, 42, 43, 84, 86, 204]. Additionaly, we find that there are 274 unique components, meaning that connected components (2 or more people) account for 11% of all cases.

In [169]:
comp_size = [0 for i in comp_list.keys()]
for key in comp_list.keys():
    comp_size[key] = len(comp_list[key])

print("Number of People in the Top 5 largest networks:")
comp_size.sort()
comp_size[-5:]

Number of People in the Top 5 largest networks:


[3, 4, 4, 6, 8]

In [165]:
#Dictionary that maps all connected cases
comp_list

{0: [1, 2],
 1: [3],
 2: [4],
 3: [5, 6, 38],
 4: [7],
 5: [8],
 6: [9, 27, 28, 29, 30, 31],
 7: [10],
 8: [11],
 9: [12, 34, 35, 42, 43, 84, 86, 204],
 10: [13],
 11: [14],
 12: [15],
 13: [16, 17],
 14: [18, 19],
 15: [20],
 16: [21, 65, 66, 67],
 17: [22],
 18: [23],
 19: [24],
 20: [25],
 21: [26],
 22: [32, 90],
 23: [33],
 24: [36],
 25: [37],
 26: [39],
 27: [40],
 28: [41, 44, 87, 112],
 29: [45],
 30: [46],
 31: [47],
 32: [48],
 33: [49, 52, 183],
 34: [50],
 35: [51, 134, 135],
 36: [53],
 37: [54],
 38: [55],
 39: [56],
 40: [57],
 41: [58],
 42: [59],
 43: [60],
 44: [61],
 45: [62],
 46: [63],
 47: [64],
 48: [68],
 49: [69],
 50: [70],
 51: [71],
 52: [72],
 53: [73],
 54: [74],
 55: [75],
 56: [76],
 57: [77],
 58: [78],
 59: [79],
 60: [80],
 61: [81],
 62: [82],
 63: [83],
 64: [85],
 65: [88],
 66: [89],
 67: [91],
 68: [92],
 69: [93],
 70: [94],
 71: [95],
 72: [96],
 73: [97],
 74: [98],
 75: [99],
 76: [100],
 77: [101],
 78: [102],
 79: [103],
 80: [104],
 81: [

In [172]:
print("Number of unique networks: ", cnt_cc)

Number of unique networks:  274


In [177]:
print(f"Percentage of Transmissions in networks of 2+: {round(100*(1-cnt_cc/cnt),2)}%")

Percentage of Transmissions in networks of 2+: 10.75%


## Part III: Conclusion

Pre-reached conclusion on a possible methodology to a.) identify and isolate large spreading groups and b.) identify potential areas which are under-tested
<br><br>
Conclusions to be updated as source data gets updated