## Part I: Read Data and Pre-Process

Data source is url query into PH Department of Health Data (shared on FB by X) which returns a JSON. JSON is then converted to pandas DataFrame, Stackoverflow code reference found here: 

In [46]:
import numpy as np
import pandas as pd

#Read JSON from src url
src_url = 'https://services5.arcgis.com/mnYJ21GiFTR97WFg/arcgis/rest/services/PH_masterlist/FeatureServer/0/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=%2A&outSR=102100&cacheHint=true&fbclid=IwAR38QbvClnSnwMEw23qlVwTBcgl_fqsFQaYNM1XiCThVVus0sBjclLyp8F0'

from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
raw = json.loads(urlopen(src_url).read())
df = json_normalize(raw['features'])

#rename columns for easier navigation
df.rename(columns = lambda x: x.split('attributes.')[-1], inplace = True)

Column 'travel hx' contains a list of people interacted with and countries visited in paragraph format. Regex is used to obtain a list of contacts (of the form PHX) and store copies of exact contact code [PHX] and code only [X]

In [68]:
#include contact tracing to (sort of) map spread
import re
def get_contacts (s):
    try:
        return re.findall(r'\bPH\w+',s)
    except TypeError:
        return ""

df['contacts'] = df['travel_hx'].apply(lambda x: get_contacts(x))

def get_contacts_num (l):
    nums = []
    for i in l:
        nums.append(i.split('PH')[-1])
    return nums
        
df['contacts_num'] = df['contacts'].apply(lambda x: get_contacts_num(x))

## Part II: Connected Components Analysis

Question here is, given the contacts obtained in Part I, a.) how many inter-connected groups are there, b.) how large are they and c.) what percentage of transmissions do they cover?

In [119]:
#Create an undirected graph of all connections
contact_map = {}
def add_edge(s, d, curr_dict):
    try:
        curr_dict[s].append(d)
    except KeyError:
        curr_dict[s] = []
        curr_dict[s].append(d)

for row in df.itertuples():
    current = getattr(row, 'FID')
    contact_list = getattr(row, 'contacts_num')
    for c in pd.Series(contact_list).unique():
        add_edge(int(current), int(c), contact_map)
        add_edge(int(c), int(current), contact_map)
    
    if(len(contact_list)==0):
        contact_map[current] = []

for key in contact_map.keys():
    contact_map[key] = list(set(contact_map[key]))
        

In [127]:
cnt = len(df)
visited = [False for i in range(1,cnt+2)]
parent = [-1 for i in range(1,cnt+2)]
cnt_visited = 0
def dfs (v, g, rc):
    #dfs from node v in graph (dict) g
    visited[v] = True
    rc = rc + 1
    for node in g[v]:
        if(not visited[node]):
            parent[node] = v
            dfs(node, g, rc)
            
cnt_cc = 0
comp_list = [0 for i in range(0, cnt)]
for i in range(1, cnt+1):
    cnt_visited_pre = cnt_visited
    if(not visited[i]):
        cnt_cc+=1
        dfs(i, contact_map, cnt_visited)
    
    comp_list[i-1] = cnt_visited - cnt_visited_pre

We find that there are 274 unique components, meaning that inter-connected components (within network transmissions) account for 11% of all cases.

In [129]:
print(cnt_cc)

274


In [130]:
print(cnt_cc/cnt)

0.8925081433224755


## Part III: Conclusion

Pre-reached conclusion on a possible methodology to a.) identify and isolate large spreading groups and b.) identify potential areas which are under-tested
<br><br>
Conclusions to be updated as source data gets updated