_HDS5210 - Programming for Health Data Scientists_

# Week 9 - Data Structures - JSON

JSON is the abbreviation for Javascript Object Notation.  JSON is a very common way that web-based applications communication information to eachother and the native way that web browsers manage dynamic data internally (even though webpage content is written in HTML - a form of XML).

In this part of the lecture, we'll be working on reading / processing / writing JSON.


## The json module

https://docs.python.org/3/library/json.html

In [None]:
import json

In [None]:
help(json)

In [None]:
dosages = [
    dict( drug="Aspirin", amount=100, mass_unit="mg", time_unit="hr"),
    dict( drug="Digoxin", amount=50,  mass_unit="mg", time_unit="hr")
]

In [None]:
dosages

In [None]:
print(json.dumps(dosages))

In [None]:
type(dosages)

In [None]:
dosages

In [None]:
json.dumps(dosages, indent=4)

In [None]:
json.dumps(dosages, indent=4)[0:10]

In [None]:
with open('dosages.json','w') as dosefile:
    json.dump(dosages, dosefile, indent=4)

In [None]:
%%bash
cat dosages.json

## Reading JSON into Python

In [None]:
with open('dosages.json') as dosefile:
    d = json.load(dosefile)
d

## JSON in Healthcare

A new part of the HL7 standard is something called FHIR ("fire").  I've downloaded and stored a sample FHIR document in `/data/patient-example-a.json`

https://www.hl7.org/fhir/patient-example-a.json.html

In [None]:
with open('/data/patient-example-a.json') as patfile:
    pat = json.load(patfile)

In [None]:
type(pat)

In [None]:
pat.keys()

In [None]:
pat['gender']

In [None]:
pat['name']

In [None]:
pat['name'][0]

In [None]:
pat['name'][0]['family']

In [None]:
pat['name'][0]['family'][0]

In [None]:
pat['name'][0]['family'][0] + ' ' + pat['name'][0]['given'][0]

In [None]:
pat['name'].append({'use':'alias', 'family':'Mickey', 'given':['Mouse']})

In [None]:
pat['name']

In [None]:
pat['name'][1]['family'][0] + ' ' + pat['name'][1]['given'][0]

In [None]:
pat['problem'] = ['annoying', 'yellow']

In [None]:
pat.keys()

In [None]:
print(json.dumps(pat, indent=2))

## Another HL7 FHIR Example

https://www.hl7.org/fhir/patient-example-f001-pieter.json.html

`/data/patient-example-f001-pieter.json`


In [None]:
with open('/data/patient-example-f001-pieter.json') as patfile:
    pat = json.load(patfile)

print(json.dumps(pat, indent=4))

In [None]:
print(pat['name'])

# Load from string

In [None]:
s = '{ "one": 1, "two": 2}'

In [None]:
s

In [None]:
s_obj = json.loads(s)

In [None]:
s_obj

In [None]:
type(s_obj)

Load JSON using Pandas
===

Remember that a Pandas data frame is a rectangular 2-dimensional matrix.  JSON is hierarchical.  Yet, Pandas will try to load JSON data if it can figure out a way to unwrap the structure from a hierarchy into a matrix.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

In [None]:
import pandas as pd

df = pd.read_json('/data/dosages.json')

In [None]:
df

In [None]:
df.loc[df['drug'] == 'Aspirin']

Sometimes the JSON isn't "pretty"
===

For these examples, we're pulling air quality data from the CDC website. 
https://ephtracking.cdc.gov/DataExplorer/#/


CDC has a web API that makes downloading data programable.  Here is an example link.  Reviewing the documentation, we can actually see what the various parts of this URL mean.

https://ephtracking.cdc.gov/apigateway/api/v1/getCoreHolder/296/2/1/29/2014,2013/0/0

The CDC website returns data using JSON.



In [None]:
import pandas as pd

data = pd.read_json('https://ephtracking.cdc.gov/apigateway/api/v1/getCoreHolder/296/2/1/29/2014,2013/0/0')

Reading web data using `requests`
===

Since the JSON that we're reading from the web has a bunch of extra metadata wrapped around the records themselves, we'll have to load that to Pandas a different way.  Our steps will be:

1. Load data from web using `requests`
2. Convert the text we get back from `JSON` to a Python dictionary
3. Use `Pandas.DataFrame.from_dict` to convert that to a `DataFrame`

In [None]:
import requests
import json

web_data = requests.get('https://ephtracking.cdc.gov/apigateway/api/v1/getCoreHolder/296/2/1/29/2014,2013/0/0')

In [None]:
data = json.loads(web_data.text)

In [None]:
for key, val in data.items():
    print("{:30s} {:-5d}".format(key, len(val)))

In [None]:
df = pd.DataFrame.from_dict(data['modeledTableResult'])
df

In [None]:
df.loc[df['geo']=='St. Louis'] 

In [None]:
stl2014 = df.loc[(df['geo']=='St. Louis') & (df['year']=='2014')] 
print(json.dumps(stl2014.to_dict(),indent=4))