_HDS5210 - Programming for Health Data Scientists_

# Week 9 - Data Structures - JSON

JSON is the abbreviation for Javascript Object Notation.  JSON is a very common way that web-based applications communication information to eachother and the native way that web browsers manage dynamic data internally (even though webpage content is written in HTML - a form of XML).

In this part of the lecture, we'll be working on reading / processing / writing JSON.


In [1]:
%%bash
head /data/aco_year1.csv

"ACO Name (LBN or DBA, if applicable) ",States Where Beneficiaries Reside ,Agreement Start Date,Track,Participate in Advance Payment Model ,Total Assigned Beneficiaries,Total Benchmark Expenditures,Total Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures,Total Benchmark Expenditures Minus Total Assigned Beneficiary Expenditures as % of Total Benchmark,"Generated Savings/Losses1,2","Earned Shared Savings Payments/Owe Losses3,4",Successfully Reported Quality5,ACO-1,ACO-2,ACO-3,ACO-4,ACO-5,ACO-6,ACO-7,ACO-8^,ACO-9^,ACO-10^,ACO-11,ACO-12,ACO-13,ACO-14,ACO-15,ACO-16,ACO-17,ACO-18,ACO-19,ACO-20,ACO-21,DM Comp-osite,ACO-22,ACO-23,ACO-24,ACO-25,ACO-26,ACO-27^,ACO-28,ACO-29,ACO-30,ACO-31,CAD Comp-osite,ACO-32,ACO-33
"A.M. Beajow, M.D. Internal Medicine Associates ACO, P.C",Nevada,01/01/2013,Track1 ,No ,5921,$70912015,$67555873,$3356142,4.7%,$3356142,$1644510,Yes,75.6,93.09,92.18,82.91,58.06,76.36,71.33,14.88,0.67,1.14,75,72.5,1.24,25.83,22.4,31.19,64.08,0,39

## The json module

https://docs.python.org/3/library/json.html

In [2]:
import json

In [3]:
help(json)

Help on package json:

NAME
    json

MODULE REFERENCE
    https://docs.python.org/3.5/library/json.html
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    JSON (JavaScript Object Notation) <http://json.org> is a subset of
    JavaScript syntax (ECMA-262 3rd edition) used as a lightweight data
    interchange format.
    
    :mod:`json` exposes an API familiar to users of the standard library
    :mod:`marshal` and :mod:`pickle` modules.  It is derived from a
    version of the externally maintained simplejson library.
    
    Encoding basic Python object hierarchies::
    
        >>> import json
        >>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
        '["foo", {"bar": ["baz", null,

In [4]:
dosages = [
    dict( drug="Aspirin", amount=100, mass_unit="mg", time_unit="hr"),
    dict( drug="Digoxin", amount=50,  mass_unit="mg", time_unit="hr")
]

In [5]:
dosages

[{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'},
 {'amount': 50, 'drug': 'Digoxin', 'mass_unit': 'mg', 'time_unit': 'hr'}]

In [6]:
print(json.dumps(dosages))

[{"time_unit": "hr", "amount": 100, "mass_unit": "mg", "drug": "Aspirin"}, {"time_unit": "hr", "amount": 50, "mass_unit": "mg", "drug": "Digoxin"}]


In [7]:
type(dosages)

list

In [8]:
dosages

[{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'},
 {'amount': 50, 'drug': 'Digoxin', 'mass_unit': 'mg', 'time_unit': 'hr'}]

In [12]:
print(json.dumps(dosages, indent=4))

[
    {
        "time_unit": "hr",
        "amount": 100,
        "mass_unit": "mg",
        "drug": "Aspirin"
    },
    {
        "time_unit": "hr",
        "amount": 50,
        "mass_unit": "mg",
        "drug": "Digoxin"
    }
]


In [13]:
json.dumps(dosages, indent=4)[0:10]

'[\n    {\n  '

In [14]:
with open('dosages.json','w') as dosefile:
    json.dump(dosages, dosefile, indent=4)
    #dosefile.write(json.dumps(dosages, indent=4))

In [15]:
%%bash
cat dosages.json

[
    {
        "time_unit": "hr",
        "amount": 100,
        "mass_unit": "mg",
        "drug": "Aspirin"
    },
    {
        "time_unit": "hr",
        "amount": 50,
        "mass_unit": "mg",
        "drug": "Digoxin"
    }
]

## Reading JSON into Python

In [16]:
with open('dosages.json') as dosefile:
    b = json.load(dosefile)
d

[{'amount': 100, 'drug': 'Aspirin', 'mass_unit': 'mg', 'time_unit': 'hr'},
 {'amount': 50, 'drug': 'Digoxin', 'mass_unit': 'mg', 'time_unit': 'hr'}]

In [29]:
type(d)

list

In [26]:
ages = """{"E18734": 32,"E98904": 12,"E98743": 87}"""

In [27]:
ages

'{"E18734": 32,"E98904": 12,"E98743": 87}'

In [28]:
ages_dict = json.loads(ages) 

In [21]:
ages_dict

{'E18734': 32, 'E98743': 87, 'E98904': 12}

In [22]:
type(ages_dict)

dict

In [30]:
data = "87"
val = json.loads(data)
val

87

In [33]:
f = """
{"amount": 100}
{"amount": 50}
"""

In [34]:
json.loads(f)

JSONDecodeError: Extra data: line 3 column 1 (char 17)

## JSON in Healthcare

A new part of the HL7 standard is something called FHIR ("fire").  I've downloaded and stored a sample FHIR document in `/data/patient-example-a.json`

https://www.hl7.org/fhir/patient-example-a.json.html

In [35]:
with open('/data/patient-example-a.json') as patfile:
    pat = json.load(patfile)

In [36]:
type(pat)

dict

In [37]:
pat.keys()

dict_keys(['gender', 'text', 'link', 'id', 'resourceType', 'name', 'managingOrganization', 'identifier', 'photo', 'active', 'contact'])

In [38]:
pat['gender']

'male'

In [39]:
pat['name']

[{'family': 'Donald', 'given': ['Duck'], 'use': 'official'}]

In [41]:
pat['name'][0]

{'family': 'Donald', 'given': ['Duck'], 'use': 'official'}

In [42]:
pat['name'][0]['family'] 

'Donald'

In [45]:
pat['name'][0]['family'][0]


'D'

In [46]:
pat['name'][0]['family'] + ' ' + pat['name'][0]['given'][0]

'Donald Duck'

In [47]:
pat['name'].append({'use':'alias', 'family':'Mickey', 'given':['Mouse']}) 

In [48]:
pat['name']

[{'family': 'Donald', 'given': ['Duck'], 'use': 'official'},
 {'family': 'Mickey', 'given': ['Mouse'], 'use': 'alias'}]

In [49]:
pat['problem']

KeyError: 'problem'

In [50]:
pat['problem'] = ['annoying', 'yellow']

In [51]:
pat.keys()

dict_keys(['link', 'id', 'managingOrganization', 'contact', 'photo', 'problem', 'gender', 'text', 'resourceType', 'name', 'identifier', 'active'])

In [52]:
print(json.dumps(pat, indent=2))

{
  "link": [
    {
      "type": "seealso",
      "other": {
        "reference": "Patient/pat2"
      }
    }
  ],
  "id": "pat1",
  "managingOrganization": {
    "display": "ACME Healthcare, Inc",
    "reference": "Organization/1"
  },
  "contact": [
    {
      "relationship": [
        {
          "coding": [
            {
              "code": "E",
              "system": "http://hl7.org/fhir/v2/0131"
            }
          ]
        }
      ],
      "organization": {
        "display": "Walt Disney Corporation",
        "reference": "Organization/1"
      }
    }
  ],
  "photo": [
    {
      "data": "R0lGODlhEwARAPcAAAAAAAAA/+9aAO+1AP/WAP/eAP/eCP/eEP/eGP/nAP/nCP/nEP/nIf/nKf/nUv/nWv/vAP/vCP/vEP/vGP/vIf/vKf/vMf/vOf/vWv/vY//va//vjP/3c//3lP/3nP//tf//vf/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

## Another HL7 FHIR Example

https://www.hl7.org/fhir/patient-example-f001-pieter.json.html

`/data/patient-example-f001-pieter.json`


In [53]:
with open('/data/patient-example-f001-pieter.json') as patfile:
    pat = json.load(patfile)

print(json.dumps(pat, indent=4))

{
    "communication": [
        {
            "preferred": true,
            "language": {
                "coding": [
                    {
                        "display": "Dutch",
                        "code": "nl",
                        "system": "urn:ietf:bcp:47"
                    }
                ],
                "text": "Nederlands"
            }
        }
    ],
    "multipleBirthBoolean": true,
    "maritalStatus": {
        "coding": [
            {
                "display": "Married",
                "code": "M",
                "system": "http://hl7.org/fhir/v3/MaritalStatus"
            }
        ],
        "text": "Getrouwd"
    },
    "id": "f001",
    "deceasedBoolean": false,
    "managingOrganization": {
        "display": "Burgers University Medical Centre",
        "reference": "Organization/f001"
    },
    "contact": [
        {
            "name": {
                "family": "Abels",
                "use": "usual",
                "given": [
        

In [None]:
print(pat['name'])

# Load from string

In [None]:
s = '{ "one": 1, "two": 2}'

In [None]:
s

In [None]:
s_obj = json.loads(s)

In [None]:
s_obj

In [None]:
type(s_obj)

Load JSON using Pandas
===

Remember that a Pandas data frame is a rectangular 2-dimensional matrix.  JSON is hierarchical.  Yet, Pandas will try to load JSON data if it can figure out a way to unwrap the structure from a hierarchy into a matrix.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

In [54]:
%%bash
cat /data/dosages.json

[
    {
        "time_unit": "hr",
        "amount": 100,
        "mass_unit": "mg",
        "drug": "Aspirin"
    },
    {
        "time_unit": "hr",
        "amount": 50,
        "mass_unit": "mg",
        "drug": "Digoxin"
    }
]

In [55]:
import pandas as pd

df = pd.read_json('/data/dosages.json')

In [56]:
df

Unnamed: 0,amount,drug,mass_unit,time_unit
0,100,Aspirin,mg,hr
1,50,Digoxin,mg,hr


In [57]:
fhir = pd.read_json('/data/patient-example-f001-pieter.json')

ValueError: arrays must all be same length

In [58]:
df.loc[df['drug'] == 'Aspirin'] 

Unnamed: 0,amount,drug,mass_unit,time_unit
0,100,Aspirin,mg,hr


Sometimes the JSON isn't "pretty"
===

For these examples, we're pulling air quality data from the CDC website. 
https://ephtracking.cdc.gov/DataExplorer/#/


CDC has a web API that makes downloading data programable.  Here is an example link.  Reviewing the documentation, we can actually see what the various parts of this URL mean.



https://ephtracking.cdc.gov/apigateway/api/v1/getCoreHolder/296/2/1/29/2014,2013/0/0

The CDC website returns data using JSON.



In [59]:
import pandas as pd

data = pd.read_json('https://ephtracking.cdc.gov/apigateway/api/v1/getCoreHolder/296/2/1/29/2014,2013/0/0')

ValueError: arrays must all be same length

Reading web data using `requests`
===

Since the JSON that we're reading from the web has a bunch of extra metadata wrapped around the records themselves, we'll have to load that to Pandas a different way.  Our steps will be:

1. Load data from web using `requests`
2. Convert the text we get back from `JSON` to a Python dictionary
3. Use `Pandas.DataFrame.from_dict` to convert that to a `DataFrame`

In [60]:
import requests
import json

web_data = requests.get('https://ephtracking.cdc.gov/apigateway/api/v1/getCoreHolder/296/2/1/29/2014,2013/0/0')

In [70]:
dir(web_data)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [61]:
data = json.loads(web_data.text)

In [62]:
data

{'benchmarkInformation': {'active': None,
  'benchmarkFullName': 'National  Ambient Air Quality Standard',
  'benchmarkId': 1,
  'benchmarkName': ' Ambient Air Quality Standard',
  'benchmarkShortName': ' Ambient Air Quality Standard',
  'geographicDisplay': 'National',
  'geographicTypeDisplay': 'National Benchmark',
  'hasChart': False,
  'hasMap': True,
  'hasTable': True,
  'id': 9,
  'measureGeographicTypeId': 2,
  'measureId': 296,
  'multipleSelectionAction': 'Continue',
  'multipleSelectionActionId': 1,
  'units': 'µg/m³'},
 'benchmarkResult': [{'calculationType': None,
   'dataValue': '12.0',
   'displayValue': '12.0',
   'geo': None,
   'geoAbbreviation': None,
   'geoId': None,
   'geographicTypeId': 2,
   'groupById': '1',
   'id': '31',
   'parentGeo': None,
   'parentGeoAbbreviation': None,
   'parentGeoId': None,
   'year': '2013'},
  {'calculationType': None,
   'dataValue': '12.0',
   'displayValue': '12.0',
   'geo': None,
   'geoAbbreviation': None,
   'geoId': None,

In [63]:
for key, val in data.items():
    print("{:30s} {:-5d}".format(key, len(val)))

biomonitoringTableResult           0
categoryTableResult                0
healthImpactTableResult            0
climateChangeTableResult           0
measureInformationDTO              7
lookupList                         0
fullPublicAPIUrl                 281
heatEpisodesTableResult            0
publicAPIServerUrl                31
tableResultWithCI                  0
tableResultWithMonth               0
tableResultWithQuarter             0
dailyEstimatesTableResult          0
tableResult                        0
benchmarkInformation              16
devDisabilitiesTableResult         0
benchmarkResult                    2
legendResult                       0
tableReturnType                   18
tableResultClass                  18
modeledTableResult               230
publicAPIUrl                     250
pwsTableResult                     0
measureStratificationLevel         5


In [66]:
df = pd.DataFrame(data['modeledTableResult'])

In [67]:
df

Unnamed: 0,calculationType,dataValue,displayValue,geo,geoAbbreviation,geoId,geographicTypeId,groupById,id,modeledFlag,noDataBreakGroup,noDataId,parentGeo,parentGeoAbbreviation,parentGeoId,rollover,stabilityFlag,suppressionFlag,title,year
0,Average,9.1,9.1,Adair,29001,29001,2,1,252343,1,30,-1,Missouri,MO,29,[Average: 9.1 (Modeled)],1,0,"Adair, MO",2013
1,Average,9.5,9.5,Adair,29001,29001,2,1,252344,1,30,-1,Missouri,MO,29,[Average: 9.5 (Modeled)],1,0,"Adair, MO",2014
2,Average,9.3,9.3,Andrew,29003,29003,2,1,252357,1,30,-1,Missouri,MO,29,[Average: 9.3 (Modeled)],1,0,"Andrew, MO",2013
3,Average,9.9,9.9,Andrew,29003,29003,2,1,252358,1,30,-1,Missouri,MO,29,[Average: 9.9 (Modeled)],1,0,"Andrew, MO",2014
4,Average,8.9,8.9,Atchison,29005,29005,2,1,252371,1,30,-1,Missouri,MO,29,[Average: 8.9 (Modeled)],1,0,"Atchison, MO",2013
5,Average,9.2,9.2,Atchison,29005,29005,2,1,252372,1,30,-1,Missouri,MO,29,[Average: 9.2 (Modeled)],1,0,"Atchison, MO",2014
6,Average,9.7,9.7,Audrain,29007,29007,2,1,252385,1,30,-1,Missouri,MO,29,[Average: 9.7 (Modeled)],1,0,"Audrain, MO",2013
7,Average,10.0,10.0,Audrain,29007,29007,2,1,252386,1,30,-1,Missouri,MO,29,[Average: 10.0 (Modeled)],1,0,"Audrain, MO",2014
8,Average,9.5,9.5,Barry,29009,29009,2,1,252399,1,30,-1,Missouri,MO,29,[Average: 9.5 (Modeled)],1,0,"Barry, MO",2013
9,Average,9.8,9.8,Barry,29009,29009,2,1,252400,1,30,-1,Missouri,MO,29,[Average: 9.8 (Modeled)],1,0,"Barry, MO",2014


In [64]:
df = pd.DataFrame.from_dict(data['modeledTableResult'])
df

Unnamed: 0,calculationType,dataValue,displayValue,geo,geoAbbreviation,geoId,geographicTypeId,groupById,id,modeledFlag,noDataBreakGroup,noDataId,parentGeo,parentGeoAbbreviation,parentGeoId,rollover,stabilityFlag,suppressionFlag,title,year
0,Average,9.1,9.1,Adair,29001,29001,2,1,252343,1,30,-1,Missouri,MO,29,[Average: 9.1 (Modeled)],1,0,"Adair, MO",2013
1,Average,9.5,9.5,Adair,29001,29001,2,1,252344,1,30,-1,Missouri,MO,29,[Average: 9.5 (Modeled)],1,0,"Adair, MO",2014
2,Average,9.3,9.3,Andrew,29003,29003,2,1,252357,1,30,-1,Missouri,MO,29,[Average: 9.3 (Modeled)],1,0,"Andrew, MO",2013
3,Average,9.9,9.9,Andrew,29003,29003,2,1,252358,1,30,-1,Missouri,MO,29,[Average: 9.9 (Modeled)],1,0,"Andrew, MO",2014
4,Average,8.9,8.9,Atchison,29005,29005,2,1,252371,1,30,-1,Missouri,MO,29,[Average: 8.9 (Modeled)],1,0,"Atchison, MO",2013
5,Average,9.2,9.2,Atchison,29005,29005,2,1,252372,1,30,-1,Missouri,MO,29,[Average: 9.2 (Modeled)],1,0,"Atchison, MO",2014
6,Average,9.7,9.7,Audrain,29007,29007,2,1,252385,1,30,-1,Missouri,MO,29,[Average: 9.7 (Modeled)],1,0,"Audrain, MO",2013
7,Average,10.0,10.0,Audrain,29007,29007,2,1,252386,1,30,-1,Missouri,MO,29,[Average: 10.0 (Modeled)],1,0,"Audrain, MO",2014
8,Average,9.5,9.5,Barry,29009,29009,2,1,252399,1,30,-1,Missouri,MO,29,[Average: 9.5 (Modeled)],1,0,"Barry, MO",2013
9,Average,9.8,9.8,Barry,29009,29009,2,1,252400,1,30,-1,Missouri,MO,29,[Average: 9.8 (Modeled)],1,0,"Barry, MO",2014


In [68]:
df.loc[df['geo']=='St. Louis'] 

Unnamed: 0,calculationType,dataValue,displayValue,geo,geoAbbreviation,geoId,geographicTypeId,groupById,id,modeledFlag,noDataBreakGroup,noDataId,parentGeo,parentGeoAbbreviation,parentGeoId,rollover,stabilityFlag,suppressionFlag,title,year
200,Average,12.0,12.0,St. Louis,29189,29189,2,1,253673,1,30,-1,Missouri,MO,29,[Average: 12.0 (Modeled)],1,0,"St. Louis, MO",2013
201,Average,12.1,12.1,St. Louis,29189,29189,2,1,253674,1,30,-1,Missouri,MO,29,[Average: 12.1 (Modeled)],1,0,"St. Louis, MO",2014


In [69]:
stl2014 = df.loc[(df['geo']=='St. Louis') & (df['year']=='2014')] 
print(json.dumps(stl2014.to_dict(),indent=4))

{
    "title": {
        "201": "St. Louis, MO"
    },
    "calculationType": {
        "201": "Average"
    },
    "geographicTypeId": {
        "201": 2
    },
    "id": {
        "201": "253674"
    },
    "year": {
        "201": "2014"
    },
    "parentGeoId": {
        "201": "29"
    },
    "displayValue": {
        "201": "12.1"
    },
    "parentGeo": {
        "201": "Missouri"
    },
    "rollover": {
        "201": [
            "Average: 12.1 (Modeled)"
        ]
    },
    "dataValue": {
        "201": "12.1"
    },
    "groupById": {
        "201": "1"
    },
    "suppressionFlag": {
        "201": "0"
    },
    "geoAbbreviation": {
        "201": "29189"
    },
    "geoId": {
        "201": "29189"
    },
    "geo": {
        "201": "St. Louis"
    },
    "noDataBreakGroup": {
        "201": 30
    },
    "modeledFlag": {
        "201": "1"
    },
    "parentGeoAbbreviation": {
        "201": "MO"
    },
    "noDataId": {
        "201": -1
    },
    "stabilityFlag": 

Base64 Decode Photo
===



In [1]:
with open('/data/patient-example-a.json') as patfile:
    pat = json.load(patfile)

In [10]:
imgtext = pat['photo'][0]['data']
imgtext

'R0lGODlhEwARAPcAAAAAAAAA/+9aAO+1AP/WAP/eAP/eCP/eEP/eGP/nAP/nCP/nEP/nIf/nKf/nUv/nWv/vAP/vCP/vEP/vGP/vIf/vKf/vMf/vOf/vWv/vY//va//vjP/3c//3lP/3nP//tf//vf/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

In [6]:
import base64
img = base64.b64decode(imgtext)

In [9]:
with open ('photo.gif','wb') as imgfile:
    imgfile.write(img)

![title](photo.gif)
