# JSON files

## Import statements

In [2]:
import json
import pandas as pd
import pathlib
import pprint

In [3]:
json

<module 'json' from 'C:\\Users\\Bea\\anaconda3\\lib\\json\\__init__.py'>

## Setup

The files for this class are stored in the `data/json` directory:

In [5]:
directory = pathlib.Path("data") / "json"

## Preamble: JSON

### From Python to JSON

In [45]:
python_object = {
    "integer": 1,
    "float": 2.3,
    "string": "Madrid",
    "list": ["NYC", "SF", "DC"],
    "boolean": True,
    "none": None,
}

In [46]:
python_object

{'integer': 1,
 'float': 2.3,
 'string': 'Madrid',
 'list': ['NYC', 'SF', 'DC'],
 'boolean': True,
 'none': None}

In [47]:
type(python_object)

dict

In [48]:
json_object = json.dumps(python_object)

In [49]:
print(json_object)

{"integer": 1, "float": 2.3, "string": "Madrid", "list": ["NYC", "SF", "DC"], "boolean": true, "none": null}


In [50]:
type(json_object)

str

In [51]:
python_object["string"]

'Madrid'

In [52]:
# Raises an error, because it is not possible to manipulate a JSON object directly:
json_object["string"]

TypeError: string indices must be integers

### From JSON to Python

In [None]:
python_object_bis = json.loads(json_object)

In [None]:
python_object_bis

In [None]:
python_object == python_object_bis

### Displaying Python

In [None]:
python_object

In [None]:
pprint.pprint(python_object)

In [None]:
pprint.pprint(python_object, sort_dicts=False)

In [None]:
def print_python(python_object):
    pprint.pprint(python_object, sort_dicts=False)

In [None]:
print_python(python_object)

### Displaying JSON

In [None]:
print(json_object)

In [None]:
print(json.dumps(json.loads(json_object), indent=4))

In [None]:
def print_json(json_object):
    print(json.dumps(json.loads(json_object), indent=4))

In [None]:
print_json(json_object)

### From a dataframe to JSON

In [None]:
filename = "demo-0.csv"
file = directory / filename

In [None]:
df = pd.read_csv(file)

**Question:** How to represent this dataframe in JSON?

In [None]:
df

...

...

...

...

...

...

...

...

...

...

In [None]:
df.to_json()

In [None]:
print_json(df.to_json())

In [None]:
print_json(df.to_json(orient="columns"))

In [None]:
print_json(df.to_json(orient="index"))

In [None]:
print_json(df.to_json(orient="records"))

In [None]:
print_json(df.to_json(orient="values"))

In [None]:
print_json(df.to_json(orient="split"))

In [None]:
print_json(df.to_json(orient="table"))

### From JSON to a dataframe

Question: How to import this JSON object in a dataframe?!?

In [None]:
print_json(json_object)

...

...

...

...

...

...

...

...

...

...

## Demo 1: Import a simple JSON file

### Import the JSON file into a dataframe

In [None]:
filename = "demo-1.json"
file = directory / filename

Import the JSON file into a dataframe:

In [None]:
df = pd.read_json(file)

In [None]:
df

## Exercise 1

The dataset for this exercise is the `pu2018_schema.json` file taken from the [Survey of Income and Program Participation (SIPP) survey](https://www.census.gov/programs-surveys/sipp/about.html) from the US census.

* https://www2.census.gov/programs-surveys/sipp/data/datasets/2018/sipp_python_input_example.py
* https://www2.census.gov/programs-surveys/sipp/data/datasets/2018/pu2018_schema.json
* https://www2.census.gov/programs-surveys/sipp/data/datasets/2018/pu2018_csv.zip

In [None]:
filename = "exercise-1.json"
file = directory / filename

Import the JSON file into a dataframe named `df`:

In [None]:
df = pd.read_json(file)

Check the first 5 rows of the dataframe:

In [None]:
df.head()

## Demo 2: Import a JSON file requiring normalization

### Import the JSON file into a dataframe

In [None]:
filename = "demo-2a.json"
file = directory / filename

In [None]:
! cat {file}

In [None]:
df = pd.read_json(file)

In [None]:
df

In [None]:
with open(file) as f:
    python_object = json.load(f)

In [None]:
print_python(python_object)

In [None]:
type(python_object)

In [None]:
df = pd.json_normalize(python_object)

In [None]:
df

In [None]:
filename = "demo-2b.json"
file = directory / filename

In [None]:
! cat {file}

In [None]:
df = pd.read_json(file)

In [None]:
df

In [None]:
with open(file) as f:
    python_object = json.load(f)

In [None]:
print_python(python_object)

In [None]:
type(python_object)

In [None]:
df = pd.json_normalize(python_object)

In [None]:
df

In [None]:
filename = "demo-2c.json"
file = directory / filename

In [None]:
! cat {file}

In [None]:
df = pd.read_json(file)

In [None]:
df

In [None]:
with open(file) as f:
    python_object = json.load(f)

In [None]:
print_python(python_object)

In [None]:
type(python_object)

In [None]:
df = pd.json_normalize(python_object)

In [None]:
df

In [None]:
df_teachers = pd.json_normalize(python_object, record_path="Teachers")

In [None]:
df_teachers

In [None]:
df_teachers = pd.json_normalize(python_object, record_path="Teachers", meta=["Name"])

In [None]:
df_teachers

**Typical steps to read a JSON file:**
* Try to import the JSON file without any normalization
```python
pd.read_json(file)
```
* Import the JSON file into a python object  
```python
with open(file) as f:
    python_object = json.load(f)
```
* Explore the python object to identify a list with the data rows
* Use `pd.json_normalize(python_object)` to create a dataframe

In [None]:
python_object = {
    "metadata": {"Creator": "JC", "Data": "2 June 2021"},
    "license": "Creative Commons",
    "data": [
        {"Country": "Spain", "Capital": "Madrid"},
        {"Country": "France", "Capital": "Paris"},
        {"Country": "Germany", "Capital": "Berlin"},
    ],
}

In [None]:
type(python_object)

In [None]:
python_object.keys()

In [None]:
type(python_object["data"])

In [None]:
len(python_object["data"])

In [None]:
python_object

In [None]:
python_object["data"]

In [None]:
python_object["data"][0]

## Exercise 2

The dataset for this exercise is a subset of the  `complete.json` file (limited to the year 2020) of the [NYC Philharmonic list of concerts](https://github.com/nyphilarchive/PerformanceHistory).

In [53]:
filename = "exercise-2.json"
file = directory / filename

Try to import the JSON file into a dataframe named `df` using the simplest approach:

In [54]:
df = pd.read_json(file)

Check the first 5 rows of the dataframe:

In [55]:
df.head()

Unnamed: 0,programs
0,{'id': '5ebddf26-57c1-4bda-ad6e-ed4be67d1033-0...
1,{'id': '14d63c8b-4cbb-4910-a0dc-b5d40cd43d70-0...
2,{'id': 'cfb75df5-e9b7-4494-8538-8312e2520a36-0...
3,{'id': '7b3da945-828b-4f45-9bb5-3d0dc5f83b4b-0...
4,{'id': '5080686d-665c-40ca-9504-27ac66929e52-0...


Turn the JSON file into a Python object:

In [56]:
with open(file) as f:
    python_object = json.load(f)

Search and **identify a list of programs** inside this Python object.  
Check the type of this Python object:

In [57]:
type(python_object)

dict

If it is a dictionary, check its keys; if it is a list, check its length:

In [59]:
python_object.keys()

dict_keys(['programs'])

Check the type of an element inside (either using a key identified above if it is a dictionary, or using an index if it is a list):

In [61]:
type(python_object["programs"])

list

Once a list has been found, check the first element of the list:

In [62]:
python_object["programs"][0]

{'id': '5ebddf26-57c1-4bda-ad6e-ed4be67d1033-0.1',
 'programID': '14353',
 'orchestra': 'New York Philharmonic',
 'season': '2019-20',
 'concerts': [{'eventType': 'Non-Subscription',
   'Location': 'Manhattan, NY',
   'Venue': 'David Geffen Hall',
   'Date': '2019-09-11T04:00:00Z',
   'Time': '7:30PM'},
  {'eventType': 'Non-Subscription',
   'Location': 'Manhattan, NY',
   'Venue': 'David Geffen Hall',
   'Date': '2019-09-12T04:00:00Z',
   'Time': '7:30PM'}],
 'works': [{'ID': '12860*',
   'composerName': 'Williams,  John',
   'workTitle': 'CLOSE ENOUNTERS OF THE THIRD KIND - LIVE TO FILM',
   'conductorName': 'Kaufman, Richard',
   'soloists': [{'soloistName': 'Musica Sacra',
     'soloistInstrument': 'Chorus',
     'soloistRoles': 'S'}]},
  {'ID': '0*', 'interval': 'Intermission', 'soloists': []}]}

Once a list has been found, import it into a dataframe named `df`:

In [63]:
df = pd.json_normalize(python_object["programs"])

Check the first 5 rows of the dataframe:

In [64]:
df.head()

Unnamed: 0,id,programID,orchestra,season,concerts,works
0,5ebddf26-57c1-4bda-ad6e-ed4be67d1033-0.1,14353,New York Philharmonic,2019-20,"[{'eventType': 'Non-Subscription', 'Location':...","[{'ID': '12860*', 'composerName': 'Williams, ..."
1,14d63c8b-4cbb-4910-a0dc-b5d40cd43d70-0.1,14354,New York Philharmonic,2019-20,"[{'eventType': 'Non-Subscription', 'Location':...","[{'ID': '12888*', 'composerName': 'Herrmann, ..."
2,cfb75df5-e9b7-4494-8538-8312e2520a36-0.1,14318,New York Philharmonic,2019-20,"[{'eventType': 'Subscription Season', 'Locatio...","[{'ID': '3965*', 'composerName': 'Anthem,', 'w..."
3,7b3da945-828b-4f45-9bb5-3d0dc5f83b4b-0.1,14539,New York Philharmonic,2019-20,"[{'eventType': 'Subscription Season', 'Locatio...","[{'ID': '12856*', 'composerName': 'Glass, Phi..."
4,5080686d-665c-40ca-9504-27ac66929e52-0.1,14319,New York Philharmonic,2019-20,"[{'eventType': 'Subscription Season', 'Locatio...","[{'ID': '12995*5', 'composerName': 'Schoenberg..."


The `df` dataframe looks good, but it contains several columns whose elements are lists (e.g. the `concerts` column).  
Import the different elements from the `works` column into another dataframe named `df_works`:

In [65]:
df_works = pd.json_normalize(python_object["programs"], record_path="works")

Check the **last 5 rows** of the dataframe:

In [66]:
df_works.tail()

Unnamed: 0,ID,composerName,workTitle,conductorName,soloists,interval,movement,movement._,movement.em
359,0*,,,,[],Intermission,,,
360,1490*,"Ravel, Maurice",SHEHERAZADE,"Langree [Langrée], Louis","[{'soloistName': 'Leonard, Isabel', 'soloistIn...",,,,
361,53102*,"Scriabin, Alexander","SYMPHONY NO. 4, ""LE POEME D'EXTASE,"" OP. 54","Langree [Langrée], Louis",[],,,,
362,12880*,"LeFrak, Karen",SLEEPOVER AT THE MUSEUM,"Bahl, Ankush Kumar","[{'soloistName': 'Bernstein, Jamie', 'soloistI...",,,,
363,51779*,"Saint-Saens [Saint-Saëns], Camille",CARNIVAL OF THE ANIMALS,"Bahl, Ankush Kumar","[{'soloistName': 'Huebner, Eric', 'soloistInst...",,,,


The `df_works` dataframe looks good, but it would be nicer to maintain some information from the original `df` dataframe.  
Import the different elements from the `works` again, making sure to preserve the `programID` and `orchestra` columns from the original data:

In [67]:
df_works = pd.json_normalize(python_object["programs"], record_path="works", meta=["programID","orchestra"])

Check the **last 5 rows** of the dataframe:

In [68]:
df_works.tail()

Unnamed: 0,ID,composerName,workTitle,conductorName,soloists,interval,movement,movement._,movement.em,programID,orchestra
359,0*,,,,[],Intermission,,,,14380,New York Philharmonic
360,1490*,"Ravel, Maurice",SHEHERAZADE,"Langree [Langrée], Louis","[{'soloistName': 'Leonard, Isabel', 'soloistIn...",,,,,14380,New York Philharmonic
361,53102*,"Scriabin, Alexander","SYMPHONY NO. 4, ""LE POEME D'EXTASE,"" OP. 54","Langree [Langrée], Louis",[],,,,,14380,New York Philharmonic
362,12880*,"LeFrak, Karen",SLEEPOVER AT THE MUSEUM,"Bahl, Ankush Kumar","[{'soloistName': 'Bernstein, Jamie', 'soloistI...",,,,,14379,New York Philharmonic
363,51779*,"Saint-Saens [Saint-Saëns], Camille",CARNIVAL OF THE ANIMALS,"Bahl, Ankush Kumar","[{'soloistName': 'Huebner, Eric', 'soloistInst...",,,,,14379,New York Philharmonic


<div class="alert alert-info">

<b>Note:</b> Who was the conductor who directed Maurice Ravel's Sheherazade piece at the NYC Philharmonic?

</div>

## Exercise 3

The dataset for this exercise is the `station_information.json` file of the [NYC CitiBike system](https://www.citibikenyc.com/). https://gbfs.citibikenyc.com/gbfs/en/station_information.json

In [6]:
filename = "exercise-3.json"
file = directory / filename

Try to import the JSON file into a dataframe named `df` using the simplest approach:

In [7]:
df = pd.read_json(file)

Check the first 5 rows of the dataframe:

In [8]:
df.head()

Unnamed: 0,data,last_updated,ttl
stations,"[{'lon': -73.99392888, 'capacity': 55, 'statio...",1615296230,5


Turn the JSON file into a Python object:

In [9]:
with open(file) as f:
    python_object = json.load(f)

Search and **identify a list of stations** inside this Python object.  
Check the type of this Python object:

In [10]:
type(python_object)

dict

If it is a dictionary, check its keys; if it is a list, check its length:

In [12]:
python_object.keys()

dict_keys(['data', 'last_updated', 'ttl'])

Check the type of an element inside (either using a key identified above if it is a dictionary, or using an index if it is a list):

In [14]:
type(python_object["data"])

dict

Repeat this process until identifying a **list of stations**:

In [28]:
df_stations = pd.json_normalize(python_object["data"]["stations"])

In [29]:
df_stations.head()

Unnamed: 0,lon,capacity,station_id,eightd_has_key_dispenser,has_kiosk,lat,legacy_id,rental_methods,region_id,short_name,station_type,external_id,name,electric_bike_surcharge_waiver,eightd_station_services,rental_uris.ios,rental_uris.android
0,-73.993929,55,72,False,True,40.767272,72,"[KEY, CREDITCARD]",71,6926.01,classic,66db237e-0aca-11e7-82f6-3863bb44ef7c,W 52 St & 11 Ave,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
1,-74.006667,33,79,False,True,40.719116,79,"[KEY, CREDITCARD]",71,5430.08,classic,66db269c-0aca-11e7-82f6-3863bb44ef7c,Franklin St & W Broadway,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
2,-74.000165,27,82,False,True,40.711174,82,"[KEY, CREDITCARD]",71,5167.06,classic,66db277a-0aca-11e7-82f6-3863bb44ef7c,St James Pl & Pearl St,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
3,-73.976323,62,83,False,True,40.683826,83,"[KEY, CREDITCARD]",71,4354.07,classic,66db281e-0aca-11e7-82f6-3863bb44ef7c,Atlantic Ave & Fort Greene Pl,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
4,-74.001497,50,116,False,True,40.741776,116,"[KEY, CREDITCARD]",71,6148.02,classic,66db28b5-0aca-11e7-82f6-3863bb44ef7c,W 17 St & 8 Ave,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan


Once a list has been found, check the first element of the list:

In [31]:
df_stations[0]

KeyError: 0

Once a list has been found, import it into a dataframe named `df`:

Check the first 5 rows of the dataframe:

In [20]:
df_stations.head()

Unnamed: 0,lon,capacity,station_id,eightd_has_key_dispenser,has_kiosk,lat,legacy_id,rental_methods,region_id,short_name,station_type,external_id,name,electric_bike_surcharge_waiver,eightd_station_services,rental_uris.ios,rental_uris.android
0,-73.993929,55,72,False,True,40.767272,72,"[KEY, CREDITCARD]",71,6926.01,classic,66db237e-0aca-11e7-82f6-3863bb44ef7c,W 52 St & 11 Ave,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
1,-74.006667,33,79,False,True,40.719116,79,"[KEY, CREDITCARD]",71,5430.08,classic,66db269c-0aca-11e7-82f6-3863bb44ef7c,Franklin St & W Broadway,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
2,-74.000165,27,82,False,True,40.711174,82,"[KEY, CREDITCARD]",71,5167.06,classic,66db277a-0aca-11e7-82f6-3863bb44ef7c,St James Pl & Pearl St,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
3,-73.976323,62,83,False,True,40.683826,83,"[KEY, CREDITCARD]",71,4354.07,classic,66db281e-0aca-11e7-82f6-3863bb44ef7c,Atlantic Ave & Fort Greene Pl,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan
4,-74.001497,50,116,False,True,40.741776,116,"[KEY, CREDITCARD]",71,6148.02,classic,66db28b5-0aca-11e7-82f6-3863bb44ef7c,W 17 St & 8 Ave,False,[],https://bkn.lft.to/lastmile_qr_scan,https://bkn.lft.to/lastmile_qr_scan


The `df` dataframe looks good, but it contains a column (`rental_methods`) whose elements are lists.  
Import the different elements from the `rental_methods` column into another dataframe named `df_rental`:

In [21]:
df_rental = pd.json_normalize(df_stations["rental_methods"])

AttributeError: 'list' object has no attribute 'values'

Check the first 5 rows of the dataframe:

The `df_rental` dataframe looks good, but it would be nicer to maintain some information from the original `df` dataframe.  
Import the different elements from the `rental_methods` again, making sure to preserve the `station_id` column from the original data:

Check the first 5 rows of the dataframe:

Find out how many stations accept credit cards as payment method, and how many accept a key: