# JSON in Python

This notebook demonstrates reading JSON data from files into lists, dictionaries, and dataframes in Python.

Notice the Python version running on the cluster.

In [1]:
import sys
sys.version

'3.6.7 (default, Oct 22 2018, 11:32:17) \n[GCC 8.2.0]'

## Contents 
1. Setup
2. List example
1. Dictionary example
3. Mixed examples
1. Realistic examples

## 1. Setup

In [2]:
%pip install jsonlines

Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Installing collected packages: jsonlines
Successfully installed jsonlines-1.2.0


Load the `pandas` Python library.

In [3]:
import pandas as pd
pd.__version__

'0.24.2'

In this notebook we read data from some of these JSON files.

In [4]:
%%sh
git clone https://github.com/datalab-datasets/file-samples.git

Cloning into 'file-samples'...


In [5]:
%ls /content/file-samples/*.json

/content/file-samples/dict_of_lists.json
/content/file-samples/each_line.json
/content/file-samples/enron.json
/content/file-samples/list_of_dicts.json
/content/file-samples/one_dictionary.json
/content/file-samples/one_list.json
/content/file-samples/one_list_with_metadata.json
/content/file-samples/simple_dict.json
/content/file-samples/simple_list.json
/content/file-samples/stocks.json
/content/file-samples/world_bank.json
/content/file-samples/zips.json


The contents of individual files can be displayed with the `head` shell command. 

For example, the following command displays the first three lines of the `one_list.json` file. Notice the parameter `-n 3`.

The `%sh` magic at the beginning of the code cell indicates that the cell contains shell commands.

In [6]:
%%sh 
head -n 3 /content/file-samples/one_list.json

[{ 	
	"id": "ae4d9000001", 
	"product_name": "sildenafil citrate", 


The contents of files are often read into Python using the `open` function. (An example is three cells below.)
- The first argument to this function is the full path of the file. 
- The second argument is the mode with which to read the file. An `r` indicates the file will be read. 
- The function returns a file object that can be used to read (in this case) from the indicated file.
- See the documentation https://docs.python.org/3.5/library/functions.html#open for details.

The `with` command (in the example two cells below) is called a _context manager_.

In the cell below (and in many cells below that), the `with` command opens the file and stores the _file object_, returned by the `open` function, in the `infile` variable. All indented commands that follow the `with` command can access this variable and the contents of the file, but when the indentation stops then the file object is deleted and it is no longer possible to read from the file. 

The basic reason to use the `with` and `open` statements like this is to automatically close the file after the indented commands are run.
There are more complicated reasons to use them, but they're not that important (to know) if you __always__ use it to open files for reading or writing.

The following code cell:
1. opens the `simple_list.json` file for reading
2. loads the contents of the file into a list (using the `load` function from the `json` module)
3. stores this list in `list_from_file` 
4. displays the list (as it is the last command of the cell)

In [7]:
import json
with open('/content/file-samples/simple_list.json', 'r') as infile:
  list_from_file = json.load(infile)

list_from_file

[10, 20, 30]

Notice the resulting list (above) corresponds exactly to the characters from the file (below).

In [8]:
%%sh 
head -n 3 /content/file-samples/simple_list.json

[10, 20, 30]


The remaining sections of this notebook demonstrate the reading of JSON text (from files) into Python lists. dictionaries, and tables.

## 2. List example

The `json` library provides Python functions to read and write JSON text. 

The `loads` function:
- expects its argument to be a string of JSON data and 
- returns the Python object (usually a list or dictionary) that corresponds to the JSON data

In [9]:
import json
my_json = json.loads('["hello","goodbye",777,null]')
my_json

['hello', 'goodbye', 777, None]

The `null` JSON value is often used to indicate missing data. Notice that this value corresponds to the `None` value in Python.

The JSON text for the next example comes from the file `simple_list.json`, which is displayed below. 

This is the same example as above in the Setup section. The following command displays the contents of this file (which only contains one line).

In [10]:
%%sh 
head -n 3 /content/file-samples/simple_list.json

[10, 20, 30]


The following code reads the contents of the file `simple_list.json` into a Python list.

In [11]:
import json
with open("/content/file-samples/simple_list.json", 'r') as infile:
  list_from_file = json.load(infile)
  
list_from_file

[10, 20, 30]

The following code demonstrates that the `load` function does in fact return a list.

In [12]:
type(list_from_file), list_from_file[-1]

(list, 30)

## 2. Dictionary example

The JSON text for this example comes from the file `simple_dict.json`, which is displayed below.

In [13]:
%%sh 
cat /content/file-samples/simple_dict.json

{"a":1, "b":2, "c":3}


The following code reads the contents of the file `simple_dict.json` into a Python dictionary.

In [14]:
import json
with open("/content/file-samples/simple_dict.json", 'r') as infile:
  dict_from_file = json.load(infile)

dict_from_file

{'a': 1, 'b': 2, 'c': 3}

In [15]:
dict_from_file.get('b')

2

## 3. Mixed examples

This section contains two examples, each in a section:
1. Dictionary of lists
1. List of dictionaries

### 3.1 Dictionary of lists

In [16]:
%%sh 
cat /content/file-samples/dict_of_lists.json

{"a":[1,2,3], "b":[4,5,6], "c":[7,8,9]}


The following code reads the contents of the file `dict_of_lists.json` into a Python dictionary.

In [17]:
with open("/content/file-samples/dict_of_lists.json", 'r') as infile:
  dict_from_file = json.load(infile)

dict_from_file

{'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}

In [18]:
type(dict_from_file)

dict

In [19]:
dict_from_file.get('b')

[4, 5, 6]

In [20]:
dict_from_file.get('b')[2]

6

### 3.2 List of dictionaries

In [21]:
%%sh 
cat /content/file-samples/list_of_dicts.json

[{"a":1, "b":2, "c":3}, {"d":4, "e":5, "f":6}]


The following code reads the contents of the file `list_of_dicts.json` into a Python list with items that are dictionaries.

In [22]:
with open("/content/file-samples/list_of_dicts.json", 'r') as infile:
  dict_from_file = json.load(infile)

dict_from_file

[{'a': 1, 'b': 2, 'c': 3}, {'d': 4, 'e': 5, 'f': 6}]

In [23]:
type(dict_from_file), dict_from_file[1]

(list, {'d': 4, 'e': 5, 'f': 6})

In [24]:
dict_from_file[1].get('d')

4

## 4. More realistic examples

The three subsections below demonstrate three ways in which tables are stored in JSON formatted files. 
These three ways are distinguished by the format of JSON data stored in the file to be read. 
The JSON data is: 
1. A list of dictionaries
1. A dictionary contains a list of dictionaries
1. Each line of the file is a dictionary

In each of the three cases above, the dictionaries correspond to records of the table stored in the file.

### 4.1 Dataset from a single list of JSON records

In [25]:
%%sh 
head -n 30 /content/file-samples/one_list.json

[{ 	
	"id": "ae4d9000001", 
	"product_name": "sildenafil citrate", 
	"supplier": "Wisozk Inc", 
	"quantity": 261, 
	"unit_cost": "$10.47" 
},
{
	"id": "ae4d9000002", 
	"product_name": "Mountain Juniperus ashei", 
	"supplier": "Keebler-Hilpert", 
	"quantity": 292, 
	"unit_cost": "$8.74" 
},
{
	"id": "ae4d9000003", 
	"product_name": "Dextromathorphan HBr", 
	"supplier": "Schmitt-Weissnat", 
	"quantity": 211, 
	"unit_cost": "$20.53" 
}]


The following code reads the contents of the file `one_list.json` into a Python list of dictionaries

In [26]:
import json
with open('/content/file-samples/one_list.json', 'r') as infile:
  dict_from_file = json.load(infile)
  
dict_from_file

[{'id': 'ae4d9000001',
  'product_name': 'sildenafil citrate',
  'quantity': 261,
  'supplier': 'Wisozk Inc',
  'unit_cost': '$10.47'},
 {'id': 'ae4d9000002',
  'product_name': 'Mountain Juniperus ashei',
  'quantity': 292,
  'supplier': 'Keebler-Hilpert',
  'unit_cost': '$8.74'},
 {'id': 'ae4d9000003',
  'product_name': 'Dextromathorphan HBr',
  'quantity': 211,
  'supplier': 'Schmitt-Weissnat',
  'unit_cost': '$20.53'}]

The pandas `DataFrame` constructor can take as input a list of dictionaries. 

The keys of the dictionaries are interpreted as column names.

In [27]:
import pandas as pd 
pd.DataFrame(data=dict_from_file)

Unnamed: 0,id,product_name,quantity,supplier,unit_cost
0,ae4d9000001,sildenafil citrate,261,Wisozk Inc,$10.47
1,ae4d9000002,Mountain Juniperus ashei,292,Keebler-Hilpert,$8.74
2,ae4d9000003,Dextromathorphan HBr,211,Schmitt-Weissnat,$20.53


### 4.2 Dataset from single dictionary

In [28]:
%%sh 
head -n 9 /content/file-samples/one_dictionary.json

{"metadata": {"author": "David Oury", "date": "18 Sep 2017"}, 
 "data": 
	[{ 	
		"id": "ae4d9000001", 
		"product_name": "sildenafil citrate", 
		"supplier": "Wisozk Inc", 
		"quantity": 261, 
		"unit_cost": "$10.47" 
	},


The following code reads the contents of the file `one_dictionary.json` into a Python dictionary.

Notice that the last command converts the dictionary `my_dict_from_file['data']` into a Pandas dataframe.

In [29]:
import pandas as pd
with open('/content/file-samples/one_dictionary.json', 'r') as infile:
  my_dict_from_file = json.load(infile)
my_dict_from_file

pd.DataFrame(data=my_dict_from_file['data'])

Unnamed: 0,id,product_name,quantity,supplier,unit_cost
0,ae4d9000001,sildenafil citrate,261,Wisozk Inc,$10.47
1,ae4d9000002,Mountain Juniperus ashei,292,Keebler-Hilpert,$8.74
2,ae4d9000003,Dextromathorphan HBr,211,Schmitt-Weissnat,$20.53


### 4.3 Dataset from one JSON record per line (JSONL format)

This is a very brief introduction to the _line delimited_ JSON format. For details see
- http://ndjson.org/
- http://jsonlines.org/
- [Wikipedia](https://en.wikipedia.org/wiki/JSON_streaming)

The acronyms JSONL, NDJSON and LDJSON are equivalent terms and refer to a format where each line of a file contains a single JSON record which is a dictionary and describes a record of data.

In [30]:
%%sh 
head -n 7 /content/file-samples/each_line.json

{ "id": "ae4d9000001", "product_name": "sildenafil citrate", "supplier": "Wisozk Inc", "quantity": 261, "unit_cost": "$10.47" }
{ "id": "ae4d9000002", "product_name": "Mountain Juniperus ashei", "supplier": "Keebler-Hilpert", "quantity": 292, "unit_cost": "$8.74" }
{ "id": "ae4d9000003", "product_name": "Dextromathorphan HBr", "supplier": "Schmitt-Weissnat", "quantity": 211, "unit_cost": "$20.53" }


This next example introduces the `jsonlines` package (from Python).

The `open` function (in this package) reads a file in JSONL and returns a list of dictionaries.

In [31]:
import pandas as pd
import jsonlines
list_of_dictionaries = list(jsonlines.open("/content/file-samples/each_line.json"))
list_of_dictionaries 

[{'id': 'ae4d9000001',
  'product_name': 'sildenafil citrate',
  'quantity': 261,
  'supplier': 'Wisozk Inc',
  'unit_cost': '$10.47'},
 {'id': 'ae4d9000002',
  'product_name': 'Mountain Juniperus ashei',
  'quantity': 292,
  'supplier': 'Keebler-Hilpert',
  'unit_cost': '$8.74'},
 {'id': 'ae4d9000003',
  'product_name': 'Dextromathorphan HBr',
  'quantity': 211,
  'supplier': 'Schmitt-Weissnat',
  'unit_cost': '$20.53'}]

In [34]:
pdf_from_list_of_dictionaries = pd.DataFrame(data=list_of_dictionaries)
pdf_from_list_of_dictionaries

Unnamed: 0,id,product_name,quantity,supplier,unit_cost
0,ae4d9000001,sildenafil citrate,261,Wisozk Inc,$10.47
1,ae4d9000002,Mountain Juniperus ashei,292,Keebler-Hilpert,$8.74
2,ae4d9000003,Dextromathorphan HBr,211,Schmitt-Weissnat,$20.53


In [36]:
pdf_from_list_of_dictionaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
id              3 non-null object
product_name    3 non-null object
quantity        3 non-null int64
supplier        3 non-null object
unit_cost       3 non-null object
dtypes: int64(1), object(4)
memory usage: 200.0+ bytes


__The End__