#Tools for JSON: jq

`jq` is a command line tool that allows you to create small programs to work with data in JSON format very efficiently. ("Efficiently" here is meant in the computational sense&mdash;that is, you can work with large quanitities of data without worrying too much about not having enough memory or other system resources).

In the specialized jargon used in `jq`'s documentation, you create "filters" to work with "streams" of JSON data.

Some common tasks that can be accomplished with jq include:

* reformatting JSON to make it easier (for humans) to read
* profiling data &mdash; figuring out what keys a set of JSON contains
* plucking out specific values (if you know the key you're looking for)
* re-shaping JSON data for different purposes
* and more … (see the [manual](http://stedolan.github.io/jq/manual/))

This notebook is intended to introduce you to these basic operations you can use `jq` to perform.

We'll use a sample data set [provided by the Tate Gallery](https://github.com/tategallery/collection) to demonstrate. 

### Preliminaries

**Don't worry too much about understanding what's going on in this next section.** It's just about moving data around so we have a reasonable but not overwhelming amount of JSON to work with

In [None]:
import tarfile
import re
import os
from itertools import count

# You have a copy of this file in your `data` directory. Tate provides the data in a single TAR (tape archive) file
DATA_PATH = '../data/tate-collection-1.2.tar.gz'
DATA_FOBJ = tarfile.open(DATA_PATH)

# We can use Python's tools for working with tar files to inspect the data package
# For instance by listing the files it contains without unpacking it
FILES = DATA_FOBJ.getmembers()
len(FILES)

There's a lot of data here! Let's look at how it's organized (The first 1000 files should give the basic flavor) …

In [None]:
for f in FILES[:1000]:
    print(f)

For the moment, we're interested in the JSON files under `collection-1.2/artists`, and for now, let's just take the artists with names starting with the letter "a" (as a manageable subset with which to start)

In [None]:
# We're only going to unpack the part of the tar archive with a-artist JSON files
pattern = re.compile(r"artists\/a\/.*\.json")

def get_a_names(tpath):
    if pattern.search(tpath.name) == None:
        return False
    else:
        return True

a_names_indices = [index for index, obj in zip(count(), FILES) if get_a_names(obj) == True]
DATA_FOBJ.extractall(path='../data/tate-collection',members=FILES[a_names_indices[0]:a_names_indices[-1]])

Now, we have a directory of JSON files …

In [None]:
directory_list = os.listdir('../data/tate-collection/collection-1.2/artists/a')
for f in directory_list:
    print(f)

In [None]:
len(directory_list)

Much more manageable, now let's look at `jq` …

### What `jq` can do
NB: From here on, the code is shell script rather than Python …

#### Human readability
`jq` can format (or "pretty print") JSON so that it's easier (for humans) to read. We can compare by printing out the contents of one file, first in compact format, then pretty-printed:
```
{"activePlaceCount":0,"birth":{"place":{"name":"Polska","placeName":"Polska","placeType":"nation"},"time":{"startYear":1930}},"birthYear":1930,"date":"born 1930","fc":"Magdalena Abakanowicz","gender":"Female","id":10093,"mda":"Abakanowicz, Magdalena","movements":[],"startLetter":"A","totalWorks":4,"url":"http://www.tate.org.uk/art/artists/magdalena-abakanowicz-10093"}
```

In [None]:
!cat ../data/tate-collection/collection-1.2/artists/a/abakanowicz-magdalena-10093.json | jq .

Note this version breaks content across multiple lines making it easier to see where objects (denoted by pairs of curly brackets) begin and end. Also, keys and values are printed in different colors, and different types of values (strings vs numeric literals) are printed in different colors to help distinguish them.\*

\* YMMV somewhat depending on the color settings of your terminal program

Remember that everything in `jq` is a filter, so, as the documentation explains, `jq .` is just the simplest possible filter you could write: "This is a filter that takes its input and produces it unchanged as output."

#### Profiling Data

Remember from our [introduction to JSON](1-json-intro.ipynb), that you can think of the format as a combination of two types of structures. In Python these were called "lists" and "dictionaries." In Javascript, these same structures are called "arrays" and "objects", respectively.

`jq` gives us tools for getting data out of these two kind of structures. First, objects.

The top-level structure in each of our files is an object. Note the enclosing `{}`. `jq` gives us a built-in function for getting a list ("array") of all the keys in an object:

In [None]:
!cat ../data/tate-collection/collection-1.2/artists/a/abakanowicz-magdalena-10093.json | jq 'keys'

Now if we want to see the value associated with a particular key we can:

In [None]:
!cat ../data/tate-collection/collection-1.2/artists/a/abakanowicz-magdalena-10093.json | jq '{gender}'

We can get a subset of keys at the same time:

In [None]:
!cat ../data/tate-collection/collection-1.2/artists/a/abakanowicz-magdalena-10093.json | jq '{fc, date, totalWorks}'

#### Working with multiple objects

This is where `jq`'s power starts to shine, e.g.:

In [None]:
# Concatenate all 138 files together
!cat ../data/tate-collection/collection-1.2/artists/a/*.json | jq -s '[.[]]' > ../data/tate-collection/all_a_artists.json

The `-s` flag stands for `slurp`. It means read the JSON object from each file into one big string, the square brackets `[]` capture the output into a single array, then we write the result to a new file.

In [None]:
# One big array
!cat ../data/tate-collection/all_a_artists.json | jq 'length'

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq .

`.[]` returns all the elements of an array, just as they are.

Alternately, we can specify indices to get a subset or "slice":

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq .[95:105]

We can confirm we just have 10 records with the `length` function we saw above: 

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq .[95:105] | jq 'length'

#### Plucking particular values

Now we can combine our tools for manipulating arrays and objects into a sequence of little programs to pull out all the names and get just a list of those values:

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq '[.[] | {fc}[]]'

In [None]:
# Our 10 sample records
!cat ../data/tate-collection/all_a_artists.json | jq '[.[] | {fc}[]]' | jq '.[95:105]'

#### More advanced data profiling

There are functions for finding minimum and maximum values in a set &mdash; let's use this on birth years:

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq 'min_by(.birthYear)'

Here the earliest birth year is for `Anonymous`, where the key does not even exist.

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq 'max_by(.birthYear)'

We could use what we know so far to quickly find all the artists (with "a" names) who were born after 1900, by adding an invocation of the built-in `select` method:

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq '[.[] | select(.birthYear >= 1900)]'

To make it easier to see the results, let's simplify these objects to include just name and birth year:

In [None]:
!cat ../data/tate-collection/all_a_artists.json | jq '[.[] | select(.birthYear >= 1900)]' | jq '[.[] | {fc, birthYear}]'

And one more refinement, sorting by birth year to make it easier to inspect the list. (Using another built-in function)

In [None]:
!cat ../data/tate-collection/all_a_artists.json | \
jq '[.[] | select(.birthYear >= 1900)]' | jq '[.[] | {fc, birthYear}]' | jq 'sort_by(.birthYear)'

#### Creating new structures

Finally we can use `jq` not only to read data structures but to reshape data from one structure to another:

In [None]:
!cat ../data/tate-collection/all_a_artists.json | \
jq '{c19_births: [.[] | select(.birthYear <= 1900) | {fc} |.[]], \
c20_births: [.[] | select(.birthYear >= 1900) | {fc} | .[]]}'

\* We know from the example of "Anonymous" above that this solution is not perfect &mdash; how could we make it better?

Note that I can use `jq` as a series of separate filter programs chained together:
```
jq '[.[] | select(.birthYear >= 1900)]' | jq '[.[] | {fc, birthYear}]' | jq 'sort_by(.birthYear)'
```
Or, I can achieve the same thing as one equivalent filter program that includes the various pipe mechanisms:
```
jq '{c19_births: [.[] | select(.birthYear <= 1900) | {fc} |.[]]'
```
I use the second method so I could make the result of that filter the value to a key in a new object I was building: 
```
{
    c19_births: […], 
    c20_births: […]
}
```