# RAILS JSON lesson

2018-07-26

For today's lesson, I thought that we could revisit some of the JSON materials that we covered in our in person session.  We briefly looked at the tools associated with it, but didn't fully explore some of the strengths  and common approaches for working within it.  

Let's remind ourselves of our data and how JSON data structures operate.

## How is JSON data organized?

Actual JSON files are ordered driven by schemas, really similarly to how we write XML files. The core rule structure supports a lot of flexibility, but the schemas impose more rules and define meaning for label usage.

Generally speaking, this data is structured in attribute/value pairs.  If you're used to working with dictionaries in Python, then this structure will work pretty much the same.  There are data types respected in JSON, including data collection types.

There is a single root structure that all data is stored within empty curly braces.

`{}`

This cantainer will contain attribute/value pairs (to borrow the Python lingo).  This is formatted almost exactly how you know dictionaries operate.  Attribute labels are usually stored as strings (offering the most flexibility in content and processing), followed by a colon `:`, followed by the value for that key.  The value's data type of open to however you need or desire to store the value, and you can even have more JSON data structures or arrays (that operate like Python lists).

## How are entities stored?

Generally, again because the framework doesn't require this, these structures are very deeply nested.  There are two ways that these appear.

### The first pattern: the record entry

Usually living at the deepest level of the structure, this is a pattern that is usually used to capture a single record's data.  Let's take a book.

``` json
{ 'title': 'Hello Friends',
  'pub_date': '2018-04-23',
  'id':  '034039',
  'authors': 'Human One (writer); Human Two (editor); Human Three (editor)'
}

```

This is a pretty small record, but gets the point across.  This structure holds three attribute pairs:  title, pub_date, and id.  These all have values that are all strings.  There's noting with this structure that we couldn't easily transform over into a 2D structure.  There is a 1 to 1 relationship, one attribute has one and only one value.  

Let's add some more data to explore the power of this structure.  


``` json
{ 'title': 'Hello Friends',
  'pub_date': '2018-04-23',
  'id':  '034039',
  'author':  ['Human One (writer)', 'Human Two (editor)', 'Human Three (editor)']
}

```

I've added this author attribute pair.  From a technical stanpoint, there is a one to one relationship.  There is a single attribute (author) and one value attached to it.  Looking at this value will tell you that I'm either wrong or lying to you.  Again, from a technical standpoint there's a single object stored within the value.  However, this object is a data collection object.  Where we might look at this like a list within python, it is in fact called an array for json.  But the point is about the same. It allows you to satisfy the need to have a single object stored in that value, but have many values inside it.

We've been able to store two values within this collection, keeping them collected in one object and neatly in place, but the values are still separate enough for me to cleanly extract them individually.  I don't have to presume a delimeter in the text, because the data points are literally separate from eachother.

This sort of relationship can be stored within a relational database via two tables:  one table that holds the book data that includes the book id key, and another authors table that has pairs of book IDs and author names or other information.  There may even be three tables, that same book table, a table containing solely the book ids and author ids, and then another with author ID numbers and all the biographical data about that author. 

But in our json structure we can see that it's all just stored right there, and is pretty readable even as raw data.  Here we can ask for the authors directly without having to bash apart any text.  Because they're stored in something like a list, we can also iterate them as normal.

Let's expand this even more.

``` json
{ 'title': 'Hello Friends',
  'pub_date': '2018-04-23',
  'id':  '034039',
  'author':  {'writer': 'Human One', 'editor': ['Human Two', 'Human Three']}
}

```

Here we can see the depth building up.  We've taken this idea of the people's names all condensed out and abstracted it into individual elements of an array.  Now we've abstracted out the roles into their own attributes, with the attribute labels containing the role that they held.  So we've finally restricted the granularity of our strings to keep the names separate from other people and the roles separate from the name. 

While this is beautiful and reasonably human readable, running a query on this will require much more than just asking for the author list.  That might be valuable to you, but it may be more trouble than it's worth.  All three of these level of data granularity are just fine for JSON as a structure, so much of this is up to you unless you're working with a specific schema.

You'll need to carefully balance the level of detail you want with how obnoxious it will be to process it.

### The second patters:  a record of records

The true power of json is that you can easily smash many records into one big structure, and even include some metadata about that chunk of data.

Many times within APIs you'll be doing large queries that are delivering you many records as one result.  So instead of having to respond to 1,000 of your queries, it can give you 100 records in 10 queries.

We saw this in the data that we worked with in our first workshop session.  

Remember that all JSON structures will have one outside wrapper of `{}`.  The first level of the hierarchy here is:

`data` and `meta`

`data` contains another JSON object (the thing with the curly braces).  This contains many more objects that represent individual records.

`meta` contains a JSON object that has attribute pairs with administrative metadata about the data files being received.  Here you'll find timestamps about the request, how many items are in the data payload, etc.  This is a really common pattern to find in these kinds of API calls and other data payload files.

Let's start exploring this in Python.

# JSON in Python

We first need to import the `json` module from the standard Python library.

In [1]:
import json

The main reading function to use is `json.load(fileio)`.  You first need to open this file path up as a file io object in read mode.  Then you can pass that open object to this function.  You will be given a Python dictionary representing the JSON data.  You can also use the `json.loads()` function to have it parse a JSON data structure saved as a string.

In [23]:
with open('results/result_page_1.json', 'r') as file_in:
    data = json.load(file_in)
    
print(type(data))

<class 'dict'>


Now we can operate on the data using normal dictionary data syntax. Let's look at the keys first.

In [24]:
print(data.keys())

dict_keys(['data', 'meta'])


We know that this structure is very deep, but this `.keys()` method doesn't work recursively.  It will only give you the keys for the top level dictionary.  

A warning here!  Python dictionary structures do not allow keys to be repeated.  All keys must be unique.  JSON specification does suggest that you don't repeat keys, but you may be working with a json data file that violates this rule.  The python parser will work with these files, but it will drop all but the last value seen if there are any repeating keys found.  This acts really similar to how a dictionary will appear if you are attempting do create it with an accumulator but you include duplicate keys. Because the `dict[key] = value` syntax works for both key/value creation and value overwrite, the last one seen in a key repeat will be the last one remaining.

Let's dig into the `meta` chunk because that's smaller. Remember our dictionary syntax here.

In [25]:
print(data['meta'])

{'resource-types': [{'id': 'dataset', 'title': 'Dataset', 'count': 729}, {'id': 'image', 'title': 'Image', 'count': 554}, {'id': 'text', 'title': 'Text', 'count': 532}, {'id': 'collection', 'title': 'Collection', 'count': 164}, {'id': 'audiovisual', 'title': 'Audiovisual', 'count': 26}, {'id': 'other', 'title': 'Other', 'count': 24}, {'id': 'software', 'title': 'Software', 'count': 17}, {'id': 'event', 'title': 'Event', 'count': 1}, {'id': 'film', 'title': 'Film', 'count': 1}, {'id': 'model', 'title': 'Model', 'count': 1}, {'id': 'physical-object', 'title': 'Physical object', 'count': 1}], 'years': [{'id': '2018', 'title': '2018', 'count': 175}, {'id': '2017', 'title': '2017', 'count': 485}, {'id': '2016', 'title': '2016', 'count': 313}, {'id': '2015', 'title': '2015', 'count': 267}, {'id': '2014', 'title': '2014', 'count': 104}, {'id': '2013', 'title': '2013', 'count': 100}, {'id': '2012', 'title': '2012', 'count': 82}, {'id': '2011', 'title': '2011', 'count': 94}, {'id': '2010', 'tit

You'll note that what was delivered back to me was a dictionary, so I can drill into the keys for this one as well.

In [26]:
print(data['meta'].keys())

dict_keys(['resource-types', 'years', 'registered', 'data_centers', 'schema-versions', 'total', 'total_pages', 'page'])


There's more stuff here that look like actual data points.  So we can start exctracting things!  We'll need to talk a bit about the mind's eye perspective of stacking up these extractions.  Note that these results are numbers but don't have units.  We will still be dependent on documentation and a well known exemplar to put things together and fully understand the meaning of things.

In [27]:
print(data['meta']['total_pages'])

88


In [28]:
print(data['meta']['page'])

1


In [29]:
print(data['meta'].keys())

print(data['meta']['total_pages'])

print(data['meta']['page'])

dict_keys(['resource-types', 'years', 'registered', 'data_centers', 'schema-versions', 'total', 'total_pages', 'page'])
88
1


There's more stuff here that look like actual data points.  So we can start exctracting things!  We'll need to talk a bit about the mind's eye perspective of stacking up these extractions.  Note that these results are numbers but don't have units.  We will still be dependent on documentation and a well known exemplar to put things together and fully understand the meaning of things.


In [30]:
print(data['meta']['total'])

2186


At this point we know enough to pull things together and make a nice little program that will loop through these pages and download however many they are.  We may or may not really want to unpack this program.

We can go back to using the `glob` module for grabbing all the json files.

In [31]:
import glob

files = glob.glob('results/*.json')

print(files)

['results/result_page_1.json', 'results/result_page_10.json', 'results/result_page_11.json', 'results/result_page_12.json', 'results/result_page_13.json', 'results/result_page_14.json', 'results/result_page_15.json', 'results/result_page_16.json', 'results/result_page_17.json', 'results/result_page_18.json', 'results/result_page_19.json', 'results/result_page_2.json', 'results/result_page_20.json', 'results/result_page_21.json', 'results/result_page_22.json', 'results/result_page_23.json', 'results/result_page_24.json', 'results/result_page_25.json', 'results/result_page_26.json', 'results/result_page_27.json', 'results/result_page_28.json', 'results/result_page_29.json', 'results/result_page_3.json', 'results/result_page_30.json', 'results/result_page_31.json', 'results/result_page_32.json', 'results/result_page_33.json', 'results/result_page_34.json', 'results/result_page_35.json', 'results/result_page_36.json', 'results/result_page_37.json', 'results/result_page_38.json', 'results/r

So what if we wanted to reduce this data down and bit and make a data file that only collected the DOI and the attributes about that DOI?

First off, what's our tree structure here?

Do we want a flat file of objects without ids?  That's just a list.  

Do we want each record grouped by the data center?  That might be interesting, but we wouldn't be able to easily query for an arbitrary DOI number.

Do we want to have the DOI be the unique attribute and the value be the original object with all the descriptive metadata?  That sounds like a cool first plan.

But first we need to concatenate all our records together in a flat list.  We can do this by using the `+` operator on the accumulator list rather than append.

In [43]:
records = []

# lets go back to our original small program and add the accumulator to it

for file in files:
    with open(file, 'r') as fin:
        data = json.load(fin)
    records += data['data']

You may be asking why I didn't use append here.  The structure coming out of `data['meta']['data_centers']` is a list, so had I used `.append()` I would have ended up with a list of lists.

There's a nice little trick to lists, where if you use `+` on them they will concatenate all the elements into a single list.  We can see this in action.

In [44]:
print([1,2,3] + ['a', 'b', 'c'])

[1, 2, 3, 'a', 'b', 'c']


Now that we print the accumulator list we can see that everything is saved in one flat structure.

In [47]:
print(records[:10]) # just the first 10 for space

[{'id': 'https://doi.org/10.4225/72/570d1ead85f29', 'type': 'works', 'attributes': {'doi': '10.4225/72/570d1ead85f29', 'identifier': 'https://doi.org/10.4225/72/570d1ead85f29', 'url': None, 'author': [{'literal': 'Don Daniels'}], 'title': 'Snake', 'container-title': 'PARADISEC', 'description': None, 'resource-type-subtype': None, 'data-center-id': 'ands.centre72', 'member-id': 'ands', 'resource-type-id': None, 'version': None, 'license': None, 'schema-version': '3', 'results': [], 'related-identifiers': [], 'published': '2016', 'registered': '2016-04-12T16:13:38Z', 'checked': None, 'updated': '2016-04-12T16:13:38Z', 'media': None, 'xml': 'PD94bWwgdmVyc2lvbj0iMS4wIj8+CjxyZXNvdXJjZSB4bWxucz0iaHR0cDovL2RhdGFjaXRlLm9yZy9zY2hlbWEva2VybmVsLTMiIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiIHhzaTpzY2hlbWFMb2NhdGlvbj0iaHR0cDovL2RhdGFjaXRlLm9yZy9zY2hlbWEva2VybmVsLTMgaHR0cDovL3NjaGVtYS5kYXRhY2l0ZS5vcmcvbWV0YS9rZXJuZWwtMy9tZXRhZGF0YS54c2QiPjxpZGVudGlmaWVyIGlkZW50aWZpZXJUe

In [49]:
print(len(records))

1000


Now we can loop through these results and start crafting our new data.

One way to approach getting started with this is to focus on extracting out what you want first, then playing with how it all appears, and finally assemble the new data structure and add it all in.  We need to first get at the data that we want before we can figure out how we might want to reassemble thing
.

In [52]:
for record in records:
    # let's extract all the data out of this dictionary
    descripitivecluster = record['attributes']
    doi = descripitivecluster['doi']

Now we have all our unique ids and the values that we want.  We can quickly spin this into a new structure.  Let's add a dictionary accumulator.

In [53]:
recordsjson = {}

for record in records:
    # let's extract all the data out of this dictionary
    descripitivecluster = record['attributes']
    doi = descripitivecluster['doi']
    recordsjson[doi] = descripitivecluster

In [54]:
print(recordsjson.keys())

dict_keys(['10.4225/72/570d1ead85f29', '10.17863/cam.18284', '10.7282/t3gb23fq', '10.7282/t3wq036b', '10.7282/t36d5t6m', '10.7299/x7g15xzk', '10.7299/x78c9tfq', '10.7299/x7kp80bz', '10.7299/x70r9mkb', '10.7299/x7k35t3r', '10.7299/x72v2fk8', '10.7299/x76h4gw8', '10.5061/dryad.868/2', '10.3932/ethz-a-000200296', '10.3932/ethz-a-000207736', '10.3932/ethz-a-000249278', '10.3932/ethz-a-000171364', '10.4225/72/56f55a20c7481', '10.5281/zenodo.846553', '10.5281/zenodo.846552', '10.14456/cmvj.2017.18', '10.5281/zenodo.1089204', '10.5281/zenodo.1089203', '10.5281/zenodo.1082907', '10.5281/zenodo.1082906', '10.4225/72/570a82b93dc7c', '10.6084/m9.figshare.900948.v1', '10.13140/rg.2.1.5165.1685', '10.14457/oer.img.2015.924', '10.5061/dryad.gs7j4.2', '10.5061/dryad.h020k', '10.5061/dryad.4v937', '10.5061/dryad.c36k6', '10.5061/dryad.k6t0s.2', '10.5061/dryad.6ph26', '10.14279/depositonce-5046', '10.21236/ada594656', '10.6084/m9.figshare.3806646.v1', '10.6084/m9.figshare.3806646', '10.15121/1148775', 

Now that we have this dictionary, we can dump it out to file with the `.dump()` function.  This will parse our d data structure into valid json and write it out to our file open for writing.  Note that it will take two arguments.

In [57]:
with open('doidata.json','w') as fout:
    json.dump(recordsjson, fout)

One thing that we can play with here is how the output looks. There are several choices.  By default the text gets dumped out as one long line.  This saves some file space, but is really difficult to read through.  We can specify that there is an indent included. We need to specify what we want there. We can provide text if we want a delimiter, or we can use a number to add a certain number of white spaces for indent.

In [58]:
with open('doidata.json','w') as fout:
    json.dump(recordsjson, fout, indent = 2)

Worth reviewing here are the differences in syntax between passing arguments without a label (and thus by position) versus why we will sometimes need to pass an argument a value by name (like we are doing here with indent

## Final thoughts about this sort of data structure

With tabular data stored within a relational database, we have the advantage that all values retain their labels (unless you do something silly, which is totally possible).  So when you extract a row it comes with an ID and each cell value retains the column label.  But this is a much more rigid system. You can adapt certain values with set delimiter structure within them, but when you have data with looser structure (such as optional keywords), it get harder to query and process.  Not impossible, but harder.

When we consider how we generally (outside of pandas dataframes) in memory, we normally think about having a list of lists.  These sublists contain individual non collection elements (sequences like strings, excluded, but no other lists or dictionaries.) This is our normal conception of two dimensional data.  This has the benefit of being extremely flexible because you would be allowed to have additional levels of data collections.  While you may be restricting yourself to only two dimensions, that's a design choice you are making rather than a limitation of the data structure itself.  What we lose, however, is the context behind that data.  While you may be able to infer the labels and identities of values coming out of a query, these things are not natively transmitted along with the data.  You cannot ask for things by name directly within the extraction syntax of the collection.

JSON combines these two strengths into a single structure, thus eliminating these limitations.  Unless you do something silly and purposefully don't encode the context of the data in the struture, it'll be there by name for the asking.  Not only can you have one to many relationships easily and directly, you can also directly ask for things by their names and many times the data coming back is often labeled.

 








## scrapyard

Let's look at the idea of a date here for a second.  We normally store these as a string with a standard delimiter.  This meets our needs for being human readable and easily manipulated using core string methods.

So in a way, this usage is hacking a one to many relationship with tabular data.  While a singleton data point is being saved, that data point can be easily processed into many sub-values.  This is more formally supported in common database systems via storing a date is a Date data type (of which there are many names, depending on the system).  One data object is being stored, but you can ask many questions about it.  Might it be nicer to store all the elements of a date as separate fields within the date node.  This does add more complication, because you'll need to reconstruct it into a more expected format for output. However, the data has already been split apart.

This means that we usually have a design choice here:

* you can have data neatly granular but you'll need to do more work to reconstruct it for more traditional outputs
* you can have all the data kept together in a nicely formatted structure, but you'll need to add more processing in to break it apart for granular queries

There is no one right answer to this question.  You will either be working with a system handed to you, have other design standards dictated to you by your system designers, or you'll need to make a call on where do you want your data fussing to be.  You can choose to fuss with it at processing point for queries or you can fuss with it to make nicely readable reports and output.  Only you will know what you're up for dealing with.  Sometimes the fussy data is so small and unimportant that you rarely have to deal with it.  Other times the value of having some fussy data already split apart for neatly granular storage is incredibly valuable.
