# RAILS JSON lesson

2018-07-26

For today's lesson, I thought that we could revisit some of the JSON materials that we covered in our in person session.  We briefly looked at the tools associated with it, but didn't fully explore some of the strengths  and common approaches for working within it.  

Let's remind ourselves of our data and how JSON data structures operate.

## How is JSON data organized?

Actual JSON files are ordered driven by schemas, really similarly to how we write XML files. The core rule structure supports a lot of flexibility, but the schemas impose more rules and define meaning for label usage.

Generally speaking, this data is structured in attribute/value pairs.  If you're used to working with dictionaries in Python, then this structure will work pretty much the same.  There are data types respected in JSON, including data collection types.

There is a single root structure that all data is stored within.

`{}`

This data structure will contain attribute/value pairs (to borrow the Python lingo).  This is formatted exactly how you know dictionaries operate.  Attribute labels are usually stored as strings (offering the most flexibility in content and processing), followed by a colon `:`, followed by the value for that key.  The value's data type of open to however you need or desire to store the value, and you can even have more JSON data structures or arrays (that operate like Python lists).








Let's look at the idea of a date here for a second.  We normally store these as a string with a standard delimiter.  This meets our needs for being human readable and easily manipulated using core string methods.

So in a way, this usage is hacking a one to many relationship with tabular data.  While a singleton data point is being saved, that data point can be easily processed into many sub-values.  This is more formally supported in common database systems via storing a date is a Date data type (of which there are many names, depending on the system).  One data object is being stored, but you can ask many questions about it.  Might it be nicer to store all the elements of a date as separate fields within the date node.  This does add more complication, because you'll need to reconstruct it into a more expected format for output. However, the data has already been split apart.

This means that we usually have a design choice here:

* you can have data neatly granular but you'll need to do more work to reconstruct it for more traditional outputs
* you can have all the data kept together in a nicely formatted structure, but you'll need to add more processing in to break it apart for granular queries

There is no one right answer to this question.  You will either be working with a system handed to you, have other design standards dictated to you by your system designers, or you'll need to make a call on where do you want your data fussing to be.  You can choose to fuss with it at processing point for queries or you can fuss with it to make nicely readable reports and output.  Only you will know what you're up for dealing with.  Sometimes the fussy data is so small and unimportant that you rarely have to deal with it.  Other times the value of having some fussy data already split apart for neatly granular storage is incredibly valuable.

Let's look at some examples here.  We'll do dates and keyword tags.

## Deep dive: dates

There are so many date formats out there.  Let's not discuss which ones are better or worse, let's focus on the ISO standard date format of `yyyy-mm-dd`.  Having the year come first, then the month, and the date (and all with two digits) means that just using normal text sorting and queries will allow dates stored in this style to operate as expected for date math.  All years should be grouped, inside of that all months should be grouped, and inside of that we want all the days to be grouped.  This is all done purely on the string sorting values and will usually work without a problem so long as all the date values are correctly formatted.

Having it go the other way around, `dd-mm-yyyy` means that all `01` days will be grouped, then month, and then year.  You may actually want this sort of query at some point, but storing it like this means that you'll have to do a bunch of work to make it sort according to normal date rules.

So if all you want to do with your dates is sort them, storing them like this is perfectly sufficient.  We might store it like this in JSON.

``` json
{'book1': 
    {'pub_date': '2018-01-23',
     'title': 'Hello world'},
 'book2':
     {'pub_date': '2012-11-02',
      'title': 'Python Stuff'}
}

```

With tabular data stored within a relational database, we have the advantage that all values retain their labels (unless you do something silly, which is totally possible).  So when you extract a row it comes with an ID and each cell value retains the column label.  But this is a much more rigid system. You can adapt certain values with set delimiter structure within them, but when you have data with looser structure (such as optional keywords), it get harder to query and process.  Not impossible, but harder.

When we consider how we generally (outside of pandas dataframes) in memory, we normally think about having a list of lists.  These sublists contain individual non collection elements (sequences like strings, excluded, but no other lists or dictionaries.) This is our normal conception of two dimensional data.  This has the benefit of being extremely flexible because you would be allowed to have additional levels of data collections.  While you may be restricting yourself to only two dimensions, that's a design choice you are making rather than a limitation of the data structure itself.  What we lose, however, is the context behind that data.  While you may be able to infer the labels and identities of values coming out of a query, these things are not natively transmitted along with the data.  You cannot ask for things by name directly within the extraction syntax of the collection.

JSON combines these two strengths into a single structure, thus eliminating these limitations.  Unless you do something silly and purposefully don't encode the context of the data in the struture, it'll be there by name for the asking.  Not only can you have one to many relationships easily and directly, you can also directly ask for things by their names and many times the data coming back is often labeled.