# Nested Data

Nested data structures, also known as hierarchical or tree data structures are any data
structure that can recursively contain more data. Imagine fitting backpacks within
backpacks. One of the more common activities in data analysis is converting nested data
into tabular data.

## Serialization

The process of turning a nested data structure into a flat text representation.

## Parsing

Also know as Deserialization, or marshalling. The reverse process of turning flat text into
a nested data structure. You cannot generate nested or hierarchical data structures using
regular expressions. Nor can you even represent searches, such as find the fifth nested
deep tag.

Overall, any algorithm for converting flat text into a tree-like or recursive data
structure.

## Indirection

Instead of embedding data within the data structure instead pointers are used. The parent 
data contains pointers to the child data. In the backpack analogy, instead of having
backpacks full of backpacks, each backpack contains notes saying which backpacks to look
in next.

## Breadth First

In breadth first searches, enumeration, and serialization, we step through the highest
siblings first. This is analogous to database normalization, where we first create a table
of all the parents, then a table of all the immediate children, then a table of all the
grandchildren, and so on.

## Depth First

In depth first searches, enumeration, and serialization we go down the rabbit hole from the
root of the tree, listing the first parent, then their first child, then their first
grandchild, unit we can go no further. We then step back until we find a level with more
sibling nodes. Most of the modern text serializations, like JSON, and XML are depth first
representations, with the degree of textual nesting representing the tree depth.

## Examples of Nested Data

* Nested Python lists, tuples, and dictionaries
* File systems
* [Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model)
of a webpage

## Examples of Serialized Data

* [JSON](https://www.json.org/json-en.html)
* [HTML](https://html.spec.whatwg.org/multipage/)
* [XML](https://www.w3.org/TR/xml11/)
* [Any computer language](https://en.wikipedia.org/wiki/Context-free_grammar)

## Branching

Hierarchical data structures are formed by branching in one direction. These are also called
edges, or paths.

## Leafs

The end of branch is called a leaf, node, child, or vertex.

In [None]:
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup as soup

In [None]:
GITURL = "https://api.github.com/events"
with requests.get(GITURL) as r:
    eventsjson= r.json()
    eventstext = r.text
print(type(eventsjson))
print(len(eventsjson))
print(eventsjson[0])
eventsload = json.loads(eventstext)
print(eventsload[0])

Lets practice some navigation and compare the results.

In [None]:
print(json.dumps(eventsjson[0]["actor"], indent = 4))
print(eventsjson[0]["actor"])
print(eventsload[0]["actor"])
print(type(eventsjson))
print(type(eventsload))
print(len(eventsjson))
print(len(eventsload))


## Converting Nested Data to Tables

Below is a small example of typical code we would run to convert a list of nested data
structures to a list of rows, and then finally as single table. To do so we have to pick
the data we want from the nested data structure and say which column to place the data. We
will implement this using the 
[Pandas from records](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_records.html)
method and list comprehensions.

In [None]:
def mapping(item):
    """ In general this dictionary is the only part that you should change. The keys are the
    columns of the dataframe, and their values are single item lists that contain the value
    for the row in the column. Just use regular dictionary indexing to extract the value
    you want:
    * `record` item flattened into a record dictionary."""
    return {
                    
        # Event column, value is the event name
        "event": item["type"],

        # Repository column, value is the repository name.
        "repository": item["repo"]["name"],

        #  Created, value is the time of the transaction
        "created": pd.to_datetime(item["created_at"])
    }

def tabulate(items, mapping):
    """ Turn a list of items into a dataframe using the supplied record mapping function:
    * `df` dateframe of records"""
    return pd.DataFrame.from_records([ mapping(i) for i in items ])

Test the tabulator.

In [None]:
df = tabulate(eventsjson, mapping)

Now lets go from a Python tree-structure of dictionaries and lists to flat JSON.

In [None]:
originaldata = [
    "Calgary",
    45,
    {
        "a": 3,
        4: "b",
        "a method": "def mymethod(a,b): return a+b"
    },
    "Edmonton"
]
print(originaldata)
serializedjson = json.dumps(originaldata, indent = 4)
print(serializedjson)

And the reverse process of starting with some JSON and parsing into a tree-structure.

In [None]:
jsontext = """[
    45,
    43,
    -90,
    -80,
    [
        { "a": "b", "c": "d" },
        {
            "q": "u",
            "z": [ "y", "x" ]
        }
    ]
]"""
print(jsontext)
loadtext = json.loads(jsontext)
print(type(jsontext))
print(type(loadtext))


A little more practice navigating the tree.

In [None]:
print(loadtext[-1][-2])
print(json.dumps(loadtext, indent = 2))

Some library documentation for serializing and marshalling:

* [Python JSON](https://docs.python.org/3/library/json.html)
* [Python HTML](https://docs.python.org/3/library/html.parser.html)
* [Python XML](https://docs.python.org/3/library/xml.html)
* [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/)

We will demonstrate using Beautiful Soup to create a dataframe from HTML5 text. Our goal
is to find and use the data on the
[Wikipedia page listing gravitational waves](https://en.wikipedia.org/wiki/List_of_gravitational_wave_observations)
to create a dataframe.

In [None]:
WIKIURL = "https://en.wikipedia.org/wiki/List_of_gravitational_wave_observations"
with requests.get(WIKIURL) as r:
    pagetext = r.text

Prove that page text is just plain text.

In [None]:
print(pagetext[0:1024])

Parse the flat text into a tree-like or recursive data structure. This way we can navigate
the data programmatically.

In [None]:
pagesoup = soup(pagetext)

Check that the parse structure is an object, and has a method that returns the formatted
text. The print the formatted text and compare to the original source.

In [None]:
print(type(pagesoup))
print(type(pagesoup.prettify()))
print(pagesoup.prettify())

We can explore the soup object model using the dot notation to access the first tag of
each type of tag. Each tag contains more...tags.

In [None]:
print(type(pagesoup.html))
print(type(pagesoup.html.head))
print(type(pagesoup.html.head.title))
print(pagesoup.html.head)

The `find_all()` function selects all the tags of a particular type, from the current
location deeper into the nested tags.

In [None]:
print(len(pagesoup.html.head.find_all("script")))
print(len(pagesoup.html.head.contents))

In conjunction with the `string` property and a list comprehension we can, for example,
get all the contents of all the script tags in the header.

In [None]:
jsscripts = [ t.string for t in pagesoup.html.head.find_all("script") ]
print(type(jsscripts[1]))

How about the first table in the document?

In [None]:
print(pagesoup.table.prettify())

Now retrieve all the tables in the document.

In [53]:
# List every table in the document
candidates = pagesoup.find_all("table")
print(len(candidates))

17


Filter that list for just those tables with captions, and verify the fourth table in the
list is the one we want.

In [71]:
# Extract the captions that are available
captionedtables = [ c for c in candidates if c.caption is not None]
print(captionedtables[3].caption)

<caption>Marginal event detections
</caption>


Now lets grab that specific table, and visually verify.

In [None]:
marginaltable = captionedtables[3]
print(type(marginaltable))
print(marginaltable.prettify())

<class 'bs4.element.Tag'>


Next we grab all the rows from the table.

In [79]:
marginalrows = marginaltable.findAll("tr")

Check that our data starts at the third row, and note how individual blocks of text are 
children as well as the table cells.

In [97]:
print(marginalrows[2].contents)

['\n', <td bgcolor="#fff0f0">151205</td>, '\n', <td>2015-12-05 19:55:25</td>, '\n', <td>2019-10-11</td>, '\n', <td><span class="nowrap"><span data-sort-value="7003300000000000000♠"></span>3000<span style="margin-left:0.3em;"><span style="display:inline-block;margin-bottom:-0.3em;vertical-align:-0.4em;line-height:1.2em;font-size:85%;text-align:right;">+2400<br/>−1600</span></span></span></td>, '\n', <td>H,L</td>, '\n', <td>0.61</td>, '\n', <td><span class="nowrap"><span data-sort-value="6999140000000000000♠"></span>0.14<span style="margin-left:0.3em;"><span style="display:inline-block;margin-bottom:-0.3em;vertical-align:-0.4em;line-height:1.2em;font-size:85%;text-align:right;">+0.40<br/>−0.38</span></span></span></td>, '\n', <td bgcolor="888888"><div class="center" style="width:auto; margin-left:auto; margin-right:auto;"><span style="color:white;">BH</span></div></td>, '\n', <td><span class="nowrap"><span data-sort-value="7001670000000000000♠"></span>67<span style="margin-left:0.3em;"><

Now build a list of records, where each record is a dictionary. The keys are the column
names, and the values are the column values.

In [None]:
recordset = [ { "event": r.contents[1].string, "at": r.contents[3].string } for r in marginalrows[2:] ]
print(recordset)

Finally this list of records can be directly converted to Pandas dataframe.

In [96]:
marginaldf = pd.DataFrame.from_records(recordset)