In [1]:
%%javascript
$.getScript('http://asimjalis.github.io/ipyn-ext/js/ipyn-present.js')

<IPython.core.display.Javascript object>

<h1 id="tocheading">Data model for Big Data</h1>
<div id="toc"></div>

![The master dataset in the Lambda Architecture serves as the source of truth for your Big Data system. Errors at the serving and speed layers can be corrected, but corruption of the master dataset is irreparable](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-1.png)

* Learn the key properties of data
* See how these properties are maintained in the fact-based model
* Examine the advantages of the fact-based model for the master dataset
* Express a fact-based model using graph schemas

The properties of data
-------------------------------------------------
* *Information* is the general collection of knowledge relevant to your Big Data sys- tem. It’s synonymous with the colloquial usage of the word data.
* *Data* refers to the information that can’t be derived from anything else. Data serves as the axioms from which everything else derives.
* *Queries* are questions you ask of your data. For example, you query your finan- cial transaction history to determine your current bank account balance.
* *Views* are information that has been derived from your base data. They are built to assist with answering specific types of queries.

![Three possible options for storing friendship information for FaceSpace. Each option can be derived from the one to its left, but it’s a one-way process.](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-2.png)

![The relationships between data, views, and queries](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-3.png)

![Classifying information as data or a view depends on your perspective. To FaceSpace, Tom’s birthday is a view because it’s derived from the user’s birthdate. But the birthday is considered data to a third-party advertiser.](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-4.png)

### Data is raw

| Company | Symbol | Previous | Open  | High   | Low    | Close  | Net   |
|:-------:|:------:|:-------:|:------:|:------:|:------:|:------:|:-----:|
| Google  | GOOG   |  564.68 | 567.70 | 573.99 | 566.02 | 569.30 | +4.62 |
| Apple   | AAPL   |  572.02 | 575.00 | 576.74 | 571.92 | 574.50 | +2.48 | 
| Amazon  | AMZN   |  225.61 | 225.01 | 227.50 | 223.30 | 225.62 | +0.01 | 
A summary of one day of trading for Google, Apple, and Amazon stocks: previous close, opening, high, low, close, and net change.

![Relative stock price changes of Google, Apple, and Amazon on June 27, 2012, compared to closing prices on June 26 (www.google.com/finance). Short-term analysis isn’t supported by daily records but can be performed by storing data at finer time resolutions.](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-6.png)

#### Unstructured data is rawer than normalized data
![Semantic normalization of unstructured location responses to city, state, and country. A simple algorithm will normalize “North Beach” to NULL if it doesn’t recognize it as a San Francisco neighborhood.](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-7.png)
As a rule of thumb, if your algorithm for extracting the data is simple and accurate, like extracting an age from an HTML page, you should store the results of that algorithm. If the algorithm is subject to change, due to improvements or broadening the requirements, store the unstructured form of the data.  
#### More information doesn't necessarily mean rawer data

### Data is immutable
* Human-fault tolerance
* Simplicity

| id | name    | age | gender | employer  | location
|:--:|:-------:|:---:|:------:|:---------:|:--------          
|  1 | Alice   |  25 | female | Apple     | Atlanta, GA
|  2 | Bob     |  36 | male   | SAS       | Chicago, IL
|  3 | Tom     |  28 | male   | Google    | San Francisco, CA | <- Should Tom move to a different city, this value would be owerwritten.
|  4 | Charlie |  25 | male   | Microsoft | Washington, DC
| ...| ...     | ... | ...    | ...       | ...
A mutable schema for FaceSpace user information. When details change—say, Tom moves to Los Angeles—previous values are overwritten and lost.

![An equivalent immutable schema for FaceSpace user information. Each field is tracked in a separate table, and each row has a timestamp for when it’s known to be true. (Gender and employer data are omitted for space, but are stored similarly.)](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-9.png)

![Instead of updating preexisting records, an immutable schema uses new records to represent changed information. An immutable schema thus can store multiple records for the same user. (Other tables omitted because they remain unchanged.)](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-10.png)

### Data is eternally true
*e.g.* 
> The United States consisted of thirteen states on July 4, 1776.  

Special cases:
* Garbage collection
* Regulations

The fact-based model for representing data
-------------------------------------------------
### Example facts and their properties
![All of the raw data concerning Tom is deconstructed into timestamped, atomic units we call facts.](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-11.png)

```
struct PageView:
  DateTime timestamp
  String url
  String ip_address
```
To distinguish different pageviews, you can add a `nonce` to your schema—a 64-bit number randomly generated for each pageview:
```
struct PageView: 
    Datetime timestamp 
    String url
    String ip_address 
    Long nonce
```
**The nonce, combined with the other fields, uniquely identifies a particular pageview.**

##### Duplicates aren’t as rare as you might think

To quickly recap, the fact-based model  
* Stores your raw data as atomic facts
* Keeps the facts immutable and eternally true by using timestamps
* Ensures each fact is identifiable so that query processing can identify duplicates

### Benefits of the fact-based model
* Is queryable at any time in its history
* Tolerates human errors
* Handles partial information
* Has the advantages of both normalized and denormalized forms

Human faults can easily be corrected by simply deleting erroneous facts. The record is automatically reset by using earlier timestamps.

| user id | location          | timestamp
| ------- | ----------------- | ----------
|   1     | Atlanta, GA       | 2012/03/29 08:12:24
|   2     | Chicago, IL       | 2012/04/12 14:47:51
|   3     | San Francisco, CA | 2012/04/04 18:31:24
|   4     | Washington, DC    | 2012/04/09 11:52:30
| ~~3~~   |~~Los Angeles, CA~~| ~~2012/06/17 20:09:48~~
To correct for human errors, simply remove the incorrect facts. This process automatically resets to an earlier state by “uncovering” any relevant previous facts.

![The Lambda Architecture has the benefits of both normalization and denormalization by separating objectives at different layers.](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-15.png)

Graph schemas
-------------------------------------------------
### Elements of a graph schema
![Visualizing the relationship between FaceSpace facts](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-16.png)
* Nodes are the entities in the system.
* Edges are relationships between nodes.
* Properties are information about entities.


### The need for an enforceable schema
Suppose you chose to represent Tom’s age using JSON:
```json
{"id": 3, "field":"age", "value":28, "timestamp":1333589484}
```
There’s no way to ensure that all subsequent facts will follow the same format.
```json
{"name": "Alice", "field":"age", "value":25, "timestamp":"2012/03/29 08:12:24"}
{"id":2, "field":"age", "value":36}
```
Both of these examples are valid JSON, but they have inconsistent formats or missing data.

Why a serialization framework?
-------------------------------------------------
### Apache Avro

#### Primitive Types
The set of primitive type names is:  
* `null`: no value
* `boolean`: a binary value
* `int`: 32-bit signed integer
* `long`: 64-bit signed integer
* `float`: single precision (32-bit) IEEE 754 floating-point number
* `double`: double precision (64-bit) IEEE 754 floating-point number
* `bytes`: sequence of 8-bit unsigned bytes
* `string`: unicode character sequence  

Primitive types have no specified attributes.

Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to:

    {"type": "string"}

#### Complex Types
Avro supports six kinds of complex types: `records`, `enums`, `arrays`, `maps`, `unions` and `fixed`.

See http://avro.apache.org/docs/current/spec.html for more details.

In [2]:
import json
import avro.schema

### Nodes

In [3]:
PersonID = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PersonID1",
        "fields": [
            {
                "name": "cookie",
                "type": "string"
            }
        ]
    },
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PersonID2",
        "fields": [
            {
                "name": "user_id",
                "type": "long"
            }
        ]
    }]

In [4]:
PageID = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageID",
        "fields": [
            {
                "name": "url",
                "type": "string"
            }
        ]
    }]

In [5]:
Nodes = PersonID + PageID

### Edges

In [6]:
EquivEdge = {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "EquivEdge",
        "fields": [
            {
                "name": "id1",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            },
            {
                "name": "id2",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            }
        ]
    }

In [7]:
PageViewEdge = {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageViewEdge",
        "fields": [
            {
                "name": "person",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            },
            {
                "name": "page",
                "type": "PageID"
            },
            {
                "name": "nonce",
                "type": "long"
            }
        ]
    }

In [8]:
Edges = [EquivEdge, PageViewEdge]

### Properties

#### Page Properties

In [9]:
PageProperties = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PagePropertyValue",
        "fields": [
            {
                "name": "page_views",
                "type": "int"
            }
        ]
    }, 
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageProperty",
        "fields": [
            {
                "name": "id",
                "type": "PageID"
            },
            {
                "name": "property",
                "type": "PagePropertyValue"
            }
        ]
    }]

or

In [10]:
PageProperties = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageProperty",
        "fields": [
            {
                "name": "id",
                "type": "PageID"
            },
            {
                "name": "property",
                "type": {
                    "type": "record",
                    "name": "PagePropertyValue",
                    "fields": [
                        {
                            "name": "page_views",
                            "type": "int"
                        }
                    ]
                }
            }
        ]
    }]

#### Person Properties

In [11]:
PersonProperties = [
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "Location",
        "fields": [
            {"name": "city", "type": ["string", "null"]},
            {"name": "state", "type": ["string", "null"]},
            {"name": "country", "type": [ "string","null"]}
        ]
    },
    {
        "namespace": "analytics.avro",
        "type": "enum",
        "name": "GenderType",
        "symbols": ["MALE", "FEMALE"]
    },
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PersonProperty",
        "fields": [
            {
                "name": "id",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            },
            {
                "name": "property",
                "type": [
                    {
                        "type": "record",
                        "name": "PersonPropertyValue1",
                        "fields": [{"name": "full_name", "type": "string"}]
                    },
                    {
                        "type": "record",
                        "name": "PersonPropertyValue2",
                        "fields": [{"name": "gender", "type": "GenderType"}]
                    },
                    {
                        "type": "record",
                        "name": "PersonPropertyValue3",
                        "fields": [{"name": "location", "type": "Location"}]
                    }
                ]
            }
        ]
    }]

### Tying everything together into data objects

In [12]:
Data = [
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "Pedigree",
        "fields": [{"name": "true_as_of_secs", "type": "int"}]
    },
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "Data",
        "fields": [
            {
                "name": "pedigree",
                "type": "Pedigree"
            },
            {
                "name": "dataunit",
                "type": [
                    {
                        "type": "record",
                        "name": "DataUnit1",
                        "fields": [{"name": "person_property", "type": "PersonProperty"}]
                    },
                    {
                        "type": "record",
                        "name": "DataUnit2",
                        "fields": [{"name": "page_property", "type": "PageProperty"}]
                    },
                    {
                        "type": "record",
                        "name": "DataUnit3",
                        "fields": [{"name": "equiv", "type": "EquivEdge"}]
                    },
                    {
                        "type": "record",
                        "name": "DataUnit4",
                        "fields": [{"name": "page_view", "type": "PageViewEdge"}]
                    }
                ]
            }
        ]
    }
]

In [13]:
schema = avro.schema.parse(json.dumps(Nodes + Edges + PageProperties + PersonProperties + Data))

Limitations of serialization frameworks
-------------------------------------------------
In order to enforce more rigorous business logic:  
* Wrap your generated code in additional code that checks the additional properties you care about, like ages being non-negative. 
* Check the extra properties at the very beginning of your batch-processing workflow.

---

Lab
-------------------------------------------------
### Evolving your schema  
1. Work through the Avro [getting started guide](http://avro.apache.org/docs/current/gettingstartedpython.html).
2. Reproduce the fact-based graph schema in Gliffy.
3. Add "Age" as a user property
4. Add links between web pages as edges
5. Modify the Avro schema to allow this new property and edge