# Designing Data-Intensive Applications Notes
# Chapter 2: Data Models and Query Languages

> The limits of my language mean the limits of my world.
—Ludwig Wittgenstein

* Clean data models hide the complexity of the layers below it
* There are many types of data models as they are specific to each application
    * All will have their pros / cons associated with them
    
## Relational Model vs. Document Model
* SQL:
    * Best known data model today
    * Data is organized into ***relations***
        * AKA tables
    * Relations are composed of ***tuples***
        * AKA rows
* History of the Relational Model
    * Created originally in the 1970's
    * Was originally created for business data processing
        * Now is used for many facets of society:
            * ecommerce, games, SASP, etc.
    * Primary Alternatives:
        * Network Model
        * Hierarchical Mdoel
        * Object Databases

### NoSQL
* NoSQL: *Not Only* SQL
* Reasons for NoSQL:
    * Need for greater scalability for large datasets or high write throughput
    * Preference for free / open source software
    * Specialized query operations 
    * Ability for more dynamic and expressive data models
* Polyglot Persistence:
    * Knowing / using several technologies 
* Different applications having different requirements!
---

### Object-Relational Mismatch
* If data is stored in relational tables, there has to be a translation betweeen the objects in the code and the database model!
    * Impedance mismatch:
        * When electric circuits have a certain resistance to AC power on inputs / outputs. If the output / input of two connections have the same resistance, then the power transfer is maximized else its 'mismatched'. 

* Object-relational mapping (ORM) Framework example: 
    * Express a user profile with first_name, last_name, career, education fields
    * Traditional SQL: 

    ```
    Table 1: users_table
    user_id | first_name | last_name | career | education 

    Table 2: career
    id | user_id | job_title | organization

    Table 3: education table
    id | user_id | school_name | start | end 
    ```

        * Separate tables with foreign key references to conjoin data
        * Can be messy if you are trying to do a multi-way join
    * Advanced SQL: 
        * Later forms of SQL allow for structured data types and XML data to be stored within a single row
        * Primarily supported by MySQL and PostrgreSQL
    * JSON / XML:
    ```
    {
        user_id: 251
        first_name: 'Bill'
        last_lane: 'Gates'
        positions: [
            {
            'job_title': 'co-founder',
            'organization': 'Microsft'
            },
            {'job_title': 'co-chair', 
            'organization': 'Bill & Melinda Gates Foundation'
            }
        ]   
        education: [
            {
            'school_name': 'Harvard University',
            'start': 1973,
            'end': 1975
            },
            {
            'school_name': 'Lakeside School, Seattle',
            'start': null,
            'end': null
            }
        ]    
    ```
        * You can typically query and extract information within these rows
        * Encode the information as an above type and let the application interpret its structure and content
        * You can't query for the information with this encoding though
        * Widely thought to reduce the impedance mismatch between the code and the storage layer
        * Seen as simpler due to having better *locality * compared to relational database examples
---
        
### Many-to-One and Many-to-Many Relationships

* How can IDs shine?
    * Avoids ambiguity
    * Consistent style and spelling
    * Easy to update
    * Duplication prevention
    
* Normalization: Removing duplicates from a database
    * Requires a Many-to-One Relationship
    * Supports for joins in relational dbs are often considered weak
* As an application grows, data has the tendency to become more interconnected as features are added. 
    * Ex: Rather than showcasing a string of an entity, showcase the entity that it's linked to instead
---

### Are Document Databases Repeating History?
* How can you best represent many-to-many relationships?

* Hierarchical Model
    * IBM's Information Management System
        * Highly popular database
        * Popularized the Hierarchical Model
   * Represents all data as a tree of records within records
        * ***One*** record to ***One*** parent
        * Highly similar to existing JSON formats
        * Worked well for one-to-many
        * Not well for many-to-many relationships or joins
    * Developers had to do one of the following:
        * Denormalize/duplicate the data
        * Resolve the records manually
* Network Model (now defunct)
    * Created by the Conference on Data Systems Languages (CODASYL)
    * Created as a generalization of the Hierarchical Model
    * ***One*** record to ***Many*** parents
        * Will now work for many-to-many and many-to-one modeling!
    * Needs to utilize an access path:
        * Represents the path from a root record along a chain of links
    * Record linkages utilized pointers rather than foreign keys
        * Can be thought of as linked list travesal
    * Due to the one-to-many children, this ended up being very difficult to keep track of all of the various relationships. 
        * As such, quering and updating were too complicated to do well
* Relational Model (most popular)
    * Primarily known through SQL
    * Relation: Collection of Tuples
        * Relation -> Table
        * Tuples -> Rows
    * Utilizes a query optimizer:
        * Automatically decides which parts of the query to execute in what order and with what indexes. 
        * Access path exists here as opposed to the developer having to handcode the access paths for a particular query. 

* Document Databases:
    * Also alows for the storing of nested records
        * Similar to the traditional SQL style
    * Not quite different from relational databases
        * Both have unique identifiers 
            * DD: Document Reference
            * RD: Foreign Key
        * Identifiers are resolved at read time using joins or queries
---

### Relational Versus Document Databases Today

* The kinds of relationships between data items affects which data model to use

* Document Data Model:
    * Pros:
        * Schema Flexibility
        * Locality-Enhanced Performance
        * Reduced impedance mismatching
    * Cons:
        * Poor support for joins
        * Inability to reference nested items within a document
    * Best used if the application has a document-like structure
        * Tree of one-to-many relationships
        * Shredding: 
            * Splitting a document-like structure into multiple tables
            * can lead to extremely complicated code
    * Avoid for many-to-many relationships!
        * Will lead to more complex code with worse performance
* Relational Data Model
    * Pros:
        * Better support for joins
        * Many-to-one 
        * Many-to-many relationships

* Highly interconnected data:
    * Best: Graph Models
    * Mid: Relational Data Model
    * Worst: Document Model
---

### Schema Flexibility in Document Models:
* Document databases are sometimes called schemaless
    * Misleading
    * There is an implicit schema but it's not enforced
    * Schema-on-read: 
        * Structure of the data is implicit and only interpreted on read
        * Similar to runtime type checking in Python
        ```python
        if (user & user.name & !user.first_name):
            user.first_name = user.name.split(" ")[0]
        ```
        * Great for hetergenous data that can change at any time
    * Schema-on-write:
        * Structure is explicit and the database ensures data conformity
        * Similar to compile type checking in SQL
        ```sql
        ALTER TABLE users ADD COLUMN first_name text;
        UPDATE users SET first_name = split_part(name, '', 1)
        ```
        * Great for homogenous data structures
---
### Data Locality for Queries:
* Documents are stored as a single continuous string
* Storage Locality Performance Advantage:
    * Only if an app needs a large part / all of the document
    * Otherwise data split up across multiple tables requires multiple lookups
* Cons:
    * On read, the database have to load up the entire docuent
    * Updates involve rewriting the whole file
        * Unless the document size doesn't change
* Overall there are great performance constraints that reduce the document database efficiency
---

### Convergence of Document and Relational Databases
* Most relational database systems have long since supported XML
    * Local modifications to XML documents
    * Ability to query / index inside XML documents
* Most relational database systems also support JSON

---
## Query Languages for Data
* Declarative Query
    * ex: SQL
    * Tells the computer to specify data patterns
        * What conditions to meet
        * How you want the data transformed
    * You do NOT tell the computer HOW to perform this goal
        * Up to the query optimizer to decide which indexes/join methods/syntax order to use
    * Seen as easier to work with with implementation details under the hood
        * Performance optimization doesn't require new code rewriting!
    * Very EASY to parallelize code
* Imperative Query
    * ex: IMS/CODASYL
    * Tells the computer to perform certain operations in a certain order
        * Line by Line
    * Many commonly used programming languages are imperative     
    * Very HARD to parallelize code

### Declarative Queries
* Declarative Queries are not just for databases!
    * Can also be used for CSS/XSL for pattern selection
* Declarative: 
    ```css
    ls.selected > p {
        backgrouns-color: blue;
    }
    ```
* Imperative 
    * Imagine a java block that gets all document elements, filters through them, then sets the attribute as it gets to the target attribute. 

