<img src="https://github.com/christopherhuntley/DATA6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Lesson 5: Table Design and Normalization** 
_The science of bulletproofing your tables._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- Tradeoffs that every designer makes
- Table normalization and normal forms
- The Entity Attribute Value database model

### **Skills / Know how to ...**
- Break a large table into smaller *normalized* tables
- Use relational notation to describe table schema
- Detect when a choice of keys will potentially corrupt data

--------
## **LESSON 5 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/l1laoDAoR5Q" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

---
## **BIG PICTURE: Why programmers suck at database design**
In some ways, programming is the most arrogant profession of all. Software is inherently malleable in ways that nothing else can possibly be. Like science fiction writers, programmers can alter the laws of physics to suit whatever the needs are at the time. However, unlike fiction writers, programmers can go a step further by building *and running* the universes they design. Software is not a simulation or a movie, it is the very embodiment of whatever the programmer wants it to be. That is real power, just in a very narrow universe. 

These days most programmers learn their craft by building *apps* of one sort or another. An app is all about process, handling whatever actions the user chooses. If there is data involved then it rarely exists beyond a single use of the app, perhaps as cache or maybe a message to be sent to a server somewhere in the cloud. Outside of this scope, the app programmer generally does not really care. It's all beyond their control anyway. 

It is in the cloud that the persistent part of the software exists. If data is to be stored and shared among many users then it is a *DataOps* programmer who will design and build any necessary middleware and data repositories. For these programmers the world is less about the dynamics of the app and more about the permanent structures needed to keep it running. 

There are some programmers, who like to call themselves *full stack developers*, that do both frontend app and the backend server development. However, if you dig even a little bit into their knowledge base, you will likely find that they are 80% frontend and 20% backend. They know just enough about the backend to keep the apps running but don't really like doing it very much. Instead, they are always looking for shortcuts so they can spend more time on making the the visible part of the app that much nicer. 

The same kind of milieu is common in data science, where the sexy frontend stuff that everybody sees is the models and the visuals. Like the apps developers, data scientists see data management as a chore. To them everything is just better if each project has a massive dataset (table) that they can build models from. If there are any bugs in the data then they will just program around them. Why not? The tools make it easy to do so. 

Where does this leave us? In a world where fewer and fewer programmers *really* understand database design. There just isn't enough to get excited about when one can get so much instant gratification from a UI tweak or running a fancy new machine learning algorithm. Honestly, who can blame them? Nobody is going to pat them on the back for getting the backend right but everybody will exclaim in excitement when an analytical model unearths a previously unknown insight and then distills it down to *just the right story*. 

That said, always be on the lookout for data errors that can't be programmed around. Sometimes they make the difference between being right and dead wrong. 

In this lesson we will learn about table design, starting with the tradeoffs a designer invariably has to make before moving on to the normatively *correct* techniques of normalization and table decomposition. We will conclude with a discussion of when to throw out correctness in favor of convenience, speed, or analyst preferences. 

 ---
## **Design Tradeoffs**
### **The Eternal Dilemma**

Design is about making decisions. If we make the right decisions, then the right systems get built and everybody is blissfully happy. We might not get the credit but people are happy nonetheless. If we make the wrong decisions then everybody is upset *at us*. 

So what is the right way to design a system? Well, if that were answerable in a paragraph, then it wouldn't be design. We have to consider what is being asked of the system, what solutions are available, and what we can afford. In other words, it comes down to tradeoffs and priorities. 

We will now take a look at a few eternal data design priorities, in what should be increasing importance for most applications. However, your mileage may vary depending on the needs at the time. 

### **Minimizing Space**
In the old days before big data, storage was often the most expensive part of a computer system. Programmers would do just about any amount of programming to avoid buying new storage hardware. They would literally count characters to minimize the number of bytes a given file required on disk. 

To this end, they came up with some tricks that often shaved off kilobytes without having to resort to file compression. A few examples:
- Repeating fields, where each line of file only recorded what was different from the line above.
- Cryptic codes in place of long strings of text. Often they were hardwired into the programs, working like magic incantations when used by people in the know. 
- Overloading fields so that multiple facts could be stored in one field. 

You can see this same kind of thinking today in the messages passed between the front end app and a server. However, network bandwidth is becoming so plentiful that even this last bastion of space efficiency is just not important to worry about. 

Space is cheap and getting cheaper. 

### **Maximizing Calculation Speed**
Along the same lines as with space, raw speed has historically been prioritized over correctness. Long ago it was because computers were so slow. These days it is because we ask so much more of our computer systems. If we can shave 5% of the computing time off a given operation that will be performed billions of times, then it is well worth it to do so.

Relevant techniques for raw speed include:
- Precomputing whatever can be done in advance, even when it swells storage with redundant data.
- Approximating results whenever 100% fidelity is not strictly necessary.
- Locating data closer to each user, even when it means some data will be out of sync with others

Of course, computers are getting faster and faster. However, expect this trend to continue as demands for raw speed will likely increase faster than we can build bigger and faster hardware.  

### **Maximizing Coherency**
Coherency is the ability to make sense of the data. Do all the facts fit together to tell coherent stories? Is each fact expressed in the best possible way? 

Generally, data coherency has been the domain of data modelers, who are more concerned with the stories than the data itself:
- What are the entities being tracked?
- What data is collected about each one? 
- How do the entities relate to each other? 

These sorts of questions never get old. They are focused on the same things as the app developer and the data scientists. 

We will touch on some of these questions in this lesson, then devote the bulk of Lesson 6 to entity relationship modeling. 

### **Minimizing Risk of Data Corruption**
Data integrity is an essential quality that never gets old. It is literally seeking to put the truth (and only the truth) into our databases. It is getting harder and harder to achieve, however. 

Big data is ugly data. It often comes in corrupted, forcing the database system to clean it up before it can be stored. If the system is going to do that then it needs to have a goal, a definition of what *correct* and *clean* are. If data can't be fixed then the system should reject it rather than accept a lie as the truth. 

It is this last design priority that is at the heart of table design and normalization. If we design our tables so that they follow a few (not-so-easy) rules, then we can avoid the vast majority of data corruption errors or, as we will call them, **data anomalies**. 

---
## **Relational Notation**

In order to design tables we need concise language to describe them. We have already seen ER diagrams (and will again in Lesson 6), but often we don't want or need diagramming software, especially when we're just getting started and table names, columns, etc. may change. For that we use **relational notation**.

In this lesson we will adopt the following convention:  
`Table_Name(`**`primary, key, columns`**, `non, key, columns,` <u>`foreign, key, columns`</u>`)`

- The table name uses `Initial_Caps`.
- Columns are listed `(`inside parentheses`)` immediately after the table name.
- Primary key columns **`are in bold`**; on a whiteboard we might use an alternate color instead.
- Non-key columns are in `regular text`.
- Foreign key columns are <u>`underlined`</u>.

With this notation in place we can design dozens of tables at a time without worrying too much about details that can be worked out later. 

In the Movies Tonight case (which we'll start later in this lesson) we will be designing the following tables, starting with a messy spreadsheet:
- `Artist(`**`artistID`**, `name)`
- `Movie(`**`movieID`**, `title,rating)`
- `Theater(`**`theaterID`**, `name, location, phone)`
- `Credit(`**`showId`**, `ccode`, <u>`movieID`</u>,<u>`artistID`</u>`)`
- `Show(`**`showId`**, `showtime`, <u>`movieID`</u>,<u>`theaterID`</u>`)`




---
## **Normalization**

**Normalization** is a process of breaking large, messy tables into smaller, more coherent ones.

>The term "normalization" actually has a bit of political history behind it. At about the time that researchers were formalizing the rules of normalization in the early 1970s, the US was *normalizing relations* with China. The two countries were going to have to adopt a few conventions in order to work together. Why not apply the same term to making tables cohabitate nicely within a database?

The normalization process has four goals:
- Each table represents a single subject.
- No data item will be unnecessarily stored in more than one table.
- All non-prime (not PK) attributes in a table are dependent on the PK (and only the PK).
- Each table is devoid of insertion, update, and deletion **anomalies**.

These rules are actually a bit stricter than the **coherent relation** rules we learned in Lesson 4. In other words, normalization always produces coherent relations but we don't always need to *fully* normalize in order to have coherent relations. Thus, the normalization process defines **degrees of normalization** called **normal forms**:
- 1NF
- 2NF
- 3NF
- ...

Each normal form builds on the ones before it, applying ever stricter conditions that have to be satisfied.

We will need a little bit more math before we get started with the normal forms. 

### **Functional Dependencies**
A **functional dependency** within a table is when a group of columns can be used to *look up* or *derive* the values of another group of columns. 

We write out dependencies as mappings like the ones we used in lesson 4:  
**determinants $\rightarrow$ dependents**

Given the values of the **determinant** columns on the left, we can deduce the value of the **dependent** columns on the right. Or, more concisely, the determinants *determine* the dependents.

The most obvious functional dependency derives from the row number within a table. Given the row number we can simply look up the values in the columns. 

Let's say we have a table like this:  
`Student(`**`studentID`**,`name, dorm, room, fee)`

Then we can pretty easily deduce a few dependencies like
- `studentID` $\rightarrow$  `name`
- `studentID` $\rightarrow$  `(dorm, room, fee)`
- `studentID` $\rightarrow$ `(name, dorm, room, fee)`

These are of course redundant, with the first two dependencies implied by the third. 

The following rules can be used to simplify a set of dependencies to just the *non-redundant* ones we need for normalization: 

- **If A $\rightarrow$ (B, C), then A $\rightarrow$ B and A $\rightarrow$ C.**   
  *This is the **decomposition** rule.*
- **If A $\rightarrow$ B and A $\rightarrow$ C, then A $\rightarrow$ (B, C).**   
  *This is the **union** rule.*
- **If (A,B) $\rightarrow$ C, then we cannot assume that A $\rightarrow$ C or that B $\rightarrow$ C.**  
  *This is less of a rule than a warning.*

So, if after simplifying all the dependencies to just the essential ones, we find only one dependency, then the table is fully normalized. 

> We say a table is in **Domain Key Normal Form (DKNF)** when all functional dependencies are on the primary key and only the primary key. There is no way to normalize beyond that. 

### **What's an Anomaly?**
We learned about anomalies as part of Data Integrity in Lesson 4. Anomalies are violations of referential integrity constraints caused by adding, updating, or deleting data. 
- **insertion anomaly**: adding a new row to a table causes a foreign key to become ambiguous
- **update anomaly**: editing data in a row causes a foreign key to become invalid or ambiguous
- **deletion anomaly**: deleting a row triggers a referential integrity violation

Generally, most anomalies reduce down one or more of the following bugs:
- a fact is defined in more than one place (and can become inconsistent)
- a key reference has a typo so that it doesn't match the intended primary key
- a row has been deleted, making any references to it invalid
- a function or other calculation makes an invalid assumption about the ordering of the rows or columns of the table. 

If it is possible for any of these things to happen then you need to reconsider your system design. Note that normalization only addresses the first three bugs. If the fourth bug is present then look for a more capable app programmer. That's a rookie mistake that can be fixed with a `SELECT` query. 


### **1st Normal Form (1NF)**

A table in first normal form has
- a primary key column (with no duplicates)
- no multivalued columns (lists of values)
- no repeating groups (of row values)

The second and third bullets need a bit of explanation. 

The following table has a multi-valued column:

| name | email addresses |
|------| --------------|
Barb Ackue |	backue@acmesales.com, barb.ackue@gmail.com
Buck Kinnear | bkinnear@acmesales.com, buckkinnear2315@hotmail.com

**As a general rule if we are tempted to use a plural name for a column then it is likely multivalued.** Presence of a multivalued column is *normalized* away by creating a separate table, with one row per item on the value list:
- `Contact(`**`contact_id`**, `name)`
- `Contact_Email(`**`contact_email_id`**, `email, usage,`<u>`contact_id`</u>`)`

Notice that we used a foreign key to link the two tables together. Foreign keys are always on the "many" side of a relationship. Also, we added a `usage` column so we know how the email address is to be used (work, home, spam, etc.). 

The table below has a repeating group of rows:

| name | email |
|------| --------------|
| Barb Ackue |	backue@acmesales.com |
|             | barb.ackue@gmail.com |
| Buck Kinnear | bkinnear@acmesales.com |
|               | buckkinnear2315@hotmail.com |

Here the assumption is that the names carry over (repeat) from one row to the next *unless* overwritten by a new name. This does get around the multivalued column but it also makes it so that we can't sort the rows by name without messing up the meaning. Also, what is the primary key of this table? It's pretty dicey all around. 

**If we have to know the value of the row (column) immediately before the current row (column) then we have a repeating group.**
The simplest solution to a repeating group bug is to fill in the blanks and then create a proper primary key. The data will have lots of redundancies but at least it will be 1NF. A *better* solution is the same as for the multivalued column bug; split the data into smaller tables with cross-references: 
- `Contact(`**`contact_id`**, `name)`
- `Contact_Email(`**`contact_email_id`**, `email, usage,`<u>`contact_id`</u>`)`

### **2nd Normal Form (2NF)**
A table is in second normal form if:
- it is already in 1NF
- there are no dependencies on just part of the primary key

This rule only really comes into play if the table has a **composite primary key** (with multiple columns). If any of the non-key columns can be determined with a subset of the primary key columns then we have a 2NF violation. 

**Let (A,B) be a composite key for a table. If B $\rightarrow$ C for some other column(s) C then the table violates 2NF.** The fix is to create a new table for the B $\rightarrow$ C dependency, with B as the primary key: 
- `Table1(`**`A,B`**,`D,E F)`
- `Table2(`**`B`**,`C)`

Note that we removed column C from `Table1` because it can be looked up from `Table2`

**Heads up:** It is possible to have dependencies *within* a composite key. The fix is the same. 

### **3rd Normal Form (3NF)**
A table is in third normal form if
- it is in 2NF
- there are no dependencies among the non-key columns

**This form comes about because of *transitive* dependencies like A $\rightarrow$ B $\rightarrow$ C. Here A is the primary key and B and C are non-key columns.** The fix is to create a new table for the B $\rightarrow$ C dependency:
- `Table1(`**`A`**,`B,D,E F)`
- `Table2(`**`B`**,`C)`

This looks very similar to the 2NF fix except that B is not a primary key column. 

> **Heads Up**: Derived columns (from calculations on other columns) are yet another kind of dependency where *inputs* $\rightarrow$ *outputs*. In this case the best solution is to eliminate the derived columns entirely. No data is lost by doing so. 

### **Boyce Codd Normal Form (BCNF)**

A table is in BCNF if:
- it is 3NF
- every determinant (on the left of the arrows) is a candidate key

Almost always a table in 3NF is also BCNF. However, let's say that we have a composite primary key (A,B) and that there exists a non-key column C such that C $\rightarrow$ B. Then while (A,B) may look like a good primary key, (A,C) is better. 

**BCNF is only in play if there are multiple composite candidate keys *and the wrong one is used as the primary key*.** The fix is to switch to the new primary key. In some cases that may require splitting the table into two or more smaller tables, but we'll leave that for another time. 

### **4th Normal Form (4NF)**
Fourth normal form deals with multivalued dependencies, where a column narrows the scope of another column to a consistent subset. The notation is a little odd:  
**determinant $\twoheadrightarrow$ dependent**

The purpose is to pick up hidden repeating fields in the data. Going back to the email address example, consider the following table:

|eid | name | email | usage | 
|---|------| --------------|---|
| 1 | Barb Ackue |	backue@acmesales.com | work |
| 1 | Barb Ackue | barb.ackue@gmail.com | home |
| 2 | Buck Kinnear | bkinnear@acmesales.com | work |
| 2 | Buck Kinnear | buckkinnear2315@hotmail.com | home |

Each time we refer to Barb Ackue we are also referring to her two email addresses. That is a multivalued dependency:

**`eid` $\twoheadrightarrow$ `email`**

The solution is to break this into two tables, one for the name and the other for emails:
- `Employee(`**`eid`**,`name)`
- `Contact(`**`contact_id`**,`email,usage`,<u>`eid`</u>`)`

### **Even Higher Normal Forms**

In our quest to be even more sure that we will never trigger an anomaly we can go to 5NF, and even 6NF. 

The final and last normal form is DKNF, which we introduced at the beginning of this section. Unfortunately, there is no set way to redesign a table to always be DKNF. Instead we just have to guess and then check that 
> "... all functional dependencies are on the primary key and only the primary key." 

**Fortunately, however, about 99.99% of the time a table in 3NF is also in DKNF.** Leave the remaining edge cases for the hardcore data engineers to figure out. (Though, we will see an example of 4NF normalization in a bit.)




---
## **Movies Tonight: A Case Study**

This is the start of a 4 part case that will run through Lesson 8. 

### **A very old-school web app**

Movies Tonight was an ancient web app built as a tech demo in the days before broadband, CSS, web services, ReST APIs, JSON, and all the other technologies we now take for granted. It was designed to show what a rich user interface could look like once we had all of those things. 

> For those who may be wondering ... Yes, your instructor built the app over a weekend before a Tuesday morning class. And yes, the design is truly hideous. 

![Movies Tonight UI](https://github.com/christopherhuntley/DATA6510/raw/master/img/L5_Movies_Tonight_UI.png)

**Through some sort of Internet miracle, [the app still works](http://christopherhuntley.github.io/movies-tonight).** It provides information about every movie shown in Riverside, California, on Thanksgiving 1996. The code is ancient $–$ Javascript was just 2 years old at the time $–$ and won’t work in some modern browsers. It should work fine in Chrome and Firefox, however. Try it out to get a feel for the basic flow. 

While the web design was only barely passable, there is a gem hidden in the source code: all the data in a compressed format and parser functions used to extract the data into usable data records (about movies, theaters, and shows). The idea was that the javascript would be generated by a webserver each time the page was loaded. Then the user would continue on without ever needing to refresh the page. Everything on the screen was *generated* in Javascript, which was truly radical idea at the time but is how most web pages are designed today. 
> **Note for web geeks:** [XMLHttpRequest](https://en.wikipedia.org/wiki/XMLHttpRequest) did not exist yet; that didn't happen until 2008. This is the truly old school way to do one page web apps.

![Movies Tonight Source](https://github.com/christopherhuntley/DATA6510/raw/master/img/L5_Movies_Tonight_Source.png)







### **The Data, in Three Formats**
The data is downloadable as an [MS Excel file](https://github.com/christopherhuntley/ba510-movies-tonight/blob/master/movies.xls). (You may want to do that before moving on.)

![Movies Tonight Data in Excel](https://github.com/christopherhuntley/ba510-movies-tonight/raw/master/img/img1.png)  

The file has the same data in three tabs/sheets:  
- **Format 1** is a classic 1960s era mainframe data layout, designed to minimize the number of characters used in the file without using any numerical ids (which can be hard to debug by hand). Each record is on a line and is one of three types (M, S, or T). Depending on the record type, the last field may be repeated if there are multiple values, making the # of fields variable (even when the record type doesn’t change).

![MT format 1](https://github.com/christopherhuntley/DATA6510/raw/master/img/L5_Movies_Tonight_format1.png)


- **Format 2** is a slightly different arrangement, again with three record types. This time the repeated fields are split into separate records. To conserve characters in the file, fields are left blank if the values are the same in the record above it.

![MT format 2](https://github.com/christopherhuntley/DATA6510/raw/master/img/L5_Movies_Tonight_format2.png)

- **Format 3** combines all three record types into a single record, at the cost of being extremely verbose and redundant. Notice how many rows the sheet has! Each record represents a single **movie credit** within a single **movie showing** at a given time at a **single theater**. (Read that three times to be sure you understand before going on.)  

![MT format 3](https://github.com/christopherhuntley/DATA6510/raw/master/img/L5_Movies_Tonight_format3.png)

> **Take a moment to think about which of the three formats you might want to use if you were sending 10 billion rows of data over the internet (or a few thousand rows over 1996 dial-up internet). Then think about which one you would want to use to train a machine learning model.** 

### **Normalizing to 1NF**

Of the three formats, the only one that can be made 1NF as a single table  is Format 3. The other two are the very definition of repeating fields. 

So, let's start with the following table:  
`Dataset(`**`tname, mtitle, showtime, ccode, cname`**, `location, phone, rating)`

where the keys were discovered after inspecting the data for repeating patterns (i.e., composite keys).  


### **Normalizing to 2NF and 3NF**

2NF and 3NF require us to *normalize out* any functional dependencies that are not on the full primary key. We find two:
- `tname` $\rightarrow$ (`location`, `phone`)
- `mtitle` $\rightarrow$ `rating`

The fix is to break the data into three tables:
- `Dataset(`**`tname, mtitle, showtime, ccode, cname`**`)`
- `Theaters(`**`mtitle`**, `location, phone)`
- `Movies(`**`mtitle`**, `rating)`

Note that we removed the dependents (columns on the right sides of the arrows) from the `Dataset` table. That leaves the `Dataset` table with a massive primary key that includes every remaining column. 

That just can't be right. We're going to have to keep looking. 

### **Normalizing to BCNF and 4NF**
Since there is only one primary key for the `Dataset` table, **we can ignore BCNF.**

However, we do find a multivalued dependency to work with. Each movie, no matter how many times it is shown, always has the same movie credits. In other words:  
**`mtitle` $\twoheadrightarrow$ (`ccode`,`cname`)**

We can use this to create yet another new table, called `Credits`, leaving us with four tables:
- `Dataset(`**`tname, mtitle, showtime`**`)`
- `Theaters(`**`mtitle`**, `location, phone)`
- `Movies(`**`mtitle`**, `rating)`
- `Credits`(**<u>`mtitle`</u>,`ccode, cname`**`)`

Note that `mtitle` is a foreign key in the `Credits` table. It's also of course a primary key field. (Yes, that's totally possible.)  

### **Final Cleanup**

The `Dataset` table doesn't seem right. It's too generic. If we think hard about what each `Dataset` row represents, we will arrive at the concept of a *show* or perhaps a *showing*. Also, we will realize that the `mtitle` and `tname` columns are actually foreign key references to movies and theaters. This suggests a couple of changes to the table definition:
- `Shows(`**<u>`tname`</u>,<u>`mtitle`</u>, `showtime`**`)`
- `Theaters(`**`mtitle`**, `location, phone)`
- `Movies(`**`mtitle`**, `rating)`
- `Credits`(**<u>`mtitle`</u>,`ccode, cname`**`)`

**Finally, let's adopt best practice and use surrogate keys for all of the tables. This leaves us with:**
- `Shows(`**`showid`**`, showtime,`<u>`theaterid`</u>`, `<u>`movieid`</u>`)`
- `Theaters(`**`theaterid`**`, tname, location, phone)`
- `Movies(`**`movieid`**`, mtitle, rating)`
- `Credits`(**`creditid`**`, ccode, cname, `<u>`movieid`</u>`)`

We will come back to this design in Lesson 6, where we will find that we should add *one more* table.

















---
## **PRO TIPS: How to handle SQLite's quirks**

SQLite implements just enough of the SQL language specification to be useful. Given that it is intended to be embedded into tiny devices with almost no storage, it is remarkable that it does as much as it does. 

In some cases we just have to live with these inherent limitations. In others we can try workarounds that are a bit *different* but can be manageable in a pinch. 

### **No DBMS**
SQLite is just not designed to be a *server*. Instead, it is a software library that can be used to work with database files or in memory. 

Since there is no server to enforce any access rules, SQLite relies on the operating system to provide basic data security. If a file is supposed to be read-only then make it read-only. If some users should be able to access it but not others, then restrict file access in your operating system. 

Besides the lack of built in access controls, it is also inherently poor at concurrent access, where multiple users or apps try to access the database at the same time. The issue that the SQLite does file lock but nothing any more fine grained than that. So, if one process tries to write to any tabe in the database then any other process that also wants to write to data is locked out entirely until it is done. By contrast, a full-featured DBMS would allow much finer grained locking mechanisms at the table, row, or even the (row, column) levels. 

### **Nonstandard Connection Strings**
Since there is no actual DBMS, no usernames, no passwords, all that a SQLite connection string can specify is where to find the database file. The format is   

`sqlite:///filepath`

where `filepath` is a string with folders and subfolders, etc. leading to the actual database file. How the `filepath` string is formatted depends on the operating system. The safest thing to do is to require that the file reside in the same folder as the code (notebook) that uses it. Also, note the third `/` after the `:`. That's required for some reason. 

> In Lesson 1 we had to create a symbolic link to specify the file path for SQLite. That was due to a combination of quirks. First, SQLite cannot handle file paths with spaces in the names. Second, any file path in Google Drive always has spaces in it. Yuck.  

Things get a little weirder when we use an in-memory (file-less) database:  

`sqlite://`

Here there is no file path and there is also no third `/`. Maybe that's what the third `/` is for? So that SQLite knows to look for a file? 

### **`SELECT` and `rowid`**
SQLite follows a convention we also see in *pandas*. It automatically generates a unique row index called `rowid`. If you want a surrogate key, you then have to alias the `rowid` in the table definition by specifying the data type as `INDEX PRIMARY KEY`. It won't create the alias if you phrase it any other way. 

If, for some reason, you want to use the hidden `rowid` index column in a `SELECT` query then you specify it separately from the other columns:

```sql
SELECT rowid, * 
FROM ...
```


### **Limited Data Types**
To keep things as simple as possible, in SQLite tables only support five data types for columns:
- `TEXT` for strings of characters (any length)
- `INTEGER` for numbers without a decimal
- `REAL` for numbers with a decimal
- `BLOB` for binary data 
- `NULL` for nothing at all

Most of the time that works out just fine. However, there is a reason that more expansive SQL implementations include dozens of data types. With more types, the database can automatically handle data formatting and translation for you behind the scenes. 

Consider, for example, the SQL-standard `DECIMAL` data type, which is used for things like monetary values with a fixed decimal place and precision. In SQLite, `$2.35` could be stored in one of three ways:
- the `REAL` number `2.35`. But then what is `0.5 * 2.35`? It's `1.175` of course. (I wonder what they do with those half pennies?) 
- the `INTEGER` number `235`. But then we have to remember to divide by 100 before we can use the number for anything. 
- the `TEXT` "$2.35" which at least allows us to indicate the currency as dollars. However, we have to extract the "2.35" and convert it to `REAL` or `INTEGER` before doing any calculations. 

### **Dates and Times as `TEXT`**
This quirk is a follow on to the previous one. In order to make them readable by humans, dates and times in CSV files are generally stored as text like "2021-1-26" or "10:39:00". Most DBMSes would automatically convert such strings to specialized data types needed to do date arithmetic and sorting. SQLite simply doesn't do that. Instead it provides a pretty sparse set of [date and time functions](https://sqlite.org/lang_datefunc.html) with even sparser documentation.

> We will see this date and time quirk up close in Lesson 8 with the Movies Tonight database, where the show times are in a "nonstandard" text format. (SQLite does not know about 'AM' or 'PM'.)







---
## **SQL AND BEYOND: EAV Models**

One easy way to always keep your data normalized is to use a single table with four columns:
- An **entity type**
- An **entity** identifier
- An **attribute** identifier
- A **value** to store

We can, in fact, normalize any set of tables down to this one design, which we call EAV. 

For example, in an EAV database the relational database tables

**<center>Staff</center>**

|eid | name |
|---|------|
 1 | Barb Ackue 
 2 | Buck Kinnear


**<center>Contacts</center>**

| contactid |eid | email | usage | 
|---|------| --------------|---|
1 | 1 | backue@acmesales.com | work 
2 | 1 | barb.ackue@gmail.com | home 
3 | 2 | bkinnear@acmesales.com | work 
4 | 2 | buckkinnear2315@hotmail.com | home 

would be stored as 

| type | id | attribute | value |
| ---|------| --------------|-----|
 Staff | 1 | name | Barb Ackue
 Staff | 2 | name | Buck Kinnear
 Contact | 1 | eid | 1
 Contact | 1 | email | backue@acmesales.com
 Contact | 1 | usage | work
 ... | ... | ... | ...

 > **Heads up:** EAV is a special case of the key-value store model used by some NoSQL databases, where a composite like (type, id, attribute) is used for the keys. 

EAV databases have a few advantages:
- The single table design never changes.
- There is flexibility to handle new kinds of information easily; just add a new entity type (schema) with a few attributes and a data type for the values.
- If there are lots of attributes with missing data then there is no need to store the NULL values.

However, there are also some disadvantages:
- The table will have an enormous number of rows; to assemble the facts about even one entity might require dozens of rows.
- It is really hard to enforce referential integrity rules because foreign keys are just data like anything else.
- That the schema are so flexible makes it hard to know exactly what facts are knowable about a given entity without actually assembling it.

The most well known application for EAV is for electronic medical records, where the database has to store test results that may have any number of attributes. In such a situation having total flexibility to store whatever data is available in whatever format is needed is very very useful. However, not many applications need so much flexibility. 




 







  

 








---
## **Congratulations! You've made it to the end of Lesson 5.**

We will continue our coverage of database design in Lesson 6, which takes a more visual approach, using ER diagrams instead of normalization rules to model and analyze database requirements.



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `DATA6510` folder so you can find it next time.