<img src="./img/Dolan.png" width="180px" align="right">

# **DATA 6510**
# **Lesson 5: Entity Relationship Modeling** 
_A Visual Approach to Database Design._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- Entity Relationship notation and interpretation
- Key distinctions between entities and attributes
- The various kinds of relationships and how they affect database design
- How Entity-Attribute-Value databases can be useful for data without fixed database schema

### **Skills / Know how to ...**
- Draw Entity Relationship Diagrams
- Assess the validity of entity types
- Identify potential integrity issues before committing to table designs

--------

## **BIG PICTURE: Data modeling is not about data**

**Data modeling is something we do *before* we have data.** It's the design activity that makes it possible to collect data for a given purpose. So, data modeling is about what we will do with the data and what we can expect to find as we collect it. The data itself ends up being a collection of artifacts that gets us from data collection to data usage. 


So, whenever we start design of a new database (i.e., the receptacle for our data) the most important questions are:
- What exists now?
- What is the data about?
- What is the data going to be used for?
- How long does the data need to be kept?
- Who is going to use the data? 


Only after we get answers to these questions can we begin to think about the tables, columns, keys, constraints, etc. For example, let's consider the "What exists now?" question. What exists now may include any or all of the following: 
- a paper process used to collect dead tree data and produce reports
- people whose jobs are to work with the current data sources every day, perhaps with an organizational hierarchy that determines data access permissions
- online systems that need data in a specific format or produce data in a (not necessarily the same) format
- end users that don't currently know they need the data 
- a laundry list of complaints about the current system (or lack of one)
- ...
It goes on and on. 

The purpose of data modeling is to capture enough detail about the data requirements as possible before we make any technology decisions we might regret later. If, for example, we find that some data needs to be separate from other data because of different sources or user access permissions, then we need to take that into account when designing our tables. Similarly, if some data is permanent and other data is only useful for a couple of weeks, then perhaps we will want to keep them separate as well. 

In this lesson we will use Entity Relationship Modeling to capture data requirements in an intuitive way that can be explained to clients and power users. It is like writing a draft of the user manual for the data before anything else. 

---
## **Entity Relationship Diagrams**

Entity Relationship models have been around almost as long as relational databases. Peter Chen first developed the theory in the mid-1970s, a time when relational databases were still pretty exotic, to solve a serious conceptual problem. People understood how to read and write files but following all the rules needed to build relational databases was a challenge.  


ER diagrams (ERDs) are designed to be just specific enough to be useful without introducing any unnecessary details that are best left for later. Consider this ERD from Lesson 1:

![Cleaners ERD](./img/L6_erd1.png)

- **Each box represents an entity class or type.** Any given type can model any number of instances. So, for example, the `Customer` entity (type) can represent millions of customers (instances).

- **The attributes are listed inside the box.** Traditionally, primary key attributes (identifiers like `customer_id`) are listed at the top and the foreign key attributes at the bottom, with non-key attributes in between. If an attribute is both a primary key and foreign key then group it with the primary keys attributes.

> Note: in the earliest stages of the requirements process, we may omit the attributes entirely, focusing on just the entities and relationships.

- **Relationships are shown with lines connecting the boxes.** The notation at each end of a connecting line represents how many entity instances there are at that end. The relationship shown in the diagram indicates that:
  - each `Customer` instance can generate zero of more `Invoice` instances
  - each `Invoice` instance is generated by one and only one `Customer`

With this much detail we can say:
- what tables we will need in the database
- what columns are on each table
- what foreign keys are needed and what tables they refer to

What we don't care about (yet) includes 
- the data types of the attributes
- how the keys are generated and managed
- how many rows of data are to be in each table
- how the tables will be populated with data
- how data will be updated or deleted from the tables

These last few things are for logical design. However, since those things left for a database administrator (usually not the analyst), we are just interested in the general database structure, not its implementation in SQL. 

## **ERDs as Conceptual Storytelling**

The key insight of entity relationship modeling $-$ the one that makes it so useful for database design $-$ is that **databases exist to tell stories**. They describe things that people care about enough to record for later. Entity-relationship diagramming is a visual language for telling database stories before we actually have data.

In the visual language of ERDs there are just two kinds of sentences:
- Description: entity A **is described by** attributes X, Y, and Z 
- Action: entity A **acts upon** entity B

In these stories, the **entities and attributes are the descriptive nouns**, while the **relationships are the action verbs**. 

We will take these one at a time and then discuss some of the more interesting special cases. 

## **Entities and Attributes**

In principle, any particular thing could be considered an entity or an attribute. Often it's hard to tell which is which, at least at first, before we know the context for our database stories. 

The difference between an entity and an attribute is that entities always have a unique identity within the **domain model** (i.e., context). Recall that in Lesson 4 we defined the relational model using set mappings from **domains** to **codomains**: 

![Relation mappings from Lesson 4](./img/L4_Relations.png)


**The items in the domain set are the entities represented by a table. The mappings to the codomains are the attributes.** While entities are by definition unique, we could map any given **domain entity** to **attribute values** in any number of codomains. Like the entities, the attribute values themselves are unique but the **attribute mappings** have no such constraint. We could map several entities to the same attribute value or even map an entity to itself (i.e., where codomain = domain) several times if we like. It is in this sense that **entities are unique** while **attribute mappings** are not.

To determine if a given item is an entity, we look for **identifier attributes** that make sense in the story context. For example, does a bank need to track individual dollar bills (via serial numbers) or just the number of dollars in a person's account? If the former then dollars are entities. If the latter, then dollar amounts are just values used to describe something else (a customer's account, a financial transaction, etc.).

> **Tip:** Identifier attributes should be immutable, never changing once they have been set. If one is ever tempted to modify an identifier then it is not really an identifier. So, for example, while we informally use people's names as identifiers, names can change and thus make poor identifiers. We are better off defining a new attribute (an id number, for example) and letting the name be just another non-key attribute, perhaps indexed for speedy lookups but not a candidate key. 

Besides the need for unique identifiers, there is another sense in which entities must be unique. **Each entity is always one thing and only one thing.** If an entity is composed of multiple things, then it is actually an **associative entity** that connects other things together. Each of these connected things is an entity, as is the association itself. 

Let's consider, for example, a marriage between two people. Is that one thing (a marriage) or three things (marriage plus two spouses)? In early versions of the database story it may be just a marriage license with a date, few signatures, and an id number. However, as the story gets fleshed out, one may find the need to represent the people getting married. While it is tempting to just add the details about these people to the marriage license itself, this makes the idea of a marriage license much more complex than it has to be. It also complicates questions like "Has Toby been married before?" or "Is Toby *currently married* to somebody else?" Instead of looking up Toby (and finding licenses), we would have to scan through licenses to find any with Toby's name. A much better, more savvy, way to do it is to **register each person** separately and then **reference** them on the marriage license. As long as the licenses are **indexed** by person $-$ easily done if each person has a unique identifier $-$ we could find any of Toby's previous marriages on file with one lookup.

Once we have the entities worked out, the attributes are usually pretty easy to suss out. All we need to know in an ERD is what they are named, and whether any of them are identifiers (primary keys) or references (foreign keys). Data types and other constraints (e.g., allowed combinations of attribute values) can wait until logical design (in SQL). 


---
## **Relationships**

Relationships are how the entities in our stories interact. They provide the action. How that action plays out depends on the natures of the entities and their interactions. 

Continuing our marriage example, here are three different scenarios, each as a different ERD:

![Three Marriage Proposals](./img/L6_Marriage_ERD.png)

Scenario A represents the case where people are just names on a marriage license. Each person is listed as either `spouse1` or `spouse2`. This is a very static story with no action. It is almost as if nobody will ever read the story after it is written.


Scenario B represents a person-centric case, where all that matters is who is currently married to whom. (It's a rare unary mapping of an entity domain onto itself.) Notice the phrasing of the text on the relationship. It describes the action (or more like a condition, depending on your views of marriage) of being married to another person. It is assumed that if person X is married to person Y then the relationship would be noted on both entities. If a person is unmarried, then the `spouseID` would be left blank.

Scenario C separates out the marriage license from the people getting married. In this case we can add more attributes to the `Marriage License` and `Person` entities to indicate things like birthdates, divorce dates, etc. Generally, this is likely how your state records marriages. It is up to them to enforce the "can't have more than one marriage at a time" law. 


> Note that Scenario C allows for marriage to be between more than 2 people. Why? Because standard ERD graphic language doesn't distinguish between "more than 1" and "many". That marriage is between two people is a constraint that would be handled in logical design, to be enforced using SQL controls. 

### **Degrees**

The degree of a relationship is **how many entity *types*** are being connected.

By far the most common relationship is between two entity types. That is called a **binary relationship**, represented by scenario C in the marriage example. In the many-to-many relationship depicted, there can be just about any number of entities on either end. However, there are only two entity types involved. 

A **unary relationship** is one where the associations are among entities with the same entity type. A unary relationship was shown in scenario B. One can conceive of a unary relationship between rail cars on a train. Each car has 0 or 1 cars immediately before and 0 or cars after it. However, every rail car is a car.

> We can have higher degree relationships (ternary, quaternary, etc.) as well. These are not very common and can be confusing to interpret. Often in such cases the designer will create an **associative entity** that represents the relationship itself. We only mention higher degree relationships in passing because you may see one in the wild someday. 

### **Maximum Cardinalities**

A cardinality is a count of items in a set. (In fact, the counting numbers 1, 2, 3, etc. are sometimes called the cardinal numbers.) 

In database design the placements of foreign keys (i.e., which tables) are determined by the maximum cardinalities of the relationships. In our ERD notation, there are three possibilities:

![Maximum Cardinalities](./img/L6_Max_Cardinalities.png)

**In the one-to-one case, we will need foreign keys on or both ends of the relationship.** Given an entity on one side, we will need to be able to look up the entity on the other end. 

**In the one-to-many case we will always place the foreign key on the many end of the relationship.** If we were to put it on the one end then we would need to refer to a list of entities on the other end, which is not allowed by the relational model. Instead, we add a foreign key attribute to the entity on the many end (the crows foot) to indicate *which one* of the entity instances in the other type it is referring to. 

**In the many-to-many case we will eventually need to create an associative entity with foreign keys to the tables at either end.** This in effect converts the many-to-many relationship into two one-to-many relationships with a new entity type. In our ERDs that's fine, but in SQL we will need to create a new table for the cross-references. 


> The ERD diagrams shown here use "crows foot" notation with a triangular shape for "many" ends and a single cross hatch for the "one" ends. Other notations, including the original Chen models for the 1970s and the UML class diagrams used by software engineers, permit more specificity ("maximum of 2") but are a bit more verbose. Finally, as we shall see, there is no need for so much specificity when designing tables and columns.  

### **Minimum Cardinalities**

Minimum cardinalities indicate whether a relationship is optional or mandatory on each end. 

![Minimum Cardinalities](./img/L6_Min_Cardinalities.png)

The minimum cardinalities (either 0 or 1) are shown to the inside of the maximum cardinalities. These have implications for the placement of foreign keys and the order in which entities are created (i.e., some entities may have to precede others to maintain referential integrity). We will explore both in the Special Cases section below. 

---
## **Special Cases**

### **Strong Entities and Weak Entities**

An entity type is said to be **strong** if we can create entities without having to consult another type. Strong entities do not have any mandatory foreign keys to worry about. 

A weak entity is one that has a mandatory foreign key dependency. In other words, the entity cannot exist without being preceded by the entity on the other end of the relationship. 

We can identify weak entities by looking for either
- the presence of a foreign key that cannot be null
- an entity on the *other side* of the relationship that is mandatory (i.e., has a minimum cardinality of 1)

Returning to the ERDs we used for minimum cardinalities, we see two examples of weak entities. 

![Minimum Cardinalities](./img/L6_Min_Cardinalities.png)

- In the **mandatory-mandatory** case, both A and B are weak, depending on each other, a situation we call **twinning**. In this case neither entity can precede the other. In order to comply with referential integrity constraints, entity B cannot exist without a corresponding A entity, which in turn requires a B entity. For entity A is it a little less strict, in that the B entity may already exist at the time we create entity A. However, if it doesn't then we are faced with creating two entities (an A and a B) at the same time. 

- In the **mandatory-optional** case, mandatory entity A is the **parent** of the **child** entity B. To comply with the integrity rules, we just have to make sure that the parent precedes the child. If, however, the parent subsequently gets deleted, then the child either has to be deleted too or reassigned a new parent before deleting the current one.

- In the **optional-optional** case, both A and B can be considered strong because they do not have any required dependencies. 

> Note that it is possible to have twinning and parent-child relationships in many-to-many relationships. However, the terms used here would apply to the one-to-many relationships in the derived associate entity. 

Why do we care about all this? Because it tells us in what order we need to _load data_ into the database, with strong entities (tables without FKs) coming before their weaker dependents (tables with FKs to already loaded tables). Further, in some cases we can't load anything at all without temporarily suspending referential integrity rules using transaction control.  

### **IS-A Relationships: Subtypes and Supertypes**

A subtype is a special case of its supertype. If entity type B is a subtype of A, then B has an **is-a** relationship with A: each B entity is a type A as well. A and B will have overlapping attributes, with type B **inheriting** all of type A's attributes. There are several ways to represent this on an ERD, but the cleanest is to just create a parent-child relationship, with the parent as the supertype and the child as the subtype. 


---
## **PRO TIPS: Converting ERDs to Relations**

It is pretty easy to convert between ERDs and database table schema. The process is fairly formulaic and can even be automated. For example, MySQL Workbench can draw an ERD given any set of related tables in a MySQL database. Going the other way, from an ERD to tables, is only a little more complex. Here is a usual process, which works for just about anything except special cases like type inheritance. 

**Step 1. Convert many-to-many relationships to *associative* entities.**  

![Associative Entity](./img/L6_Associative_Entity.png)



An associative entity represents the connections implied by a many-to-many relationship. (It is common practice to name the associative entity after the entities it connects.) Also note the directions of the one-to-many relationships. The associative entity is always on the many side of each relationship from which it derives. In the logical design stage, this associative entity will become a so-called **cross table** (short for "cross reference table") with one foreign key per connected entity. 

**Step 2. Write out each entity type as a table schema using CREATE TABLE queries.** 
- Table names are drawn from the entity class names, though pluralized by common convention 
- Attributes are the columns
- Keys are as indicated on the ERD
- Data types are determined in part on what the database allows

 




We will pick up from here in the next lesson, where will build a database from scratch. 

## **Case: Movies Tonight, Part 1**

### **An Ancient App**

Movies Tonight was an ancient web app built as a tech demo in the days before broadband, CSS, web services, ReST APIs, JSON, and all the other technologies we now take for granted. It was designed to show what a rich user interface could look like once we had all of those things.


It provides information about every movie shown in Riverside, California, on Thanksgiving 1996. It has dropdown menus to select the theater, movie, title, and time. Below the dropdown menus it shows links to all movie showings that meet the given selections. Selecting nothing is equivalent to selecting all. 

> For those who may be wondering ... Yes, your instructor built the app over a weekend before a Tuesday morning class. And yes, the visual design is truly hideous.

![Movies Tonight UI](https://github.com/christopherhuntley/DATA6510/raw/master/img/L5_Movies_Tonight_UI.png)

> **Through some sort of Internet miracle, the app still works in most browsers.** The code is ancient $–$ Javascript was just 2 years old at the time $–$ and won’t work in some modern browsers. 

While the web design was only barely passable, there is a gem hidden in the source code: all the data in a compressed format and parser functions used to extract the data into usable data records (about movies, theaters, and shows). The idea was that the javascript would be generated by a webserver each time the page was loaded. Then the user would continue on without ever needing to refresh the page. Everything on the screen was *generated* in Javascript, which was a truly radical idea at the time but is how most web pages are designed today.
> **Note for web geeks:** [XMLHttpRequest](https://en.wikipedia.org/wiki/XMLHttpRequest) did not exist yet; that didn't happen until 2008. This is the truly old school way to do one page web apps.



![Movies Tonight Source](./img/L5_Movies_Tonight_Source.png)

### **Data Model**

After normalization, the database has five tables, as illustrated in the following ERD.

![Movies Tonight ERD](./img/L6_MoviesTonight_v2.png)

There are three strong entities:
- **Theater**, which has the name, location (address), phone number of the place where the movie is showing
- **Movie**, which has the title and rating for the film being shown
- **Artist**, which have the name and biography of each actor or director in the movies



There are two associative (weak) entities:
- **Show**, which provides the time (`showtime`) when the given movie (`movieID`) is being shown at a given theater (`theaterID`)
- **Credit**, which connects each artist (`artistID`) to each movie (`movieID`) as they appear in the movie credits before/after the movie; the `ccode` indicates whether the artist is listed as an _actor_ (A) or a _director_ (D); some people appear as both, with two lines of credit, for a given movie. 

### **The Live Database**

Run each of the cells below, one at a time, to see thew database in action. Study the queries to be sure you understand exactly how and why they work.

In [None]:
# Load %%sql magic
!pip install jupysql
%load_ext sql
%config SqlMagic.displaylimit = None

In [None]:
%sql sqlite:///data/MoviesTonight/MoviesTonight.db

#### **What movies titles were showing?**

In [None]:
%%sql
SELECT DISTINCT title
FROM movies;

#### **What is the earliest time when we can see a show?**

In [None]:
%%sql
SELECT MIN(showtime)
FROM shows

#### **In what movies did Eli Wallach appear?**

In [None]:
%%sql
SELECT DISTINCT movieID, title
FROM CREDITS
      JOIN MOVIES USING (movieID)
      JOIN ARTISTS USING (artistID)
WHERE NAME = 'Eli Wallach'

#### **Who were Eli Wallach's costars (note: actors only) in movies released in 1996?** Be sure to include movie titles.

In [None]:
%%sql
SELECT distinct a2.name, m.title
FROM artists AS a1
  JOIN credits AS c1 USING (artistID)
  JOIN movies as m USING (movieID)
  JOIN credits AS c2 USING (movieID)
  JOIN artists AS a2 ON (a2.artistID = c2.artistID)
WHERE a1.name = 'Eli Wallach' AND c2.credit_code = 'A' AND a2.name <> 'Eli Wallach'

### **Which artists were both actor and director in the same movie?**

In [None]:
%%sql
SELECT artists.name, movies.title
FROM artists
        JOIN credits AS c1 USING (artistID)
        JOIN movies USING (movieID)
        JOIN credits AS c2 ON (c1.artistID=c2.artistID AND c1.movieID=c2.movieID)
WHERE c1.credit_code = 'A' AND c2.credit_code = 'D'

### **How many artists were there in the above query?**

In [None]:
%%sql
WITH multiple_credits AS (
  SELECT artists.name,movies.title
  FROM credits
        JOIN artists USING (artistID)
        JOIN movies USING (movieID)
  GROUP BY artistID,movieID,artists.name,movies.title
  HAVING count(*)>1
)
SELECT COUNT(*) AS `actors who are also directors`
FROM multiple_credits

---
## **Congratulations! You've made it to the end of Lesson 5.**

We have covered the essential theory of Entity Relationship Modeling. It is surprisingly concise! However, there is also an art to ER modeling that is hard to develop without practice.### **Which artists were both actors and directors in movies released in 1996? (That's actor and director in the same movie.)**
