<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_07_Advanced_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Modeling - Unfurling the Magic of Normalization and Subclass-Superclass Relationships
### Brendan Shea, PhD

Welcome to the next stage our journey into the magical realm of database design and modeling. You've journeyed through the basics of database fundamentals and elementary modeling. Now, we take you further into the world of advanced modeling techniques, shining a light on the intricacies of normalization and subclass-superclass relationships.

In this chapter, we will draw upon the enchanting world of J.K. Rowling's Hogwarts School of Witchcraft and Wizardry to guide our exploration. Using a comprehensive example of Hogwarts' enrollment data, initially stored as a simple spreadsheet, we will illustrate how it can be methodically transformed into a robust, efficient database model.

The process of normalization, developed by Edgar F. Codd, will be introduced as our essential tool for removing redundancies and improving the structural integrity of our database. We'll discover the importance of achieving various 'normal forms', helping to prevent the calamitous effects of data anomalies and inconsistencies.

From the Great Hall to the classrooms of Hogwarts, we will encounter the ubiquitous presence of relationships in our data. Building upon your understanding of elementary relationships, we'll delve into the concept of subclass-superclass relationships, the very cornerstone of object-oriented database models. Drawing from the diverse personalities within Hogwarts, we'll see how these relationships play a pivotal role in creating a flexible and expressive database.

Understanding how to model these complex relationships correctly and implement appropriate constraints will be vital. They are the keys to ensuring the accuracy, consistency, and relevancy of the data housed within our database.

Throughout the chapter, we'll also touch upon concepts such as multi-valued attributes, transitive dependencies, one-to-many and many-to-many relationships, and more. We will consider the role of denormalization and indexing in database design, essential topics in the advanced design of databases.

By the end of this chapter, you'll not only have a comprehensive understanding of these advanced modeling techniques but also see their application in real-world contexts, making you equipped to tackle complex database design problems. So, let's journey together into the fascinating world of normalization and subclass-superclass relationships, weaving together the enchanting stories of Hogwarts with the practical magic of database design.

## Chapter Case Study: Hogwarts Enrollment Database

In this chapter, we're going to explore the fascinating world of Hogwarts School of Witchcraft and Wizardry through the lens of a database designer. We will utilize Hogwarts' enrollment data, initially captured in a spreadsheet, to understand the intricacies of advanced database modeling.

The initial spreadsheet used for enrollment contains the following ten attributes:

1.  Student Name
2.  Date of Birth
3.  Gender
4.  Guardian Name
5.  Guardian Contact Information
6.  Home Address
7.  House (Gryffindor, Hufflepuff, Ravenclaw, or Slytherin)
8.  Year Level (Year 1 through Year 7)
9.  Courses Enrolled
10. Extra-Curricular Clubs

A sample of this spreadsheet is as follows:




| Student Name | Date of Birth | Gender | Guardian Name | Guardian Contact Information | Home Address | House | Year Level | Courses Enrolled | Extra-Curricular Clubs |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Harry Potter | 31/07/1980 | Male | Sirius Black | <SiriusBlack@mail.com> | 4 Privet Drive, Little Whinging, Surrey | Gryffindor | Year 7 | Defence Against the Dark Arts, Potions, Herbology | Quidditch, Dumbledore's Army |
| Hermione Granger | 19/09/1979 | Female | Mr. and Mrs. Granger | <GrangerFamily@mail.com> | 17 Mill Lane, London | Gryffindor | Year 7 | Arithmancy, Potions, Transfiguration | Dumbledore's Army |
| Draco Malfoy | 05/06/1980 | Male | Lucius Malfoy | <LuciusMalfoy@mail.com> | Malfoy Manor, Wiltshire | Slytherin | Year 7 | Potions, Defence Against the Dark Arts, Transfiguration | Slytherin Quidditch Team |
| Luna Lovegood | 13/02/1981 | Female | Xenophilius Lovegood | <XenoLovegood@mail.com> | Lovegood House, near Ottery St. Catchpole, Devon | Ravenclaw | Year 6 | Divination, Herbology, Astronomy | Ravenclaw Quidditch Team, Dumbledore's Army |
| Cedric Diggory | 10/10/1977 | Male | Amos Diggory | <AmosDiggory@mail.com> | Diggory House, near Ottery St. Catchpole, Devon | Hufflepuff | Year 7 | Herbology, Charms, History of Magic | Hufflepuff Quidditch Team |

This sample data will provide a baseline for us to identify the existing issues with the spreadsheet format and demonstrate how we can transform it into a more efficient and structured database model.

### Problems With a "Flat File" Enrollment Spreadsheet

This data structure (known as a "flat file"), although straightforward and simple to use, has several inherent issues from a database design perspective:

1. *No obvious primary keys:* No single attribute uniquely identifies a record in the spreadsheet. For instance, 'Student Name' may seem a likely candidate, but there's always a possibility of having two students with the same name.

2. *Redundant data:* The spreadsheet includes repeated information, such as 'Courses Enrolled' and 'Extra-Curricular Clubs', which can lead to anomalies and inconsistencies.

3. *Multi-valued attributes:* Some attributes, like 'Courses Enrolled' and 'Extra-Curricular Clubs', can have multiple values for a single student, which is problematic in a database context.

4. *Lack of structure for complex data:* The current format does not adequately handle complex relationships. For example, how do we handle course assignments for different teachers, or track a student's year level as it changes over time?

In this chapter, we'll take this unnormalized spreadsheet and transform it into a well-structured, efficient database model, capable of supporting Hogwarts in managing its magical student body effectively. We will apply normalization techniques and identify subclass-superclass relationships, addressing each issue present in the initial spreadsheet. By the end, the spreadsheet's transformation into a database will reveal the powerful magic of advanced database design techniques. Let's step into this enchanting journey!

## Why Normalize?
Normalizing data is like organizing a big, messy closet. Imagine you have a closet where you've thrown in clothes, shoes, hats, scarves, books, and even some snacks. Now, if you want to find your blue sweater or your favorite book, it's going to be quite a challenge, right? You'll have to sift through everything else to get what you need. Now imagine if you took some time to organize this closet, putting clothes in one area, books on a shelf, snacks in a corner, and shoes neatly lined up on the floor. It will be much easier to find what you're looking for, and there's less chance of pulling out a snack when you're looking for a hat!

This is pretty much what normalization does to data in a database. When data is not normalized, information can be scattered across various tables or can be repeated in many places. This can lead to a lot of problems:

- *Redundancy:* Just like our messy closet example, unorganized data can lead to unnecessary duplication. This not only takes up more storage space but can also lead to inconsistencies. Imagine if Hogwarts kept the house points in each student's record. If they update it for one student and forget to do it for others, it can lead to confusion.

- *Update Anomalies:* When data is repeated, updating it can become a nightmare. Imagine a professor changes their office location. If their office location is stored in every course they teach, that's a lot of places to make one change! With normalization, we would store professor information in one table, and any changes only need to be made in one place.

- *Insertion and Deletion Anomalies*: Without normalization, adding or removing data can be tricky. What if a new student joins Hogwarts, but they haven't been assigned a house yet? Where would we store their information if we only had a table for each house? With a separate "Students" table, we can add the student immediately and update their house assignment later.

Normalization, through its different forms (1NF, 2NF, 3NF, etc.), helps to solve these problems by dividing data into logical units where each piece of information is stored in one place.

## Normalization Applied to the Hogwarts Enrollment Spreadsheet

*Normalization* is the process of structuring a relational database in accordance with a series of so-called normal forms to reduce data redundancy and improve data integrity. Let's examine how we can apply normalization to our Hogwarts Enrollment Spreadsheet, starting with the first normal form (1NF), then progressing to the second normal form (2NF), and finally, the third normal form (3NF).

### First Normal Form (1NF):

The first normal form dictates that each column of a table must contain only **atomic (indivisible) values**, and there must be a primary key.

To conform our spreadsheet to 1NF, we first need to eliminate **multi-valued attributes** (that is, attributes that contain multiple values, such as a list of multiple courses). We'll split "Courses Enrolled" and "Extra-Curricular Clubs" into separate rows. Additionally, we'll introduce a unique Student ID for each student, as it's possible to have two students with the same name (and so, student name should NOT be used a primary key).

For example, our 1NF tables might look like this:

*Students*

| Student ID | Student Name | Date of Birth | Gender | Guardian Name | Guardian Contact Information | Home Address | House | Year Level |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1001 | Harry Potter | 31/07/1980 | Male | Sirius Black | <SiriusBlack@mail.com> | 4 Privet Drive, Little Whinging, Surrey | Gryffindor | Year 7 |

*Courses*

| Student ID | Course |
| --- | --- |
| 1001 | Defence Against the Dark Arts |
| 1001 | Potions |
| 1001 | Herbology |

*Clubs*

| Student ID | Club |
| --- | --- |
| 1001 | Quidditch |
| 1001 | Dumbledore's Army |

### Second Normal Form (2NF)

The second normal form requires that all non-key attributes for each table are fully dependent on the primary key. In particular, it means that if we have a composite key (that is, a key that is made up of two or more attributes), all the other attributes should depend on BOTH parts of this key. In our current database, this isn't a problem, since we have simple primary keys for each table (this is generally a good idea!).

Let's consider a different scenario where we have a table that doesn't conform to the 2NF. Suppose we have a table called "Course_Enrollment" where we store which professor teaches which course to which student. The primary key for this table is a composite key, composed of "Student ID" and "Course ID". We also store the professor's name who teaches that course.

**Course_Enrollment (Before 2NF)**
| Key | Attribute | Data Type |
| --- | --- | --- |
| PK | Student ID | Integer |
| PK | Course ID | Integer |
|  | Professor Name | String |

In this example, "Professor Name" depends only on the "Course ID", not on both parts of the composite key. This is a violation of the 2NF.  To correct this and bring the table to 2NF, we would need to split this table into two:

1.  Course_Enrollment - which stores which student takes which course. Here the primary key is a composite key of "Student ID" and "Course ID".

2.  Course_Professor - which stores which professor teaches which course. Here the primary key is "Course ID".

**Course_Enrollment (After 2NF)**

| Key | Attribute | Data Type |
| --- | --- | --- |
| PK, FK | Student ID | Integer |
| PK, FK | Course ID | Integer |

**Course_Professor (After 2NF)**

| Key | Attribute | Data Type |
| --- | --- | --- |
| PK, FK | Course ID | Integer |
|  | Professor Name | String |

Now, in both tables, all non-key attributes are fully dependent on their primary keys. We have successfully eliminated the partial dependency, and our tables now conform to the 2NF.

While our original data did not have this issue (as there were no composite keys), it is important to understand this concept, as more complex databases often require the use of composite keys.

### Third Normal Form (3NF):

The third normal form dictates that all non-key attributes must depend only on the primary key, not on other non-key attributes (no **transitive dependencies**).

Here, a transitive dependency is a type of database dependency where a non-key attribute depends on another non-key attribute, which depends on the primary key of the table. In simpler terms, if there are three attributes A, B, and C, a transitive dependency is a condition where A → B and B → C, which indirectly creates a dependency of A → C (A depends on C through B). This is a violation of the third normal form (3NF).

For our Hogwarts data, we do not have transitive dependencies in our current tables. However, if we decided to add additional information like "House Head" or "House Points" related to the house each student belongs to, we would then have a transitive dependency (Student ID → House → House Head / House Points). To satisfy 3NF, we would then create a separate table for "Houses".

*Houses*

| House | House Head | House Points |
| --- | --- | --- |
| Gryffindor | Minerva McGonagall | 6720 |

Now our Hogwarts enrollment data is normalized to the third normal form. Through normalization, we have minimized data redundancy, and the database structure can now handle complex queries more efficiently. As Hogwarts' enrollment continues to grow, the benefits of normalization will be increasingly apparent, proving essential for effectively managing student data.

## "The Key, the Whole Key, and Nothing But the Key, So Help Me Codd**

A common mnemonic for remembering Codd's three normal forms is **"The Key, The Whole Key, and Nothing But The Key, So Help Me Codd."** Here's the basic idea:

1.  **"The Key"** This corresponds to the First Normal Form (1NF). In 1NF, each column of a table must contain only atomic (indivisible) values, and there must be a primary key. This means that there should be no duplicate rows, and each attribute (column) must hold a single value rather than a list of values. Thus, the "Key" part of the mnemonic refers to the uniqueness and singularity of the data in each row and column, which is defined by the primary key.

2.  **"The Whole Key"** This corresponds to the Second Normal Form (2NF). 2NF dictates that every non-key attribute (field) must be functionally dependent on the whole key, meaning that all information in a table should pertain to the concept that the primary key represents. In particular, if the primary key is composite (made up of multiple fields), no non-key attribute should depend on only part of the composite key.

3.  **"Nothing But The Key"** This corresponds to the Third Normal Form (3NF). The phrase "Nothing but the Key" refers to the rule that no non-key attribute should depend on other non-key attributes, only on the primary key. This is designed to eliminate transitive dependencies (where A depends on B, and B depends on C, leading to an indirect dependency of A on C).

This mnemonic covers the first three normal forms, which are typically sufficient for most database normalization tasks. It's important to note, however, that there are further forms of normalization, including the Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF), that deal with more specific and less common issues.


However, there's a point where we may decide to "stop" normalizing. Higher levels of normalization (beyond 3NF) can lead to a very large number of small tables. This can make the database harder to understand and queries more complex. In some cases, it might even affect performance.

## Enrollment Data After Normalization
After the data has been normalized, we are left with something like the following. (Note: There are multiple ways of successfully "normalizing" the data, and there might well be *better* ways of doing this!). Here is what the diagram might look in modified Crow's Foot (later, we'll be looking at another example of a Chen diagram):




![Howgarts Enrollment ER Diagram](https://github.com/brendanpshea/database_sql/raw/main/images/hogworts_enroll.png)



Here is the SQL code:

In [1]:
!pip install SQLAlchemy==1.3.24 -q # Needed o avoid problems with more recent version in Colab

# For this section, we need to use PostgreSQL,
# which supports INHERITANCE

!apt install postgresql postgresql-contrib &>log
!service postgresql start
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"

%load_ext sql
%sql postgresql+psycopg2://@/postgres

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/6.4 MB[0m [31m6.0 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/6.4 MB[0m [31m24.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m4.5/6.4 MB[0m [31m42.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.3/6.4 MB[0m [31m46.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.3/6.4 MB[0m [31m46.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for SQLAlchemy (setup.py) 

In [None]:
%%sql

-- Create the Houses table first, as it's referenced by other tables
CREATE TABLE Houses (
    House VARCHAR(50) PRIMARY KEY,
    HouseHead VARCHAR(100),
    HousePoints INT CHECK (HousePoints >= 0)  -- Constraint to ensure points can't be negative
);

-- Create the Students table, referencing the Houses table
CREATE TABLE Students (
    StudentID INT PRIMARY KEY,
    StudentName VARCHAR(100) NOT NULL,
    DateOfBirth DATE NOT NULL,
    Gender VARCHAR(10) CHECK (Gender IN ('Male', 'Female', 'Other')),  -- Constraint for valid genders
    GuardianName VARCHAR(100),
    GuardianContactInformation VARCHAR(100),
    HomeAddress VARCHAR(200),
    House VARCHAR(50),
    YearLevel INT CHECK (YearLevel BETWEEN 1 AND 7),  -- Hogwarts has 7 year levels
    FOREIGN KEY (House) REFERENCES Houses(House)  -- References House in Houses table
);

-- Create the Courses table, referencing the Students table
CREATE TABLE Courses (
    StudentID INT,
    Course VARCHAR(100),
    PRIMARY KEY (StudentID, Course),  -- Composite primary key
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID)  -- References StudentID in Students table
);

-- Create the Clubs table, referencing the Students table
CREATE TABLE Clubs (
    StudentID INT,
    Club VARCHAR(100),
    PRIMARY KEY (StudentID, Club),  -- Composite primary key
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID)  -- References StudentID in Students table
);


 * sqlite:///hogwarts.db
Done.
Done.
Done.
Done.


[]

## Exercise: Normalizing the Hogwarts Library Inventory

### The Scenario

Imagine that the Hogwarts Library maintains a spreadsheet to keep track of their book inventory. Each row represents a book, and there are the following columns:

1.  Book Name
2.  Book Author
3.  Publish Year
4.  Categories (as books can belong to more than one category, e.g., "Herbology, Potion-Making")
5.  Availability (whether the book is currently available or checked out)
6.  Current Holder (the person who has the book if it's checked out)
7.  Borrow Dates (the dates that the book has been borrowed)

Here is an example of what five rows might look like:

| Book Name | Book Author | Publish Year | Categories | Availability | Current Holder | Borrow Dates |
| --- | --- | --- | --- | --- | --- | --- |
| Herbology at Home | Phyllida Spore | 1983 | Herbology, Home Studies | Checked Out | Neville Longbottom | 12/02/2023, 20/06/2023 |
| Hogwarts: A History | Bathilda Bagshot | 1950 | History, Magic | Available | - | 22/01/2023 |
| Quidditch Through the Ages | Kennilworthy Whisp | 1952 | Sports, Magic | Checked Out | Harry Potter | 15/01/2023, 07/03/2023, 01/06/2023 |
| Magical Water Plants of the Mediterranean | Hadrian Whittle | 1971 | Herbology, Travel | Available | - | - |
| Magical Hieroglyphs and Logograms | Bathilda Bagshot | 1964 | Runes, Magic | Available | - | 28/02/2023 |

Notice that there's no clear primary key, and the Categories and Borrow Dates columns have multi-valued attributes.

* * * * *

#### Your Task

1.  First Normal Form (1NF): Convert this data to the first normal form (1NF) by eliminating multi-valued attributes and introducing a unique identifier for each book.

2.  Second Normal Form (2NF): Then convert your 1NF tables to the second normal form (2NF). Are there any columns that are not fully functionally dependent on the primary key? How would you resolve this?

3.  Third Normal Form (3NF): Finally, ensure that your 2NF tables are in the third normal form (3NF). Are there any columns whose values depend on other non-key columns? If so, how would you resolve this?

Take some time to work through each of these steps, sketching out your tables and thinking about how to handle the multi-valued and dependent attributes at each stage

## Answer: Normalizing the Hogwarts Library

1. (1NF)




2. (2NF)




3. (3NF)


4. Create the Tables in SQL (Below)

In [None]:
%%sql
--Create the Hogwarts Library Tables in SQL

## Subtypes and Supertypes - A Practical Dive
Having mastered normalization, it's time to explore another crucial aspect of advanced database modeling - Subtypes and Supertypes. This concept is at the heart of modeling more complex real-world relationships in our databases.

In the simplest terms, a Supertype is a generic entity type that has a relationship with one or more Subtypes, which are more specific entity types. Essentially, we have a hierarchy where the Supertype sits at the top, and Subtypes represent categories or types of the Supertype. There are common attributes in the Supertype, and each Subtype may contain additional attributes specific to it. This relationship often comes into play in object-oriented programming but is just as critical in structuring our databases effectively.

To illustrate this, let's consider the case of Hogwarts School of Witchcraft and Wizardry again. Imagine the IT staff at Hogwarts. They have a daunting task - they need to manage accounts for different types of users: students, faculty, and administrators. Each of these user types has common attributes (like Name, Password, Email), but they also have unique attributes.

For instance, a student has a Year Level and a House, faculty members have Subjects Taught, and administrators have a Job Role. Without Subtype and Supertype relationships, we would have redundancy in our database, and changes would be cumbersome.

By the end of this section, you'll have a solid understanding of Subtypes and Supertypes and be equipped to structure databases even more effectively, whether you're managing a school, a bookstore, or a magical institution like Hogwarts!

## What are Supertypes and Subtypes?
Supertypes and Subtypes are foundational concepts in both object-oriented programming and database design. They help us model real-world scenarios more accurately and efficiently by capturing the hierarchical relationships between different entities.

A **supertype** is a generic entity type that has a relationship with one or more *subtypes**, which are more specific categories of the Supertype. In other words, a Supertype is a broad classification, while Subtypes are its more specialized versions.

To better understand this concept, let's look at some examples from our Hogwarts case study:

1. *Person as a Supertype:* At Hogwarts, many entities could be classified broadly as "Persons". This category could have attributes like Name, Date of Birth, and Address. However, there are different kinds of "Persons" at Hogwarts, such as Students, Professors, and Administrators. These more specific categories are the Subtypes, each having unique attributes in addition to those of the Person Supertype. For example, Students might have additional attributes like House, Year Level, and Courses, while Professors have Subjects Taught, and Administrators have Job Role.

2. *Course as a Supertype:* A "Course" is another generic entity at Hogwarts with attributes such as Course Name and Course Description. Subtypes might be Mandatory Courses and Optional Courses. Mandatory Courses may have an additional attribute like Year Level (indicating which year level the course is compulsory for), and Optional Courses might have Pre-requisite Courses as an additional attribute.

3. *Room as a Supertype:* Hogwarts has many rooms with generic attributes like Room Number and Floor. However, the rooms serve different purposes, leading to Subtypes like Classrooms, Offices, and Dormitories. A Classroom might have additional attributes like Course Taught, an Office might be associated with a specific Faculty Member, and a Dormitory could have Capacity and Associated House as additional attributes.

In each example, the Supertype has attributes common to all Subtypes, and each Subtype has attributes specific to it. This approach avoids unnecessary redundancy and offers a more structured and efficient way to model our data.

## Scenarios for Sub- and Supertypes
Subtype and supertype relationships are commonly used in scenarios where entities have certain attributes or relationships in common but also have unique attributes or relationships. This type of modeling helps us keep our database design DRY (Don't Repeat Yourself), a principle aimed at reducing redundancy.

These relationships are particularly useful when:

1.  *Entities have several attributes in common, but not all:* In our Hogwarts case, "Student", "Professor", and "Administrator" could all be subtypes of the supertype "Person". All persons might have attributes like Name, Date of Birth, and Contact Information in common, but each subtype may also have unique attributes: a student might have 'Year Level' and 'Guardian Name', a professor might have 'Department' and 'Courses Taught', and an administrator might have 'Administrative Role' and 'Office Location'.

2.  *Entities participate in common relationships, but also have unique ones:* For example, all types of persons might be related to a 'Houses' entity (if we assume professors and administrators also have house affiliations), but a student might be related to 'Courses' and 'Clubs' entities, while an administrator might be related to a 'Meetings' entity.

3.  *We want to enforce entity integrity across subtypes:* For instance, if we want to ensure that each 'Person' is either a 'Student', 'Professor', or 'Administrator', but not more than one of these. In other words, we want to avoid a situation where someone is both a student and a professor.

4.  We want to easily handle and maintain common behavior or business rules across entities: If we know that every person at Hogwarts, regardless of their role, must have a unique ID and an associated house, it's easier to enforce this at the 'Person' level, rather than individually for 'Student', 'Professor', and 'Administrator'.

By modeling these scenarios using supertype/subtype relationships, we can create a more organized and efficient database design that is easier to understand, maintain, and evolve over time.

## Example: Hogwarts IT
Let's consider the situation where Hogwarts IT is managing accounts for three types of users: students, faculty, and administrators. In this case, we can identify a supertype "User" with subtypes "Student", "Faculty", and "Administrator". The "User" entity would hold attributes common to all subtypes, while each subtype would hold attributes unique to that user type.

To start, we define a "User" table:

*User*

| UserID (PK) | Username | Password | User_Type |
| --- | --- | --- | --- |
| 1001 | hpotter | secret123 | Student |
| 2001 | mgonagall | catpassword | Faculty |
| 3001 | adumbledore | elderwand | Administrator |

This table contains a primary key (UserID), as well as common attributes (Username, Password), and a discriminator column (User_Type) which indicates which subtype each record belongs to.

Then, we create separate tables for each of the subtypes:

*Student*

| UserID (PK, FK) | YearLevel | GuardianName |
| --- | --- | --- |
| 1001 | 7 | Sirius Black |

*Faculty*

| UserID (PK, FK) | Department | CoursesTaught |
| --- | --- | --- |
| 2001 | Transfiguration | Transfiguration 101, Advanced Transfiguration |

*Administrator*

| UserID (PK, FK) | AdminRole | OfficeLocation |
| --- | --- | --- |
| 3001 | Headmaster | Headmaster's Office |

Each of these subtype tables uses the UserID from the "User" table as a foreign key. This way, we establish a relationship between the supertype and each of its subtypes.

Now, the Hogwarts IT can easily manage accounts for all users, keeping common attributes together in one place and maintaining the unique attributes separately in a structured and organized manner. This approach not only enhances clarity and usability, but also facilitates efficient management and scalability of the database system as the school grows.

## Understanding Overlapping and Non-Overlapping, Mandatory and Optional Subtype Relationships

Subtype relationships can be further classified based on their degree of exclusivity (overlapping vs non-overlapping) and their obligation (mandatory vs optional). Understanding these classifications is important for accurately modeling real-world scenarios and enforcing appropriate constraints in our database.  Let's examine these classifications in more detail, using Hogwarts-related examples:

### Overlapping vs Non-Overlapping:

**Overlapping subtypes** occur when an instance of the supertype can belong to two (or more) subtypes simultaneously. For example, let's consider "Role" as the supertype and "Student", "Club Member", and "Quidditch Player" as subtypes. A student at Hogwarts could be a member of a club and also play Quidditch. So, these subtypes are overlapping.
  
**Non-Overlapping (or disjoint) subtypes** occur when an instance of the supertype can belong to only one subtype. For instance, consider the supertype "Staff" and the subtypes "Professor" and "Administrator". A staff member can either be a professor or an administrator, but not both. Thus, these are non-overlapping subtypes.


### Mandatory vs Optional:

**Mandatory subtypes** require an instance of the supertype to be a member of at least one subtype. For example, suppose we have a supertype "Person" with subtypes "Student", "Faculty", and "Administrator". If we assume that every person at Hogwarts has a specific role, then it is mandatory for each "Person" to be a part of one of the subtypes.

**Optional subtypes** allow an instance of the supertype to not be a part of any subtype. For example, let's consider "Club Member" as a supertype, with "Dumbledore's Army Member" and "Slug Club Member" as subtypes. In this case, not all club members belong to these specific clubs, so these are optional subtypes.

It's important to choose the appropriate type of relationship according to the specific characteristics and constraints of the entities being modeled. This will ensure that the data model accurately represents the real-world scenario and enforces the correct business rules.

Here is a table illustrating the four combinations of subtype relationships:

|  | Overlapping | Non-Overlapping |
| --- | --- | --- |
| Mandatory | An instance of the supertype must be a member of at least one subtype, and it can belong to multiple subtypes. <br> For example, a "Magical Being" (supertype) at Hogwarts must be either a "Student", "Faculty", or "Staff" (subtypes), <br> and could potentially fit into both "Student" and "Staff" categories if they are a student who also works part-time at the school. <br> | An instance of the supertype must be a member of exactly one subtype. <br> For example, a "Staff Member" (supertype) at Hogwarts must be either a "Professor", "Administrator", or "Groundskeeper" (subtypes), <br> but cannot be more than one of these roles. |
| Optional | An instance of the supertype can belong to multiple subtypes or none at all.  <br> For example, a "Club Member" (supertype) at Hogwarts can be part of "Dumbledore's Army", <br> "Slug Club", both, or neither (subtypes).| An instance of the supertype can belong to one subtype or none at all.<br> For example, a "Person" (supertype) at Hogwarts can be a "Student", "Faculty", "Staff", <br> or none of these (subtypes) if they are a visitor or guest. |

Please note that these examples are not strictly real-world cases but simplified scenarios to illustrate the concepts. Real-world data models can often be more complex and may involve additional considerations and constraints.

## Visualizing the 'is-a' Relationship

The **'is-a' relationship** is a fundamental concept in object-oriented programming and database modeling. It is used to establish a hierarchy between classes or types, where a subtype is a specialized version of a supertype. Essentially, an 'is-a' relationship means that a subtype 'is a kind of' its supertype.

In the context of our Hogwarts case study, we could say a "Student" is a "Person," a "Professor" is a "Person," and an "Administrator" is a "Person." Here, "Person" is the supertype, and "Student," "Professor," and "Administrator" are its subtypes. This means that anything that is true of the supertype (like having a name and date of birth) is also true of its subtypes. It's a way of demonstrating inheritance – subtypes inherit characteristics from their supertype.

This 'is-a' relationship helps to create a structured, logical model of data that can handle complex queries, ensure data consistency, and streamline database operations.

To visualize these relationships, we use an extended version of Chen's notation:

1. The supertype entity is connected to the top point of an inverted triangle with a straight line, and each subtype entity is connected to one of the bottom points of the triangle. The triangle is labeled 'is-a' and signifies that the subtype 'is a kind of' the supertype.

2. For optional vs mandatory relationships, we can indicate these using dotted (optional) or solid (mandatory), as we for other sorts of relationships in Chen.

3. Overlapping vs non-overlapping relationships can be indicated by placing 'o' (for overlapping) or 'd' (for disjoint or non-overlapping) inside the is-a triangle.

Remember, this triangle notation and other enhancements are not part of the original Chen's notation but are widely accepted extensions used to articulate more complex hierarchical relationships. By combining these symbols and the 'is-a' concept, we can create a comprehensive visual representation of our database structure.

![Hogwarts SUbtype diagram](https://github.com/brendanpshea/database_sql/raw/main/images/Hogwarts_Subtype.svg)

## Implementing Subtypes In Modern Databases
While the relational model primarily focuses on the relationship between tables (entities), it does not natively support the concept of inheritance or type hierarchies. This is because the relational model was designed to be straightforward and performant, and complex features like inheritance can introduce ambiguity and performance overhead.

However, many modern SQL-based databases have introduced features that allow some level of **object-oriented** behavior, including inheritance and subtype relationships. This includes popular databases such as PostgreSQL and Oracle, which support table inheritance or object-oriented types, respectively.

It's crucial to understand that these features are extensions of the original SQL standard and the relational model, which means they ae NOT supported by all SQL databases and will geneally behave differently between databases. Use them with caution, keeping the specific requirements of your application and database system in mind.

In PostgreSQL, we can implement inheritance using the `INHERITS` clause when creating a table. Here's how we might model our Hogwarts IT case with PostgreSQL:

In [2]:
%%sql

CREATE TABLE Person (
    PersonID SERIAL PRIMARY KEY,
    Name TEXT,
    DateOfBirth DATE
);

CREATE TABLE Student (
    YearLevel INTEGER
) INHERITS (Person);

CREATE TABLE Faculty (
    Department TEXT
) INHERITS (Person);

CREATE TABLE Administrator (
    AdminRole TEXT
) INHERITS (Person);


 * postgresql+psycopg2://@/postgres
Done.
Done.
Done.
Done.


[]

## Putting It All Together: Building a Comprehensive Hogwarts Database Model
As we've seen in our journey through Hogwarts' data needs, various aspects of the school's operation can be effectively modeled using different database submodels. We've explored submodels for Hogwarts Enrollment and IT Services, but the complexity of the Hogwarts universe would likely necessitate several more. These could include areas like Course Management, Library Services, Potion Inventory, Magical Creature Care, Quidditch Match Records, Staff Payroll, and the like. Each of these areas would have its own submodel capturing the unique entities, attributes, and relationships relevant to that domain.

When it comes to creating a comprehensive data model that encompasses the entirety of Hogwarts' operations, it's like piecing together a vast, intricate puzzle. Here are some steps and considerations involved in this process:

### 1\. Identify Key Entities Across Submodels:

In the heart of each submodel lie its defining entities and relationships. These form the core around which the submodel revolves. For example, in our Hogwarts Enrollment submodel, key entities included 'Students', 'Houses', 'Courses', and 'Clubs'. Similarly, the IT Services submodel had 'User', 'Student', 'Faculty', and 'Administrator' as central entities.

The first step in our comprehensive modeling journey is to recognize these key entities from each submodel. Further, we need to spot entities that are common to multiple submodels. For instance, 'Students' appear prominently in both our Enrollment and IT Services submodels. They might also feature in other submodels like Course Management (as enrollees), Library Services (as borrowers), and Quidditch Match Records (as players or spectators).

This step is critical as these common entities often serve as the link between different submodels, facilitating their integration into a larger, unified model.

### 2\. Reconcile and Merge Common Entities:

Once common entities have been identified, the next task is to reconcile and merge them into a single, unified representation that can span across all submodels. This involves aligning attributes, constraints, and data types.

Consider 'Students' as an example. In the Enrollment submodel, we might have attributes like 'Student Name', 'Date of Birth', 'Guardian Name', 'House', etc. In the IT Services submodel, 'User Type' (with value 'Student'), 'Username', and 'Password' might be pertinent attributes for 'Students'. When merging these entities, we need to create a unified 'Student' entity that carries all these attributes.

Moreover, the data types and constraints of these attributes should align. If 'Student Name' is a string in one submodel and an integer in another, such inconsistencies need to be resolved.

### 3\. Link Related Entities:

In a large institution like Hogwarts, entities from distinct submodels can often be interrelated. It's vital to identify these connections and represent them accurately in our comprehensive data model.  

For instance, 'Professor' might be an entity in the Staff Payroll submodel, containing attributes like 'Salary', 'Pay Grade', etc. The same 'Professor' might also feature in the Course Management submodel, teaching various 'Courses'. Here, 'Professor' and 'Course' would be linked entities, and we would need to capture this relationship in our model.

Such connections across submodels add a layer of realism to our model, illustrating the multifaceted roles entities can play in different contexts within the same system.

### 4\. Implement Subtype/Supertype Structures If Needed:

In our quest to build an inclusive data model, we might encounter entities that can be further subdivided based on certain characteristics. In such cases, implementing subtype/supertype relationships can provide a more granular and accurate depiction of the data.

For instance, in our IT Services submodel, 'Users' was a generic entity, a supertype that could be any individual with access to Hogwarts' IT system. However, we recognized that the different types of users - 'Students', 'Faculty', and 'Administrators' - each had distinct attributes that were not shared by the others.

By implementing a subtype/supertype structure, we were able to create specific entities for each user type, each inheriting common attributes from the 'User' supertype, while also holding their unique attributes. This structure helped capture the complexities of the IT Services data more effectively and accurately.

### 5\. Review and Refine:

The construction of an integrated model marks a significant milestone, but our work is far from over. Now, we must review this preliminary model to ensure its accuracy, efficiency, and comprehensiveness.

During this review, we look out for redundancies and inconsistencies that may have slipped in during the merging process. For instance, if the 'Student' entity from the Enrollment submodel and the 'Student' subtype from the IT Services submodel contain duplicate attributes, such redundancy needs to be eliminated.

We also check for missing elements - entities, attributes, or relationships - that might have been overlooked or newly recognized during the integration process. For example, we might realize that our model lacks an entity for 'School Supplies' when incorporating the 'Student Shopping' submodel, prompting us to add this entity and its corresponding relationships.

This review stage might necessitate several revisions, which could include adding new attributes or entities, modifying relationships, or restructuring parts of the model to better reflect Hogwarts' operations. It's a cycle of constant refinement, where we iteratively improve the model until it aligns seamlessly with the reality of Hogwarts' multifaceted operations.

Building a comprehensive data model for a system as complex as Hogwarts is an ambitious endeavor. It might encapsulate hundreds of entities and relationships, reflecting every facet of the school's operations from Quidditch Match Records to Potion Ingredient Inventory. The secret lies in methodically deconstructing this vast system into manageable submodels and then carefully integrating these submodels into a coherent, logical, and accurate representation of the entire Hogwarts data universe.


## Exercise: Exercise: Creating a Data Model for the Ministry of Magic

The Ministry of Magic, the governing body of the wizarding world in Britain, is a complex organization with many departments and functions. One of its primary roles is to enforce the law and maintain security in the wizarding world. For this exercise, let's focus on creating a data model that captures some of the Ministry's key operations.

Consider the following entities and their attributes:

1.  Department (DepartmentID, DepartmentName, Head, Budget)
2.  Employee (EmployeeID, Name, DateOfBirth, Position, Salary, DepartmentID)
3.  MagicalLaw (LawID, LawTitle, LawDescription, DepartmentID)
4.  Violation (ViolationID, ViolationDescription, LawID)
5.  Case (CaseID, EmployeeID, ViolationID, DateOpened, DateClosed, Status)

And the following business rules:

-   The Ministry has several Departments, each with a unique ID, name, head, and budget.
-   Each Department employs many Employees, each with a unique ID, name, date of birth, position, salary, and an associated department.
-   Each Department enforces several MagicalLaws, each with a unique ID, title, and description.
-   Each MagicalLaw can be linked to several Violations, each with a unique ID and description.
-   An Employee (who can be a 'Clerk', 'Auror', or 'Administrator') manages each Case, which tracks a specific Violation. A Case has a unique ID, an open date, a close date (if applicable), and a status.

Now, follow these steps to create a data model:

### Part 1: Identify Entities and Attributes

Start by listing each entity and its attributes. Note which attributes are unique (potential primary keys) and which attributes could be used to link entities (potential foreign keys).

### Part 2: Define Relationships

Define the relationships between entities. Use the business rules to guide you. For instance, a Department employs many Employees, indicating a one-to-many relationship.

### Part 3: Implement Subtype/Supertype Structures

The 'Employee' entity could be further divided into subtypes based on the 'Position' attribute. Create a supertype for 'Employee' and subtypes for 'Clerk', 'Auror', and 'Administrator'. Each subtype would inherit common attributes from 'Employee' while holding unique attributes.

### Part 4: Reconcile and Merge Common Entities

Examine all entities for potential overlap or redundancy. Ensure that each entity serves a unique purpose in the model.

### Part 5: Construct the ER Diagram

Use a tool like diagrams.net to create an ER diagram based on your data model. Make sure to accurately represent entities, attributes, relationships, and subtype/supertype structures. Your diagram should visually convey all the information outlined in the business rules.

As you complete each step, keep in mind that data modeling is an iterative process. Review and revise your model as needed to ensure it accurately captures the Ministry of Magic's operations. Good luck!

**Please include your final ER diagram below.**