<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_07_Advanced_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Modeling - Unfurling the Magic of Normalization and Subclass-Superclass Relationships
### Brendan Shea, PhD

Welcome to the next stage our journey into the magical realm of database design and modeling. You've journeyed through the basics of database fundamentals and elementary modeling. Now, we take you further into the world of advanced modeling techniques, shining a light on the intricacies of normalization and subclass-superclass relationships.

In this chapter, we will draw upon the enchanting world of J.K. Rowling's Hogwarts School of Witchcraft and Wizardry to guide our exploration. Using a comprehensive example of Hogwarts' enrollment data, initially stored as a simple spreadsheet, we will illustrate how it can be methodically transformed into a robust, efficient database model.

The process of normalization, developed by Edgar F. Codd, will be introduced as our essential tool for removing redundancies and improving the structural integrity of our database. We'll discover the importance of achieving various 'normal forms', helping to prevent the calamitous effects of data anomalies and inconsistencies.

From the Great Hall to the classrooms of Hogwarts, we will encounter the ubiquitous presence of relationships in our data. Building upon your understanding of elementary relationships, we'll delve into the concept of subclass-superclass relationships, the very cornerstone of object-oriented database models. Drawing from the diverse personalities within Hogwarts, we'll see how these relationships play a pivotal role in creating a flexible and expressive database.

Understanding how to model these complex relationships correctly and implement appropriate constraints will be vital. They are the keys to ensuring the accuracy, consistency, and relevancy of the data housed within our database.

Throughout the chapter, we'll also touch upon concepts such as multi-valued attributes, transitive dependencies, one-to-many and many-to-many relationships, and more. We will consider the role of denormalization and indexing in database design, essential topics in the advanced design of databases.

By the end of this chapter, you'll not only have a comprehensive understanding of these advanced modeling techniques but also see their application in real-world contexts, making you equipped to tackle complex database design problems. So, let's journey together into the fascinating world of normalization and subclass-superclass relationships, weaving together the enchanting stories of Hogwarts with the practical magic of database design.

## Chapter Case Study: Hogwarts Enrollment Database

In this chapter, we're going to explore the fascinating world of Hogwarts School of Witchcraft and Wizardry through the lens of a database designer. We will utilize Hogwarts' enrollment data, initially captured in a spreadsheet, to understand the intricacies of advanced database modeling.

The initial spreadsheet used for enrollment contains the following ten attributes:

1.  Student Name
2.  Date of Birth
3.  Gender
4.  Guardian Name
5.  Guardian Contact Information
6.  Home Address
7.  House (Gryffindor, Hufflepuff, Ravenclaw, or Slytherin)
8.  Year Level (Year 1 through Year 7)
9.  Courses Enrolled
10. Extra-Curricular Clubs

A sample of this spreadsheet is as follows:




| Student Name | Date of Birth | Gender | Guardian Name | Guardian Contact Information | Home Address | House | Year Level | Courses Enrolled | Extra-Curricular Clubs |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Harry Potter | 31/07/1980 | Male | Sirius Black | <SiriusBlack@mail.com> | 4 Privet Drive, Little Whinging, Surrey | Gryffindor | Year 7 | Defence Against the Dark Arts, Potions, Herbology | Quidditch, Dumbledore's Army |
| Hermione Granger | 19/09/1979 | Female | Mr. and Mrs. Granger | <GrangerFamily@mail.com> | 17 Mill Lane, London | Gryffindor | Year 7 | Arithmancy, Potions, Transfiguration | Dumbledore's Army |
| Draco Malfoy | 05/06/1980 | Male | Lucius Malfoy | <LuciusMalfoy@mail.com> | Malfoy Manor, Wiltshire | Slytherin | Year 7 | Potions, Defence Against the Dark Arts, Transfiguration | Slytherin Quidditch Team |
| Luna Lovegood | 13/02/1981 | Female | Xenophilius Lovegood | <XenoLovegood@mail.com> | Lovegood House, near Ottery St. Catchpole, Devon | Ravenclaw | Year 6 | Divination, Herbology, Astronomy | Ravenclaw Quidditch Team, Dumbledore's Army |
| Cedric Diggory | 10/10/1977 | Male | Amos Diggory | <AmosDiggory@mail.com> | Diggory House, near Ottery St. Catchpole, Devon | Hufflepuff | Year 7 | Herbology, Charms, History of Magic | Hufflepuff Quidditch Team |

This sample data will provide a baseline for us to identify the existing issues with the spreadsheet format and demonstrate how we can transform it into a more efficient and structured database model.

### Problems With a "Flat File" Enrollment Spreadsheet

This data structure (known as a "flat file"), although straightforward and simple to use, has several inherent issues from a database design perspective:

1. *No obvious primary keys:* No single attribute uniquely identifies a record in the spreadsheet. For instance, 'Student Name' may seem a likely candidate, but there's always a possibility of having two students with the same name.

2. *Redundant data:* The spreadsheet includes repeated information, such as 'Courses Enrolled' and 'Extra-Curricular Clubs', which can lead to anomalies and inconsistencies.

3. *Multi-valued attributes:* Some attributes, like 'Courses Enrolled' and 'Extra-Curricular Clubs', can have multiple values for a single student, which is problematic in a database context.

4. *Lack of structure for complex data:* The current format does not adequately handle complex relationships. For example, how do we handle course assignments for different teachers, or track a student's year level as it changes over time?

In this chapter, we'll take this unnormalized spreadsheet and transform it into a well-structured, efficient database model, capable of supporting Hogwarts in managing its magical student body effectively. We will apply normalization techniques and identify subclass-superclass relationships, addressing each issue present in the initial spreadsheet. By the end, the spreadsheet's transformation into a database will reveal the powerful magic of advanced database design techniques. Let's step into this enchanting journey!

## Why Normalize?
Normalizing data is like organizing a big, messy closet. Imagine you have a closet where you've thrown in clothes, shoes, hats, scarves, books, and even some snacks. Now, if you want to find your blue sweater or your favorite book, it's going to be quite a challenge, right? You'll have to sift through everything else to get what you need. Now imagine if you took some time to organize this closet, putting clothes in one area, books on a shelf, snacks in a corner, and shoes neatly lined up on the floor. It will be much easier to find what you're looking for, and there's less chance of pulling out a snack when you're looking for a hat!

This is pretty much what normalization does to data in a database. When data is not normalized, information can be scattered across various tables or can be repeated in many places. This can lead to a lot of problems:

- *Redundancy:* Just like our messy closet example, unorganized data can lead to unnecessary duplication. This not only takes up more storage space but can also lead to inconsistencies. Imagine if Hogwarts kept the house points in each student's record. If they update it for one student and forget to do it for others, it can lead to confusion.

- *Update Anomalies:* When data is repeated, updating it can become a nightmare. Imagine a professor changes their office location. If their office location is stored in every course they teach, that's a lot of places to make one change! With normalization, we would store professor information in one table, and any changes only need to be made in one place.

- *Insertion and Deletion Anomalies*: Without normalization, adding or removing data can be tricky. What if a new student joins Hogwarts, but they haven't been assigned a house yet? Where would we store their information if we only had a table for each house? With a separate "Students" table, we can add the student immediately and update their house assignment later.

Normalization, through its different forms (1NF, 2NF, 3NF, etc.), helps to solve these problems by dividing data into logical units where each piece of information is stored in one place.

## Normalization Applied to the Hogwarts Enrollment Spreadsheet

*Normalization* is the process of structuring a relational database in accordance with a series of so-called normal forms to reduce data redundancy and improve data integrity. Let's examine how we can apply normalization to our Hogwarts Enrollment Spreadsheet, starting with the first normal form (1NF), then progressing to the second normal form (2NF), and finally, the third normal form (3NF).

### First Normal Form (1NF):

The first normal form dictates that each column of a table must contain only **atomic (indivisible) values**, and there must be a primary key.

To conform our spreadsheet to 1NF, we first need to eliminate **multi-valued attributes** (that is, attributes that contain multiple values, such as a list of multiple courses). We'll split "Courses Enrolled" and "Extra-Curricular Clubs" into separate rows. Additionally, we'll introduce a unique Student ID for each student, as it's possible to have two students with the same name (and so, student name should NOT be used a primary key).

For example, our 1NF tables might look like this:

*Students*

| Student ID | Student Name | Date of Birth | Gender | Guardian Name | Guardian Contact Information | Home Address | House | Year Level |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1001 | Harry Potter | 31/07/1980 | Male | Sirius Black | <SiriusBlack@mail.com> | 4 Privet Drive, Little Whinging, Surrey | Gryffindor | Year 7 |

*Courses*

| Student ID | Course |
| --- | --- |
| 1001 | Defence Against the Dark Arts |
| 1001 | Potions |
| 1001 | Herbology |

*Clubs*

| Student ID | Club |
| --- | --- |
| 1001 | Quidditch |
| 1001 | Dumbledore's Army |

### Second Normal Form (2NF)

The second normal form requires that all non-key attributes for each table are fully dependent on the primary key. In particular, it means that if we have a composite key (that is, a key that is made up of two or more attributes), all the other attributes should depend on BOTH parts of this key. In our current database, this isn't a problem, since we have simple primary keys for each table (this is generally a good idea!).

Let's consider a different scenario where we have a table that doesn't conform to the 2NF. Suppose we have a table called "Course_Enrollment" where we store which professor teaches which course to which student. The primary key for this table is a composite key, composed of "Student ID" and "Course ID". We also store the professor's name who teaches that course.

**Course_Enrollment (Before 2NF)**
| Key | Attribute | Data Type |
| --- | --- | --- |
| PK | Student ID | Integer |
| PK | Course ID | Integer |
|  | Professor Name | String |

In this example, "Professor Name" depends only on the "Course ID", not on both parts of the composite key. This is a violation of the 2NF.  To correct this and bring the table to 2NF, we would need to split this table into two:

1.  Course_Enrollment - which stores which student takes which course. Here the primary key is a composite key of "Student ID" and "Course ID".

2.  Course_Professor - which stores which professor teaches which course. Here the primary key is "Course ID".

**Course_Enrollment (After 2NF)**

| Key | Attribute | Data Type |
| --- | --- | --- |
| PK, FK | Student ID | Integer |
| PK, FK | Course ID | Integer |

**Course_Professor (After 2NF)**

| Key | Attribute | Data Type |
| --- | --- | --- |
| PK, FK | Course ID | Integer |
|  | Professor Name | String |

Now, in both tables, all non-key attributes are fully dependent on their primary keys. We have successfully eliminated the partial dependency, and our tables now conform to the 2NF.

While our original data did not have this issue (as there were no composite keys), it is important to understand this concept, as more complex databases often require the use of composite keys.

### Third Normal Form (3NF):

The third normal form dictates that all non-key attributes must depend only on the primary key, not on other non-key attributes (no **transitive dependencies**).

Here, a transitive dependency is a type of database dependency where a non-key attribute depends on another non-key attribute, which depends on the primary key of the table. In simpler terms, if there are three attributes A, B, and C, a transitive dependency is a condition where A → B and B → C, which indirectly creates a dependency of A → C (A depends on C through B). This is a violation of the third normal form (3NF).

For our Hogwarts data, we do not have transitive dependencies in our current tables. However, if we decided to add additional information like "House Head" or "House Points" related to the house each student belongs to, we would then have a transitive dependency (Student ID → House → House Head / House Points). To satisfy 3NF, we would then create a separate table for "Houses".

*Houses*

| House | House Head | House Points |
| --- | --- | --- |
| Gryffindor | Minerva McGonagall | 6720 |

Now our Hogwarts enrollment data is normalized to the third normal form. Through normalization, we have minimized data redundancy, and the database structure can now handle complex queries more efficiently. As Hogwarts' enrollment continues to grow, the benefits of normalization will be increasingly apparent, proving essential for effectively managing student data.

## "The Key, the Whole Key, and Nothing But the Key, So Help Me Codd**

A common mnemonic for remembering Codd's three normal forms is **"The Key, The Whole Key, and Nothing But The Key, So Help Me Codd."** Here's the basic idea:

1.  **"The Key"** This corresponds to the First Normal Form (1NF). In 1NF, each column of a table must contain only atomic (indivisible) values, and there must be a primary key. This means that there should be no duplicate rows, and each attribute (column) must hold a single value rather than a list of values. Thus, the "Key" part of the mnemonic refers to the uniqueness and singularity of the data in each row and column, which is defined by the primary key.

2.  **"The Whole Key"** This corresponds to the Second Normal Form (2NF). 2NF dictates that every non-key attribute (field) must be functionally dependent on the whole key, meaning that all information in a table should pertain to the concept that the primary key represents. In particular, if the primary key is composite (made up of multiple fields), no non-key attribute should depend on only part of the composite key.

3.  **"Nothing But The Key"** This corresponds to the Third Normal Form (3NF). The phrase "Nothing but the Key" refers to the rule that no non-key attribute should depend on other non-key attributes, only on the primary key. This is designed to eliminate transitive dependencies (where A depends on B, and B depends on C, leading to an indirect dependency of A on C).

This mnemonic covers the first three normal forms, which are typically sufficient for most database normalization tasks. It's important to note, however, that there are further forms of normalization, including the Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF), that deal with more specific and less common issues.


However, there's a point where we may decide to "stop" normalizing. Higher levels of normalization (beyond 3NF) can lead to a very large number of small tables. This can make the database harder to understand and queries more complex. In some cases, it might even affect performance.

## Hogwarts Data After Normalization
After the data has been normalized, we are left with something like the following. (Note: There are multiple ways of successfully "normalizing" the data, and there might well be *better* ways of doing this!).




![Howgarts Enrollment ER Diagram](https://github.com/brendanpshea/database_sql/raw/main/images/hogworts_enroll.png)



Here is the SQL code:

In [1]:
# First, we create a database and connect to it
!pip install SQLAlchemy==1.3.24 -q # Needed o avoid problems with more recent version in Colab

%load_ext sql
%sql sqlite:///hogwarts.db

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/6.4 MB[0m [31m8.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/6.4 MB[0m [31m35.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.3/6.4 MB[0m [31m68.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for SQLAlchemy (setup.py) ... [?25l[?25hdone


In [2]:
%%sql

-- Create the Houses table first, as it's referenced by other tables
CREATE TABLE Houses (
    House VARCHAR(50) PRIMARY KEY,
    HouseHead VARCHAR(100),
    HousePoints INT CHECK (HousePoints >= 0)  -- Constraint to ensure points can't be negative
);

-- Create the Students table, referencing the Houses table
CREATE TABLE Students (
    StudentID INT PRIMARY KEY,
    StudentName VARCHAR(100) NOT NULL,
    DateOfBirth DATE NOT NULL,
    Gender VARCHAR(10) CHECK (Gender IN ('Male', 'Female', 'Other')),  -- Constraint for valid genders
    GuardianName VARCHAR(100),
    GuardianContactInformation VARCHAR(100),
    HomeAddress VARCHAR(200),
    House VARCHAR(50),
    YearLevel INT CHECK (YearLevel BETWEEN 1 AND 7),  -- Hogwarts has 7 year levels
    FOREIGN KEY (House) REFERENCES Houses(House)  -- References House in Houses table
);

-- Create the Courses table, referencing the Students table
CREATE TABLE Courses (
    StudentID INT,
    Course VARCHAR(100),
    PRIMARY KEY (StudentID, Course),  -- Composite primary key
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID)  -- References StudentID in Students table
);

-- Create the Clubs table, referencing the Students table
CREATE TABLE Clubs (
    StudentID INT,
    Club VARCHAR(100),
    PRIMARY KEY (StudentID, Club),  -- Composite primary key
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID)  -- References StudentID in Students table
);


 * sqlite:///hogwarts.db
Done.
Done.
Done.
Done.


[]

## Exercise: Normalizing the Hogwarts Library Inventory

### The Scenario

Imagine that the Hogwarts Library maintains a spreadsheet to keep track of their book inventory. Each row represents a book, and there are the following columns:

1.  Book Name
2.  Book Author
3.  Publish Year
4.  Categories (as books can belong to more than one category, e.g., "Herbology, Potion-Making")
5.  Availability (whether the book is currently available or checked out)
6.  Current Holder (the person who has the book if it's checked out)
7.  Borrow Dates (the dates that the book has been borrowed)

Here is an example of what five rows might look like:

| Book Name | Book Author | Publish Year | Categories | Availability | Current Holder | Borrow Dates |
| --- | --- | --- | --- | --- | --- | --- |
| Herbology at Home | Phyllida Spore | 1983 | Herbology, Home Studies | Checked Out | Neville Longbottom | 12/02/2023, 20/06/2023 |
| Hogwarts: A History | Bathilda Bagshot | 1950 | History, Magic | Available | - | 22/01/2023 |
| Quidditch Through the Ages | Kennilworthy Whisp | 1952 | Sports, Magic | Checked Out | Harry Potter | 15/01/2023, 07/03/2023, 01/06/2023 |
| Magical Water Plants of the Mediterranean | Hadrian Whittle | 1971 | Herbology, Travel | Available | - | - |
| Magical Hieroglyphs and Logograms | Bathilda Bagshot | 1964 | Runes, Magic | Available | - | 28/02/2023 |

Notice that there's no clear primary key, and the Categories and Borrow Dates columns have multi-valued attributes.

* * * * *

#### Your Task

1.  First Normal Form (1NF): Convert this data to the first normal form (1NF) by eliminating multi-valued attributes and introducing a unique identifier for each book.

2.  Second Normal Form (2NF): Then convert your 1NF tables to the second normal form (2NF). Are there any columns that are not fully functionally dependent on the primary key? How would you resolve this?

3.  Third Normal Form (3NF): Finally, ensure that your 2NF tables are in the third normal form (3NF). Are there any columns whose values depend on other non-key columns? If so, how would you resolve this?

Take some time to work through each of these steps, sketching out your tables and thinking about how to handle the multi-valued and dependent attributes at each stage

## Answer: Normalizing the Hogwarts Library

1. (1NF)




2. (2NF)




3. (3NF)


4. Create the Tables in SQL (Below)

In [None]:
%%sql
--Create the Hogwarts Library Tables in SQL

## Subtypes and Supertypes - A Practical Dive
Having mastered normalization, it's time to explore another crucial aspect of advanced database modeling - Subtypes and Supertypes. This concept is at the heart of modeling more complex real-world relationships in our databases.

In the simplest terms, a Supertype is a generic entity type that has a relationship with one or more Subtypes, which are more specific entity types. Essentially, we have a hierarchy where the Supertype sits at the top, and Subtypes represent categories or types of the Supertype. There are common attributes in the Supertype, and each Subtype may contain additional attributes specific to it. This relationship often comes into play in object-oriented programming but is just as critical in structuring our databases effectively.

To illustrate this, let's consider the case of Hogwarts School of Witchcraft and Wizardry again. Imagine the IT staff at Hogwarts. They have a daunting task - they need to manage accounts for different types of users: students, faculty, and administrators. Each of these user types has common attributes (like Name, Password, Email), but they also have unique attributes.

For instance, a student has a Year Level and a House, faculty members have Subjects Taught, and administrators have a Job Role. Without Subtype and Supertype relationships, we would have redundancy in our database, and changes would be cumbersome.

By the end of this section, you'll have a solid understanding of Subtypes and Supertypes and be equipped to structure databases even more effectively, whether you're managing a school, a bookstore, or a magical institution like Hogwarts!

## What are Supertypes and Subtypes?
Supertypes and Subtypes are foundational concepts in both object-oriented programming and database design. They help us model real-world scenarios more accurately and efficiently by capturing the hierarchical relationships between different entities.

A **supertype** is a generic entity type that has a relationship with one or more *subtypes**, which are more specific categories of the Supertype. In other words, a Supertype is a broad classification, while Subtypes are its more specialized versions.

To better understand this concept, let's look at some examples from our Hogwarts case study:

1. *Person as a Supertype:* At Hogwarts, many entities could be classified broadly as "Persons". This category could have attributes like Name, Date of Birth, and Address. However, there are different kinds of "Persons" at Hogwarts, such as Students, Professors, and Administrators. These more specific categories are the Subtypes, each having unique attributes in addition to those of the Person Supertype. For example, Students might have additional attributes like House, Year Level, and Courses, while Professors have Subjects Taught, and Administrators have Job Role.

2. *Course as a Supertype:* A "Course" is another generic entity at Hogwarts with attributes such as Course Name and Course Description. Subtypes might be Mandatory Courses and Optional Courses. Mandatory Courses may have an additional attribute like Year Level (indicating which year level the course is compulsory for), and Optional Courses might have Pre-requisite Courses as an additional attribute.

3. *Room as a Supertype:* Hogwarts has many rooms with generic attributes like Room Number and Floor. However, the rooms serve different purposes, leading to Subtypes like Classrooms, Offices, and Dormitories. A Classroom might have additional attributes like Course Taught, an Office might be associated with a specific Faculty Member, and a Dormitory could have Capacity and Associated House as additional attributes.

In each example, the Supertype has attributes common to all Subtypes, and each Subtype has attributes specific to it. This approach avoids unnecessary redundancy and offers a more structured and efficient way to model our data.