## Introduction to relations and relational algebra
CS 236 <br>
Fall 2025

Michael A. Goodrich and Eric G. Mercer<br>
Brigham Young University <br>
March 2023, Updated Oct 2024 and Sept 2025 <br><br>
This tutorial includes some input from ChatPT-5 in response to prompts on (a) how to introduce parameterized tests in pytest and (b) how to start thinking about test coverage using input partitioning
***

## 0. Overview

This notebook will:
- Review both the **mathematical definition** of a relation and the definition used in **relational databases**
- Demonstrate implementations of relations and relational operators in Python
- Show how to check whether two relations are equal using **dunder** methods
- Review the difference between **copying** data vs **referencing** it in Python
- Describe how to construct and use **parameterized unit tests**
- Discuss **test coverage** via **input partitioning**


### 0.1 Before you Begin

- Follow the steps in the **Developer Setup** instructions in the Project 3 `README`.
- Tell VS Code to use the Python interpreter you created in your virtual environment.

---

## 1. What is a Relation

The textbook defines as relation as a set of mathematical tuples:

* $R\subset A\times B$
* $R=\{(a,1),(b,2),(c,3),(d,4)\}$

In relational databases, the same relation is represented as a table (with a header) instead of as a set:

| `char` | `int` | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ | 

The table __header__ contains the **attributes** associated with each element of the tuple. You can think of the attributes specified in the header as assigning names to each column. For example, the first column in the relation above contains elements from the set `char` and the second column contains elements from the set `int`. The header contains information about the sets from which the cartesian product is formed, $R\subseteq $`char`$ \times$ `int`.

Below the header are rows containing __tuples__. The tuples are the elements of the __set__ that defines the relation, $R=\{(a,1),(b,2),(c,3),(d,4)\}$.

The order of elements in a set doesn't matter

* $\{(a,1),(b,2),(c,3),(d,4)\} = \{(d,4),(b,2),(a,1),(c,3)\}$

Similarly, the order of the rows in the table doesn't matter (except for the header) because the rows represent the set of tuples. For example, the following table represents the same relation as the table above.

| `char` | `int` | 
| :-: | :-: | 
| $d$ | $4$ | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 


***

## 2. Encoding a Relation as a Python Class

Let's construct a class that represents a relation.  The class members need to encode the parts of the table. Thus, there needs to be a class member for the header and another class member for the set of tuples. To help with the discussion, we'll also include the name of the relation in the code below, but you won't see this in the starter code for Project 3. Thus, the relation class will include:
- The relation name as a `str`. 
- The relation header will be defined as a list of strings: `list[str]`
- Each tuple in the relation will be defined as a new type called `RelationTuple`
  - The `RelationTuple` will be built using Python's default `tuple` type.  
  - The class created below allows both integers and strings to be the elements of a tuple, so look for the vertical bar that says the tuple types can be either an int or a str yielding `RelationTuple = tuple[str | int]`
  - We won't specify how many elements will be in the `RelationTuple` so we'll use Python's `...` to indicate that we can have several tuples yielding `RelationTuple = tuple[str | int, ...]`
- A helper `__str__` method that returns a string in the correct format. This method will use a table formatting tool in Python called `tabulate`, which should have been installed when you executed `pip install --editable ".[dev]"`.

In [1]:
from tabulate import tabulate  # requires the tabulate module

RelationTuple = tuple[str | int, ...]
        
class Relation:
    def __init__(self, relation_name: str, 
                 relation_header: list[str], 
                 set_of_tuples: set[RelationTuple]) -> None:
        self.name:str = relation_name # I threw in a string for the relation name to help witht he discussion
        self.header: list[str] = relation_header
        self.set_of_tuples: set[RelationTuple] = set_of_tuples

    def __str__(self) -> str:
        value: str = f"relation name = {self.name}\n" + tabulate(iter(self.set_of_tuples), self.header, tablefmt="fancy_grid")
        return value
        
    
R: Relation = Relation(relation_name = "R",relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
print(R)

Q: Relation = Relation('Q',('C','D'),{})
print(Q)

# The following code will cause the __str__ method to crash
# The problem is that tabulate requires strings and the types in the set are ints
# W: Relation = Relation('W',('A'), {1, 2, 3})
# They must be tuples. Later, the trailing comma will be explained
W: Relation = Relation('W',('A'), {(1,), (2,), (3,)})
print(W)


relation name = R
╒════════╤═══════╕
│ char   │   int │
╞════════╪═══════╡
│ d      │     4 │
├────────┼───────┤
│ b      │     2 │
├────────┼───────┤
│ a      │     1 │
├────────┼───────┤
│ c      │     3 │
╘════════╧═══════╛
relation name = Q
╒═════╤═════╕
│ C   │ D   │
╞═════╪═════╡
╘═════╧═════╛
relation name = W
╒═════╕
│   A │
╞═════╡
│   1 │
├─────┤
│   2 │
├─────┤
│   3 │
╘═════╛


The top row in the tables above consist of the relation header. The remaining rows represent the set of tuples contained in the relation. Each line contains a unique tuple. The order of the tuples in the set of tuples doesn't matter since sets are not ordered.

---

#### 2.1 The `__eq__` Dunder Method

A **dunder method** is any method in Python that begins and ends with **d**ouble **under**scores. There are several dunder methods defined each time a class is defined. Sometimes, we need to change the default definition of the dunder methods that are automatically created so that they behave like we want. It will be necessary to do this for the dunder method `__eq__` because the default implementation does not correctly check whether two relations are equal. 

Consider whether you think the following statement that checks whether two relations are equal should be true or false.

In [2]:
R2: Relation = Relation(relation_name = "R",relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
print(R)
print(R2)
print(f"The default __eq__ method says the statement R==R2 is {R==R2}")

relation name = R
╒════════╤═══════╕
│ char   │   int │
╞════════╪═══════╡
│ d      │     4 │
├────────┼───────┤
│ b      │     2 │
├────────┼───────┤
│ a      │     1 │
├────────┼───────┤
│ c      │     3 │
╘════════╧═══════╛
relation name = R
╒════════╤═══════╕
│ char   │   int │
╞════════╪═══════╡
│ d      │     4 │
├────────┼───────┤
│ b      │     2 │
├────────┼───────┤
│ a      │     1 │
├────────┼───────┤
│ c      │     3 │
╘════════╧═══════╛
The default __eq__ method says the statement R==R2 is False


Mathematically, $R=R2$ since (a) the headers are equal and (b) the two relations have the same set of tuples. The reason the default implementation of `__eq__` says that R does not equal R2 is the default implementation **doesn't check what's inside the relations** but rather **checks whether R and R2 are two different names for the same object**. In other words, `__eq__` just checks whether the two relation names are the same object.  When we print out the `id` information about each instance, we see that the objects have different addresses. Since the addresses don't match, the default `__eq__` method says the two relations are different.

In [3]:
print(id(R))
print(id(R2))

281472726344400
281472714996192


Thus, we need to define an equality operator that checks the contents of the relation (both header and set of tuples) rather than just the id's of the two relations. A good discussion of how to do this can be found here: 

https://stackoverflow.com/questions/1227121/compare-object-instances-for-equality-by-their-attributes

Let's redefine the class to define the `__eq__` function.

In [4]:
from typing import Any

class Relation:
    def __init__(self, relation_name: str, relation_header: list[str], set_of_tuples: set[RelationTuple]) -> None:
        self.name:str = relation_name # I threw in a string for the relation name for fun
        self.header: list[str] = relation_header
        self.set_of_tuples: set[RelationTuple] = set_of_tuples

    def __str__(self) -> str:
        value: str = f"relation name = {self.name}\n" + tabulate(iter(self.set_of_tuples), self.header, tablefmt="fancy_grid")
        return value

    def __eq__(self, other: Any) -> bool:
        if not isinstance(other, Relation):
            print("you are trying to compare a relation to something that is not a relation")
            return False
        return (
            self.header == other.header
            and self.set_of_tuples == other.set_of_tuples
        )

    
R: Relation = Relation(relation_name = "R",relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
R2: Relation = Relation(relation_name = "R2",relation_header = ('char','int'), set_of_tuples = {('b',2),('a',1),('c',3),('d',4)})
print(f"The revised __eq__ method says the statement R==R2 is {R==R2}")

The revised __eq__ method says the statement R==R2 is True


Defining the `__eq__` function within the class allows us to use the == operator to check whether the contents of the two relations are the same. Notice a few things about the `__eq__` method:

- Two arguments are passed to the `__eq__` method: `self` and `other`. 
  - `self` refers to the instance of the relation class
  - `other` refers to whatever the relation can be compared to. The code above uses the `Any` type to say that we can compare anything to the relation.
- The first line of the method tests whether the `other` object passed in is a `Relation` type. If it's not, the function returns false.
- The `__eq__` method returns the boolean expression that tests whether both headers match as well as whether the set of tuples match.

***


#### 2.2 Copy vs Reference (Aliasing)

In Python, variables **hold references** to objects. Assigning one variable to another **does not make a new copy** . This means that both names refer to the **same object** (aliasing). Since mutating (i.e., changing) the object referenced by one name changes the single shared object, making a change to one object alters the other. We can demonstrate this with using Python strings.


In [5]:
string1: str = "Happy"
string2 = string1

print("Demonstrating the difference between copying and referencing")
print(f"string1 = {string1}")
print(f"string2 = {string2}")
print(f"The id for string 1 is {id(string1)}")
print(f"The id for string 2 is {id(string2)}")
print(f"Copying string1 to string2 means string1 == string2 is {string1 == string2}")


Demonstrating the difference between copying and referencing
string1 = Happy
string2 = Happy
The id for string 1 is 281472714997312
The id for string 2 is 281472714997312
Copying string1 to string2 means string1 == string2 is True


It is easy to miss the fact that assigning `string2 = string1` just assigns the reference because of an important property of strings: they are immutable in Python. Look at what happens when we add more string to `string2`.

In [6]:

# The following line creates a new string object by concatenating " Birthday" to string2
string2 += " Birthday"  
print(f"After modifying string2, string1 = {string1}")
print(f"After modifying string2, string2 = {string2}")
print(f"The id for string 1 is {id(string1)}")
print(f"The id for string 2 is {id(string2)}")
print(f"Modifying string2 means string1 == string2 is {string1 == string2} because modifying string2 created a new string object")


After modifying string2, string1 = Happy
After modifying string2, string2 = Happy Birthday
The id for string 1 is 281472714997312
The id for string 2 is 281472715137200
Modifying string2 means string1 == string2 is False because modifying string2 created a new string object


Since strings are immutable, modifying `string2` through `string2 += "Birthday"` actually creates a new copy of `string2`.

Let's see what happens when we copy a relation. We'll create a copy and print the `id` of both the original and the copy.

In [7]:
R: Relation = Relation(relation_name = "R",relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
Q: Relation = R

print(id(R))
print(id(Q))

281472726349008
281472726349008


Notice that the two objects have the same `id`. The problem is that Python defaults to creating a **reference** to the original object. This means that if we modify the copy we also modify the original. We can see this if I overwrite the header of the copy and then print out the header of the original and the copy. 

In [8]:
print(f'Header of R before modification = {R.header}')
print(f'Header of Q before modification = {Q.header}')

Q.header = ['bart', 'lisa']

print(f'Header of R after modification = {R.header}')
print(f'Header of Q after modification = {Q.header}')


Header of R before modification = ('char', 'int')
Header of Q before modification = ('char', 'int')
Header of R after modification = ['bart', 'lisa']
Header of Q after modification = ['bart', 'lisa']


Notice how both the copy and the original are modified. This is because the copy created a new pointer to the original relation, so when we modified the memory to which the pointer referenced the we also modified the original.

Python's `deepcopy` creates a new instance of the object.

In [9]:
from copy import deepcopy

R: Relation = Relation(relation_name = "R",relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
Q: Relation = deepcopy(R)

print(id(R))
print(id(Q))

print(f'Header of R before modification = {R.header}')
print(f'Header of Q before modification = {Q.header}')

Q.header = ['bart', 'lisa']

print(f'Header of R before modification = {R.header}')
print(f'Header of Q before modification = {Q.header}')

Q.header = ['bart', 'lisa']

print(f'Header of R after modification = {R.header}')
print(f'Header of Q after modification = {Q.header}')

281472767155024
281472715161728
Header of R before modification = ('char', 'int')
Header of Q before modification = ('char', 'int')
Header of R before modification = ('char', 'int')
Header of Q before modification = ['bart', 'lisa']
Header of R after modification = ('char', 'int')
Header of Q after modification = ['bart', 'lisa']


When we copy relations in Project 3, we intend to modify the new copy. Thus, we'll want to do a deep copy.

**Rules of thumb**
- If you plan to **mutate**, decide whether that mutation should be **shared** (reference) or **isolated** (copy).
- For **immutable** types (e.g., `int`, `str`, `tuple` of immutables), aliasing is harmless because the object itself cannot be mutated.


---


## 3. Notes on the Starter Code

There are a few things in the starter code that have been confusing to students in previous semesters. 

#### 3.1  Creating New Initalized Lists

**Note that some of the information shared here was culled from copilot in response to a prompt on differences in initalizing member variables in the starter code.**

Starter code is provided for project 3. It has a subtle difference with the definition of the class above. In the code above, assigning the `relation_header` to the class variable `self.header`is done by

`self.header: list[str] = relation_header`

but in the starter code the assigment is 

`self.header = list(relation_header)`

What's the difference? Let's experiment with some variables outside of the definition of the Relation class. 

In [10]:
relation_header: list[str] = ['A', 'B', 'C']
header_v1: list[str] = relation_header
header_v2 = list(relation_header)

print(f"The original list has id = {id(relation_header)}")
print(f"The assignment operator in version 1 has id = {id(header_v1)}")
print(f"The assignment operator in version 2 has id = {id(header_v2)}")

The original list has id = 281472714952384
The assignment operator in version 1 has id = 281472714952384
The assignment operator in version 2 has id = 281472715012288


Notice that the id's for the original list and the first version of the header have the same id, but the id's of the original list and the second version of the header have different id's. Stated simply,
 - the assignment `header_v1: list[str] = relation_header` copies a *reference* (e.g., the pointer)
 - the assignment `header_v2 = list(relation_header)` creates a brand new list object

You can tell that the second assignment creates a brand new list object because it uses the formatting that we use whenever we create a new object in Python. The `list(stuff)` creates a new instance of a list object, where `stuff` initializes the list object. We can confirm that a new list object is created by asking Python about the list keyword, as follows:

In [11]:
help(list)

Help on class list in module builtins:

class list(object)
 |  list(iterable=(), /)
 |
 |  Built-in mutable sequence.
 |
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |
 |  Methods defined here:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return bool(key in self).
 |
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getitem__(self, index, /)
 |      Return self[index].
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate sign

---

#### 3.2 Creating relations with one-ples or empty relations ####

A tuple can have only a single element in it, what Goodrich jokingly calls a "one-ple" in class. There is a tricky problem that comes up when we try to create a tuple with only a single element. Consider the following. We want to create a relation 

| `char` | 
| :-: | 
| a | 

but when we call the constructor for the `Relation` class we get a different relation


In [12]:

R: Relation = Relation('R',('char'),{('a')}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
print(R)


relation name = R
╒═════╕
│ c   │
╞═════╡
│ a   │
╘═════╛


Notice how the only the "c" from the attribute name "char" is printed out. This can be a difficult bug to diagnose.

It gets worse if we try to create a relation with no tuples in the set. We intend to create the relation

| `char` | 
| :-: | 

but calling the contructor with an empty set of tuples gives

In [13]:
Q: Relation = Relation('Q',('char'),{}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
print(Q)

relation name = Q
╒═════╤═════╤═════╤═════╕
│ c   │ h   │ a   │ r   │
╞═════╪═════╪═════╪═════╡
╘═════╧═════╧═════╧═════╛


That's not what we intended at all. We can get around this by adding a comma after the single element of the tuple (after both the 'char' and the 'a')...

In [14]:
P: Relation = Relation('P',('char',),{('a',)}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
print(P)


relation name = P
╒════════╕
│ char   │
╞════════╡
│ a      │
╘════════╛


... and by adding a single comma after the header when we create a relation with the set of tuples empty

In [15]:
Q: Relation = Relation('Q',('char',),{}) #To create a tuple with only one item, you have add a comma after the item, otherwise Python will not recognize the variable as a tuple.
print(Q)


relation name = Q
╒════════╕
│ char   │
╞════════╡
╘════════╛


A good LLM prompt to understand this issue better is

     Why do i have to add a comma after a list with only one element in python? 

--- 

## 4. Adding relational algebra operators to the Relation class

We are now in a position to start filling in the rest of the Relation class.

I want to be able to apply the _relational operators_ to any relation or pair of relations. Using good object-oriented programming style, I'll add the relational operators as functions to the class.

Let's begin by creating our own error handler a la the starter code


In [16]:
class IncompatibleOperandError(Exception):
    def __init__(self, msg: str) -> None:
        super().__init__(msg)

#### 4.1 Union

Consider the relation R defined as before
| char | int | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ | 

and consider a new relation Q defined as
| char | int | 
| :-: | :-: | 
| $f$ | $3$ |

The union $P\cup Q$ is possible since the headers match. "Doing the math" on this union yields
| char | int | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ | 
| $f$ | $3$ |  

Let's check in the code.

In [17]:
class Relation:
    def __init__(self, relation_name: str, relation_header: list[str], set_of_tuples: set[RelationTuple]) -> None:
        self.name:str = relation_name # I threw in a string for the relation name for fun
        self.header: list[str] = relation_header
        self.set_of_tuples: set[RelationTuple] = set_of_tuples

    def __str__(self) -> str:
        value: str = f"relation name = {self.name}\n" + tabulate(iter(self.set_of_tuples), self.header, tablefmt="fancy_grid")
        return value

    def __eq__(self, other: Any) -> bool:
        if not isinstance(other, Relation):
            print("you are trying to compare a relation to something that is not a relation")
            return False
        return (
            self.header == other.header
            and self.set_of_tuples == other.set_of_tuples
        )
    
    ########################
    # Relational Operators #
    ########################
    def union(self,other) -> 'Relation':
        if not isinstance(other, Relation):
            raise IncompatibleOperandError(f"Tried to union a relation with a {type(other)}") # don't attempt to union with something not a relation
        # First, check the precondition to see if the headers are the same
        if self.header != other.header:
            raise IncompatibleOperandError("Tried to union two relations with different headers")
        
        # Second, create a new header that is the union of the sets of tuples
        name: str = self.name + "\u222A" + other.name
        header: list[str] = self.header
        set_of_tuples = self.set_of_tuples
        set_of_tuples = set_of_tuples.union(other.set_of_tuples)    # This is the union operator defined for set objects
        return Relation(name,header,set_of_tuples) # Create a new relation
        


In [18]:
    
R = Relation(relation_name = 'R',relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
Q = Relation('Q',('char','int'),{('f',3),}) # Notice the comma after the ('f',3) tuple. This create a set of tuples, with only one element in the set

P: Relation = R.union(Q)
print(P) # I can see that the relation name is a union when I print it out

relation name = R∪Q
╒════════╤═══════╕
│ char   │   int │
╞════════╪═══════╡
│ d      │     4 │
├────────┼───────┤
│ c      │     3 │
├────────┼───────┤
│ f      │     3 │
├────────┼───────┤
│ b      │     2 │
├────────┼───────┤
│ a      │     1 │
╘════════╧═══════╛


#### 4.2 Error Handling Unit Tests

Let's write unit tests that check two types of errors:
- The error caused by trying to take the union of a relation and something that is not a relation (i.e., a string)
- The error caused by trying to take the union of two relations that have different headers

I gave ChatGPT-5 the specifics of the two errors that I wanted to check and had it generate the tests:

In [19]:
import ipytest
import pytest

ipytest.autoconfig()


In [20]:
%%ipytest -qq
def test_union_raises_when_other_is_not_relation() -> None:
    # Step 1: Instantiate a valid Relation P
    P: Relation = Relation("P", ["bart", "lisa"], {("h", 1), ("m", 2)})

    #######################################
    ## Step 2: Specify invalid input      ##
    #######################################
    not_a_relation = "mystring"

    #################################################
    ## Step 3: Expected: raise IncompatibleOperandError
    #################################################
    # The message includes the type of the other operand (<class 'str'>)
    with pytest.raises(IncompatibleOperandError, match=r"union a relation with a <class 'str'>"):
        P.union(not_a_relation)


def test_union_raises_when_headers_differ() -> None:
    # Step 1: Instantiate two Relations with different headers
    P = Relation("P", ["bart", "lisa"], {("h", 1), ("m", 2)})
    W = Relation("Q", ["bart", "maggie"], {("f", 3)})  # header differs in second column

    #################################################
    ## Step 2: Expected: raise IncompatibleOperandError
    #################################################
    with pytest.raises(IncompatibleOperandError, match=r"different headers"):
        P.union(W)

[32m.[0m[32m.[0m[32m                                                                                           [100%][0m


Both tests should pass by raising the correct errors.

#### 4.3 Parameterized Unit Tests

There are times when we want to write several tests for a specific function or class. It can become tedious to write a function for each test that we want to perform. A common way around this is to use parameterized tests. 

Instead of writing many tests that are nearly identical, we can write **one test** and feed it **many cases** using `@pytest.mark.parametrize`. This improves readability and coverage with less code. Notice the spelling of `parametrize`. The verb form of **parameter** is spelled **parameterize**. The function used in `pytest` for doing parmeterized tests uses a different spelling. 

When we "do the math" we identify several cases that could cause problems. 
- Disjoint sets
- Overlapping sets
- Union with empty set
- Idempotence: $A \cup A == A$
- Commutativity: $A \cup B == B \cup A$

Let's set up `pytest` inside of the Jupyter notebook before looking at how we'd write parameterized tests.

In [21]:
import ipytest
import pytest

ipytest.autoconfig()

In [22]:
%%ipytest
@pytest.mark.parametrize(
    "hdr,rows1,name1,rows2,name2,expected_rows",
    [
        # Disjoint
        (["A","B"], {(1,"a"), (2,"b")}, "R1",
                     {(3,"c")},            "R2",
                     {(1,"a"), (2,"b"), (3,"c")}),

        # Overlap (duplicates should be removed by set semantics)
        (["A","B"], {(1,"a"), (2,"b")}, "R1",
                     {(2,"b"), (3,"c")}, "R2",
                     {(1,"a"), (2,"b"), (3,"c")}),

        # Left empty
        (["A","B"], set(),              "Empty",
                     {(1,"a")},         "R2",
                     {(1,"a")}),

        # Right empty
        (["A","B"], {(1,"a")},          "R1",
                     set(),             "Empty",
                     {(1,"a")}),

        # Idempotent: R ∪ R = R
        (["A","B"], {(10,"x"), (11,"y")}, "R",
                     {(10,"x"), (11,"y")}, "R_again",
                     {(10,"x"), (11,"y")}),
    ],
)
def test_union_basic_cases(hdr, rows1, name1, rows2, name2, expected_rows):
    r1 = Relation(name1, hdr, rows1)
    r2 = Relation(name2, hdr, rows2)

    out = r1.union(r2)

    # Header must match original header (by precondition)
    assert out.header == hdr
    # Row set must match expected
    assert out.set_of_tuples == expected_rows
    # Equality should work via __eq__
    assert out == Relation("EXPECTED", hdr, expected_rows)



[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                                                        [100%][0m


[32m[32m[1m5 passed[0m[32m in 0.01s[0m[0m


#### 4.4 What’s Happening in the Parameterized Tests?

The `@pytest.mark.parametrize` **decorator** is a way to tell `pytest`:  
“Run this same test function many times, once for each row in the table of inputs.” You don't need to understand how **decorators** work, but they are a neat programming concept that allows you to modify the way functions work. A good LLM will be able to provide a tutorial about decorators.

Let’s unpack what’s going on in `test_union_basic_cases`:

```python
@pytest.mark.parametrize(
    "hdr,rows1,name1,rows2,name2,expected_rows",
    [
        (["A","B"], {(1,"a"), (2,"b")}, "R1",
                     {(3,"c")},          "R2",
                     {(1,"a"), (2,"b"), (3,"c")}),   # disjoint

        (["A","B"], {(1,"a"), (2,"b")}, "R1",
                     {(2,"b"), (3,"c")}, "R2",
                     {(1,"a"), (2,"b"), (3,"c")}),   # overlap

        ...
    ],
)
def test_union_basic_cases(hdr, rows1, name1, rows2, name2, expected_rows):
    ...
```

The first argument to `parametrize` is a comma-separated string of variable names.  
These become the parameters to the test function (`hdr, rows1, name1, ...`).

- The second argument is a **list of tuples**.  
  Each tuple is one set of inputs that fills in those variables.

- When pytest runs:
  - It substitutes the first tuple into the variables and runs the test once.  
  - Then it substitutes the second tuple and runs again.  
  - And so on for every row.  

This way, one test function actually checks **five different scenarios**:
1. Disjoint relations  
2. Overlapping relations (testing duplicate elimination)  
3. Left operand empty  
4. Right operand empty  
5. Idempotence ($R \cup R = R$)  


**Why Do This?**

Without parameterization, we would need to write **five almost-identical test functions**. That makes tests longer, harder to read, and more likely to miss a case.  

By using `parametrize`, we:
- Capture the essence of what changes across cases (the data, not the logic).  
- Keep the test logic short and clear.  
- Guarantee consistent checks across many scenarios.  


**Key takeaway:** Parameterized tests let you explore a whole *family* of inputs with one concise piece of code.

#### 4.5 More Parameterized Tests for Union

Let's write a parameterized test that checks that $A \cup B = B \cup A$. In the test, we'll check relations with tuples of size 2, tuples with only one element, and relations that have identical tuples. 

In [23]:
%%ipytest
@pytest.mark.parametrize(
    "hdr,rows1,rows2",
    [
        (["A","B"], {(1,"a"), (2,"b")}, {(2,"b"), (3,"c")}),
        (["A"],      {(1,)},            {(1,), (2,)}),
        (["X","Y"], {(0,"p")},          {(0,"p")}),
    ],
)
def test_union_commutative(hdr, rows1, rows2):
    r1 = Relation("R1", hdr, rows1)
    r2 = Relation("R2", hdr, rows2)

    # Commutativity: r1 ∪ r2 == r2 ∪ r1
    out1 = r1.union(r2)
    out2 = r2.union(r1)
    assert out1 == out2

[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.01s[0m[0m


Parameterized tests are a convenient way of checking the math on several test cases but only having to write a single testing function. The key idea is to use the  `@pytest.mark.parametrize` decorator that defines the different parameters to be tested by the testing function. Each row in the list enclosed by the square braces `[ row1, row2 ]` represents a different set of testing values.

***


## The Project Relational Operator

We'll only implement a portion of this operator since you will need to implement the rest of the operator in Project 3. Specifically, this project operation will only work to project to a single column.
 
Consider the relation $P = R\cup Q$ defined as before
| `char` | `int` | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ |
| $f$ | $3$ | 

We want to compute $\pi_{char}(P)$, which does two things. 
* First, it creates a new relation.
* Second, it populates the new relation with the `char` column.

The result is the relation $\pi_{char}(P)$
| `char` | 
| :-: |
| $a$ |
| $b$ |
| $c$ |
| $d$ |
| $f$ |

***
Here's the code

In [24]:
class Relation:
    def __init__(self, relation_name: str, relation_header: list[str], set_of_tuples: set[RelationTuple]) -> None:
        self.name:str = relation_name # I threw in a string for the relation name for fun
        self.header: list[str] = relation_header
        self.set_of_tuples: set[RelationTuple] = set_of_tuples

    def __str__(self) -> str:
        value: str = f"relation name = {self.name}\n" + tabulate(iter(self.set_of_tuples), self.header, tablefmt="fancy_grid")
        return value

    def __eq__(self, other: Any) -> bool:
        if not isinstance(other, Relation):
            print("you are trying to compare a relation to something that is not a relation")
            return False
        return (
            self.header == other.header
            and self.set_of_tuples == other.set_of_tuples
        )
    
    ########################
    # Relational Operators #
    ########################
    def union(self,other) -> 'Relation':
        if not isinstance(other, Relation):
            raise IncompatibleOperandError(f"Tried to union a relation with a {type(other)}") # don't attempt to union with something not a relation
        # First, check the precondition to see if the headers are the same
        if self.header != other.header:
            raise IncompatibleOperandError("Tried to union two relations with different headers")
        
        # Second, create a new header that is the union of the sets of tuples
        name: str = self.name + "\u222A" + other.name
        header: list[str] = self.header
        set_of_tuples = self.set_of_tuples
        set_of_tuples = set_of_tuples.union(other.set_of_tuples)    # This is the union operator defined for set objects
        return Relation(name,header,set_of_tuples) # Create a new relation

    def project(self,column_header: str) -> 'Relation':
        # Only a portion of this function is implemented. 
        # Specifically, only the portion that projects onto a single column.
        # You have to implement the rest of this function in the Project 3
        
        # First, check the precondition
        # The precondition for the project operator is that the column attribute
        # must exist in the set of attributes
        if column_header not in set(self.header):
            raise IncompatibleOperandError(f"Tried to project onto {column_header}, which isn't an attribute of the relation")
        
        # Second, create a new relation that is the output of the projection operator.
        # THe relation needs a name, a header, and the set of tuples
        new_name = "\u03C0" + "_{" + column_header + "}(" + self.name + ")" # The \u03c0 is a special code for a union symbol
        new_header = (column_header,) # Notice the comma after "column_header", which forces the header to be a tuple
        header_index = self.header.index(column_header) # Get the index of the header that matches the column you want
        new_tuples: set[RelationTuple] = {(rtuple[header_index],) for rtuple in self.set_of_tuples}
        new_relation = Relation(new_name,new_header,new_tuples) # Create the relation
        return new_relation 


#### 5.1 Doing the Math

Let's look at the output of the `project` operator using simple tests. For simplicity, we won't put these simple tests into `pytest` so that we can just print out the relation. 

In [25]:
R: Relation = Relation(relation_name = 'R',relation_header = ('char','int'), set_of_tuples = {('a',1),('b',2),('c',3),('d',4)})
Q: Relation = Relation('Q',('char','int'),{('f','3'),}) # Observe the comma after the ('f',3) tuple. This forces python to make the tuple the lone element of a set

P: Relation = R.union(Q)
M: Relation = P.project('char')
print(M)


relation name = π_{char}(R∪Q)
╒════════╕
│ char   │
╞════════╡
│ a      │
├────────┤
│ b      │
├────────┤
│ f      │
├────────┤
│ d      │
├────────┤
│ c      │
╘════════╛


Let's project onto the `int` column instead. Why does projecting onto the `char` column produce five tuples but projecting onto the `char` column only produce four tuples?

In [26]:
P = R.union(Q)
M = P.project('int')
print(M)

relation name = π_{int}(R∪Q)
╒═══════╕
│   int │
╞═══════╡
│     2 │
├───────┤
│     3 │
├───────┤
│     4 │
├───────┤
│     1 │
├───────┤
│     3 │
╘═══════╛


The answer is that the set of tuples is a set, and sets don't have repeats. 

If we hadn't ignored repeated elements, projecting the relation $P$ defined as
| `char` | `int` | 
| :-: | :-: | 
| $a$ | $1$ |
| $b$ | $2$ | 
| $c$ | $3$ | 
| $d$ | $4$ |
| $f$ | $3$ | 

onto the `int` column would yield $\pi_{int}(P)$ 
| `int` | 
| :-: | 
| $1$ |
| $2$ | 
| $3$ | 
| $4$ |
| $3$ |

but the "3" appears in the set twice, so the correct answer is
$\pi_{int}(P)$ 
| `int` | 
| :-: | 
| $1$ |
| $2$ | 
| $3$ | 
| $4$ |


#### 5.2 Test Coverage via Input Partitioning

The unit tests in Project 1 and Project 2 were based on the principle that you should write a few positive tests and at least one negative test for each method or class. That principle can be strengthened using the idea of **test coverage**, which is that the tests that we write should cover a wide range of possible input values so that we are more confident that the thing we are testing works in a real situation.

One way to systematically design tests that produe high **test coverate** is to **partition the input space** into meaningful categories, then pick at least one test case from each partition. The `project` operator creates a new relation by removing columns from an existing relation.

For the `project` operator, we can identify a set of possible inputs that might cause problems for the `project` operator. The inputs that we chose go from problems with few tuples and few columns to problems with more tuples and more columns.

**Partitions to cover:**
- empty relation with two columns projected onto one column
- nonempty with two columns projected onto one column, and the resulting relation doesn't have repeat elements
- nonempty with two columns projected onto one column, and the resulting relation would have repeat elements if not written correctly
- nonempty with three columns projected onto one column
- nonempty with three columns projected onto two columns and the two columns are side-by-side in the original relation
- nonempty with three columns projected onto two columns and the two columns are not by each other in the original relation
- projection that keeps all columns
- error case where there is an invalid column name to which we are trying to project (i.e., header mismatch)

---

**Empty Relation**
If we project an **empty relation** $E$ defined as

| `A` | `B` |
| :-: | :-: |

onto the column `A`, the result is $\pi_{A}(E)$:

| `A` |
| :-: |

Since $E$ has no rows, the projected relation should also be empty.

---

**Two Projected to One (no duplicates)**
If we project the relation $R$ defined as

| `A` | `B` |
| :-: | :-: |
| $x$ | $1$ |
| $y$ | $2$ |

onto the column `A`, the result is $\pi_{A}(R)$:

| `A` |
| :-: |
| $x$ |
| $y$ |

No duplicates appear in this projection, so the result is just the distinct values of `A`.

---

**Two Projected to One (possible duplicates)**
If we hadn’t ignored repeated elements, projecting the relation $P$ defined as

| `A` | `B` |
| :-: | :-: |
| $x$ | $1$ |
| $x$ | $2$ |

onto the column `A` would yield $\pi_{A}(P)$:

| `A` |
| :-: |
| $x$ |
| $x$ |

But projection in relational algebra **removes duplicates**, so the correct result is:

| `A` |
| :-: |
| $x$ |

---

**Three Columns Projected to One Column**
If we project the relation $R$ defined as

| `A` | `B` | `C` |
| :-: | :-: | :-: |
| $x$ | $1$ | $\mathsf{true}$ |
| $y$ | $2$ | $\mathsf{false}$ |

onto the column `B`, the result is $\pi_{B}(R)$:

| `B` |
| :-: |
| $1$ |
| $2$ |

---

**Three Columns Projected to Two Adjacent Columns**

If we project the relation $R$ defined as

| `A` | `B` | `C` |
| :-: | :-: | :-: |
| $x$ | $1$ | $\mathsf{true}$ |
| $y$ | $2$ | $\mathsf{false}$ |
| $x$ | $1$ | $\mathsf{false}$ |

onto the **adjacent columns** `A` and `B`, the result is $\pi_{A,B}(R)$:

| `A` | `B` |
| :-: | :-: |
| $x$ | $1$ |
| $y$ | $2$ |

Notice that the third row collapses with the first under projection, since only the `A` and `B` columns are kept.

---

**Three Columns Projected to Two NonAdjacent Columns**

If we project the relation $R$ defined as

| `A` | `B` | `C` |
| :-: | :-: | :-: |
| $x$ | $1$ | $\mathsf{true}$ |
| $y$ | $2$ | $\mathsf{false}$ |
| $x$ | $9$ | $\mathsf{true}$ |

onto the **non-adjacent columns** `A` and `C`, the result is $\pi_{A,C}(R)$:

| `A` | `C` |
| :-: | :-: |
| $x$ | $\mathsf{true}$ |
| $y$ | $\mathsf{false}$ |

Here the first and third rows collapse into one, since the projection ignores the `B` column and both give $(x,\mathsf{true})$.

---

**Projection that Keeps All Columns**

If we project the relation $R$ defined as

| `A` | `B` |
| :-: | :-: |
| $x$ | $1$ |
| $y$ | $2$ |

onto **all of its columns** (`A` and `B`), then $\pi_{A,B}(R) = R$:

| `A` | `B` |
| :-: | :-: |
| $x$ | $1$ |
| $y$ | $2$ |

Projecting onto the full set of attributes returns the original relation unchanged.

---

**Error Case: Invalid Column Name**
If we try to project the relation $R$ defined as

| `A` | `B` |
| :-: | :-: |
| $x$ | $1$ |

onto a column `Z` that does not appear in the header, the operation is undefined.  
In our implementation this raises an `IncompatibleOperandError`.


In [27]:
%%ipytest -qq

def test_project_invalid_column() -> None:
    # error case where invalid column name is used
    R = Relation("R", ["A","B"], {("x",1)})
    with pytest.raises(IncompatibleOperandError):
        R.project("Z")

def test_project_empty_relation() -> None:
    # empty relation with two columns projected onto one column
    R = Relation("Empty", ["A","B"], set())
    out = R.project("A")
    expected = Relation("π_{A}(Empty)", ("A",), set())
    assert out == expected


def test_project_nonempty_two_columns_no_duplicates() -> None:
    # nonempty with two columns projected onto one column, distinct values
    R = Relation("R", ["A","B"], {("x",1), ("y",2)})
    out = R.project("A")
    expected = Relation("π_{A}(R)", ("A",), {("x",), ("y",)})
    assert out == expected


def test_project_nonempty_two_columns_with_duplicates() -> None:
    # nonempty with two columns projected onto one column, duplicates removed
    R = Relation("R", ["A","B"], {("x",1), ("x",2)})
    out = R.project("A")
    expected = Relation("π_{A}(R)", ("A",), {("x",)})
    assert out == expected


def test_project_three_columns_one_column() -> None:
    # nonempty with three columns projected onto one column
    R = Relation("R", ["A","B","C"], {("x",1,True), ("y",2,False)})
    out = R.project("B")
    expected = Relation("π_{B}(R)", ("B",), {(1,), (2,)})
    assert out == expected


def test_project_three_cols_to_two_adjacent() -> None:
    # Header A, B, C — project onto adjacent columns A,B
    r = Relation(
        "R",
        ["A", "B", "C"],
        {
            ("x", 1, True),
            ("y", 2, False),
            ("x", 1, False),   # duplicate on (A,B) should be removed by projection
        },
    )

    out = r.project(("A", "B"))  # expecting future support for multi-column

    # Expected: header (A,B) and duplicates removed
    expected_header = ("A", "B")
    expected_rows = {("x", 1), ("y", 2)}

    assert out.header == expected_header
    assert out.set_of_tuples == expected_rows


def test_project_three_cols_to_two_non_adjacent() -> None:
    # Header A, B, C — project onto non-adjacent columns A,C
    r = Relation(
        "R",
        ["A", "B", "C"],
        {
            ("x", 1, True),
            ("y", 2, False),
            ("x", 9, True),    # same (A,C) as first row; should collapse to one
        },
    )

    out = r.project(("A", "C"))  # expecting future support for multi-column

    # Expected: header (A,C) and duplicates removed
    expected_header = ("A", "C")
    expected_rows = {("x", True), ("y", False)}

    assert out.header == expected_header
    assert out.set_of_tuples == expected_rows

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[31mF[0m[31mF[0m[31m                                                                                      [100%][0m


[31m[1m_____________________________ test_project_three_cols_to_two_adjacent ______________________________[0m

    [0m[94mdef[39;49;00m[90m [39;49;00m[92mtest_project_three_cols_to_two_adjacent[39;49;00m() -> [94mNone[39;49;00m:[90m[39;49;00m
        [90m# Header A, B, C — project onto adjacent columns A,B[39;49;00m[90m[39;49;00m
        r = Relation([90m[39;49;00m
            [33m"[39;49;00m[33mR[39;49;00m[33m"[39;49;00m,[90m[39;49;00m
            [[33m"[39;49;00m[33mA[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mB[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mC[39;49;00m[33m"[39;49;00m],[90m[39;49;00m
            {[90m[39;49;00m
                ([33m"[39;49;00m[33mx[39;49;00m[33m"[39;49;00m, [94m1[39;49;00m, [94mTrue[39;49;00m),[90m[39;49;00m
                ([33m"[39;49;00m[33my[39;49;00m[33m"[39;49;00m, [94m2[39;49;00m, [94mFalse[39;49;00m),[90m[39;49;00m
                ([33m"[39;49;00m[33mx[39;49;00m[

Why do two of the tests fail? Because the `project` operator implemented above is written to only project onto a single column. The two failed tests are for problems when we are trying to project onto more than one column. 

And this is the reason that the idea of **test coverage through input partitioning** is useful; it helps us think through different **types of inputs** that can occur and write a test for each input type.

---