# PyDough Knowledge Graph Creation: From Relational Data to Graph-Based Metadata

## Introduction

This document serves as a comprehensive tutorial on constructing a knowledge graph in PyDough, based on relational data. The tutorial is designed for converting relational database schemas into PyDough metadata, ensuring a structured and efficient approach to representing data. For this example, we will be working with a schema written in SQLite.  

This tutorial takes a hands-on approach, walking you through the entire workflow of creating a PyDough knowledge graph. Along the way, we will integrate key concepts from the PyDough metadata documentation to ensure a thorough understanding of how different components interact.  

The guide is structured as follows:
- Understanding the Data Schema
- Defining Collections and Properties
- Establishing Simple Relationships (Joins)
- Implementing Compound Relationships
- Validating the Graph Structure

In the next cell we will import PyDough and initialize it for several tests later.

In [1]:
%load_ext pydough.jupyter_extensions

import pydough

%load_ext pydough.jupyter_extensions

The pydough.jupyter_extensions extension is already loaded. To reload it, use:
  %reload_ext pydough.jupyter_extensions


## Understanding the Data Schema

To construct our knowledge graph in PyDough, we first need to define the underlying relational database schema. In this tutorial, we will use a university database that models departments, professors, students, courses, and study partnerships. This schema contains a variety of relationships, including one-to-one, one-to-many, many-to-many, and self-referencing relationships to showcase how the graphs convert all these kinds of relationships.

### Relational Database Schema

Below is the SQL schema that we will use as the foundation for our PyDough knowledge graph:

```sql
-- One-to-Many (Department → Professors)
CREATE TABLE Departments (
    department_id INT PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL,
    founded DATE
);

CREATE TABLE Professors (
    professor_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    department_id INT,
    is_active BOOLEAN,
    FOREIGN KEY (department_id) REFERENCES Departments(department_id)
);

-- One-to-One (Professor ↔ Office)
CREATE TABLE ProfessorOffices (
    professor_id INT PRIMARY KEY,
    office_number INT NOT NULL,
    building VARCHAR(100) NOT NULL,
    FOREIGN KEY (professor_id) REFERENCES Professors(professor_id)
);

-- Many-to-Many (Students ↔ Courses, Through Enrollments)
CREATE TABLE Students (
    student_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL
);

CREATE TABLE Courses (
    course_id INT PRIMARY KEY,
    course_name VARCHAR(100) NOT NULL
);

CREATE TABLE Enrollments (
    student_id INT,
    course_id INT,
    PRIMARY KEY (student_id, course_id),
    FOREIGN KEY (student_id) REFERENCES Students(student_id),
    FOREIGN KEY (course_id) REFERENCES Courses(course_id)
);

-- Self-referencing Many-to-Many (Student study partnerships)
CREATE TABLE StudentStudyPartner (
    student_id INT,
    partner_id INT,
    PRIMARY KEY (student_id, partner_id),
    FOREIGN KEY (student_id) REFERENCES Students(student_id),
    FOREIGN KEY (partner_id) REFERENCES Students(student_id)
);
```

### Schema Overview and Explanation

This schema models the university system with the following entities and relationships:

- **One-to-Many:** Each department can have multiple professors. The `Professors` table contains a foreign key `department_id` referencing the `Departments` table.
- **One-to-One:** Each professor has exactly one office. The `ProfessorOffices` table uses `professor_id` as both the primary key and foreign key, ensuring a one-to-one relationship.
- **Many-to-Many:** Students enroll in multiple courses, and courses have multiple students. The `Enrollments` table serves as a bridge table to connect `Students` and `Courses`.
- **Self-referencing Many-to-Many:** Students can form study partnerships with other students. The `StudentStudyPartner` table creates self-referencing relationships, ensuring students can be partners but not with themselves.

This structured schema will serve as the basis for transforming relational data into a PyDough knowledge graph, which we will cover in the next sections.

## Defining Collections and Properties

### Understanding the Metadata Structure

In PyDough, knowledge graphs are defined using JSON metadata files. These files store structured data representing the entities, attributes, and relationships of a dataset. A metadata file consists of:
- **Graphs:** Logical groupings of related collections.
- **Collections:** Corresponding to SQL tables, collections represent entities.
- **Properties:** Attributes of the tables and their relationships between other tables.

The most basic structure for our example is shown below:

```json
{
    "UniversityGraph": {
        "Students": {...},
        "Professors": {...},
        "Courses": {...},
        "Departments": {...}
    }
}
```

As you can see this part represents the Graph and it's collections. This structure also allows defining multiple logical datasets within the same metadata file.

### Defining Collections

Each SQL table corresponds to a collection in PyDough. Initially, we define empty collections with:

- **type**: Always set to `"simple_table"` for standard SQL tables.
- **table_path**: The SQL table name and location.
- **unique_properties**: The primary key(s) of the table. This can be:
  - A single unique attribute: `"unique_properties": ["attribute"]`
  - A combination of attributes that create a unique key: `"unique_properties": ["attribute1", "attribute2"]`
- **properties**: Initially left empty. This will be defined in the next step.

Here is how the metadata graph would look after defining the collections based on SQL tables:

In [None]:
{
    "University": {
        "Students": {
            "type": "simple_table",
            "table_path": "main.students",
            "unique_properties": ["student_id"],
            "properties": {}
        },
        "Professors": {
            "type": "simple_table",
            "table_path": "main.professors",
            "unique_properties": ["professor_id"],
            "properties": {}
        },
        "Departments": {
            "type": "simple_table",
            "table_path": "main.departments",
            "unique_properties": ["department_id"],
            "properties": {}
        },
        "Courses": {
            "type": "simple_table",
            "table_path": "main.courses",
            "unique_properties": ["course_id"],
            "properties": {}
        },
        "Enrollments": {
            "type": "simple_table",
            "table_path": "main.enrollments",
            "unique_properties": [["student_id", "course_id"]],
            "properties": {}
        },
        "ProfessorOffices": {
            "type": "simple_table",
            "table_path": "main.professor_offices",
            "unique_properties": ["professor_id"],
            "properties": {}
        },
        "StudentStudyPartner": {
            "type": "simple_table",
            "table_path": "main.student_study_partner",
            "unique_properties": [["student_id", "partner_id"]],
            "properties": {}
        }
    }
}

At this stage, collections exist but lack properties.

### Defining Properties

Each property in a collection corresponds to a specific characteristic of an entity. In PyDough, properties can be categorized into:

- **Attributes**: These represent SQL table columns, storing direct data about an entity.
- **Relationships**: These define connections between collections, linking entities through joins or compound relationships.

#### Attributes

At this stage, we focus on defining **attributes**, which map directly to SQL table columns. Each attribute includes:

- **type**: Always `"table_column"` for a SQL table column.
- **column_name**: The SQL column name.
- **data_type**: The data type.

Now, we add properties to collections:


In [None]:
{
    "University": {
        "Students": {
            "type": "simple_table",
            "table_path": "main.students",
            "unique_properties": ["student_id"],
            "properties": {
                "student_id": {"type": "table_column", "column_name": "student_id", "data_type": "int32"},
                "name": {"type": "table_column", "column_name": "name", "data_type": "string"}
            }
        },
        "Professors": {
            "type": "simple_table",
            "table_path": "main.professors",
            "unique_properties": ["professor_id"],
            "properties": {
                "professor_id": {"type": "table_column", "column_name": "professor_id", "data_type": "int32"},
                "name": {"type": "table_column", "column_name": "name", "data_type": "string"},
                "department_id": {"type": "table_column", "column_name": "department_id", "data_type": "int32"},
                "is_active": {"type": "table_column", "column_name": "is_active", "data_type": "bool"}
            }
        },
        "Departments": {
            "type": "simple_table",
            "table_path": "main.departments",
            "unique_properties": ["department_id"],
            "properties": {
                "department_id": {"type": "table_column", "column_name": "department_id", "data_type": "int32"},
                "department_name": {"type": "table_column", "column_name": "department_name", "data_type": "string"},
                "founded": {"type": "table_column", "column_name": "founded", "data_type": "date"}
            }
        },
        "Courses": {
            "type": "simple_table",
            "table_path": "main.courses",
            "unique_properties": ["course_id"],
            "properties": {
                "course_id": {"type": "table_column", "column_name": "course_id", "data_type": "int32"},
                "course_name": {"type": "table_column", "column_name": "course_name", "data_type": "string"}
            }
        },
        "Enrollments": {
            "type": "simple_table",
            "table_path": "main.enrollments",
            "unique_properties": [["student_id", "course_id"]],
            "properties": {
                "student_id": {"type": "table_column", "column_name": "student_id", "data_type": "int32"},
                "course_id": {"type": "table_column", "column_name": "course_id", "data_type": "int32"}
            }
        },
        "ProfessorOffices": {
            "type": "simple_table",
            "table_path": "main.professor_offices",
            "unique_properties": ["professor_id"],
            "properties": {
                "professor_id": {"type": "table_column", "column_name": "professor_id", "data_type": "int32"},
                "office_number": {"type": "table_column", "column_name": "office_number", "data_type": "int32"},
                "building": {"type": "table_column", "column_name": "building", "data_type": "string"}
            }
        },
        "StudentStudyPartner": {
            "type": "simple_table",
            "table_path": "main.student_study_partner",
            "unique_properties": [["student_id", "partner_id"]],
            "properties": {
                "student_id": {"type": "table_column", "column_name": "student_id", "data_type": "int32"},
                "partner_id": {"type": "table_column", "column_name": "partner_id", "data_type": "int32"}
            }
        }
    }
}

If we copy and paste that in to a graph.json file to load it, we can check the graph current structure with PyDough utilities. We can do this constantly in anypart of the process to check the structure.

In [6]:
pydough.active_session.load_metadata_graph("../metadata/graph.json", "University")
graph = pydough.active_session.metadata
print(pydough.explain_structure(graph))

Structure of PyDough graph: University

  Courses
  ├── course_id
  └── course_name

  Departments
  ├── department_id
  ├── department_name
  └── founded

  Enrollments
  ├── course_id
  └── student_id

  ProfessorOffices
  ├── building
  ├── office_number
  └── professor_id

  Professors
  ├── department_id
  ├── is_active
  ├── name
  └── professor_id

  StudentStudyPartner
  ├── partner_id
  └── student_id

  Students
  ├── name
  └── student_id


#### Simple Relationships

In PyDough, simple relationships are direct mappings between tables that reflect the foreign key constraints in relational databases. These relationships are used to define one-to-one and one-to-many connections between collections.

Each simple relationship is a **property** defined using:

- **type**: `"simple_join"`, indicating that this is a direct relationship.
- **other_collection_name**: The name of the related collection.
- **singular**: A boolean indicating whether the relationship is one-to-one (`true`) or one-to-many (`false`).
- **no_collisions**: A boolean ensuring uniqueness of the relationship. It is `true` when the reverse relationship is singular.
- **keys**: A mapping of columns between the two collections.
- **reverse_relationship_name**: The name given to the reverse relationship.

##### One-to-One Relationship:

A **one-to-one relationship** means that each record in a table corresponds to exactly one record in another table. In our schema, each professor has a unique office.

```sql
CREATE TABLE Professors (
    professor_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    department_id INT,
    is_active BOOLEAN,
    FOREIGN KEY (department_id) REFERENCES Departments(department_id)
);

CREATE TABLE ProfessorOffices (
    professor_id INT PRIMARY KEY,
    office_number INT NOT NULL,
    building VARCHAR(100) NOT NULL,
    FOREIGN KEY (professor_id) REFERENCES Professors(professor_id)
);
```

In PyDough, we represent this as a **simple relationship**, the relationship is defined in ProfessorOffices.


In [None]:
{
    "Professors": {
      "type": "simple_table",
      "table_path": "main.professors",
      "unique_properties": ["professor_id"],
      "properties": {
        "professor_id": {"type": "table_column", "column_name": "professor_id", "data_type": "int32"},
        "name": {"type": "table_column", "column_name": "name", "data_type": "string"},
        "department_id": {"type": "table_column", "column_name": "department_id", "data_type": "int32"},
        "is_active": {"type": "table_column", "column_name": "is_active", "data_type": "boolean"}
      }
    },
    "ProfessorOffices": {
      "type": "simple_table",
      "table_path": "main.professor_offices",
      "unique_properties": ["professor_id"],
      "properties": {
        "professor_id": {"type": "table_column", "column_name": "professor_id", "data_type": "int32"},
        "office_number": {"type": "table_column", "column_name": "office_number", "data_type": "int32"},
        "building": {"type": "table_column", "column_name": "building", "data_type": "string"},

        "professor": {
          "type": "simple_join",
          "other_collection_name": "Professors",
          "singular": true,
          "no_collisions": true,
          "keys": { "professor_id": ["professor_id"] },
          "reverse_relationship_name": "office"

        }
      }
    }
  }

**Explanation:**
- The `ProfessorOffices` collection includes the `professor` relationship, linking each office to a single professor.
- The `Professors` collection does not contain an explicit relationship, as PyDough automatically infers it from `ProfessorOffices`.
- **singular: true** ensures that each office is associated with only one professor.
- **no_collisions: true** enforces uniqueness, meaning one office cannot belong to multiple professors.

Here we can check the structure after adding the relationship in our graph.json document.

In [7]:
pydough.active_session.load_metadata_graph("../metadata/graph.json", "University")
graph = pydough.active_session.metadata
print(pydough.explain_structure(graph))

Structure of PyDough graph: University

  Courses
  ├── course_id
  └── course_name

  Departments
  ├── department_id
  ├── department_name
  └── founded

  Enrollments
  ├── course_id
  └── student_id

  ProfessorOffices
  ├── building
  ├── office_number
  ├── professor_id
  └── professor [one member of Professors] (reverse of Professors.office)

  Professors
  ├── department_id
  ├── is_active
  ├── name
  ├── professor_id
  └── office [one member of ProfessorOffices] (reverse of ProfessorOffices.professor)

  StudentStudyPartner
  ├── partner_id
  └── student_id

  Students
  ├── name
  └── student_id


As one can see, now we have a one-to-one relationship between Professors and their respective office.

Alternatively, we could define the relationship in the `Professors` collection instead of `ProfessorOffices`, and it would still work correctly. In this case, PyDough would infer the reverse relationship in `ProfessorOffices`. 

#### One-to-Many Relationship:

A **one-to-many** relationship occurs when a single record in one table relates to multiple records in another. In SQL, this is commonly represented by a foreign key.

For example, in our SQL schema, each department has multiple professors, but each professor belongs to only one department:

```sql
CREATE TABLE Departments (
    department_id INT PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL,
    founded DATE
);

CREATE TABLE Professors (
    professor_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    department_id INT,
    is_active BOOLEAN,
    FOREIGN KEY (department_id) REFERENCES Departments(department_id)
);
```

In PyDough, this is represented using a simple relationship very similar to the one-to-one but with different singular or no_colision values:

In [None]:
{
  "Departments": {
    "type": "simple_table",
    "table_path": "main.departments",
    "unique_properties": ["department_id"],
    "properties": {
        "department_id": {"type": "table_column", "column_name": "department_id", "data_type": "int32"},
        "department_name": {"type": "table_column", "column_name": "department_name", "data_type": "string"},
        "founded": {"type": "table_column", "column_name": "founded", "data_type": "date"},
    }
},
"Professors": {
    "type": "simple_table",
    "table_path": "main.professors",
    "unique_properties": ["professor_id"],
    "properties": {
        "professor_id": {"type": "table_column", "column_name": "professor_id", "data_type": "int32"},
        "name": {"type": "table_column", "column_name": "name", "data_type": "string"},
        "department_id": {"type": "table_column", "column_name": "department_id", "data_type": "int32"},
        "is_active": {"type": "table_column", "column_name": "is_active", "data_type": "boolean"},
        
        "department": {
            "type": "simple_join",
            "other_collection_name": "Departments",
            "singular": true,
            "no_collisions": false,
            "keys": { "department_id": ["department_id"] },
            "reverse_relationship_name": "professors"
        }

    }
  }
}

**Explanation:**
- The `Professors` collection defines a `department` relationship, which connects each professor to a single `Departments` record.
- The relationship is not explicitly defined in `Departments` because it is automatically inferred from `Professors`.
- **singular: true** in `Professors` ensures that each professor belongs to only one department.
- **no_collisions: false** allows multiple professors to reference the same department.

Alternatively, we could define the relationship in the `Departments` collection instead of `Professors`. In this case, we would add a `professors` property to `Departments`, making the relationship plural. However, this means:

- **singular: false** – A department can have multiple professors.
- **no_collisions: true** – Each professor belongs to only one department, ensuring uniqueness from the reverse perspective.

Both approaches correctly define the relationship, but choosing which one to use depends on how we prefer to navigate the data in PyDough.

Here we can check the structure after adding the relationship in our graph.json document.

In [9]:
pydough.active_session.load_metadata_graph("../metadata/graph.json", "University")
graph = pydough.active_session.metadata
print(pydough.explain_structure(graph))

Structure of PyDough graph: University

  Courses
  ├── course_id
  └── course_name

  Departments
  ├── department_id
  ├── department_name
  ├── founded
  └── professors [multiple Professors] (reverse of Professors.department)

  Enrollments
  ├── course_id
  └── student_id

  ProfessorOffices
  ├── building
  ├── office_number
  ├── professor_id
  └── professor [one member of Professors] (reverse of Professors.office)

  Professors
  ├── department_id
  ├── is_active
  ├── name
  ├── professor_id
  ├── department [one member of Departments] (reverse of Departments.professors)
  └── office [one member of ProfessorOffices] (reverse of ProfessorOffices.professor)

  StudentStudyPartner
  ├── partner_id
  └── student_id

  Students
  ├── name
  └── student_id
