# Diagramming

Schema diagrams are essential tools for understanding and designing DataJoint pipelines.
They provide a visual representation of tables and their dependencies, making complex workflows comprehensible at a glance.

As introduced in [Relational Workflows](../20-concepts/05-workflows.md), DataJoint schemas form **Directed Acyclic Graphs (DAGs)** where:

- **Nodes** represent tables (workflow steps)
- **Edges** represent foreign key dependencies
- **Direction** flows from parent (referenced) to child (referencing) tables

This DAG structure embodies a core principle of the Relational Workflow Model: **the schema is an executable specification**.
Tables at the top are independent entities; tables below depend on tables above them.
Reading the diagram top-to-bottom reveals the workflow execution order.

DataJoint's diagramming notation differs from traditional notations (Chen's ER, Crow's Foot, UML) in one critical way: **line styles encode semantic relationship types**, not just cardinality.
This makes the diagram immediately informative about how entities relate—whether they share identity, belong to each other, or merely reference each other.

## Quick Reference

| Line Style | Appearance | Relationship | Child's Primary Key | Cardinality |
|------------|------------|--------------|--------------------|--------------|
| **Thick Solid** | ━━━ | Extension | Parent PK only | One-to-one |
| **Thin Solid** | ─── | Containment | Parent PK + own field(s) | One-to-many |
| **Dashed** | ┄┄┄ | Reference | Own independent PK | One-to-many |

**Key Principle**: Solid lines mean the parent's identity becomes part of the child's identity.
Dashed lines mean the child maintains independent identity.

**Visual Indicators**:
- **Underlined table name**: Independent entity with its own primary key
- **Non-underlined name**: Dependent entity whose identity derives from parent
- **Orange dots**: Renamed foreign keys (see [Renamed Foreign Keys](#renamed-foreign-keys-and-orange-dots))
- **Table colors**: Green (Manual), Blue (Imported), Red (Computed), Gray (Lookup)

## The Three Line Styles

Line styles convey the **semantic relationship** between parent and child tables.
The choice of line style is determined by where the foreign key appears in the child's definition.

### Thick Solid Line: Extension (One-to-One)

The foreign key **is** the entire primary key of the child table.

**Semantics**: The child *extends* or *specializes* the parent.
They share the same identity—at most one child exists for each parent.

```python
@schema
class Customer(dj.Manual):
    definition = """
    customer_id : int
    ---
    name : varchar(50)
    """

@schema
class CustomerPreferences(dj.Manual):
    definition = """
    -> Customer          # This IS the entire primary key
    ---
    theme : varchar(20)
    """
```

**Use cases**: Workflow sequences (Order → Shipment → Delivery), optional extensions (Customer → CustomerPreferences), modular data splits.

### Thin Solid Line: Containment (One-to-Many)

The foreign key is **part of** (but not all of) the child's primary key.

**Semantics**: The child *belongs to* or *is contained within* the parent.
Multiple children can exist for each parent, each identified within the parent's context.

```python
@schema
class Customer(dj.Manual):
    definition = """
    customer_id : int
    ---
    name : varchar(50)
    """

@schema
class Account(dj.Manual):
    definition = """
    -> Customer              # Part of primary key
    account_number : int     # Additional PK component
    ---
    balance : decimal(10,2)
    """
```

**Use cases**: Hierarchies (Study → Subject → Session), ownership (Customer → Account), containment (Order → OrderItem).

### Dashed Line: Reference (One-to-Many)

The foreign key is a **secondary attribute** (below the `---` line).

**Semantics**: The child *references* or *associates with* the parent but maintains independent identity.
The parent is just one attribute describing the child.

```python
@schema
class Bank(dj.Manual):
    definition = """
    bank_id : int
    ---
    bank_name : varchar(100)
    """

@schema
class Account(dj.Manual):
    definition = """
    account_number : int     # Own independent PK
    ---
    -> Bank                  # Secondary attribute
    balance : decimal(10,2)
    """
```

**Use cases**: Loose associations (Product → Manufacturer), references that might change (Employee → Department), when child has independent identity.

## Visual Examples

Let's see each line style in action with live diagrams.

In [None]:
import datajoint as dj
dj.conn()

### Dashed Line Example

In [None]:
schema_dashed = dj.Schema('diagram_dashed')

@schema_dashed
class Customer(dj.Manual):
    definition = """
    customer_id : int
    ---
    name : varchar(50)
    """

@schema_dashed  
class Account(dj.Manual):
    definition = """
    account_number : int
    ---
    -> Customer
    balance : decimal(10,2)
    """

dj.Diagram(schema_dashed)

**Dashed line**: `Account` has its own independent identity (`account_number`).
The `customer_id` foreign key is secondary—it references `Customer` but doesn't define the account's identity.

### Thin Solid Line Example

In [None]:
schema_thin = dj.Schema('diagram_thin')

@schema_thin
class Customer(dj.Manual):
    definition = """
    customer_id : int
    ---
    name : varchar(50)
    """

@schema_thin
class Account(dj.Manual):
    definition = """
    -> Customer
    account_number : int
    ---
    balance : decimal(10,2)
    """

dj.Diagram(schema_thin)

**Thin solid line**: `Account`'s primary key is `(customer_id, account_number)`.
Accounts *belong to* customers—Account #3 means "Account #3 of Customer X."

### Thick Solid Line Example

In [None]:
schema_thick = dj.Schema('diagram_thick')

@schema_thick
class Customer(dj.Manual):
    definition = """
    customer_id : int
    ---
    name : varchar(50)
    """

@schema_thick
class Account(dj.Manual):
    definition = """
    -> Customer
    ---
    balance : decimal(10,2)
    """

dj.Diagram(schema_thick)

**Thick solid line**: `Account`'s primary key *is* `customer_id` (inherited from `Customer`).
Each customer can have at most one account—they share identity.
Note that `Account` is no longer underlined, indicating it's not an independent dimension.

## Association Tables and Many-to-Many Relationships

Many-to-many relationships appear as tables with **converging foreign keys**—multiple thin solid lines pointing into a single table.

In [None]:
schema_assoc = dj.Schema("projects")

@schema_assoc
class Employee(dj.Manual):
    definition = """
    employee_id : int
    ---
    employee_name : varchar(60)
    """

@schema_assoc
class Project(dj.Manual):
    definition = """
    project_code  : varchar(8)
    ---
    project_title : varchar(50)
    start_date : date
    end_date : date
    """
    
@schema_assoc
class Assignment(dj.Manual):
    definition = """
    -> Employee
    -> Project
    ---
    percent_effort : decimal(4,1) unsigned
    """

dj.Diagram(schema_assoc)

**Reading this diagram**:
- `Employee` and `Project` are independent entities (underlined, at top)
- `Assignment` has two thin solid lines converging into it
- Its primary key is `(employee_id, project_code)`—the combination of both parents
- This creates a many-to-many relationship: each employee can work on multiple projects, and each project can have multiple employees

## Renamed Foreign Keys and Orange Dots

DataJoint foreign keys always reference the parent's **primary key**.
Usually, the foreign key attribute keeps the same name as in the parent.
However, sometimes you need different names:

- **Multiple references to the same table** (e.g., presynaptic and postsynaptic neurons)
- **Semantic clarity** (e.g., `manager_id` instead of `employee_id`)
- **Avoiding name conflicts**

Use `.proj()` to rename foreign key attributes:

In [None]:
schema_graph = dj.Schema('directed_graph')

@schema_graph
class Neuron(dj.Manual):
    definition = """
    neuron_id : int
    ---
    neuron_type : enum('excitatory', 'inhibitory')
    layer : int
    """

@schema_graph
class Synapse(dj.Manual):
    definition = """
    synapse_id : int
    ---
    -> Neuron.proj(presynaptic='neuron_id')
    -> Neuron.proj(postsynaptic='neuron_id')
    strength : float
    """

dj.Diagram(schema_graph)

**Orange dots** appear between `Neuron` and `Synapse`, indicating:
- A projection has renamed the foreign key attribute
- Two distinct foreign keys connect the same pair of tables
- In the `Synapse` table: `presynaptic` and `postsynaptic` both reference `Neuron.neuron_id`

In interactive Jupyter notebooks, hovering over orange dots reveals the projection expression.

**Common patterns** using renamed foreign keys:
- **Neural networks**: Presynaptic and postsynaptic neurons
- **Organizational hierarchies**: Employee and manager (both reference `Employee`)
- **Transportation**: Origin and destination airports

## Real-World Example: Classic Sales Database

Let's examine a real database—the [MySQL tutorial sample database](https://www.mysqltutorial.org/getting-started-with-mysql/mysql-sample-database/):

In [None]:
schema = dj.Schema("classic_sales")
schema.spawn_missing_classes()

dj.Diagram(schema)

**Reading this diagram**:
1. **Independent entities at top**: `Productline`, `Office`, `Customer` (underlined)
2. **Follow solid lines down**: Track how primary keys cascade through the hierarchy
3. **Identify association tables**: Look for converging lines (e.g., `Orderdetail` links `Order` and `Product`)
4. **Dashed lines**: Reference relationships that don't cascade identity

The vertical layout reveals the workflow: create products and customers first, then orders, then order details.

## What Diagrams Show and Don't Show

### Clearly Indicated

| Feature | How It's Shown |
|---------|---------------|
| Relationship type | Line style (thick/thin/dashed) |
| Dependency direction | Arrows from parent to child |
| Independent entities | Underlined table names |
| Table tiers | Colors (Green/Blue/Red/Gray) |
| Many-to-many | Converging lines into association table |
| Renamed foreign keys | Orange dots |

### Not Visible

| Feature | Must Check |
|---------|------------|
| Nullable foreign keys | Table definition |
| Secondary unique constraints | Table definition |
| Attribute names and types | Hover or inspect definition |
| CHECK constraints | Table definition |

**Design principle**: DataJoint users generally avoid secondary unique constraints.
Making foreign keys part of the primary key (creating solid lines) provides visual clarity and enables direct joins across multiple levels.

## Diagram Operations

DataJoint provides operators to filter and combine diagrams:

```python
# Show entire schema
dj.Diagram(schema)

# Show specific tables
dj.Diagram(Table1) + dj.Diagram(Table2)

# Show table and N levels of upstream dependencies
dj.Diagram(Table) - N

# Show table and N levels of downstream dependents
dj.Diagram(Table) + N

# Combine operations
(dj.Diagram(Table1) - 2) + (dj.Diagram(Table2) + 1)
```

## Diagrams and Queries

The diagram structure directly informs query patterns.

**Solid line paths enable direct joins**:
```python
# If A → B → C are connected by solid lines:
A * C  # Valid—primary keys cascade through solid lines
```

**Dashed lines require intermediate tables**:
```python
# If A ---> B (dashed), B → C (solid):
A * B * C  # Must include B
```

This is why solid lines are preferred when appropriate—they simplify queries by allowing you to skip intermediate tables.

## Comparison to Other Notations

DataJoint's notation differs significantly from traditional database diagramming:

| Feature | Chen's ER | Crow's Foot | DataJoint |
|---------|-----------|-------------|----------|
| **Cardinality** | Numbers near entities | Symbols at line ends | Line thickness/style |
| **Direction** | No inherent direction | No inherent direction | Always top-to-bottom (DAG) |
| **Cycles allowed** | Yes | Yes | No |
| **Entity vs. relationship** | Distinct (rect vs. diamond) | Not distinguished | Not distinguished |
| **Primary key cascade** | Not shown | Not shown | Solid lines show this |
| **Identity sharing** | Not indicated | Not indicated | Thick solid line |

**Why DataJoint differs**:

1. **DAG structure**: No cycles means schemas are readable as workflows (top-to-bottom execution order)
2. **Line style semantics**: Immediately reveals relationship type without reading labels
3. **Primary key cascade visibility**: Solid lines show which tables can be joined directly
4. **Unified entity treatment**: No artificial distinction between "entities" and "relationships"—associations are just tables with converging foreign keys

:::{seealso}
The [Relational Workflows](../20-concepts/05-workflows.md) chapter covers the three database paradigms in depth, including how DataJoint's workflow-centric approach compares to Codd's mathematical model and Chen's Entity-Relationship model.
:::

## Best Practices

### Reading Diagrams

1. **Start at the top**: Identify independent entities (underlined)
2. **Follow solid lines**: Trace primary key cascades downward
3. **Spot convergence patterns**: Multiple lines into a table indicate associations
4. **Check line thickness**: Thick = one-to-one, Thin = one-to-many containment
5. **Note dashed lines**: These don't cascade identity

### Designing with Diagrams

1. **Choose solid lines when**:
   - Building hierarchies (Study → Subject → Session)
   - Creating workflow sequences (Order → Ship → Deliver)
   - You want direct joins across levels

2. **Choose dashed lines when**:
   - Child has independent identity from parent
   - Reference might change or is optional
   - You don't need primary key cascade

3. **Choose thick lines when**:
   - Extending entities with optional information
   - Modeling workflow steps (one output per input)
   - Creating one-to-one relationships

### Interactive Tips

- **Hover over tables** to see complete definitions (works in Jupyter and SVG exports)
- **Hover over orange dots** to see projection expressions
- **Use `+` and `-` operators** to focus on specific parts of large schemas

## Summary

DataJoint diagrams are more than documentation—they are **live views** of your schema that:

- Reveal workflow structure through top-to-bottom layout
- Show relationship semantics through line styles
- Guide query design through primary key cascade visibility
- Stay synchronized because they're generated from the actual schema

The key insight: in DataJoint, diagrams and implementation are unified.
There's no separate design document that can drift out of sync—the diagram **is** the schema.