---
title: Indexes
authors:
  - name: Dimitri Yatsenko
date: 2024-10-22
---

# Indexes: Accelerating Data Lookups

As tables grow to thousands or millions of records, query performance becomes critical. **Indexes** are data structures that enable fast lookups by specific attributes, dramatically reducing query times from scanning every row to near-instantaneous retrieval.

Think of an index like the index at the back of a textbook: instead of reading every page to find a topic, you look it up in the index and jump directly to the relevant pages. Database indexes work the same way—they create organized lookup structures that point directly to matching records.

```{admonition} Learning Objectives
:class: note

By the end of this chapter, you will:
- Understand how indexes accelerate database queries
- Recognize the three mechanisms that create indexes in DataJoint
- Declare explicit secondary indexes for frequently queried attributes
- Understand composite index ordering and its impact on queries
- Know when to use regular vs. unique indexes
```

## Prerequisites

This chapter assumes familiarity with:
- [Primary Keys](020-primary-key.md) — Understanding unique entity identification
- [Foreign Keys](030-foreign-keys.ipynb) — Understanding table relationships
- [Create Tables](015-table.ipynb) — Basic table declaration syntax

## How Indexes Are Created in DataJoint

In DataJoint, indexes are created through three mechanisms:

| Mechanism | Index Type | Purpose |
|-----------|------------|--------|
| **Primary key** | Unique index (automatic) | Fast lookups by entity identifier |
| **Foreign key** | Secondary index (automatic) | Fast joins and referential integrity checks |
| **Explicit declaration** | Secondary index (manual) | Fast lookups by frequently queried attributes |

The first two mechanisms are **automatic**—every table has a primary key index, and foreign keys create indexes unless a suitable one already exists. The third mechanism gives you control over additional indexes for your specific query patterns.

## Demonstrating Index Performance

Let's create a table with many entries and measure the performance difference between indexed and non-indexed lookups.

In [None]:
import datajoint as dj
import random

schema = dj.Schema('indexes')

Consider a mouse tracking scenario where each mouse has a lab-specific ID (primary key) and a separate tag ID issued by the animal facility:

In [None]:
@schema
class Mouse(dj.Manual):
    definition = """
    mouse_id : int  # lab-specific ID
    ---
    tag_id : int  # animal facility ID
    """

In [None]:
def populate_mice(table, n=200_000):
    """Insert random mouse records for testing."""
    table.insert(
        ((random.randint(1, 1_000_000_000), random.randint(1, 1_000_000_000)) 
         for i in range(n)), 
        skip_duplicates=True
    )

populate_mice(Mouse())

In [None]:
Mouse()

### Primary Key Lookup (Fast)

Searching by `mouse_id` uses the primary key index—this is extremely fast:

In [None]:
%%timeit -n6 -r3

# Fast: Uses the primary key index
(Mouse() & {'mouse_id': random.randint(0, 999_999)}).fetch()

### Non-Indexed Lookup (Slow)

Searching by `tag_id` requires scanning every row in the table—this is slow:

In [None]:
%%timeit -n6 -r3

# Slow: Requires a full table scan
(Mouse() & {'tag_id': random.randint(0, 999_999)}).fetch()

```{admonition} Performance Impact
:class: important

The indexed search is typically **100x faster** than the full table scan. This difference grows even larger as the table size increases. For tables with millions of records, unindexed searches can take seconds or minutes, while indexed searches remain nearly instantaneous.
```

## Declaring Secondary Indexes

To speed up searches on non-primary-key attributes, you can declare **secondary indexes** explicitly in the table definition.

### Syntax

Indexes are declared below the `---` line in the table definition:

```
index(attr1, ..., attrN)           # Regular index
unique index(attr1, ..., attrN)    # Unique index (enforces uniqueness)
```

### Example: Adding a Unique Index

Since each mouse should have a unique `tag_id`, we can add a unique index:

In [None]:
@schema
class Mouse2(dj.Manual):
    definition = """
    mouse_id : int  # lab-specific ID
    ---
    tag_id : int  # animal facility ID
    unique index(tag_id)
    """

In [None]:
populate_mice(Mouse2())

Now both types of lookups are equally fast:

In [None]:
%%timeit -n6 -r3

# Fast: Uses the primary key index
(Mouse2() & {'mouse_id': random.randint(0, 999_999)}).fetch()

In [None]:
%%timeit -n6 -r3

# Fast: Uses the secondary index on tag_id
(Mouse2() & {'tag_id': random.randint(0, 999_999)}).fetch()

```{admonition} Regular vs. Unique Index
:class: tip

- **Regular index** `index(attr)`: Speeds up lookups but allows duplicate values
- **Unique index** `unique index(attr)`: Speeds up lookups AND enforces that all values must be distinct

Use `unique index` when the attribute should be unique (like facility tag IDs), and regular `index` when duplicates are allowed (like dates or categories).
```

## Composite Index Ordering

When a primary key (or index) contains multiple attributes, the **order matters**. The index can only be used efficiently when searching from the leftmost attribute.

This is analogous to searching in a dictionary that orders words alphabetically:
- Searching by the **first letters** is easy (use the index)
- Searching by the **last letters** requires scanning every word

Let's demonstrate with a multi-attribute primary key:

In [None]:
@schema
class Rat(dj.Manual):
    definition = """
    lab_name : char(16)           # name of the lab
    rat_id : int unsigned         # lab-specific rat ID
    ---
    date_of_birth = null : date   # birth date (optional)
    """

In [None]:
def populate_rats(table):
    """Insert random rat records for testing."""
    lab_names = ("Cajal", "Kandel", "Moser", "Wiesel")
    dates = (None, "2024-10-01", "2024-10-02", "2024-10-03", "2024-10-04")
    for date_of_birth in dates:
        table.insert(
            ((random.choice(lab_names), random.randint(1, 1_000_000_000), date_of_birth) 
             for i in range(100_000)), 
            skip_duplicates=True
        )

populate_rats(Rat)

In [None]:
Rat()

The primary key creates an index on `(lab_name, rat_id)`. This means:

| Query Pattern | Uses Index? | Performance |
|---------------|-------------|-------------|
| `lab_name` only | Yes | Fast |
| `lab_name` + `rat_id` | Yes | Fast |
| `rat_id` only | No | Slow (full scan) |

In [None]:
%%timeit -n2 -r10

# Fast: Uses the primary key index (both attributes)
(Rat() & {'rat_id': 300, 'lab_name': 'Cajal'}).fetch()

In [None]:
%%timeit -n2 -r10

# Slow: rat_id is not first in the index, requires full table scan
(Rat() & {'rat_id': 300}).fetch()

```{admonition} Composite Index Rule
:class: warning

A composite index on `(A, B, C)` can efficiently search for:
- `A` alone
- `A` and `B` together  
- `A`, `B`, and `C` together

But it **cannot** efficiently search for:
- `B` alone
- `C` alone
- `B` and `C` together (without `A`)

If you frequently search by these patterns, add explicit indexes.
```

### Adding Indexes for Common Query Patterns

If we frequently need to search by `rat_id` alone or by `date_of_birth`, we should add explicit indexes:

In [None]:
@schema
class Rat2(dj.Manual):
    definition = """
    lab_name : char(16)           # name of the lab
    rat_id : int unsigned         # lab-specific rat ID
    ---
    date_of_birth = null : date   # birth date (optional)

    index(rat_id)                 # enables fast lookup by rat_id alone
    index(date_of_birth)          # enables fast lookup by date
    """

In [None]:
populate_rats(Rat2())

In [None]:
%%timeit -n3 -r6

# Fast: Uses the secondary index on rat_id
(Rat2() & {'rat_id': 300}).fetch()

In [None]:
%%timeit -n2 -r2

# Fast: Uses the secondary index on date_of_birth
len(Rat2 & 'date_of_birth = "2024-10-02"')

## String Pattern Matching and Indexes

Indexes on string columns follow similar rules. Pattern searches with `LIKE` can only use an index when the **starting characters** are specified:

In [None]:
%%timeit -n2 -r2

# Fast: Exact match uses the index
len(Rat & 'lab_name="Cajal"')

In [None]:
%%timeit -n2 -r2

# Slow: Wildcard at start prevents index use
len(Rat & 'lab_name LIKE "%jal"')

```{admonition} String Pattern Matching
:class: tip

- `LIKE "Caj%"` — **Can use index** (known prefix)
- `LIKE "%jal"` — **Cannot use index** (unknown prefix, requires full scan)
- `LIKE "%aja%"` — **Cannot use index** (unknown prefix)

Design your queries to search by prefix when possible.
```

## Viewing Table Indexes

Use the `describe()` method to see all indexes defined on a table:

In [None]:
Rat2.describe();

## Equivalent SQL Syntax

For reference, here's how indexes are declared in standard SQL:

**(DataJoint)**
```python
@schema
class Mouse(dj.Manual):
    definition = """
    mouse_id : int
    ---
    tag_id : int
    unique index(tag_id)
    """
```

**(Equivalent SQL)**
```sql
CREATE TABLE mouse (
    mouse_id INT NOT NULL,
    tag_id INT NOT NULL,
    PRIMARY KEY (mouse_id),
    UNIQUE INDEX (tag_id)
);
```

You can also add indexes to existing tables in SQL:
```sql
-- Add a regular index
CREATE INDEX idx_date ON rat (date_of_birth);

-- Add a unique index
CREATE UNIQUE INDEX idx_tag ON mouse (tag_id);

-- Remove an index
DROP INDEX idx_tag ON mouse;
```

## Quiz

```{admonition} Question
:class: note

How many indexes does the table `Rat2` have? What are they?
```

In [None]:
# Check the table definition to see all indexes
Rat2.describe();

```{admonition} Answer
:class: tip
:class: dropdown

**Three indexes:**
1. Primary key index on `(lab_name, rat_id)` — automatic
2. Secondary index on `rat_id` — explicit
3. Secondary index on `date_of_birth` — explicit
```

## Summary

Indexes are essential for query performance in tables with many records:

1. **Primary keys** automatically create unique indexes for fast entity lookups
2. **Foreign keys** automatically create secondary indexes for fast joins
3. **Explicit indexes** can be added for frequently queried non-key attributes
4. **Composite index order matters** — only leftmost attributes benefit from the index
5. **Unique indexes** enforce uniqueness in addition to speeding up lookups

```{admonition} When to Add Indexes
:class: tip

Add secondary indexes when:
- You frequently query by a non-key attribute
- Queries on large tables are slow
- You need to enforce uniqueness on a non-primary-key attribute

Don't over-index: Each index adds overhead to insert/update operations and uses storage space. Only index attributes that are actually queried frequently.
```

```{admonition} Next Steps
:class: note

Now that you understand how to optimize queries with indexes, explore:
- [Queries](../50-queries/005-queries.ipynb) — Writing efficient database queries
- [Pipeline Projects](090-pipeline-project.md) — Designing complete data pipelines
```

In [None]:
# To re-run the notebook, drop the schema to create anew
# schema.drop()