# Queries

This tutorial covers how to query data in DataJoint. You'll learn:

- **Restriction** (`&`, `-`) — Filtering rows
- **Projection** (`.proj()`) — Selecting and computing columns
- **Join** (`*`) — Combining tables
- **Aggregation** (`.aggr()`) — Grouping and summarizing
- **Fetching** — Retrieving data in various formats

DataJoint queries are **lazy**—they build SQL expressions that execute only when you fetch data.

In [1]:
import datajoint as dj
import numpy as np

schema = dj.Schema('tutorial_queries')

[2026-01-06 15:36:20,742][INFO]: DataJoint 2.0.0a15 connected to root@mysql:3306


In [2]:
# Define tables for this tutorial
@schema
class Subject(dj.Manual):
    definition = """
    subject_id : varchar(16)
    ---
    species : varchar(50)
    date_of_birth : date
    sex : enum('M', 'F', 'U')
    weight : float32             # grams
    """

@schema
class Experimenter(dj.Manual):
    definition = """
    experimenter_id : varchar(16)
    ---
    full_name : varchar(100)
    """

@schema
class Session(dj.Manual):
    definition = """
    -> Subject
    session_idx : uint16
    ---
    -> Experimenter
    session_date : date
    duration : float32           # minutes
    """

    class Trial(dj.Part):
        definition = """
        -> master
        trial_idx : uint16
        ---
        stimulus : varchar(50)
        response : varchar(50)
        correct : bool
        reaction_time : float32  # seconds
        """

In [3]:
# Insert sample data
import random
random.seed(42)

Experimenter.insert([
    {'experimenter_id': 'alice', 'full_name': 'Alice Smith'},
    {'experimenter_id': 'bob', 'full_name': 'Bob Jones'},
])

subjects = [
    {'subject_id': 'M001', 'species': 'Mus musculus', 'date_of_birth': '2026-01-15', 'sex': 'M', 'weight': 25.3},
    {'subject_id': 'M002', 'species': 'Mus musculus', 'date_of_birth': '2026-02-01', 'sex': 'F', 'weight': 22.1},
    {'subject_id': 'M003', 'species': 'Mus musculus', 'date_of_birth': '2026-02-15', 'sex': 'M', 'weight': 26.8},
    {'subject_id': 'R001', 'species': 'Rattus norvegicus', 'date_of_birth': '2024-01-01', 'sex': 'F', 'weight': 280.5},
]
Subject.insert(subjects)

# Insert sessions
sessions = [
    {'subject_id': 'M001', 'session_idx': 1, 'experimenter_id': 'alice', 'session_date': '2026-01-06', 'duration': 45.0},
    {'subject_id': 'M001', 'session_idx': 2, 'experimenter_id': 'alice', 'session_date': '2026-01-07', 'duration': 50.0},
    {'subject_id': 'M002', 'session_idx': 1, 'experimenter_id': 'bob', 'session_date': '2026-01-06', 'duration': 40.0},
    {'subject_id': 'M002', 'session_idx': 2, 'experimenter_id': 'bob', 'session_date': '2026-01-08', 'duration': 55.0},
    {'subject_id': 'M003', 'session_idx': 1, 'experimenter_id': 'alice', 'session_date': '2026-01-07', 'duration': 35.0},
]
Session.insert(sessions)

# Insert trials
trials = []
for s in sessions:
    for i in range(10):
        trials.append({
            'subject_id': s['subject_id'],
            'session_idx': s['session_idx'],
            'trial_idx': i + 1,
            'stimulus': random.choice(['left', 'right']),
            'response': random.choice(['left', 'right']),
            'correct': random.random() > 0.3,
            'reaction_time': random.uniform(0.2, 0.8)
        })
Session.Trial.insert(trials)

print(f"Subjects: {len(Subject())}, Sessions: {len(Session())}, Trials: {len(Session.Trial())}")

Subjects: 4, Sessions: 5, Trials: 50


## Restriction (`&` and `-`)

Restriction filters rows based on conditions. Use `&` to select matching rows, `-` to exclude them.

### String Conditions

SQL expressions using attribute names:

In [4]:
# Simple comparison
Subject & "weight > 25"

subject_id,species,date_of_birth,sex,weight  grams
M001,Mus musculus,2026-01-15,M,25.3
M003,Mus musculus,2026-02-15,M,26.8
R001,Rattus norvegicus,2024-01-01,F,280.5


In [5]:
# Date comparison
Session & "session_date > '2026-01-06'"

subject_id,session_idx,experimenter_id,session_date,duration  minutes
M001,2,alice,2026-01-07,50.0
M002,2,bob,2026-01-08,55.0
M003,1,alice,2026-01-07,35.0


In [6]:
# Multiple conditions with AND
Subject & "sex = 'M' AND weight > 25"

subject_id,species,date_of_birth,sex,weight  grams
M001,Mus musculus,2026-01-15,M,25.3
M003,Mus musculus,2026-02-15,M,26.8


### Dictionary Conditions

Dictionaries specify exact matches:

In [7]:
# Single attribute
Subject & {'sex': 'F'}

subject_id,species,date_of_birth,sex,weight  grams
M002,Mus musculus,2026-02-01,F,22.1
R001,Rattus norvegicus,2024-01-01,F,280.5


In [8]:
# Multiple attributes (AND)
Session & {'subject_id': 'M001', 'session_idx': 1}

subject_id,session_idx,experimenter_id,session_date,duration  minutes
M001,1,alice,2026-01-06,45.0


### Expression Conditions (Semijoin)

Restrict to rows with matching keys in another table:

In [9]:
# Subjects that have at least one session
Subject & Session

subject_id,species,date_of_birth,sex,weight  grams
M001,Mus musculus,2026-01-15,M,25.3
M002,Mus musculus,2026-02-01,F,22.1
M003,Mus musculus,2026-02-15,M,26.8


In [10]:
# Subjects without any sessions (R001 has no sessions)
Subject - Session

subject_id,species,date_of_birth,sex,weight  grams
R001,Rattus norvegicus,2024-01-01,F,280.5


### Collection Conditions (OR)

Lists create OR conditions:

In [11]:
# Either of these subjects
Subject & [{'subject_id': 'M001'}, {'subject_id': 'M002'}]

subject_id,species,date_of_birth,sex,weight  grams
M001,Mus musculus,2026-01-15,M,25.3
M002,Mus musculus,2026-02-01,F,22.1


### Chaining Restrictions

Sequential restrictions combine with AND:

In [12]:
# These are equivalent
result1 = Subject & "sex = 'M'" & "weight > 25"
result2 = (Subject & "sex = 'M'") & "weight > 25"

print(f"Result 1: {len(result1)} rows")
print(f"Result 2: {len(result2)} rows")

Result 1: 2 rows
Result 2: 2 rows


## Projection (`.proj()`)

Projection selects, renames, or computes attributes.

### Selecting Attributes

In [13]:
# Primary key only (no arguments)
Subject.proj()

subject_id
M001
M002
M003
R001


In [14]:
# Primary key + specific attributes
Subject.proj('species', 'sex')

subject_id,species,sex
M001,Mus musculus,M
M002,Mus musculus,F
M003,Mus musculus,M
R001,Rattus norvegicus,F


In [15]:
# All attributes (using ellipsis)
Subject.proj(...)

subject_id,species,date_of_birth,sex,weight  grams
M001,Mus musculus,2026-01-15,M,25.3
M002,Mus musculus,2026-02-01,F,22.1
M003,Mus musculus,2026-02-15,M,26.8
R001,Rattus norvegicus,2024-01-01,F,280.5


In [16]:
# All except specific attributes
Subject.proj(..., '-weight')

subject_id,species,date_of_birth,sex
M001,Mus musculus,2026-01-15,M
M002,Mus musculus,2026-02-01,F
M003,Mus musculus,2026-02-15,M
R001,Rattus norvegicus,2024-01-01,F


### Renaming Attributes

In [17]:
# Rename 'species' to 'animal_species'
Subject.proj(animal_species='species')

subject_id,animal_species
M001,Mus musculus
M002,Mus musculus
M003,Mus musculus
R001,Rattus norvegicus


### Computed Attributes

In [18]:
# Arithmetic computation
Subject.proj('species', weight_kg='weight / 1000')

subject_id,species,weight_kg  calculated attribute
M001,Mus musculus,0.0252999992370605
M002,Mus musculus,0.0221000003814697
M003,Mus musculus,0.0267999992370605
R001,Rattus norvegicus,0.2805


In [19]:
# Date functions
Session.proj('session_date', year='YEAR(session_date)', month='MONTH(session_date)')

subject_id,session_idx,session_date,year  calculated attribute,month  calculated attribute
M001,1,2026-01-06,2026,1
M001,2,2026-01-07,2026,1
M002,1,2026-01-06,2026,1
M002,2,2026-01-08,2026,1
M003,1,2026-01-07,2026,1


## Join (`*`)

Join combines tables on shared attributes (matching foreign keys).

In [20]:
# Join Subject and Session on subject_id
Subject * Session

subject_id,session_idx,experimenter_id,session_date,duration  minutes,species,date_of_birth,sex,weight  grams
M001,1,alice,2026-01-06,45.0,Mus musculus,2026-01-15,M,25.3
M001,2,alice,2026-01-07,50.0,Mus musculus,2026-01-15,M,25.3
M002,1,bob,2026-01-06,40.0,Mus musculus,2026-02-01,F,22.1
M002,2,bob,2026-01-08,55.0,Mus musculus,2026-02-01,F,22.1
M003,1,alice,2026-01-07,35.0,Mus musculus,2026-02-15,M,26.8


In [21]:
# Join then restrict
(Subject * Session) & "sex = 'M'"

subject_id,session_idx,experimenter_id,session_date,duration  minutes,species,date_of_birth,sex,weight  grams
M001,1,alice,2026-01-06,45.0,Mus musculus,2026-01-15,M,25.3
M001,2,alice,2026-01-07,50.0,Mus musculus,2026-01-15,M,25.3
M003,1,alice,2026-01-07,35.0,Mus musculus,2026-02-15,M,26.8


In [22]:
# Restrict then join (equivalent result)
(Subject & "sex = 'M'") * Session

subject_id,session_idx,experimenter_id,session_date,duration  minutes,species,date_of_birth,sex,weight  grams
M001,1,alice,2026-01-06,45.0,Mus musculus,2026-01-15,M,25.3
M001,2,alice,2026-01-07,50.0,Mus musculus,2026-01-15,M,25.3
M003,1,alice,2026-01-07,35.0,Mus musculus,2026-02-15,M,26.8


In [23]:
# Three-way join
(Subject * Session * Experimenter).proj('species', 'session_date', 'full_name')

subject_id,session_idx,session_date,species,full_name
M001,1,2026-01-06,Mus musculus,Alice Smith
M001,2,2026-01-07,Mus musculus,Alice Smith
M002,1,2026-01-06,Mus musculus,Bob Jones
M002,2,2026-01-08,Mus musculus,Bob Jones
M003,1,2026-01-07,Mus musculus,Alice Smith


## Aggregation (`.aggr()`)

Aggregation groups rows and computes summary statistics.

In [24]:
# Count trials per session
Session.aggr(Session.Trial, n_trials='count(*)')

subject_id,session_idx,n_trials  calculated attribute
M001,1,10
M001,2,10
M002,1,10
M002,2,10
M003,1,10


In [25]:
# Multiple aggregates
Session.aggr(
    Session.Trial,
    n_trials='count(*)',
    n_correct='sum(correct)',
    avg_rt='avg(reaction_time)'
)

subject_id,session_idx,n_trials  calculated attribute,n_correct  calculated attribute,avg_rt  calculated attribute
M001,1,10,8,0.5068969577550888
M001,2,10,9,0.4596437990665435
M002,1,10,7,0.4584552228450775
M002,2,10,6,0.5038441717624664
M003,1,10,6,0.5117030084133148


In [26]:
# Count sessions per subject
Subject.aggr(Session, n_sessions='count(*)')

subject_id,n_sessions  calculated attribute
M001,2
M002,2
M003,1


### Universal Set (`dj.U()`)

Use `dj.U()` for global aggregation or grouping by non-primary-key attributes:

In [27]:
# Global count (no grouping)
dj.U().aggr(Session, total_sessions='count(*)')

total_sessions  calculated attribute
5


In [28]:
# Group by experimenter (not in Session's primary key)
dj.U('experimenter_id').aggr(Session, n_sessions='count(*)')

experimenter_id,n_sessions  calculated attribute
alice,3
bob,2


In [29]:
# Unique values
dj.U('species') & Subject

species
Mus musculus
Rattus norvegicus


## Fetching Data

DataJoint 2.0 provides explicit methods for different output formats.

### `to_dicts()` — List of Dictionaries

In [30]:
# Get all rows as list of dicts
rows = Subject.to_dicts()
rows[:2]

[{'subject_id': 'M001',
  'species': 'Mus musculus',
  'date_of_birth': datetime.date(2026, 1, 15),
  'sex': 'M',
  'weight': 25.3},
 {'subject_id': 'M002',
  'species': 'Mus musculus',
  'date_of_birth': datetime.date(2026, 2, 1),
  'sex': 'F',
  'weight': 22.1}]

### `to_pandas()` — DataFrame

In [31]:
# Get as pandas DataFrame (primary key as index)
df = Subject.to_pandas()
df

Unnamed: 0_level_0,species,date_of_birth,sex,weight
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M001,Mus musculus,2026-01-15,M,25.3
M002,Mus musculus,2026-02-01,F,22.1
M003,Mus musculus,2026-02-15,M,26.8
R001,Rattus norvegicus,2024-01-01,F,280.5


### `to_arrays()` — NumPy Arrays

In [32]:
# Structured array (all columns)
arr = Subject.to_arrays()
arr

array([('M001', 'Mus musculus', datetime.date(2026, 1, 15), 'M',  25.3),
       ('M002', 'Mus musculus', datetime.date(2026, 2, 1), 'F',  22.1),
       ('M003', 'Mus musculus', datetime.date(2026, 2, 15), 'M',  26.8),
       ('R001', 'Rattus norvegicus', datetime.date(2024, 1, 1), 'F', 280.5)],
      dtype=[('subject_id', 'O'), ('species', 'O'), ('date_of_birth', 'O'), ('sex', 'O'), ('weight', '<f8')])

In [33]:
# Specific columns as separate arrays
species, weights = Subject.to_arrays('species', 'weight')
print(f"Species: {species}")
print(f"Weights: {weights}")

Species: ['Mus musculus' 'Mus musculus' 'Mus musculus' 'Rattus norvegicus']
Weights: [ 25.3  22.1  26.8 280.5]


### `keys()` — Primary Keys

In [34]:
# Get primary keys for iteration
keys = Session.keys()
keys[:3]

[{'subject_id': 'M001', 'session_idx': 1},
 {'subject_id': 'M001', 'session_idx': 2},
 {'subject_id': 'M003', 'session_idx': 1}]

### `fetch1()` — Single Row

In [35]:
# Fetch one row (raises error if not exactly 1)
row = (Subject & {'subject_id': 'M001'}).fetch1()
row

{'subject_id': 'M001',
 'species': 'Mus musculus',
 'date_of_birth': datetime.date(2026, 1, 15),
 'sex': 'M',
 'weight': 25.3}

In [36]:
# Fetch specific attributes from one row
species, weight = (Subject & {'subject_id': 'M001'}).fetch1('species', 'weight')
print(f"{species}: {weight}g")

Mus musculus: 25.3g


### Ordering and Limiting

In [37]:
# Sort by weight descending, get top 2
Subject.to_dicts(order_by='weight DESC', limit=2)

[{'subject_id': 'R001',
  'species': 'Rattus norvegicus',
  'date_of_birth': datetime.date(2024, 1, 1),
  'sex': 'F',
  'weight': 280.5},
 {'subject_id': 'M003',
  'species': 'Mus musculus',
  'date_of_birth': datetime.date(2026, 2, 15),
  'sex': 'M',
  'weight': 26.8}]

In [38]:
# Sort by primary key
Subject.to_dicts(order_by='KEY')

[{'subject_id': 'M001',
  'species': 'Mus musculus',
  'date_of_birth': datetime.date(2026, 1, 15),
  'sex': 'M',
  'weight': 25.3},
 {'subject_id': 'M002',
  'species': 'Mus musculus',
  'date_of_birth': datetime.date(2026, 2, 1),
  'sex': 'F',
  'weight': 22.1},
 {'subject_id': 'M003',
  'species': 'Mus musculus',
  'date_of_birth': datetime.date(2026, 2, 15),
  'sex': 'M',
  'weight': 26.8},
 {'subject_id': 'R001',
  'species': 'Rattus norvegicus',
  'date_of_birth': datetime.date(2024, 1, 1),
  'sex': 'F',
  'weight': 280.5}]

### Lazy Iteration

Iterating directly over a table streams rows efficiently:

In [39]:
# Stream rows (single database cursor)
for row in Subject:
    print(f"{row['subject_id']}: {row['species']}")

M001: Mus musculus
M002: Mus musculus
M003: Mus musculus
R001: Rattus norvegicus


## Query Composition

Queries are composable and immutable. Build complex queries step by step:

In [40]:
# Build a complex query step by step
male_mice = Subject & "sex = 'M'" & "species LIKE '%musculus%'"
sessions_with_subject = male_mice * Session
alice_sessions = sessions_with_subject & {'experimenter_id': 'alice'}
result = alice_sessions.proj('session_date', 'duration', 'weight')

result

subject_id,session_idx,session_date,duration  minutes,weight  grams
M001,1,2026-01-06,45.0,25.3
M001,2,2026-01-07,50.0,25.3
M003,1,2026-01-07,35.0,26.8


In [41]:
# Or as a single expression
((Subject & "sex = 'M'" & "species LIKE '%musculus%'") 
 * Session 
 & {'experimenter_id': 'alice'}
).proj('session_date', 'duration', 'weight')

subject_id,session_idx,session_date,duration  minutes,weight  grams
M001,1,2026-01-06,45.0,25.3
M001,2,2026-01-07,50.0,25.3
M003,1,2026-01-07,35.0,26.8


## Operator Precedence

Python operator precedence applies:

1. `*` (join) — highest
2. `+`, `-` (union, anti-restriction)
3. `&` (restriction) — lowest

Use parentheses for clarity:

In [42]:
# Without parentheses: join happens first
# Subject * Session & condition  means  (Subject * Session) & condition

# With parentheses: explicit order
result1 = (Subject & "sex = 'M'") * Session   # Restrict then join
result2 = Subject * (Session & "duration > 40")  # Restrict then join

print(f"Result 1: {len(result1)} rows")
print(f"Result 2: {len(result2)} rows")

Result 1: 3 rows
Result 2: 3 rows


## Quick Reference

### Operators

| Operation | Syntax | Description |
|-----------|--------|-------------|
| Restrict | `A & cond` | Select matching rows |
| Anti-restrict | `A - cond` | Select non-matching rows |
| Project | `A.proj(...)` | Select/compute columns |
| Join | `A * B` | Combine tables |
| Aggregate | `A.aggr(B, ...)` | Group and summarize |
| Union | `A + B` | Combine entity sets |

### Fetch Methods

| Method | Returns | Use Case |
|--------|---------|----------|
| `to_dicts()` | `list[dict]` | JSON, iteration |
| `to_pandas()` | `DataFrame` | Data analysis |
| `to_arrays()` | `np.ndarray` | Numeric computation |
| `to_arrays('a', 'b')` | `tuple[array, ...]` | Specific columns |
| `keys()` | `list[dict]` | Primary keys |
| `fetch1()` | `dict` | Single row |

See the [Query Algebra Specification](../reference/specs/query-algebra.md) and [Fetch API](../reference/specs/fetch-api.md) for complete details.

## Next Steps

- [Computation](05-computation.ipynb) — Building computational pipelines

In [43]:
# Cleanup
schema.drop(force=True)