# Queries

This tutorial covers how to query data in DataJoint. You'll learn:

- **Restriction** (`&`, `-`) — Filtering rows
- **Projection** (`.proj()`) — Selecting and computing columns
- **Join** (`*`) — Combining tables
- **Aggregation** (`.aggr()`) — Grouping and summarizing
- **Fetching** — Retrieving data in various formats

DataJoint queries are **lazy**—they build SQL expressions that execute only when you fetch data.

In [None]:
import datajoint as dj
import numpy as np

schema = dj.Schema('tutorial_queries')

In [None]:
# Define tables for this tutorial
@schema
class Subject(dj.Manual):
    definition = """
    subject_id : varchar(16)
    ---
    species : varchar(50)
    date_of_birth : date
    sex : enum('M', 'F', 'U')
    weight : float32             # grams
    """

@schema
class Experimenter(dj.Manual):
    definition = """
    experimenter_id : varchar(16)
    ---
    full_name : varchar(100)
    """

@schema
class Session(dj.Manual):
    definition = """
    -> Subject
    session_idx : uint16
    ---
    -> Experimenter
    session_date : date
    duration : float32           # minutes
    """

    class Trial(dj.Part):
        definition = """
        -> master
        trial_idx : uint16
        ---
        stimulus : varchar(50)
        response : varchar(50)
        correct : bool
        reaction_time : float32  # seconds
        """

In [None]:
# Insert sample data
import random
random.seed(42)

Experimenter.insert([
    {'experimenter_id': 'alice', 'full_name': 'Alice Smith'},
    {'experimenter_id': 'bob', 'full_name': 'Bob Jones'},
])

subjects = [
    {'subject_id': 'M001', 'species': 'Mus musculus', 'date_of_birth': '2026-01-15', 'sex': 'M', 'weight': 25.3},
    {'subject_id': 'M002', 'species': 'Mus musculus', 'date_of_birth': '2026-02-01', 'sex': 'F', 'weight': 22.1},
    {'subject_id': 'M003', 'species': 'Mus musculus', 'date_of_birth': '2026-02-15', 'sex': 'M', 'weight': 26.8},
    {'subject_id': 'R001', 'species': 'Rattus norvegicus', 'date_of_birth': '2024-01-01', 'sex': 'F', 'weight': 280.5},
]
Subject.insert(subjects)

# Insert sessions
sessions = [
    {'subject_id': 'M001', 'session_idx': 1, 'experimenter_id': 'alice', 'session_date': '2026-01-06', 'duration': 45.0},
    {'subject_id': 'M001', 'session_idx': 2, 'experimenter_id': 'alice', 'session_date': '2026-01-07', 'duration': 50.0},
    {'subject_id': 'M002', 'session_idx': 1, 'experimenter_id': 'bob', 'session_date': '2026-01-06', 'duration': 40.0},
    {'subject_id': 'M002', 'session_idx': 2, 'experimenter_id': 'bob', 'session_date': '2026-01-08', 'duration': 55.0},
    {'subject_id': 'M003', 'session_idx': 1, 'experimenter_id': 'alice', 'session_date': '2026-01-07', 'duration': 35.0},
]
Session.insert(sessions)

# Insert trials
trials = []
for s in sessions:
    for i in range(10):
        trials.append({
            'subject_id': s['subject_id'],
            'session_idx': s['session_idx'],
            'trial_idx': i + 1,
            'stimulus': random.choice(['left', 'right']),
            'response': random.choice(['left', 'right']),
            'correct': random.random() > 0.3,
            'reaction_time': random.uniform(0.2, 0.8)
        })
Session.Trial.insert(trials)

print(f"Subjects: {len(Subject())}, Sessions: {len(Session())}, Trials: {len(Session.Trial())}")

## Restriction (`&` and `-`)

Restriction filters rows based on conditions. Use `&` to select matching rows, `-` to exclude them.

### String Conditions

SQL expressions using attribute names:

In [None]:
# Simple comparison
Subject & "weight > 25"

In [None]:
# Date comparison
Session & "session_date > '2026-01-06'"

In [None]:
# Multiple conditions with AND
Subject & "sex = 'M' AND weight > 25"

### Dictionary Conditions

Dictionaries specify exact matches:

In [None]:
# Single attribute
Subject & {'sex': 'F'}

In [None]:
# Multiple attributes (AND)
Session & {'subject_id': 'M001', 'session_idx': 1}

### Expression Conditions (Semijoin)

Restrict to rows with matching keys in another table:

In [None]:
# Subjects that have at least one session
Subject & Session

In [None]:
# Subjects without any sessions (R001 has no sessions)
Subject - Session

### Collection Conditions (OR)

Lists create OR conditions:

In [None]:
# Either of these subjects
Subject & [{'subject_id': 'M001'}, {'subject_id': 'M002'}]

### Chaining Restrictions

Sequential restrictions combine with AND:

In [None]:
# These are equivalent
result1 = Subject & "sex = 'M'" & "weight > 25"
result2 = (Subject & "sex = 'M'") & "weight > 25"

print(f"Result 1: {len(result1)} rows")
print(f"Result 2: {len(result2)} rows")

## Projection (`.proj()`)

Projection selects, renames, or computes attributes.

### Selecting Attributes

In [None]:
# Primary key only (no arguments)
Subject.proj()

In [None]:
# Primary key + specific attributes
Subject.proj('species', 'sex')

In [None]:
# All attributes (using ellipsis)
Subject.proj(...)

In [None]:
# All except specific attributes
Subject.proj(..., '-weight')

### Renaming Attributes

In [None]:
# Rename 'species' to 'animal_species'
Subject.proj(animal_species='species')

### Computed Attributes

In [None]:
# Arithmetic computation
Subject.proj('species', weight_kg='weight / 1000')

In [None]:
# Date functions
Session.proj('session_date', year='YEAR(session_date)', month='MONTH(session_date)')

## Join (`*`)

Join combines tables on shared attributes (matching foreign keys).

In [None]:
# Join Subject and Session on subject_id
Subject * Session

In [None]:
# Join then restrict
(Subject * Session) & "sex = 'M'"

In [None]:
# Restrict then join (equivalent result)
(Subject & "sex = 'M'") * Session

In [None]:
# Three-way join
(Subject * Session * Experimenter).proj('species', 'session_date', 'full_name')

## Aggregation (`.aggr()`)

Aggregation groups rows and computes summary statistics.

In [None]:
# Count trials per session
Session.aggr(Session.Trial, n_trials='count(*)')

In [None]:
# Multiple aggregates
Session.aggr(
    Session.Trial,
    n_trials='count(*)',
    n_correct='sum(correct)',
    avg_rt='avg(reaction_time)'
)

In [None]:
# Count sessions per subject
Subject.aggr(Session, n_sessions='count(*)')

### Universal Set (`dj.U()`)

Use `dj.U()` for global aggregation or grouping by non-primary-key attributes:

In [None]:
# Global count (no grouping)
dj.U().aggr(Session, total_sessions='count(*)')

In [None]:
# Group by experimenter (not in Session's primary key)
dj.U('experimenter_id').aggr(Session, n_sessions='count(*)')

In [None]:
# Unique values
dj.U('species') & Subject

## Fetching Data

DataJoint 2.0 provides explicit methods for different output formats.

### `to_dicts()` — List of Dictionaries

In [None]:
# Get all rows as list of dicts
rows = Subject.to_dicts()
rows[:2]

### `to_pandas()` — DataFrame

In [None]:
# Get as pandas DataFrame (primary key as index)
df = Subject.to_pandas()
df

### `to_arrays()` — NumPy Arrays

In [None]:
# Structured array (all columns)
arr = Subject.to_arrays()
arr

In [None]:
# Specific columns as separate arrays
species, weights = Subject.to_arrays('species', 'weight')
print(f"Species: {species}")
print(f"Weights: {weights}")

### `keys()` — Primary Keys

In [None]:
# Get primary keys for iteration
keys = Session.keys()
keys[:3]

### `fetch1()` — Single Row

In [None]:
# Fetch one row (raises error if not exactly 1)
row = (Subject & {'subject_id': 'M001'}).fetch1()
row

In [None]:
# Fetch specific attributes from one row
species, weight = (Subject & {'subject_id': 'M001'}).fetch1('species', 'weight')
print(f"{species}: {weight}g")

### Ordering and Limiting

In [None]:
# Sort by weight descending, get top 2
Subject.to_dicts(order_by='weight DESC', limit=2)

In [None]:
# Sort by primary key
Subject.to_dicts(order_by='KEY')

### Lazy Iteration

Iterating directly over a table streams rows efficiently:

In [None]:
# Stream rows (single database cursor)
for row in Subject:
    print(f"{row['subject_id']}: {row['species']}")

## Query Composition

Queries are composable and immutable. Build complex queries step by step:

In [None]:
# Build a complex query step by step
male_mice = Subject & "sex = 'M'" & "species LIKE '%musculus%'"
sessions_with_subject = male_mice * Session
alice_sessions = sessions_with_subject & {'experimenter_id': 'alice'}
result = alice_sessions.proj('session_date', 'duration', 'weight')

result

In [None]:
# Or as a single expression
((Subject & "sex = 'M'" & "species LIKE '%musculus%'") 
 * Session 
 & {'experimenter_id': 'alice'}
).proj('session_date', 'duration', 'weight')

## Operator Precedence

Python operator precedence applies:

1. `*` (join) — highest
2. `+`, `-` (union, anti-restriction)
3. `&` (restriction) — lowest

Use parentheses for clarity:

In [None]:
# Without parentheses: join happens first
# Subject * Session & condition  means  (Subject * Session) & condition

# With parentheses: explicit order
result1 = (Subject & "sex = 'M'") * Session   # Restrict then join
result2 = Subject * (Session & "duration > 40")  # Restrict then join

print(f"Result 1: {len(result1)} rows")
print(f"Result 2: {len(result2)} rows")

## Quick Reference

### Operators

| Operation | Syntax | Description |
|-----------|--------|-------------|
| Restrict | `A & cond` | Select matching rows |
| Anti-restrict | `A - cond` | Select non-matching rows |
| Project | `A.proj(...)` | Select/compute columns |
| Join | `A * B` | Combine tables |
| Aggregate | `A.aggr(B, ...)` | Group and summarize |
| Union | `A + B` | Combine entity sets |

### Fetch Methods

| Method | Returns | Use Case |
|--------|---------|----------|
| `to_dicts()` | `list[dict]` | JSON, iteration |
| `to_pandas()` | `DataFrame` | Data analysis |
| `to_arrays()` | `np.ndarray` | Numeric computation |
| `to_arrays('a', 'b')` | `tuple[array, ...]` | Specific columns |
| `keys()` | `list[dict]` | Primary keys |
| `fetch1()` | `dict` | Single row |

See the [Query Algebra Specification](../reference/specs/query-algebra.md) and [Fetch API](../reference/specs/fetch-api.md) for complete details.

## Next Steps

- [Computation](05-computation.ipynb) — Building computational pipelines

In [None]:
# Cleanup
schema.drop(force=True)