# Data Entry

This tutorial covers how to manipulate data in DataJoint tables. You'll learn:

- **Insert** — Adding rows to tables
- **Update** — Modifying existing rows (for corrections)
- **Delete** — Removing rows with cascading
- **Validation** — Checking data before insertion

DataJoint is designed around **insert** and **delete** as the primary operations. Updates are intentionally limited to surgical corrections.

In [None]:
import datajoint as dj
import numpy as np

schema = dj.Schema('tutorial_data_entry')

In [None]:
# Define tables for this tutorial
@schema
class Lab(dj.Manual):
    definition = """
    lab_id : varchar(16)
    ---
    lab_name : varchar(100)
    """

@schema
class Subject(dj.Manual):
    definition = """
    subject_id : varchar(16)
    ---
    -> Lab
    species : varchar(50)
    date_of_birth : date
    notes = '' : varchar(1000)
    """

@schema
class Session(dj.Manual):
    definition = """
    -> Subject
    session_idx : uint16
    ---
    session_date : date
    duration : float32            # minutes
    """

    class Trial(dj.Part):
        definition = """
        -> master
        trial_idx : uint16
        ---
        outcome : enum('hit', 'miss', 'false_alarm', 'correct_reject')
        reaction_time : float32   # seconds
        """

@schema
class ProcessedData(dj.Computed):
    definition = """
    -> Session
    ---
    hit_rate : float32
    """
    
    def make(self, key):
        outcomes = (Session.Trial & key).to_arrays('outcome')[0]
        hit_rate = sum(outcomes == 'hit') / len(outcomes) if len(outcomes) else 0.0
        self.insert1({**key, 'hit_rate': hit_rate})

## Insert Operations

### `insert1()` — Single Row

Use `insert1()` to add a single row as a dictionary:

In [None]:
# Insert a single row
Lab.insert1({'lab_id': 'tolias', 'lab_name': 'Tolias Lab'})

Subject.insert1({
    'subject_id': 'M001',
    'lab_id': 'tolias',
    'species': 'Mus musculus',
    'date_of_birth': '2026-01-15'
})

Subject()

### `insert()` — Multiple Rows

Use `insert()` to add multiple rows at once. This is more efficient than calling `insert1()` in a loop.

In [None]:
# Insert multiple rows as a list of dictionaries
Subject.insert([
    {'subject_id': 'M002', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-02-01'},
    {'subject_id': 'M003', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-02-15'},
])

Subject()

### Accepted Input Formats

`insert()` accepts several formats:

| Format | Example |
|--------|--------|
| List of dicts | `[{'id': 1, 'name': 'A'}, ...]` |
| pandas DataFrame | `pd.DataFrame({'id': [1, 2], 'name': ['A', 'B']})` |
| numpy structured array | `np.array([(1, 'A')], dtype=[('id', int), ('name', 'U10')])` |
| QueryExpression | `OtherTable.proj(...)` (INSERT...SELECT) |

In [None]:
# Insert from pandas DataFrame
import pandas as pd

df = pd.DataFrame({
    'subject_id': ['M004', 'M005'],
    'lab_id': ['tolias', 'tolias'],
    'species': ['Mus musculus', 'Mus musculus'],
    'date_of_birth': ['2026-03-01', '2026-03-15']
})

Subject.insert(df)
print(f"Total subjects: {len(Subject())}")

### Handling Duplicates

By default, inserting a row with an existing primary key raises an error:

In [None]:
# This will raise an error - duplicate primary key
try:
    Subject.insert1({'subject_id': 'M001', 'lab_id': 'tolias', 
                     'species': 'Mus musculus', 'date_of_birth': '2026-01-15'})
except Exception as e:
    print(f"Error: {type(e).__name__}")
    print("Cannot insert duplicate primary key!")

Use `skip_duplicates=True` to silently skip rows with existing keys:

In [None]:
# Skip duplicates - existing row unchanged
Subject.insert1(
    {'subject_id': 'M001', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-01-15'},
    skip_duplicates=True
)
print("Insert completed (duplicate skipped)")

Use `replace=True` to overwrite existing rows:

In [None]:
# Replace - overwrites existing row with new values
Subject.insert1(
    {'subject_id': 'M001', 'lab_id': 'tolias', 'species': 'Mus musculus', 
     'date_of_birth': '2026-01-15', 'notes': 'Updated via replace'},
    replace=True
)

(Subject & 'subject_id="M001"').fetch1()

### Extra Fields

By default, inserting a row with fields not in the table raises an error:

In [None]:
try:
    Subject.insert1({'subject_id': 'M006', 'lab_id': 'tolias', 
                     'species': 'Mus musculus', 'date_of_birth': '2026-04-01',
                     'unknown_field': 'some value'})  # Unknown field!
except Exception as e:
    print(f"Error: {type(e).__name__}")
    print("Field 'unknown_field' not in table!")

In [None]:
# Use ignore_extra_fields=True to silently ignore unknown fields
Subject.insert1(
    {'subject_id': 'M006', 'lab_id': 'tolias', 'species': 'Mus musculus',
     'date_of_birth': '2026-04-01', 'unknown_field': 'ignored'},
    ignore_extra_fields=True
)
print(f"Total subjects: {len(Subject())}")

### Inserting Part Tables

Part tables are inserted through their master table class. Remember that master and parts should be entered together to maintain compositional integrity:

In [None]:
# Insert a session
Session.insert1({
    'subject_id': 'M001',
    'session_idx': 1,
    'session_date': '2026-01-06',
    'duration': 45.5
})

# Insert trials for this session
Session.Trial.insert([
    {'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 1, 'outcome': 'hit', 'reaction_time': 0.35},
    {'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 2, 'outcome': 'miss', 'reaction_time': 0.82},
    {'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 3, 'outcome': 'hit', 'reaction_time': 0.41},
    {'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 4, 'outcome': 'false_alarm', 'reaction_time': 0.28},
    {'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 5, 'outcome': 'hit', 'reaction_time': 0.39},
])

Session.Trial()

## Update Operations

DataJoint provides only `update1()` for modifying single rows. This is intentional—updates bypass the normal workflow and should be used sparingly for **corrective operations**.

### When to Use Updates

**Appropriate uses:**
- Fixing data entry errors (typos, wrong values)
- Adding notes or metadata after the fact
- Administrative corrections

**Inappropriate uses** (use delete + insert + populate instead):
- Regular workflow operations
- Changes that should trigger recomputation

In [None]:
# Update a single row - must provide all primary key values
Subject.update1({'subject_id': 'M001', 'notes': 'Primary subject for behavioral study'})

(Subject & 'subject_id="M001"').fetch1()

In [None]:
# Update multiple attributes at once
Subject.update1({
    'subject_id': 'M002',
    'notes': 'Control group',
    'species': 'Mus musculus (C57BL/6)'  # More specific
})

(Subject & 'subject_id="M002"').fetch1()

### Update Requirements

1. **Complete primary key**: All PK attributes must be provided
2. **Exactly one match**: Must match exactly one existing row
3. **No restrictions**: Cannot call on a restricted table

In [None]:
# Error: incomplete primary key
try:
    Subject.update1({'notes': 'Missing subject_id!'})
except Exception as e:
    print(f"Error: {type(e).__name__}")
    print("Primary key must be complete")

In [None]:
# Error: cannot update restricted table
try:
    (Subject & 'subject_id="M001"').update1({'subject_id': 'M001', 'notes': 'test'})
except Exception as e:
    print(f"Error: {type(e).__name__}")
    print("Cannot update restricted table")

### Reset to Default

Setting an attribute to `None` resets it to its default value:

In [None]:
# Reset notes to default (empty string)
Subject.update1({'subject_id': 'M003', 'notes': None})

(Subject & 'subject_id="M003"').fetch1()

## Delete Operations

### Cascading Deletes

Deleting a row automatically cascades to all dependent tables. This maintains referential integrity across the pipeline.

In [None]:
# First, let's see what we have
print(f"Sessions: {len(Session())}")
print(f"Trials: {len(Session.Trial())}")

# Populate computed table
ProcessedData.populate()
print(f"ProcessedData: {len(ProcessedData())}")

In [None]:
# Delete a session - cascades to Trial and ProcessedData
(Session & {'subject_id': 'M001', 'session_idx': 1}).delete(safemode=False)

print(f"After delete:")
print(f"Sessions: {len(Session())}")
print(f"Trials: {len(Session.Trial())}")
print(f"ProcessedData: {len(ProcessedData())}")

### Safe Mode

By default (in interactive mode), `delete()` prompts for confirmation showing what will be deleted. Use `safemode=False` for scripts:

In [None]:
# Add more data for demonstration
Session.insert1({'subject_id': 'M002', 'session_idx': 1, 'session_date': '2026-01-07', 'duration': 30.0})
Session.Trial.insert([
    {'subject_id': 'M002', 'session_idx': 1, 'trial_idx': 1, 'outcome': 'hit', 'reaction_time': 0.40},
    {'subject_id': 'M002', 'session_idx': 1, 'trial_idx': 2, 'outcome': 'hit', 'reaction_time': 0.38},
])

# Delete with safemode=False (no confirmation prompt)
(Session & {'subject_id': 'M002', 'session_idx': 1}).delete(safemode=False)

### The Recomputation Pattern

When source data needs to change, the correct pattern is **delete → insert → populate**. This ensures all derived data remains consistent:

In [None]:
# Add a session with trials
Session.insert1({'subject_id': 'M003', 'session_idx': 1, 'session_date': '2026-01-08', 'duration': 40.0})
Session.Trial.insert([
    {'subject_id': 'M003', 'session_idx': 1, 'trial_idx': 1, 'outcome': 'hit', 'reaction_time': 0.35},
    {'subject_id': 'M003', 'session_idx': 1, 'trial_idx': 2, 'outcome': 'miss', 'reaction_time': 0.50},
])

# Compute results
ProcessedData.populate()
print("Before correction:", ProcessedData.fetch1())

In [None]:
# Suppose we discovered trial 2 was actually a 'hit' not 'miss'
# WRONG: Updating the trial would leave ProcessedData stale!
# Session.Trial.update1({...})  # DON'T DO THIS

# CORRECT: Delete, reinsert, recompute
key = {'subject_id': 'M003', 'session_idx': 1}

# 1. Delete cascades to ProcessedData
(Session & key).delete(safemode=False)

# 2. Reinsert with corrected data
Session.insert1({**key, 'session_date': '2026-01-08', 'duration': 40.0})
Session.Trial.insert([
    {**key, 'trial_idx': 1, 'outcome': 'hit', 'reaction_time': 0.35},
    {**key, 'trial_idx': 2, 'outcome': 'hit', 'reaction_time': 0.50},  # Corrected!
])

# 3. Recompute
ProcessedData.populate()
print("After correction:", ProcessedData.fetch1())

## Validation

Use `validate()` to check data before insertion:

In [None]:
# Validate rows before inserting
rows_to_insert = [
    {'subject_id': 'M007', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-05-01'},
    {'subject_id': 'M008', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-05-15'},
]

result = Subject.validate(rows_to_insert)

if result:
    Subject.insert(rows_to_insert)
    print(f"Inserted {len(rows_to_insert)} rows")
else:
    print("Validation failed:")
    print(result.summary())

In [None]:
# Example of validation failure
bad_rows = [
    {'subject_id': 'M009', 'species': 'Mus musculus', 'date_of_birth': '2026-05-20'},  # Missing lab_id!
]

result = Subject.validate(bad_rows)

if not result:
    print("Validation failed!")
    for error in result.errors:
        print(f"  {error}")

## Transactions

Single operations are atomic. For multi-table operations that must succeed or fail together, use explicit transactions:

In [None]:
# Atomic transaction - all inserts succeed or none do
with dj.conn().transaction:
    Session.insert1({'subject_id': 'M007', 'session_idx': 1, 'session_date': '2026-01-10', 'duration': 35.0})
    Session.Trial.insert([
        {'subject_id': 'M007', 'session_idx': 1, 'trial_idx': 1, 'outcome': 'hit', 'reaction_time': 0.33},
        {'subject_id': 'M007', 'session_idx': 1, 'trial_idx': 2, 'outcome': 'miss', 'reaction_time': 0.45},
    ])

print(f"Session inserted with {len(Session.Trial & {'subject_id': 'M007'})} trials")

## Best Practices

### 1. Prefer Insert/Delete Over Update

When source data changes, delete and reinsert rather than updating:

```python
# Good: Delete and reinsert
(Trial & key).delete()
Trial.insert1(corrected_trial)
DerivedTable.populate()

# Avoid: Update that leaves derived data stale
Trial.update1({**key, 'value': new_value})
```

### 2. Batch Inserts for Performance

```python
# Good: Single insert call
Subject.insert(all_rows)

# Slow: Loop of insert1 calls
for row in all_rows:
    Subject.insert1(row)  # Creates many transactions
```

### 3. Validate Before Insert

```python
result = Subject.validate(rows)
if not result:
    raise ValueError(result.summary())
Subject.insert(rows)
```

### 4. Use Transactions for Related Inserts

```python
with dj.conn().transaction:
    Session.insert1(session_data)
    Session.Trial.insert(trials)
```

### 5. Safe Deletion in Production

```python
# Interactive: Use safemode (default)
(Subject & condition).delete()

# Scripts: Disable only when tested
(Subject & condition).delete(safemode=False)
```

## Quick Reference

| Operation | Method | Use Case |
|-----------|--------|----------|
| Insert one | `insert1(row)` | Adding single entity |
| Insert many | `insert(rows)` | Bulk data loading |
| Update one | `update1(row)` | Surgical corrections only |
| Delete | `delete()` | Removing entities (cascades) |
| Delete quick | `delete_quick()` | Internal cleanup (no cascade) |
| Validate | `validate(rows)` | Pre-insert check |

See the [Data Manipulation Specification](../reference/specs/data-manipulation.md) for complete details.

## Next Steps

- [Queries](04-queries.ipynb) — Filtering, joining, and projecting data
- [Computation](05-computation.ipynb) — Building computational pipelines

In [None]:
# Cleanup
schema.drop(force=True)