# Parse, validate, and save the data

### Parsing and validation

In this notebook, we read the JSON files provided by ePSD2 into Pydantic models.
This provides a way to validate the data and ensure that it conforms to the expected format.

Specifically:
- The `Text` models are configured to forbid extra fields, so we can be
sure that we are not leaving any data on the table.
- Some of the fields have defaults. These keys are only present on some of the objects
in the JSON, so we can use the defaults to fill in the missing values so that each
has the same surface area.
- Other fields do not have defaults, so those we can be sure are present on every object.
- Finally, we also get type checking and whatever other validation we may want to add.

When interacting with JSON directly, you don't get any guarantees about the data
(what is present, what the type of the value is). This gives us these guarantees.

### Collating and writing

We then generate the transliteration for each, load all of the models into DataFrames,
and write them to CSV files at "outputs/1_<corpus_name>.csv".

These will be immediately inner joined into a single CSV in the next step,
but having separate CSVs for each corpus is useful for debugging and for
seeing all of the data that each provides (which varies based on corpus).

---

## Requirements

In [1]:
import json
import os
from pathlib import Path

import pandas as pd
from tqdm import tqdm

from src.models.corpus import CorpusEnum, CorpusType

## Parsing and validation

### Load the corpus and text metadata from JSON

In [2]:
def load_corpus_from_json(corpus: CorpusEnum) -> CorpusType:
    """
    Load a Corpus from the ePSD2 data.
    """
    print(f"\nLoading {corpus.value}...")

    project_root = Path.cwd().parents[1]
    corpus_data_path = project_root / "epsd2_data" / corpus.value

    if not os.path.exists(corpus_data_path):
        print(f"ERROR: {corpus_data_path} does not exist. Skipping...")
        return

    # Get the list of directories that contain a catalogue.json file
    # These are the directories that contain the JSON files for each tablet
    corpusjson_dirs = [
        root
        for root, _, filenames in os.walk(corpus_data_path)
        if "catalogue.json" in filenames
    ]
    if not corpusjson_dirs:
        print(
            f"ERROR: No catalogue.json files found in {corpus_data_path}. Skipping..."
        )
        return

    # Parse the catalogue.json file
    target_dir = corpusjson_dirs[0]
    catalogue_path = Path(target_dir) / "catalogue.json"
    with open(catalogue_path, "r", encoding="utf-8") as f:
        catalogue = json.load(f)

    # Use the catalogue to get the list of tablets
    tablets_path = str(Path(target_dir) / "corpusjson/")
    texts = [
        {"dir_path": tablets_path, **text_data}
        for text_data in catalogue["members"].values()
    ]

    model = corpus.model(texts=texts)
    print(f"✅ Loaded {corpus.value} ({len(model.texts)} texts)")
    return model


corpora = [load_corpus_from_json(corpus) for corpus in CorpusEnum]


Loading admin_ed12...
✅ Loaded admin_ed12 (89 texts)

Loading admin_ed3a...
✅ Loaded admin_ed3a (840 texts)

Loading admin_ed3b...
✅ Loaded admin_ed3b (3477 texts)

Loading admin_oakk...
✅ Loaded admin_oakk (5472 texts)

Loading admin_lagash2...
✅ Loaded admin_lagash2 (769 texts)

Loading admin_ur3...
✅ Loaded admin_ur3 (80181 texts)

Loading early_lit...
✅ Loaded early_lit (43 texts)

Loading oldbab_lit...
✅ Loaded oldbab_lit (1254 texts)

Loading royal...
✅ Loaded royal (1928 texts)

Loading incantations...
✅ Loaded incantations (244 texts)

Loading liturgies...
✅ Loaded liturgies (96 texts)

Loading udughul...
✅ Loaded udughul (14 texts)

Loading varia...
✅ Loaded varia (3 texts)


### Load the text content from JSON

This is stored in a separate file.

In [3]:
def load_text_content(corpus: CorpusType) -> None:
    """
    Load the content of a text.
    """
    print(f"\nLoading text content for {type(corpus)}...")

    success = []
    failed = []

    for text in tqdm(corpus.texts):
        try:
            text.load_contents()
            success.append(text.file_id)
        except Exception:
            failed.append(text.file_id)

    print(f"✅ Successfully loaded {len(success)} texts")
    if len(failed) > 0:
        print(f"❌ Failed to load {len(failed)} texts:")
        print("\n".join(failed))


for corpus in corpora:
    load_text_content(corpus)


Loading text content for <class 'src.models.corpus.Corpus.CorpusAdminEd1and2'>...


100%|██████████| 89/89 [00:00<00:00, 1411.21it/s]


✅ Successfully loaded 89 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusAdminEd3a'>...


100%|██████████| 840/840 [00:01<00:00, 626.44it/s]


✅ Successfully loaded 840 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusAdminEd3b'>...


100%|██████████| 3477/3477 [00:06<00:00, 521.87it/s] 


✅ Successfully loaded 3477 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusAdminOldAkk'>...


100%|██████████| 5472/5472 [00:05<00:00, 947.03it/s] 


✅ Successfully loaded 5472 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusAdminLagash2'>...


100%|██████████| 769/769 [00:00<00:00, 1412.34it/s]


✅ Successfully loaded 769 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusAdminUr3'>...


100%|██████████| 80181/80181 [02:57<00:00, 452.71it/s] 


✅ Successfully loaded 80181 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusLiteraryEarly'>...


100%|██████████| 43/43 [00:00<00:00, 346.46it/s]


✅ Successfully loaded 43 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusLiteraryOldBab'>...


100%|██████████| 1254/1254 [00:03<00:00, 364.64it/s]


✅ Successfully loaded 1022 texts
❌ Failed to load 232 texts:
Q000353
Q000355
Q000370
Q000378
Q000402
Q000403
Q000405
Q000411
Q000428
Q000429
Q000437
Q000441
Q000442
Q000443
Q000445
Q000456
Q000457
Q000462
Q000464
Q000468
Q000470
Q000474
Q000475
Q000477
Q000486
Q000487
Q000489
Q000503
Q000504
Q000505
Q000535
Q000551
Q000552
Q000567
Q000568
Q000569
Q000570
Q000571
Q000572
Q000585
Q000587
Q000588
Q000589
Q000590
Q000591
Q000595
Q000596
Q000597
Q000598
Q000600
Q000601
Q000602
Q000603
Q000605
Q000606
Q000607
Q000608
Q000609
Q000610
Q000617
Q000620
Q000642
Q000645
Q000648
Q000650
Q000654
Q000657
Q000663
Q000665
Q000666
Q000667
Q000685
Q000693
Q000696
Q000698
Q000703
Q000704
Q000705
Q000706
Q000718
Q000724
Q000728
Q000729
Q000730
Q000731
Q000735
Q000740
Q000742
Q000744
Q000745
Q000753
Q000754
Q000755
Q000756
Q000757
Q000758
Q000759
Q000760
Q000761
Q000763
Q000764
Q000765
Q000767
Q000768
Q000769
Q000770
Q000771
Q000772
Q000773
Q000774
Q000780
Q000783
Q000787
Q000826
Q000827
Q000828
Q000829
Q00

100%|██████████| 1928/1928 [02:20<00:00, 13.75it/s] 


✅ Successfully loaded 1928 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusIncantations'>...


100%|██████████| 244/244 [00:00<00:00, 416.48it/s]


✅ Successfully loaded 244 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusLiturgies'>...


100%|██████████| 96/96 [00:00<00:00, 337.99it/s]


✅ Successfully loaded 96 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusUdughul'>...


100%|██████████| 14/14 [00:00<00:00, 91.72it/s]


✅ Successfully loaded 14 texts

Loading text content for <class 'src.models.corpus.Corpus.CorpusVaria'>...


100%|██████████| 3/3 [00:00<00:00, 369.24it/s]

✅ Successfully loaded 3 texts





## Generate transliterations and save to CSV

In [11]:
def to_pd(corpus: CorpusType) -> pd.DataFrame:
    """
    Convert the corpus to a Pandas DataFrame.
    """
    texts = [
        {
            "id": text.file_id,
            "transliteration": text.transliteration(),
            **text.model_dump(exclude={"cdl"}),
        }
        for text in corpus.texts
    ]
    df = pd.DataFrame(texts).fillna("")
    df.set_index("id", inplace=True)
    return df


dfs = [(type(corpus).__name__, to_pd(corpus)) for corpus in corpora]

In [14]:
def write_to_csv(name: str, df: pd.DataFrame) -> None:
    print(f"\nWriting {name} to CSV...")
    project_root = Path.cwd().parents[1]
    output_path = project_root / "outputs"
    output_path.mkdir(parents=True, exist_ok=True)
    output_path = output_path / f"1_{name}.csv"
    df.to_csv(output_path)
    print(f"✅ Wrote {name} to {output_path}")


for name, df in dfs:
    write_to_csv(name, df)


Writing CorpusAdminEd1and2 to CSV...
✅ Wrote CorpusAdminEd1and2 to /Users/cole/dev/sumerian/SumTablets/outputs/1_CorpusAdminEd1and2.csv

Writing CorpusAdminEd3a to CSV...
✅ Wrote CorpusAdminEd3a to /Users/cole/dev/sumerian/SumTablets/outputs/1_CorpusAdminEd3a.csv

Writing CorpusAdminEd3b to CSV...
✅ Wrote CorpusAdminEd3b to /Users/cole/dev/sumerian/SumTablets/outputs/1_CorpusAdminEd3b.csv

Writing CorpusAdminOldAkk to CSV...
✅ Wrote CorpusAdminOldAkk to /Users/cole/dev/sumerian/SumTablets/outputs/1_CorpusAdminOldAkk.csv

Writing CorpusAdminLagash2 to CSV...
✅ Wrote CorpusAdminLagash2 to /Users/cole/dev/sumerian/SumTablets/outputs/1_CorpusAdminLagash2.csv

Writing CorpusAdminUr3 to CSV...
✅ Wrote CorpusAdminUr3 to /Users/cole/dev/sumerian/SumTablets/outputs/1_CorpusAdminUr3.csv

Writing CorpusLiteraryEarly to CSV...
✅ Wrote CorpusLiteraryEarly to /Users/cole/dev/sumerian/SumTablets/outputs/1_CorpusLiteraryEarly.csv

Writing CorpusLiteraryOldBab to CSV...
✅ Wrote CorpusLiteraryOldBab to