# Akkadian Data Exploration

This notebook explores the Akkadian text data provided by Shahar Spencer.

## Data Sources:
- **eBL (Electronic Babylonian Library)**: ~28,000 fragments in `full_corpus_dir/`
- **Filtered JSON files**: Raw source data in `filtered_json_files/`


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from collections import Counter

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')


## 1. Setup Paths


In [2]:
# Define paths
data_dir = Path('../data/downloaded')
full_corpus_dir = data_dir / 'full_corpus_dir'

# Check if directory exists
print(f"Data directory exists: {data_dir.exists()}")
print(f"Corpus directory exists: {full_corpus_dir.exists()}")


Data directory exists: True
Corpus directory exists: True


## 2. Count Available Fragments


In [3]:
# Get all CSV files
csv_files = list(full_corpus_dir.glob('*.csv'))
print(f"Total number of fragment files: {len(csv_files)}")

# Show first 5 filenames
print("\nFirst 5 files:")
for f in csv_files[:5]:
    print(f"  {f.name}")


Total number of fragment files: 28194

First 5 files:
  EBL_1881,0727.111.csv
  EBL_K.11322.csv
  EBL_BM.134828.csv
  EBL_NBC.5298.csv
  EBL_K.18460.csv


## 3. Load and Inspect a Single Fragment


In [4]:
# Load the first fragment as an example
sample_file = csv_files[0]
df_sample = pd.read_csv(sample_file, index_col=0)

print(f"Fragment: {sample_file.name}")
print(f"Shape: {df_sample.shape}")
print(f"\nColumns: {list(df_sample.columns)}")
print(f"\nFirst few rows:")
df_sample.head(10)


Fragment: EBL_1881,0727.111.csv
Shape: (10, 10)

Columns: ['fragment_id', 'fragment_line_num', 'index_in_line', 'word_language', 'value', 'clean_value', 'lemma', 'domain', 'place_discovery', 'place_composition']

First few rows:


Unnamed: 0,fragment_id,fragment_line_num,index_in_line,word_language,value,clean_value,lemma,domain,place_discovery,place_composition
0,18810727.111,3,1,AKKADIAN,SAG.DU-su,SAG.DU-su,['qaqqadu I'],['CANONICAL ➝ Technical ➝ Ritual texts'],,
1,18810727.111,4,2,AKKADIAN,a,a,[],['CANONICAL ➝ Technical ➝ Ritual texts'],,
2,18810727.111,4,3,AKKADIAN,šu₂,šu₂,['-šu I'],['CANONICAL ➝ Technical ➝ Ritual texts'],,
3,18810727.111,5,1,AKKADIAN,du,du,[],['CANONICAL ➝ Technical ➝ Ritual texts'],,
4,18810727.111,10,0,AKKADIAN,an,an,[],['CANONICAL ➝ Technical ➝ Ritual texts'],,
5,18810727.111,10,1,AKKADIAN,an,an,[],['CANONICAL ➝ Technical ➝ Ritual texts'],,
6,18810727.111,12,0,AKKADIAN,EN₂,EN₂,['šiptu I'],['CANONICAL ➝ Technical ➝ Ritual texts'],,
7,18810727.111,14,0,AKKADIAN,u₃,u₃,['u I'],['CANONICAL ➝ Technical ➝ Ritual texts'],,
8,18810727.111,16,0,AKKADIAN,EN₂,EN₂,['šiptu I'],['CANONICAL ➝ Technical ➝ Ritual texts'],,
9,18810727.111,18,0,AKKADIAN,MIN,MIN,[],['CANONICAL ➝ Technical ➝ Ritual texts'],,


## 4. Understand Fragment Structure


In [5]:
# Check unique values in key columns
print("Fragment ID(s):", df_sample['fragment_id'].unique())
print(f"\nNumber of lines: {df_sample['fragment_line_num'].nunique()}")
print(f"Total words: {len(df_sample)}")
print(f"\nLanguages: {df_sample['word_language'].value_counts().to_dict()}")
print(f"\nDomain: {df_sample['domain'].unique()}")


Fragment ID(s): ['1881,0727.111']

Number of lines: 8
Total words: 10

Languages: {'AKKADIAN': 10}

Domain: ["['CANONICAL ➝ Technical ➝ Ritual texts']"]


## 5. Reconstruct Text from Fragment


In [6]:
# Reconstruct the fragment text line by line
def reconstruct_fragment_text(df, use_clean=True):
    """Reconstruct fragment text from dataframe."""
    value_col = 'clean_value' if use_clean else 'value'
    lines = []
    
    for line_num in sorted(df['fragment_line_num'].unique()):
        line_df = df[df['fragment_line_num'] == line_num].sort_values('index_in_line')
        words = line_df[value_col].tolist()
        lines.append(f"Line {line_num}: {' '.join(words)}")
    
    return lines

# Show the reconstructed text
text_lines = reconstruct_fragment_text(df_sample)
print("Reconstructed text:\n")
for line in text_lines:
    print(line)


Reconstructed text:

Line 3: SAG.DU-su
Line 4: a šu₂
Line 5: du
Line 10: an an
Line 12: EN₂
Line 14: u₃
Line 16: EN₂
Line 18: MIN


In [7]:
# Load first 100 fragments
sample_size = 100
dfs = []
for csv_file in csv_files[:sample_size]:
    df = pd.read_csv(csv_file, index_col=0)
    dfs.append(df)
df_combined = pd.concat(dfs, ignore_index=True)
print(f"Loaded {len(dfs)} fragments, Total shape: {df_combined.shape}")


Loaded 100 fragments, Total shape: (3759, 10)


## 7. Basic Statistics


In [8]:
print("=== DATA STATISTICS ===")
print(f"Total fragments: {df_combined['fragment_id'].nunique()}")
print(f"Total words: {len(df_combined)}")
print(f"\n=== LANGUAGE DISTRIBUTION ===")
print(df_combined['word_language'].value_counts())
print(f"\n=== TOP 10 DOMAINS ===")
print(df_combined['domain'].value_counts().head(10))


=== DATA STATISTICS ===
Total fragments: 100
Total words: 3759

=== LANGUAGE DISTRIBUTION ===
word_language
AKKADIAN    3502
SUMERIAN     234
EMESAL        23
Name: count, dtype: int64

=== TOP 10 DOMAINS ===
domain
[]                                                                                                   729
see genres.json file                                                                                 617
['CANONICAL ➝ Divination ➝ Celestial ➝ Enūma Anu Enlil ➝ Ištar (EAE 50–68)']                         520
['CANONICAL ➝ Technical ➝ Astronomy ➝ Astronomical Diaries']                                         285
[['ARCHIVAL', 'Letter', 'Extispicy Query']]                                                          188
[['CANONICAL', 'Magic']]                                                                             138
['CANONICAL ➝ Divination ➝ Celestial']                                                               124
['CANONICAL ➝ Literature ➝ Hymns ➝ Divine']      

## 8. Vocabulary Analysis


In [9]:
# Filter for Akkadian only
df_akkadian = df_combined[df_combined['word_language'] == 'AKKADIAN']
print(f"Akkadian words: {len(df_akkadian)}")
print(f"Unique clean values: {df_akkadian['clean_value'].nunique()}")
print(f"\nTop 20 most common words:")
for word, count in df_akkadian['clean_value'].value_counts().head(20).items():
    print(f"  {word}: {count}")


Akkadian words: 3502
Unique clean values: 1430

Top 20 most common words:
  ina: 180
  ana: 84
  ša₂: 76
  MIN: 64
  DIŠ: 55
  la: 53
  u: 51
  ŠA₃: 50
  KUR: 45
  LUGAL: 45
  NU: 43
  {d}MIN: 34
  BE: 31
  AN: 28
  E₂: 25
  IGI: 25
  a-na: 23
  LU₂: 18
  GE₆: 18
  GIM: 18


## 9. Next Steps

Based on this exploration, we can decide what to do next with the data.


## 6. Load Multiple Fragments for Statistics
