# Dataset Preparation Notebook

This notebook copies 9000 image files from a source directory to the training dataset folder and creates a metadata.jsonl file with LaTeX formulas extracted from a text file.

## 1. Configure Paths

Set the source paths for images and formulas text file, and the destination dataset folder.

In [5]:
from pathlib import Path
import shutil
import json

# Configure these paths
SOURCE_IMAGES_DIR = Path(r"D:\datasets\CROHME\images")  # Directory containing the image files
FORMULAS_TEXT_FILE = Path(r"D:\datasets\CROHME\CROHME_math.txt")  # Text file with LaTeX formulas (one per line)
DESTINATION_DIR = Path(r"D:\code\math-content-recognition-\examples\train_texteller\dataset\train")  # Destination directory

# Number of files to copy
NUM_FILES = 9000

# Create destination directory if it doesn't exist
DESTINATION_DIR.mkdir(parents=True, exist_ok=True)

print(f"Source images: {SOURCE_IMAGES_DIR}")
print(f"Formulas file: {FORMULAS_TEXT_FILE}")
print(f"Destination: {DESTINATION_DIR}")
print(f"Files to copy: {NUM_FILES}")

Source images: D:\datasets\CROHME\images
Formulas file: D:\datasets\CROHME\CROHME_math.txt
Destination: D:\code\math-content-recognition-\examples\train_texteller\dataset\train
Files to copy: 9000


## 2. Load LaTeX Formulas

Read all LaTeX formulas from the text file. Each line corresponds to one formula.

In [6]:
# Read all formulas from the text file
with open(FORMULAS_TEXT_FILE, 'r', encoding='utf-8') as f:
	formulas = [line for line in f]

print(f"Loaded {len(formulas)} formulas from {FORMULAS_TEXT_FILE}")
print(f"First 3 formulas:")
for i in range(min(3, len(formulas))):
	print(f"  Line {i}: {formulas[i][:80]}{'...' if len(formulas[i]) > 80 else ''}")

Loaded 10846 formulas from D:\datasets\CROHME\CROHME_math.txt
First 3 formulas:
  Line 0: y = A x + A ^ { 2 }

  Line 1: B _ { n } ( 1 - x ) = ( - 1 ) ^ { n } B _ { n } ( x )

  Line 2: 0 < x < 1



## 3. Copy Files and Create Metadata

Copy image files to the destination and create a metadata entry for each file. The metadata contains:
- `file_name`: The image filename
- `latex_formula`: The LaTeX formula (extracted based on the numeric index in the filename)

In [7]:
import re

# Get all image files from source directory
image_files = sorted(SOURCE_IMAGES_DIR.glob("*.png"))
print(f"Found {len(image_files)} PNG files in source directory")

# Process up to NUM_FILES
metadata_entries = []
copied_count = 0
skipped_count = 0

for img_path in image_files[:NUM_FILES]:
	# Extract number from filename (e.g., "0010015.png" -> 10015)
	match = re.search(r'(\d+)\.png$', img_path.name)
	if not match:
	print(f"Warning: Could not extract number from {img_path.name}, skipping...")
		skipped_count += 1
		continue
	
	index = int(match.group(1))
	
	# Check if index is within range of formulas
	if index >= len(formulas):
		print(f"Warning: Index {index} from {img_path.name} exceeds formula count {len(formulas)}, skipping...")
		skipped_count += 1
		continue
	
	# Copy file to destination
	dest_path = DESTINATION_DIR / img_path.name
	shutil.copy2(img_path, dest_path)
	
	# Create metadata entry
	metadata_entry = {
		"file_name": img_path.name,
		"latex_formula": formulas[index]
	}
	metadata_entries.append(metadata_entry)
	copied_count += 1
	
	# Progress update every 1000 files
	if copied_count % 1000 == 0:
		print(f"Processed {copied_count} files...")

print(f"\nCopied {copied_count} files successfully")
print(f"Skipped {skipped_count} files")
print(f"Created {len(metadata_entries)} metadata entries")

IndentationError: expected an indented block after 'if' statement on line 15 (3499615859.py, line 16)

## 4. Save Metadata to JSONL

Save all metadata entries to `metadata.jsonl` file (one JSON object per line).

In [8]:
# Save metadata to JSONL file
metadata_path = DESTINATION_DIR / "metadata.jsonl"

with open(metadata_path, 'w', encoding='utf-8') as f:
	for entry in metadata_entries:
		f.write(json.dumps(entry, ensure_ascii=False) + '\n')

print(f"Saved metadata to {metadata_path}")
print(f"Total entries: {len(metadata_entries)}")

Saved metadata to D:\code\math-content-recognition-\examples\train_texteller\dataset\train\metadata.jsonl
Total entries: 9000


## 5. Verify Dataset

Verify that the dataset is correctly created by checking file counts and sample entries.

In [9]:
# Count files in destination
dest_images = list(DESTINATION_DIR.glob("*.png"))
print(f"Total PNG files in destination: {len(dest_images)}")

# Read and verify metadata
with open(metadata_path, 'r', encoding='utf-8') as f:
    metadata_lines = f.readlines()

print(f"Total metadata entries: {len(metadata_lines)}")

# Show sample entries
print(f"\nSample metadata entries:")
for i in range(min(5, len(metadata_lines))):
    entry = json.loads(metadata_lines[i])
    print(f"\nEntry {i+1}:")
    print(f"  File: {entry['file_name']}")
    print(f"  Formula: {entry['text'][:100]}{'...' if len(entry['text']) > 100 else ''}")

# Verify all files in metadata exist
print("\nVerifying file existence...")
missing_files = []
for line in metadata_lines:
    entry = json.loads(line)
    file_path = DESTINATION_DIR / entry['file_name']
    if not file_path.exists():
        missing_files.append(entry['file_name'])

if missing_files:
    print(f"WARNING: {len(missing_files)} files referenced in metadata but not found:")
    for fname in missing_files[:10]:
        print(f"  - {fname}")
    if len(missing_files) > 10:
        print(f"  ... and {len(missing_files) - 10} more")
else:
    print("✓ All files referenced in metadata exist in the dataset folder")

print("\n✓ Dataset preparation complete!")

Total PNG files in destination: 9000
Total metadata entries: 9000

Sample metadata entries:

Entry 1:
  File: 0000000.png
  Formula: y = A x + A ^ { 2 }


Entry 2:
  File: 0000001.png
  Formula: B _ { n } ( 1 - x ) = ( - 1 ) ^ { n } B _ { n } ( x )


Entry 3:
  File: 0000002.png
  Formula: 0 < x < 1


Entry 4:
  File: 0000003.png
  Formula: A ( B + C ) = A B + A C


Entry 5:
  File: 0000004.png
  Formula: A \times B


Verifying file existence...
✓ All files referenced in metadata exist in the dataset folder

✓ Dataset preparation complete!
✓ All files referenced in metadata exist in the dataset folder

✓ Dataset preparation complete!
