PIP-Manuscripts-Processor is a modular preprocessing and analysis pipeline developed within the Peirce Interprets Peirce project.
It enables structured access to the digitised manuscripts of Charles S. Peirce and supports downstream tasks such as visual classification, diagram recognition, and semantic annotation.
- Extracts structured metadata from Harvard’s Houghton Library IIIF manifests
- Downloads and organises manuscript pages by Robin’s classification system
- Identifies and classifies manuscript pages into
text,diagram_mixed, andcover - Computes CLIP embeddings for all pages to support downstream ML tasks
- Generates derivative datasets (e.g. only diagram-rich pages) for layout detection
- Provides UMAP visualisation for interpretability and quality control
- Prepares outputs for semantic reinjection into IIIF using
oa:Annotation
git clone https://github.com/friendlynihilist/PIP-Manuscripts-Processor.git
cd PIP-Manuscripts-Processor
pip install -r requirements.txtExample: extract CLIP embeddings for the full corpus:
python src/features/generate_clip_embeddings_full.pyRun the classification pipeline on training/test sets:
python src/classification/train_logistic_clip.pyGenerate UMAP plots from CLIP vectors and Robin categories:
python src/visualisation/umap_diagram_by_category.pydata/raw/Manuscripts/: original image files, organised by category and item IDdata/processed/: metadata files, CSVs, embeddings, classification resultsdata/derived/: generated subsets, e.g. layout-ready diagram pagessrc/: all scripts grouped by function (features,classification,visualisation,layout)
MIT License