Skip to content

chaiboo/pathless-maps

Repository files navigation

pathless-maps

A reading of The Theosophist, a monthly journal founded by H. P. Blavatsky in 1879 and still published from the Theosophical Society's headquarters in Adyar, Chennai. The corpus spans 1879–1952 and 2007–2024 — a 55-year archive gap sits in the middle. 895 issues, 948 monthly data points after bound-volume splitting, 72,990 text chunks, about 35.6M words.

The title is from Krishnamurti's 1929 dissolution speech: "Truth is a pathless land."

The essay

viz/index.html is a long-form reading of the corpus, built as a static page. It moves through five editorial eras — Blavatsky (1879–1907), Olcott (1907–1908), Besant (1908–1933), Post-Besant (1934–1973), Modern (2007–2024) — tracking genre composition, tradition frequency, Mahatma vocabulary, geographic attention, concept drift, and stylometric signature. Reads from viz/data/*.json.

Run locally:

cd viz
python -m http.server 8000
open http://localhost:8000

Method

Two passes over the chunks.

Classification. Gemini 2.5 Flash (Vertex AI batch) assigns each chunk a six-field label set: traditions (from an 11-category taxonomy), occult concepts, Mahatma references (presence and rhetorical function), locations, section type, primary topic. Schema and prompts in pipeline/prepare_batch.py. 72,988 of 72,990 chunks classified (99.997%); two fatal failures after retry. The essay uses the 71,254 chunks that survived downstream length and confidence filters.

Embeddings. text-embedding-004 (Vertex AI batch) produces 768-dim vectors per chunk. build_embeddings.py fits UMAP and PCA projections, runs concept drift (decade-to-decade cosine distance from a per-term baseline on a strict-clean subset), builds sense clusters for 12 focal terms, and runs classification probing via logistic regression with chance baselines reported inline.

The 18 JSON feeds in viz/data/ are the deliverable of these two passes: tradition, concept, Mahatma, and location frequencies per era; TF-IDF by era; co-occurrence networks; stylometric measures per issue; semantic clusters; concept drift trajectories; similarity outliers; UMAP projection.

Repo layout

Path Purpose
viz/index.html The essay.
viz/data/*.json Analysis feeds consumed by the essay.
build_analysis.py Builds classification-derived feeds from classifications_v2.jsonl.
build_embeddings.py Builds embedding-derived feeds (UMAP, PCA, drift, clusters, probing).
split_bound_volumes.py Redistributes multi-month bound-volume chunks across month slots.
submit_embeddings.py Submits the corpus to Vertex AI text-embedding-004 in two batches across two GCP projects.
pipeline/ Scrape, OCR, chunk, prepare batch, collect, retry.
DATA.md Raw-data provenance and pipeline run order.

Raw PDFs, chunk jsonl, batch requests, and embedding results live outside git (see .gitignore). DATA.md records where they sit on disk.

Stack

  • Gemini 2.5 Flash (Vertex AI batch prediction)
  • text-embedding-004 (Vertex AI batch prediction)
  • scikit-learn (PCA, UMAP, KMeans, logistic-regression probing), numpy, pandas
  • D3 v7 for custom layouts; Leaflet for the geographic map

Credentials

Environment variables:

export GOOGLE_API_KEY=...
export WEBSHARE_API_KEY=...   # optional, for proxy fallback in pipeline/download_corpus.py

Pipeline run order is documented in DATA.md.

About

A reading of The Theosophist (1879-1952, 2007-2024) — five editorial eras, 895 issues, 72,990 chunks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors