Book-Text Matching Classifier

An end-to-end RoBERTa-based transformer model for determining whether a text snippet belongs to a specific book.

Overview

This project trains a binary classifier to predict if two text chunks come from the same book (positive) or different books (negative). It uses a RoBERTa-base transformer fine-tuned on pairs of text chunks from a collection of books.

Installation

# Activate virtual environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Data Preparation

The training data consists of book text files in the training/ directory. Each book is split into overlapping chunks, and positive/negative pairs are created:

Positive pairs: Two chunks from the same book
Negative pairs: Two chunks from different books

python prepare_data.py

This will:

Load all books from training/
Split them into chunks
Create train/val/test splits
Generate positive and negative pairs
Save processed data to data/processed/

Training

Train the model with default settings:

python train.py

Customize training:

python train.py \
    --model roberta-base \
    --output models/book_matcher \
    --epochs 5 \
    --batch-size 32 \
    --lr 2e-5

Contrastive Training (Recommended)

For better style similarity learning, use the contrastive approach:

python train_contrastive.py --model roberta-base --epochs 3 --batch-size 16

The contrastive model includes:

Siamese architecture with twin encoders for style embedding
Contrastive loss to learn fine-grained distinctions
Style features: type-token ratio, sentence complexity, punctuation patterns, formality
512-token context with hierarchical pooling

Customize:

python train_contrastive.py \
    --model roberta-base \
    --output models/book_matcher_contrastive \
    --epochs 5 \
    --batch-size 16 \
    --lr 2e-5 \
    --calibrate-for f1 \
    --target-acc 0.85 --target-recall 0.80 \
    --no-style-features  # Disable style features if desired

Inference

Make predictions on new text pairs:

python inference.py \
    --model models/book_matcher/final \
    --text1 "First text chunk here..." \
    --text2 "Second text chunk here..."

Or use in Python:

from inference import BookMatcher

matcher = BookMatcher('models/book_matcher/final')
result = matcher.predict(text1, text2)

if result['same_book']:
    print(f"Texts belong to the same book (confidence: {result['confidence']:.2%})")

Contrastive Inference

For contrastive models:

python inference_contrastive.py \
    --model models/book_matcher_contrastive/final \
    --text1 "First text chunk here..." \
    --text2 "Second text chunk here..."

### Calibration-Only (no retraining)

Re-optimize temperature and decision threshold on the validation split without retraining:

```bash
python calibrate_contrastive.py \
  --model models/book_matcher_contrastive/final \
  --data data/processed \
  --calibrate-for f1 \
  --target-acc 0.85 --target-recall 0.80

This writes calibration.json next to the final/ directory.


## Run on Modal

Run data prep and training on Modal with persistent storage and optional GPU.

Prerequisites:
- Install CLI: `pip install modal`
- Authenticate: `modal token new`

Quick start (baseline model):
- Create a Volume for raw training texts and upload files:
  - `modal volume create writing7-training`
  - `modal volume put writing7-training training`
- Prepare datasets from the training Volume:
  - `modal run modal_app.py::prepare`  # reads from /input/training (writing7-training)
- Train on CPU:
  - `modal run modal_app.py::train -- --model roberta-base --epochs 3`
- Train on GPU (recommended):
  - `modal run modal_app.py::train_gpu -- --model roberta-base --epochs 3`
- Pipeline (prepare → CPU train):
  - `modal run modal_app.py::pipeline -- --model roberta-base`

Contrastive model:
- CPU: `modal run modal_app.py::train_contrastive -- --model roberta-base --epochs 3`
- GPU: `modal run modal_app.py::train_contrastive_gpu -- --model roberta-base --epochs 3`
 - Calibrate: `modal run modal_app.py::calibrate_contrastive -- --model-dir /vol/models/book_matcher_contrastive/final --calibrate-for f1 --target-acc 0.85 --target-recall 0.80`

Details:
- Raw input Volume: `writing7-training` mounted at `/input` (contains your `.txt` books)
- Artifact Volume: `writing7-artifacts` mounted at `/vol`
  - Datasets: `/vol/data/processed`
  - Models/logs: `/vol/models/...`
  - HF cache: `/vol/hf`
- The image bakes this repo at `/workspace` and attempts to download spaCy `en_core_web_sm`; if unavailable, code falls back to a simple splitter.

Data flow:
- `prepare` reads raw `.txt` files from `/input`, writes processed datasets to `/vol/data/processed`.
- `train` and `train_*` read datasets from `/vol/data/processed` and write models to `/vol/models/...`.

Deploy and call (optional):
- `modal deploy modal_app.py`
- Then:
  - `modal call writing7.prepare --training-dir training`
  - `modal call writing7.train --model roberta-base`
  - `modal call writing7.train_gpu --model roberta-base`
  - `modal call writing7.train_contrastive --model roberta-base`
  - `modal call writing7.train_contrastive_gpu --model roberta-base`

GPU guidance:
- Use GPU for any substantial training run (T4 default). CPU is fine for smoke tests or tiny datasets.

## Model Architecture

### Standard Classifier

- **Base Model**: RoBERTa-base (12-layer transformer)
- **Task**: Binary sequence classification
- **Input**: Two text chunks (concatenated with special tokens)
- **Output**: Probability that chunks are from the same book

### Contrastive Model (Recommended)

- **Architecture**: Siamese twin encoders (shared RoBERTa weights)
- **Features**:
  - Embeddings: Mean-pooled RoBERTa hidden states
  - Style features: Type-token ratio, sentence length, punctuation frequency, formality
  - Concatenation: Embeddings + style features → classifier
- **Loss**: Classification loss + contrastive loss (10% weight)
- **Context**: 512 tokens per text

## Evaluation

## Publish Trained Models to Hugging Face

You already have the trained checkpoints; publish them so anyone can run the benchmark with your scorer.

What to publish (from this repo):
- Contrastive matcher (recommended metric): `models/book_matcher_contrastive/final` plus sibling files `calibration.json` and optional `style_calibration.json`.
- Cross‑encoder baseline (optional): `models/book_matcher/final`.

Two recommended model repos on the Hub (replace `your-org`):
- `your-org/writing7-book-matcher-contrastive-v1` → upload the entire folder `models/book_matcher_contrastive/` so that the repo root contains `final/`, `calibration.json`, and (optionally) `style_calibration.json`.
- `your-org/writing7-book-matcher-cross-encoder-v1` → upload the folder `models/book_matcher/` so that the repo root contains `final/`.

Why this layout: our inference expects `--model_dir <path>/final` and reads calibration from the parent directory. Keeping `final/` as a subfolder in the Hub repo preserves this layout after a snapshot download.

### One‑time setup
1) `pip install huggingface_hub git-lfs`
2) `huggingface-cli login` (paste your User or Org write token)

### Push your models
Use the helper script:

```bash
python scripts/push_to_hf.py \
  --repo-id your-org/writing7-book-matcher-contrastive-v1 \
  --source-dir models/book_matcher_contrastive

python scripts/push_to_hf.py \
  --repo-id your-org/writing7-book-matcher-cross-encoder-v1 \
  --source-dir models/book_matcher

If your artifacts live in a Modal volume, first pull them locally, e.g.:

# Example: adjust volume name and paths as needed
modal volume get writing7-artifacts /vol/models/book_matcher_contrastive ./models/book_matcher_contrastive
modal volume get writing7-artifacts /vol/models/book_matcher ./models/book_matcher

Let others download and run

In any environment with the repo and a GPU:

pip install huggingface_hub
python scripts/download_hf_model.py \
  --repo-id your-org/writing7-book-matcher-contrastive-v1 \
  --dest ./external/writing7-book-matcher-contrastive-v1

# Run a small benchmark using the downloaded snapshot
python -m eval.benchmark_style \
  --model openai:gpt-4o-mini \
  --book_path eval/books/gatsby.txt \
  --n_excerpts 2 \
  --n_samples 1 \
  --model_dir ./external/writing7-book-matcher-contrastive-v1/final

Notes:

eval/benchmark_style.py uses the contrastive matcher locally and will automatically pick up calibration.json placed next to final/.
If you saved a style_calibration.json, place it alongside calibration.json before pushing. The scorer will use it when present to map cosine → [0,1].

The model is evaluated on:

Accuracy: Overall correctness
F1 Score: Harmonic mean of precision and recall
AUC: Area under ROC curve
Precision/Recall: Detailed classification metrics

Example Output

Prediction: SAME BOOK
Confidence: 87.32%
Probability: 0.8732

Directory Structure

writing7/
├── training/           # Raw book text files
├── data/
│   └── processed/      # Processed datasets
├── models/
│   └── book_matcher/   # Trained model
├── prepare_data.py     # Data preparation script
├── train.py            # Training script
├── inference.py        # Inference script
└── requirements.txt    # Dependencies

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
eval		eval
scripts		scripts
.gitignore		.gitignore
.modalignore		.modalignore
NEGATIVE_MINING.md		NEGATIVE_MINING.md
README.md		README.md
calibrate_contrastive.py		calibrate_contrastive.py
calibrate_style_similarity.py		calibrate_style_similarity.py
evaluate_contrastive.py		evaluate_contrastive.py
hard_negative_mining.py		hard_negative_mining.py
inference.py		inference.py
inference_contrastive.py		inference_contrastive.py
inference_two_stage.py		inference_two_stage.py
modal_app.py		modal_app.py
prepare_data.py		prepare_data.py
rename_log_high.tsv		rename_log_high.tsv
requirements.txt		requirements.txt
standardize_training.py		standardize_training.py
style_map.py		style_map.py
train.py		train.py
train_contrastive.py		train_contrastive.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book-Text Matching Classifier

Overview

Installation

Data Preparation

Training

Contrastive Training (Recommended)

Inference

Contrastive Inference

Let others download and run

Example Output

Directory Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Book-Text Matching Classifier

Overview

Installation

Data Preparation

Training

Contrastive Training (Recommended)

Inference

Contrastive Inference

Let others download and run

Example Output

Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages