BP-RPC is a lightweight, training-free prompt compression project for language model inference acceleration. It reduces the number of input tokens before inference, aiming to lower prefill latency and total generation time while preserving enough context for reasonable output quality.
The default model is EleutherAI/pythia-70m from HuggingFace. The code is designed for MacBook Air style environments: CPU first, optional Apple MPS detection, no CUDA, no vLLM, and no training.
Example results from a lightweight CPU run show that BP-RPC preserves target-token PPL much better than First-K and Random at low keep ratios, while staying competitive with Last-K and TF-IDF.
The speed-quality trade-off plot is useful for report discussion, but benchmark numbers should be interpreted as example measurements because laptop CPU latency can be noisy.
See docs/results_summary.md for a compact table summary of the example run.
Note: the Random baseline is sensitive to a single seed. For formal runs, average it over multiple --random_seeds to avoid over-interpreting one lucky sample.
The project compares six prompt handling methods:
- Full Prompt: uses the original prompt without compression.
- First-K: keeps sentences from the beginning until the token budget is reached.
- Last-K: keeps sentences from the end until the token budget is reached.
- Random: randomly keeps sentences under the budget with a fixed seed.
- TF-IDF: builds a pseudo-query from the final prompt tokens and keeps sentences most relevant to it.
- BP-RPC: combines boundary preservation, pseudo-query relevance, and recency-aware scoring.
BP-RPC keeps the first keep_head sentences and last keep_tail sentences, then scores remaining sentences with:
score_i = alpha * relevance_i + beta * recency_i
where relevance_i is TF-IDF cosine similarity to the pseudo-query and recency_i is larger for sentences closer to the end of the prompt.
pip install -r requirements.txtPerplexity is computed only on target tokens. The prompt portion of the labels is masked with -100.
python scripts/run_eval.py --max_samples 10 --prompt_len 1024 --target_len 128For a faster first run on MacBook Air:
python scripts/run_eval.py --max_samples 5 --prompt_len 512 --target_len 64 --device cpuFor a more stable Random baseline:
python scripts/run_eval.py --max_samples 20 --prompt_len 512 --target_len 64 --device cpu --pair_mode sentence --random_seeds 1 2 3 4 5--pair_mode sentence makes the target start at a sentence boundary, which is a better match for sentence-level prompt compression. Use --pair_mode token to reproduce fixed token slicing.
If HuggingFace downloads fail because of SSL or network interruptions, try a mirror or a local model directory:
export HF_ENDPOINT=https://hf-mirror.com
python scripts/run_eval.py --max_samples 5 --prompt_len 512 --target_len 64 --device cpu
huggingface-cli download EleutherAI/pythia-70m --local-dir models/pythia-70m
python scripts/run_eval.py --model_name models/pythia-70m --max_samples 5 --prompt_len 512 --target_len 64 --device cpu --local_files_onlyGeneration uses greedy decoding with do_sample=False.
python scripts/run_benchmark.py --max_samples 5 --prompt_len 1024 --max_new_tokens 32For a faster first run:
python scripts/run_benchmark.py --max_samples 3 --prompt_len 512 --max_new_tokens 16 --device cpuAfter results/eval_results.csv or results/benchmark_results.csv exists, generate report-ready figures with:
python scripts/plot_results.pyFigures are saved to results/figures/ by default. The script creates PPL, loss, compressed-token, latency, throughput, speedup, and speedup-vs-PPL trade-off plots when the required CSV columns are available.
For PPL with large outliers, median plots are often more informative than mean plots:
python scripts/plot_results.py --agg median
python scripts/summarize_results.pyIf PPL outliers make the plot hard to read, clip the y-axis data:
python scripts/plot_results.py --max_ppl 200results/eval_results.csv: perplexity results with method, keep ratio, compressed prompt tokens, loss, and PPL.results/benchmark_results.csv: generation timing results with compressed prompt tokens, total time, time per output token, and throughput.results/figures/: generated PNG/PDF/SVG figures from result CSV files.
Useful report tables and plots:
- PPL vs keep ratio: compare quality degradation as prompts become shorter.
- Compressed tokens vs latency: show how token count affects inference time.
- Speedup vs quality trade-off: compare latency reduction against PPL increase.
- Winner counts: use
scripts/summarize_results.pyto check which method wins per sample, since mean PPL can be dominated by a few outlier samples.
- MacBook Air users should start with
--max_samples 5. - If MPS has compatibility issues, force CPU with
--device cpu. - If HuggingFace access fails, set
HF_ENDPOINT=https://hf-mirror.comor pass a local model path with--model_name. - Dataset loading falls back to built-in English long texts if HuggingFace download fails.
- PPL is computed only on target tokens, not on the prompt.
- This is an inference-only experiment; it does not train or fine-tune the model.

