Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin, Qi Zhu, Haoyang Fang,
Danielle C. Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, Matthew Reimherr
Official implementation of the SenTSR-Bench knowledge injection framework.
Inject in-domain knowledge from fine-tuned time-series specialists into frozen general reasoning LMs
for robust, context-aware diagnostic reasoning.
(a) A fine-tuned TSLM captures key time-series patterns but fails on diagnostic reasoning; (b) A general-purpose GRLM reasons well but overlooks domain-specific patterns; (c) Our knowledge injection steers the GRLM's reasoning with in-domain knowledge from the TSLM, producing the correct diagnosis.
General reasoning LMs (GRLMs) show strong reasoning but lack domain-specific time-series knowledge. Time-series specialist LMs (TSLMs) capture signal patterns but struggle with multi-step diagnostic reasoning. Our framework bridges this gap by injecting TSLM-generated insights directly into the GRLM's reasoning trace — no weight updates needed.
(a) Knowledge injection paradigm; (b) RL-honed thinking traces via RLVR enable effective injection without human supervision.
- New Paradigm — A framework that injects in-domain knowledge from a TSLM into a GRLM's reasoning process, steering reasoning with domain knowledge
- RL-Based Injection — Reinforcement learning with verifiable rewards (RLVR) elicits knowledge-rich thinking traces without manual supervision
- SenTSR-Bench — A new benchmark of 110 real-world multivariate sensor streams with 330 human-curated diagnostic questions
- Strong Results — Surpasses TSLMs by 9.1%–26.1% and GRLMs by 7.9%–22.4% across benchmarks
- Python 3.10+
- Conda for environment management
- AWS account with access to Claude models via Bedrock (for closed-source experiments)
- GPU for running self-hosted model servers (8xA100 recommended)
git clone https://github.com/ZLHe0/TSR_knowledge_injection.git
cd TSR_knowledge_injection
conda create -n tsr-env python=3.10
conda activate tsr-env
pip install -r requirements.txt
# (For Claude experiments) Configure AWS credentials
aws configureSenTSR-Bench is a first-of-its-kind dataset of 110 multivariate sensor streams with 330 human-curated diagnostic questions, built from real-world industrial operations. Each time series contains 3 sensor channels (acceleration, velocity, temperature).
The benchmark evaluates a three-stage diagnostic reasoning chain:
| Stage | Task | Description |
|---|---|---|
| What Happened | Anomaly Characterization | Identify key time-series anomaly patterns |
| How Happened | Root-Cause Diagnosis | Determine the most likely causes |
| Suggested Fix | Action Recommendation | Propose corrective actions |
Download the evaluation benchmark from HuggingFace:
# Install huggingface_hub if needed
pip install huggingface_hub
# Download dataset to ./dataset/
huggingface-cli download ZLHe0/SenTSR-Bench --repo-type dataset --local-dir ./datasetSee dataset/README.md for format specifications.
We additionally evaluate on two public benchmarks: TSEvol (Dataset A) and TS&Language (MCQ2 subset).
python dataset/preprocess_dataset.py \
--dataset_a path/to/dataset_a.json \
--mcq2_source path/to/MCQ_2_TS.jsonl \
--output_dir ./dataset/processed \
--mcq2_sample_size 100The synthetic training data pipeline uses VLM-assisted code synthesis to bootstrap realistic simulators from 23 seed signals, producing 6,000 MCQ training entries.
| Stage | Script | Description |
|---|---|---|
| 1. Iterative Code Synthesis | ./scripts/run_iterative_generation.sh |
Claude generates Python simulators from real data |
| 2. Stochastic Diversification | ./scripts/run_stochastic_refinement.sh |
Convert to sampling-based generators |
| 3–4. Benchmark Generation | ./scripts/run_synthetic_benchmark.sh 100 |
Generate synthetic time series + QA/MCQ |
Note: Stage 2 requires manual review to select the best stochastic model per sample before proceeding.
See dataset/synthetic/README.md for full pipeline documentation.
The knowledge injection framework injects TSLM-generated insights directly into the GRLM's reasoning trace. We provide end-to-end examples with multiple model combinations:
- GRLMs (General Reasoners): Claude 3.7 Sonnet, Qwen3-32B, DeepSeek-R1-Distill-Qwen-32B
- TSLMs (Time-Series Specialists): ChatTS-14B, Qwen2.5-VL-3B (SFT/RL fine-tuned)
Claude (via AWS Bedrock) serves as the general reasoner; ChatTS provides injected observations via an instructional proxy (<thinking> tags).
# 1. Start the ChatTS server
./src/chatts_utils/start_chatts_server.sh
# 2. Run standalone baselines
./scripts/run_chatts_inference.sh --dataset ./dataset/dataset_a_with_mcq2.json
./scripts/run_claude_inference.sh --dataset ./dataset/dataset_a_with_mcq2.json
# 3. Run knowledge injection (generates observations + injects into Claude)
./scripts/run_injection_workflow.sh --dataset ./dataset/dataset_a_with_mcq2.json
# 4. Stop the server
./src/chatts_utils/stop_chatts_server.shQwen3-32B serves as the GRLM; Qwen2.5-VL-3B provides injected thoughts via continue_final_message assistant prefill.
# 1. Start both servers
./src/qwen_utils/start_qwen_vl_server.sh # Qwen2.5-VL-3B on port 5003
./src/qwen3_utils/start_qwen3_server.sh # Qwen3-32B on port 5001
# 2. Run standalone baseline
./scripts/run_qwen_inference.sh --dataset ./dataset/dataset_a_with_mcq2.json
# 3. Run knowledge injection
./scripts/run_qwen3_injection_workflow.sh --dataset ./dataset/dataset_a_with_mcq2.json
# 4. Stop servers
./src/qwen_utils/stop_qwen_vl_server.sh
./src/qwen3_utils/stop_qwen3_server.shDeepSeek-R1-Distill-Qwen-32B is an alternative open-source GRLM. It shares the same tokenizer and API as Qwen3, so the same injection script supports both via --model_name.
# 1. Start servers
./src/qwen_utils/start_qwen_vl_server.sh # Qwen2.5-VL-3B on port 5003
./src/r1_utils/start_r1_server.sh # DeepSeek-R1 on port 5002
# 2. Run knowledge injection
./scripts/run_r1_injection_workflow.sh --dataset ./dataset/dataset_a_with_mcq2.json
# 3. Stop servers
./src/qwen_utils/stop_qwen_vl_server.sh
./src/r1_utils/stop_r1_server.shThe time-series specialist (TSLM) is initialized from the public Qwen2.5-VL-3B-Instruct checkpoint and post-trained in two stages:
- Supervised Fine-Tuning (SFT) using LLaMA-Factory
- Reinforcement Learning (GRPO) using VERL
The synthetic training data generated by dataset/synthetic/ can be used directly with these frameworks. See Appendix B of the paper for hyperparameter details.
Evaluate results with sampling for statistical robustness (mean ± std over 3 runs):
python evaluation/evaluate_with_sampling.py \
--exp experiment_name \
--dataset ./dataset/dataset_a_with_mcq2.json \
--generated ./evaluation/results/experiment_name/generated_answer.jsonThe evaluation uses custom metrics based on the RAGAS framework. Results are saved to evaluation/exp/<experiment_name>/.
If you find this work useful, please cite:
@misc{he2026sentsrbenchthinkinginjectedknowledge,
title={SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning},
author={Zelin He and Boran Han and Xiyuan Zhang and Shuai Zhang and Haotian Lin and Qi Zhu and Haoyang Fang and Danielle C. Maddix and Abdul Fatir Ansari and Akash Chandrayan and Abhinav Pradhan and Bernie Wang and Matthew Reimherr},
year={2026},
eprint={2602.19455},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.19455},
}This project is inspired by and builds upon several excellent open-source projects:
This project is licensed under the Apache License 2.0.


