Implementation of multimodal time-series forecasting framework in UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting
Sehyuk Park1, Soyeon Caren Han1, 2 Eduard Hovy2
1 Pohang University of Science and Technology 2 The University of Melbourne
1 percy212@postech.ac.kr, drcarenhan@postech.ac.kr 2 caren.han@unimelb.edu.au, eduard.hovy@unimelb.edu.au
This project leverages two Time-Series Foundation Models: Timer and Chronos.
Each model requires a separate Python environment:
- Timer:
python==3.10.16
- Chronos:
python==3.11.11
Other dependencies can be installed from the corresponding requirements.txt
file for each model.
Timer:
conda create -n timer python=3.10.16
conda activate timer
pip install -r requirements/timer_requirements.txt
Chronos:
conda create -n chronos python=3.11.11
conda activate chronos
pip install -r requirements/chronos_requirements.txt
We use a subset of the evaluation dataset from Chronos.
- All CSV files are stored in the
csv/
folder. - The
dataset/
folder contains acreate_dataset.py
script for each dataset.
For converting time-series data into images, we follow the plotting approach used in ViTST.
To generate the datasets, simply run:
cd dataset
bash create_dataset.sh
UniCast utilizes:
- Time-Series Models: Timer, Chronos
- Vision Encoders: CLIP, BLIP
- Text Encoders: Qwen, LLaMA
Each model requires its corresponding pretrained configuration and weights.
For each model, a save_pretrained_model.py
script is provided in its respective folder.
To download and save all pretrained models, simply run:
cd models
bash save_pretrained_model.sh
For each TSFM, separate shell scripts are provided for training and testing.
These scripts are configured to iterate over different combinations of vision encoders and text encoders.
- To train:
# For Timer
bash train_multi_modal_timer.sh
# For Chronos
bash train_multi_modal_chronos.sh
- To evaluate:
# For Timer
bash test_multi_modal_timer.sh
# For Chronos
bash test_multi_modal_chronos.sh
When compared with six baseline models, UniCast achieved better performance in a parameter-efficient manner while keeping the backbone frozen.
Incorporating either visual or textual context improves performance over the time-series-only model, while combining both modalities consistently yields the best results.
The figure compares four configurations: TSFM Zero-Shot, TSFM with Prompt Tuning, TSFM with Vision Encoder, and TSFM with both Vision and Text Encoders.
It shows that adding more modalities enables the model to capture patterns more effectively.
If you find our UniCast framework helpful, we would appreciate it if you could cite our paper.
@misc{park2025unicastunifiedmultimodalprompting,
title={UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting},
author={Sehyuk Park and Soyeon Caren Han and Eduard Hovy},
year={2025},
eprint={2508.11954},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.11954},
}