## Preprocessing 
Typical Columns

STUDY_ID (ID of the study)

GENUS_SPECIES (or a similar species identifier)

YEAR, MONTH, DAY (often incomplete; sometimes only year is provided)

LATITUDE, LONGITUDE

ABUNDANCE (or BIOMASS, DENSITY, etc.)

In [1]:
import pandas as pd
df = pd.read_csv('data/BioTIMEQuery_24_06_2021.csv', low_memory=False)
df

Unnamed: 0.1,Unnamed: 0,STUDY_ID,DAY,MONTH,YEAR,SAMPLE_DESC,PLOT,ID_SPECIES,LATITUDE,LONGITUDE,sum.allrawdata.ABUNDANCE,sum.allrawdata.BIOMASS,GENUS,SPECIES,GENUS_SPECIES
0,1,10,,,1984,47.400000_-95.120000_12_Control_0_Medium,12,22,47.40000,-95.12000,1.0,0.0,Acer,rubrum,Acer rubrum
1,2,10,,,1984,47.400000_-95.120000_12_Control_0_Medium,12,23,47.40000,-95.12000,3.0,0.0,Acer,saccharum,Acer saccharum
2,3,10,,,1984,47.400000_-95.120000_12_Control_0_Medium,12,24,47.40000,-95.12000,1.0,0.0,Acer,spicatum,Acer spicatum
3,4,10,,,1984,47.400000_-95.120000_12_Control_0_Medium,12,607,47.40000,-95.12000,12.0,0.0,Corylus,cornuta,Corylus cornuta
4,5,10,,,1984,47.400000_-95.120000_12_Control_0_Small,12,1911,47.40000,-95.12000,1.0,0.0,Populus,pinnata,Populus pinnata
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8552244,26178100,548,,,2007,49.1014548954342_13.3200349605548_T3_56_2007,T3_56,49340,49.10146,13.32004,3.0,,Vaccinium,vitis.idaea,Vaccinium vitis.idaea
8552245,26179100,548,,,2009,49.1014548954342_13.3200349605548_T3_56_2009,T3_56,49340,49.10146,13.32004,4.0,,Vaccinium,vitis.idaea,Vaccinium vitis.idaea
8552246,26180100,548,,,2012,49.1014548954342_13.3200349605548_T3_56_2012,T3_56,49340,49.10146,13.32004,3.0,,Vaccinium,vitis.idaea,Vaccinium vitis.idaea
8552247,26181100,548,,,2007,49.097317976565_13.3173542074378_T3_51_2007,T3_51,40355,49.09732,13.31735,10.0,,Veronica,chamaedrys,Veronica chamaedrys


In [2]:
print(df.columns)


Index(['Unnamed: 0', 'STUDY_ID', 'DAY', 'MONTH', 'YEAR', 'SAMPLE_DESC', 'PLOT',
       'ID_SPECIES', 'LATITUDE', 'LONGITUDE', 'sum.allrawdata.ABUNDANCE',
       'sum.allrawdata.BIOMASS', 'GENUS', 'SPECIES', 'GENUS_SPECIES'],
      dtype='object')


In [3]:
import numpy as np
df = df.replace(['missing', 'NA', -9999], np.nan)


## Converting Dates into Tokens
Since our time series data comes with date information (e.g., from BioTIME), we first need to encode the dates as tokens. One approach is to:

Standardize Date Strings: Convert dates to a standard format such as YYYY-MM-DD.
Tokenize the Components: Optionally split the string into tokens (year, month, day) or simply use the full string.


In [4]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration

# ----- Part 1: Preprocessing Dates and Creating Date Tokens -----

# Assume df is your DataFrame with separate YEAR, MONTH, and DAY columns.
# Convert these columns to numeric (coercing errors to NaN)
df[['DAY', 'MONTH', 'YEAR']] = df[['DAY', 'MONTH', 'YEAR']].apply(pd.to_numeric, errors='coerce')

# Fill missing DAY and MONTH with default value 1
df['DAY'].fillna(1, inplace=True)
df['MONTH'].fillna(1, inplace=True)

# Drop rows where YEAR is missing, since YEAR is essential
df = df.dropna(subset=['YEAR'])

# Create a standardized Date column using YEAR, MONTH, and DAY
df['Date'] = pd.to_datetime(df[['YEAR', 'MONTH', 'DAY']], errors='coerce')

# Drop any rows where the date conversion failed (i.e. NaT values)
df = df.dropna(subset=['Date'])

# Set the new Date column as the index
df.set_index('Date', inplace=True)

# Rename 'sum.allrawdata.ABUNDANCE' for easier access
df.rename(columns={'sum.allrawdata.ABUNDANCE': 'ABUNDANCE'}, inplace=True)

# Use the index (Date) to create a string token in the format "YYYY-MM-DD"
df['DateToken'] = df.index.strftime('%Y-%m-%d')

# ----- (Optional) Save or inspect the DataFrame -----
# For example, print the first few rows to verify
print(df[['YEAR', 'MONTH', 'DAY', 'DateToken']].head())


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['DAY'].fillna(1, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['MONTH'].fillna(1, inplace=True)


            YEAR  MONTH  DAY   DateToken
Date                                    
1984-01-01  1984    1.0  1.0  1984-01-01
1984-01-01  1984    1.0  1.0  1984-01-01
1984-01-01  1984    1.0  1.0  1984-01-01
1984-01-01  1984    1.0  1.0  1984-01-01
1984-01-01  1984    1.0  1.0  1984-01-01


## Using a T5 Transformer for Time Series Prediction

The Chronos-Bolt models (e.g., Chronos-Bolt-Mini or Chronos-Bolt-Small) are based on the T5 architecture. 
They treat time series forecasting as a sequence-to-sequence problem where both the input and output are tokenized strings.

Input Formatting Example:
You can format the input as:

print(forecast: <Series_name> <tokenized_date_1> <value_1> ... <tokenized_date_n> <value_n>)

In [5]:
!pip install sentencepiece
!pip install --upgrade transformers sentencepiece




In [6]:
from transformers import T5TokenizerFast, T5ForConditionalGeneration

# Specify the model name. If 'amazon/chronos-bolt-small' fails, try an alternative.
model_name = 'amazon/chronos-bolt-small'
try:
    tokenizer = T5TokenizerFast.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)
except OSError as e:
    print("OSError while loading model:", e)
    print("If you have a local directory named 'amazon/chronos-bolt-small', please rename or remove it.")
    # Try an alternative model identifier (if available)
    alternative_model_name = 'amazon/chronos-bolt-mini'
    print(f"Trying alternative model: {alternative_model_name}")
    tokenizer = T5TokenizerFast.from_pretrained(alternative_model_name)
    model = T5ForConditionalGeneration.from_pretrained(alternative_model_name)

# Build an input string from a time series.
series_name = "Oneida_Lake_NY"
time_series_tokens = ["2020-01-31", "5.3", "2020-02-29", "5.6", "2020-03-31", "5.8"]
input_text = f"forecast: {series_name} " + " ".join(time_series_tokens)
print("Input text:", input_text)

# Tokenize the input and generate forecast
input_ids = tokenizer(input_text, return_tensors='pt').input_ids
outputs = model.generate(input_ids, max_length=50)
forecast = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Forecasted Output:", forecast)


OSError while loading model: Can't load tokenizer for 'amazon/chronos-bolt-small'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'amazon/chronos-bolt-small' is the correct path to a directory containing all relevant files for a T5TokenizerFast tokenizer.
If you have a local directory named 'amazon/chronos-bolt-small', please rename or remove it.
Trying alternative model: amazon/chronos-bolt-mini


OSError: Can't load tokenizer for 'amazon/chronos-bolt-mini'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'amazon/chronos-bolt-mini' is the correct path to a directory containing all relevant files for a T5TokenizerFast tokenizer.

## Links
- Chaotic Chrono Paper: arXiv:2409.15771
- Chronos-Bolt Model on Hugging Face: amazon/chronos-bolt-small
- Chaotic Synthetic Dataset: dysts_data GitHub
- Chronos GitHub Repository: amazon-science/chronos-forecasting
