## 1. Install Datasets Library

Installs Hugging Face’s `datasets` to access the FinQA dataset.


In [1]:
!pip install datasets


Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.8.4.1 which is 

## 2. Load the FinQA Dataset

Loads the FinQA dataset from Hugging Face using `load_dataset`. This gives access to the training, validation, and test splits.


In [2]:
from datasets import load_dataset

dataset = load_dataset("ibm-research/finqa", trust_remote_code=True)



README.md:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

finqa.py:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## 3. Inspect the Dataset

Prints the dataset structure and previews one training sample to understand the available fields (e.g., question, table, program, answer).


In [4]:
# Check dataset structure
print(dataset)

# View a sample entry
sample = dataset['train'][0]
for key, value in sample.items():
    print(f"{key}: {value}\n")


DatasetDict({
    train: Dataset({
        features: ['id', 'pre_text', 'post_text', 'table', 'question', 'answer', 'final_result', 'program_re', 'gold_inds'],
        num_rows: 6251
    })
    validation: Dataset({
        features: ['id', 'pre_text', 'post_text', 'table', 'question', 'answer', 'final_result', 'program_re', 'gold_inds'],
        num_rows: 883
    })
    test: Dataset({
        features: ['id', 'pre_text', 'post_text', 'table', 'question', 'answer', 'final_result', 'program_re', 'gold_inds'],
        num_rows: 1147
    })
})
id: ADI/2009/page_49.pdf-1

pre_text: ['interest rate to a variable interest rate based on the three-month libor plus 2.05% ( 2.05 % ) ( 2.34% ( 2.34 % ) as of october 31 , 2009 ) .', 'if libor changes by 100 basis points , our annual interest expense would change by $ 3.8 million .', 'foreign currency exposure as more fully described in note 2i .', 'in the notes to consolidated financial statements contained in item 8 of this annual report on form

## 4. View a Random Training Sample

Displays a random entry from the training set to explore the variety of questions and table structures in FinQA.


In [5]:
from random import randint

# View a random sample from the training set
i = randint(0, len(dataset['train']) - 1)
sample = dataset['train'][i]

for k, v in sample.items():
    print(f"\n{k.upper()}:\n{v}")



ID:
LMT/2013/page_49.pdf-3

PRE_TEXT:
['frequency ( aehf ) system , orion , global positioning satellite ( gps ) iii system , geostationary operational environmental satellite r-series ( goes-r ) , and mobile user objective system ( muos ) .', 'operating profit for our space systems business segment includes our share of earnings for our investment in united launch alliance ( ula ) , which provides expendable launch services to the u.s .', 'government .', 'space systems 2019 operating results included the following ( in millions ) : .']

POST_TEXT:
['2013 compared to 2012 space systems 2019 net sales for 2013 decreased $ 389 million , or 5% ( 5 % ) , compared to 2012 .', 'the decrease was primarily attributable to lower net sales of approximately $ 305 million for commercial satellite programs due to fewer deliveries ( zero delivered during 2013 compared to two for 2012 ) ; and about $ 290 million for the orion program due to lower volume .', 'the decreases were partially offset by hi

## 5. Convert Table Data to DataFrame

Transforms the raw table from the sample into a readable pandas DataFrame for easier viewing and analysis.


In [6]:
import pandas as pd

sample = dataset['train'][i]  # use same i from before
table = pd.DataFrame(sample['table'][1:], columns=sample['table'][0])
table


Unnamed: 0,Unnamed: 1,2013,2012,2011
0,net sales,$ 7958,$ 8347,$ 8161
1,operating profit,1045,1083,1063
2,operating margins,13.1% ( 13.1 % ),13.0% ( 13.0 % ),13.0% ( 13.0 % )
3,backlog at year-end,20500,18100,16000


## 6. Clean and Format the Table

Loads a sample table as a DataFrame, shortens long column names, removes dollar signs and commas, and attempts to convert values to numeric types for easier processing.


In [7]:
import pandas as pd

# Get one sample
sample = dataset['train'][0]
raw_table = sample['table']

# Convert to DataFrame
df = pd.DataFrame(raw_table[1:], columns=raw_table[0])

# Optionally shorten long column names
df.columns = [col[:30] + '...' if len(col) > 30 else col for col in df.columns]

# Clean $ signs and commas, convert to numbers where possible
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].str.replace(r'[\$,]', '', regex=True).str.strip()
        df[col] = pd.to_numeric(df[col], errors='ignore')

# Display cleaned table
df.head()


  df[col] = pd.to_numeric(df[col], errors='ignore')


Unnamed: 0,Unnamed: 1,october 31 2009,november 1 2008
0,fair value of forward exchange contracts asset...,6427,-23158 ( 23158 )
1,fair value of forward exchange contracts after...,20132,-9457 ( 9457 )
2,fair value of forward exchange contracts after...,-6781 ( 6781 ),-38294 ( 38294 )


## 7. Tokenize Question and Table for the Transformer

This block uses the `T5Tokenizer` to convert each FinQA sample (question + table) into a format the Transformer model can understand. It flattens the table into a single string, merges it with the question, and tokenizes the result into input IDs and attention masks.

**Purpose**: Prepares raw data for model input — a critical step before training or inference.


In [21]:
from transformers import T5Tokenizer

# Load pretrained tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Sample question and table
question = sample["question"]
table_str = " ".join([" | ".join(row) for row in sample["table"]])

# Combine question and table for input
input_text = f"question: {question}  context: {table_str}"

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt", padding="max_length", truncation=True, max_length=512)

# View tokenized inputs
print(inputs)


{'input_ids': tensor([[  822,    10,   125,    19,     8,     8,  1046,  8225,    16,  2464,
            58,  2625,    10,  1820,     3,    32,    75,   235,  1152,  2664,
          2464,  1820,     3,  5326, 18247,   209,  2628,  2725,   701,    13,
          1039,  2509,  8201,  7000,    41,  6283,     3,    61,  1820,  1514,
          6687,  2555,  1820,  1514,     3,    18,  2773, 26556,    41,  1902,
         26556,     3,    61,  2725,   701,    13,  1039,  2509,  8201,   227,
             3,     9,  6389,    41,   335,     3,  1454,     3,    61,    73,
            89,     9,  1967,   179,  2426,    16,  2959,  7481,  2509,  1917,
          7000,    41,  6283,     3,    61,  1820,  1514,  2038,   357,  1820,
          1514,     3,    18,  4240,  3436,    41,     3,  4240,  3436,     3,
            61,  2725,   701,    13,  1039,  2509,  8201,   227,     3,     9,
          6389,    41,   335,     3,  1454,     3,    61, 15229,  2426,    16,
          2959,  7481,  2509,  1917,  

## 8. Preview Multiple Training Samples

Creates a DataFrame from several FinQA samples, showing the question, answer, rationale, gold indices, and a short preview of the table — useful for quickly understanding data structure and content diversity.


In [22]:
import pandas as pd

# Convert a few training samples into a structured DataFrame
samples = [dataset['train'][i] for i in range(5)]

# Create DataFrame with question, answer, rationale (the reasoning steps), and raw table
df_samples = pd.DataFrame([{
    "question": s["question"],
    "answer": s["answer"],
    "rationale": s.get("rationale", ""),  # Use 'rationale' instead of 'program' and provide default if missing
    "gold_inds": s["gold_inds"],
    "table_preview": " | ".join([" | ".join(row) for row in s["table"][:3]]) + " ..."
} for s in samples])

df_samples

Unnamed: 0,question,answer,rationale,gold_inds,table_preview
0,what is the the interest expense in 2009?,380,,"[if libor changes by 100 basis points , our an...",| october 31 2009 | november 1 2008 | fair va...
1,"during the 2012 year , did the equity awards i...",,,[the granted of number of shares ( in thousand...,| number of shares ( in thousands ) | weighte...
2,what was the total operating expenses in 2018 ...,41932,,[year the 2018 of gallons is 4447 ; the 2018 o...,year | gallons | average priceper gallon | air...
3,what percentage of total cash and investments ...,53%,,[( in millions ) the available-for-sale invest...,( in millions ) | dec 282013 | dec 292012 | av...
4,what is the growth rate in net revenue in 2008?,-3.2%,,[the 2007 net revenue of amount ( in millions ...,| amount ( in millions ) | 2007 net revenue |...


## 9. Preprocessing Function (`preprocess`)

Defines how each sample is tokenized:

- Merges question and table into a single string.
- Tokenizes the combined text for model input (`input_ids`, `attention_mask`).
- Separately tokenizes the answer as the target (`labels`).
- Returns all components as a dictionary of tensors.

**Purpose**: Prepares the input-output pair for the model — what to "read" and what to "predict".


In [23]:
def preprocess(sample):
    table_str = " ".join([" | ".join(row) for row in sample["table"]])
    input_text = f"question: {sample['question']}  context: {table_str}"

    model_inputs = tokenizer(
        input_text,
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

    # Tokenize the answer as target
    labels = tokenizer(
        sample["answer"],
        padding="max_length",
        truncation=True,
        max_length=64,
        return_tensors="pt"
    )["input_ids"]

    model_inputs["labels"] = labels
    return {k: v.squeeze() for k, v in model_inputs.items()}


## 10. Apply Preprocessing to Training Set

Applies the `preprocess` function to each example in the training set, converting raw samples into tokenized inputs and labels for the model.


In [24]:
tokenized_dataset = dataset["train"].map(preprocess)


## 11. Wrap Tokenized Data in a PyTorch Dataset

Creates a custom `FinQADataset` class to wrap the tokenized Hugging Face dataset in a format compatible with PyTorch’s DataLoader.

**Purpose**: Enables efficient batching and shuffling during training.


In [25]:
import torch
from torch.utils.data import Dataset

class FinQADataset(Dataset):
    def __init__(self, hf_dataset):
        self.dataset = hf_dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        return {
            "input_ids": torch.tensor(self.dataset[idx]["input_ids"]),
            "attention_mask": torch.tensor(self.dataset[idx]["attention_mask"]),
            "labels": torch.tensor(self.dataset[idx]["labels"]),
        }



## 12. Create DataLoader for Training

Wraps the `FinQADataset` in a PyTorch `DataLoader` for easy batching and shuffling during training.

**Batch size**: 4  
**Shuffling**: Enabled for randomness


In [26]:
from torch.utils.data import DataLoader

train_dataset = FinQADataset(tokenized_dataset)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)


## 13. Load the T5 Model

Loads the pretrained `T5ForConditionalGeneration` model, which includes both encoder and decoder components for sequence-to-sequence tasks like FinQA.

**Model used**: `t5-small`


In [27]:
from transformers import T5ForConditionalGeneration

# Load pretrained T5 model (encoder + decoder)
model = T5ForConditionalGeneration.from_pretrained("t5-small")


## 14. Train the T5 Model

Runs a simple training loop for 3 epochs using the `AdamW` optimizer.

- Moves each batch to the model’s device (CPU or GPU)
- Computes loss, backpropagates, and updates weights
- Uses `tqdm` to show real-time training progress

**Purpose**: Fine-tunes the T5 model on the FinQA dataset.


In [None]:
import torch
from torch.optim import AdamW
from tqdm import tqdm

# Make sure model is in training mode
model.train()

# Define optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Set number of epochs
num_epochs = 3

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    loop = tqdm(train_loader, leave=True, desc=f"Epoch {epoch+1}")

    for batch in loop:
        # Move data to the same device as model
        input_ids = batch["input_ids"].to(model.device)
        attention_mask = batch["attention_mask"].to(model.device)
        labels = batch["labels"].to(model.device)

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        loop.set_postfix(loss=loss.item())

    print(f"Epoch {epoch+1} Average Loss: {total_loss / len(train_loader):.4f}")


Epoch 1: 100%|██████████| 1563/1563 [1:14:41<00:00,  2.87s/it, loss=0.165]


Epoch 1 Average Loss: 0.1961


Epoch 2: 100%|██████████| 1563/1563 [1:16:46<00:00,  2.95s/it, loss=0.126] 


Epoch 2 Average Loss: 0.1590


Epoch 3:  89%|████████▊ | 1385/1563 [1:08:32<08:41,  2.93s/it, loss=0.145] 

## 15. Set Device for Inference or Training

Detects if a GPU is available and sets the computation device accordingly (`cuda` or `cpu`).

**Purpose**: Ensures the model runs on the most efficient hardware available.


In [6]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


## 16. Load Fine-Tuned Model and Tokenizer

Loads the previously trained T5 model and tokenizer from the saved directory, then moves the model to the selected device.

**Path**: `/kaggle/input/finqa-model-after-learning`


In [7]:
model = T5ForConditionalGeneration.from_pretrained("/kaggle/input/finqa-model-after-learning")
tokenizer = T5Tokenizer.from_pretrained("/kaggle/input/finqa-model-after-learning")
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

## 17. Run Inference on a Sample Question

Formats a sample question and table into a single input, tokenizes it, runs it through the fine-tuned model, and decodes the predicted answer.




In [8]:
question = "What is the net income in 2020?"
table_str = "Year | Revenue | Net Income 2019 | $100M | $20M 2020 | $120M | $25M"

input_text = f"question: {question} context: {table_str}"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Answer:", answer)


Answer: $ 900 million
