[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/famma-bench/bench-script/blob/main/notebooks/FAMMA_3_generation.ipynb)

## Introduction

This notebook explains how to generate answers for the **FAMMA** benchmark.  
Before you start, complete the following prerequisites:

1. **Install the FAMMA benchmark scripts**  
2. **Set up access to a language model**—either through a web API or a locally hosted LLM  

---

## Install the benchmark scripts

Run the commands below to download and install the FAMMA utilities:


In [None]:
! rm -rf bench-script
! git clone https://github.com/famma-bench/bench-script.git
! pip install -r bench-script/requirements.txt
! pip install -U datasets
# Install the package in editable mode using the notebook's pip
# ! pip install -e ./bench-script/

Then, we can download the dataset by running the following script.

In [25]:
import sys

sys.path.append("./bench-script")

from famma_runner.utils.data_utils import download_data

# the directory of the dataset in huggingface
hf_dir = "weaverbirdllm/famma"

# the version of the dataset, there are two versions: release_basic and release_livepro
# if None, it will download the whole dataset
split = "release_livepro"

# the local directory to save the dataset
save_dir = "./data"

# whether to download the dataset from huggingface or local, by default it is False
from_local = False

success = download_data(
        hf_dir=hf_dir,
        split=split,
        save_dir=save_dir,
        from_local=from_local
    )



Saved release_livepro split to ./data/release_livepro.json

Dataset downloaded and saved to ./data
Images are saved in ./data/images_release_livepro


After downloading, the dataset will be saved in the local directory `./data` in json format.  

# Answer Generation

## Calling a Custom LLM

In most cases we can simply plug in any model supported by `easyllm_kit`, but if we wish to invoke a proprietary endpoint-- e.g., a Deepseek-R1 instance deployed on Alibaba Cloud-- we only need to wrap that endpoint in a thin adapter that subclasses `easyllm_kit.models.base.LLM`. The snippet below shows a minimal implementation that registers a custom client under the handle `custom_llm` and forwards prompts to the remote API.

In [None]:
from easyllm_kit.models.base import LLM


# Define your custom model class
@LLM.register("custom_llm")
class MyReasoningModel(LLM):
    model_name = 'custom_llm'

    def __init__(self, config):
        # Ensure the base class is initialized correctly
        # Initialize your model here
        import openai
        self.model_config = config['model_config']
        self.generation_config = config['generation_config']
        self.client = openai.OpenAI(api_key=self.model_config.api_key,
                                    base_url=self.model_config.api_url,
                                    timeout=1800)


    def generate(self, prompt: str, **kwargs):
        response = self.client.chat.completions.create(
            model=self.model_config.model_full_name,
            max_tokens=self.generation_config.max_length,
            temperature=self.generation_config.temperature,
            top_p=self.generation_config.top_p,
            messages=[
                {"role": "user", "content": prompt}]
        )
        reasoning_content = ""
        if hasattr(response.choices[0].message, 'reasoning_content'):
            reasoning_content = response.choices[0].message.reasoning_content
        content = response.choices[0].message.content
        return  {'content': content, 'reasoning_content': reasoning_content}



As an illustrative example, we write the YAML content to a file yaml_content as below. Note that runner_config is the pipeline configuration ID that tells `famma_runner` which configuration to use. We combine the data configuration and model configuration into a single YAML file. This file will be used to initialize the model runner, which will handle the training process based on the specified parameters.

For simplicity, we run over only one question -- `english_1_1_r2` (runner will answer with its all subquestions) and use `qwen-vl-max` as the model.

In [29]:
yaml_content = """
runner_name: generation

data:
  data_dir: ./data/release_livepro.json
  question_id: english_1_1_r4  # suppose we generate answerns only for question 1

model:
  model_name: custom_llm # register name of your custom model
  api_key: sk-xxxx  # put your api key here
  api_url: https://dashscope.aliyuncs.com/compatible-mode/v1
  model_full_name: qwen-vl-max # put the model name here
  use_ocr: false
  use_pot: false
  is_reasoning_model: false

generation:
  temperature: 0.0
  top_p: 0.9
  max_length: 1024
"""


# Save the content to a file named config.yaml
with open("config.yaml", "w") as file:
    file.write(yaml_content)

We can run the snippet below to verify that the configuration is working correctly.

In [30]:
from omegaconf import OmegaConf

config_dir = "config.yaml"
config = OmegaConf.load(config_dir)


# Build the LLM model
llm_config = {'model_config': config.get('model', None),
              'generation_config': config.get('generation', None), }
custom_model = LLM.build_from_config(llm_config)
output = custom_model.generate("What is the impact of rising tarrif on china")
print(output)

{'content': "The impact of rising tariffs on China can be multifaceted, affecting various aspects of its economy and global trade relationships. Here are some key impacts:\n\n1. **Economic Growth**: Rising tariffs can slow down China's economic growth by reducing exports, which is a significant driver of the Chinese economy. Higher costs for Chinese goods in foreign markets can lead to decreased demand.\n\n2. **Trade Balance**: Tariffs can disrupt China's trade balance. As tariffs increase, the cost of Chinese exports rises, potentially leading to a decrease in exports and an increase in imports, thus worsening the trade deficit.\n\n3. **Inflation**: Higher tariffs can lead to increased prices for imported goods in China, contributing to inflationary pressures. This can affect both consumers and businesses that rely on imported materials and components.\n\n4. **Business Costs**: For Chinese companies, higher tariffs mean increased production costs if they rely on imported raw materials

Finnaly, we use the following script to run the overall generation for FAMMA.

In [31]:
import argparse
from omegaconf import OmegaConf
from famma_runner.runners import Runner

"""
Generate answers from a specified model and save the results to files.
"""

config = OmegaConf.load('config.yaml')

runner = Runner.build_from_config(config)

runner.run()

[32m2025-05-18 12:21:31, generation_runner [generation_runner.filter_dataset_by_question_id:94] INFO - Filtering dataset for language: english, main_question_id: 1[39m
[32m2025-05-18 12:21:31, generation_runner [generation_runner.filter_dataset_by_question_id:105] INFO - Found 5 questions matching english_1_1_r4[39m
[32m2025-05-18 12:21:31, generation_runner [generation_runner.filter_dataset_by_question_id:118] INFO - Total of 5 questions matched across all filters[39m
[32m2025-05-18 12:21:31, easyllm_kit [easyllm_kit.initialize_database:21] INFO - Initialized new database: qwen-vl-max_ans_release_livepro[39m
[32m2025-05-18 12:21:31, generation_runner [generation_runner.run:204] INFO - start generating answers for english -- main_question_id: 1[39m
[32m2025-05-18 12:22:00, easyllm_kit [easyllm_kit.write_to_database:35] INFO - Stored answer for record_idx english_1.[39m
[32m2025-05-18 12:22:00, generation_runner [generation_runner.run:240] INFO - Generation complete[39m
[

## Using pre-wrapped models in `easyllm_kit`

`easyllm_kit` exposes several popular LLM endpoints through an OpenAI-style interface.  
To invoke one, simply supply:

- **`model_name`** – the provider key recognised by `easyllm_kit`
- **`model_full_name`** – the exact model identifier offered by that provider

| `model_name`        |  `model_full_name`  |
|---------------------|----------------------------------|
| `gpt4o`             | `o1`, `o1-mini`, `gpt-4o` |
| `claude_35_sonnet`  | `claude-3-5-sonnet-20240620` (default) |
| `gemini`            | Any Gemini model ID published by Google |


To resue the above config, we simple modify the `model_name`, `model_full_name` along with `api_key` to call got-4o using `easyllm_kit`.

In [35]:
yaml_content = """
runner_name: generation

data:
  data_dir: ./data/release_livepro.json
  question_id: english_1_1_r4  # suppose we generate answerns only for question 1

model:
  model_name: gpt4o
  model_full_name: gpt-4o
  use_api: true
  api_key: sk-proj-xxx
  use_litellm_api: false. # set to false to not use litellm api
  use_ocr: false
  use_pot: false
  is_reasoning_model: false

generation:
  temperature: 0.0
  top_p: 0.9
  max_length: 1024
"""

# Save the content to a file named config.yaml
with open("config_easyllm.yaml", "w") as file:
    file.write(yaml_content)


In [36]:
config = OmegaConf.load("config_easyllm.yaml")

runner = Runner.build_from_config(config)

runner.run()

[32m2025-05-18 13:11:00, generation_runner [generation_runner.filter_dataset_by_question_id:94] INFO - Filtering dataset for language: english, main_question_id: 1[39m
[32m2025-05-18 13:11:00, generation_runner [generation_runner.filter_dataset_by_question_id:105] INFO - Found 5 questions matching english_1_1_r4[39m
[32m2025-05-18 13:11:00, generation_runner [generation_runner.filter_dataset_by_question_id:118] INFO - Total of 5 questions matched across all filters[39m
[32m2025-05-18 13:11:00, easyllm_kit [easyllm_kit.initialize_database:21] INFO - Initialized new database: gpt-4o_ans_release_livepro[39m
[32m2025-05-18 13:11:00, generation_runner [generation_runner.run:204] INFO - start generating answers for english -- main_question_id: 1[39m
[32m2025-05-18 13:11:09, easyllm_kit [easyllm_kit.write_to_database:35] INFO - Stored answer for record_idx english_1.[39m
[32m2025-05-18 13:11:09, generation_runner [generation_runner.run:240] INFO - Generation complete[39m
[32m20