# Семинар по данным

Сегодня на семинаре мы научимся запускать и оценивать VLM (Vision-Language Models) на бенчмарках в Google Colab.

**Цель:**  
Освоить базовые шаги работы с инструментом lmm-eval: запустить его, протестировать модель на стандартных бенчмарках, а также собрать свой небольшой бенчмарк и оценить модель на нём.

**План:**  
- Установим lmm-eval и выберем модель.
- Прогоним модель на готовых бенчмарках.
- Сделаем свой мини-бенчмарк и посмотрим, как справляется модель.
- Улучшим дистракторы через доп модель
- Разберёмся, как анализировать результаты.

**Важно:**  Не забудьте выбрать среду исполнения с GPU

# Загружаем LMMs-Eval + зависимости

In [None]:
!pip install git+https://github.com/huggingface/transformers
!pip install -U accelerate

!git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git

%cd lmms-eval
!pip install -e .
%cd ..

In [None]:
import os
os.environ['HF_TOKEN'] = "..." # Токен с только что созданного аккаунта
# settings -> access tokens -> create
# для работы с hugging face датасетами

# Запуск модели на бенчмарке (-ах)

Хотим померить качество модели на бенчмарке mmmu

Все конфиги бенчмарков лежат в lmms-eval/lmms_eval/tasks \
Все конфиги моделей лежат в lmms-eval/lmms_eval/models

In [None]:
!python3 -m accelerate.commands.launch \
            --num_processes=1 \
            -m lmms_eval \
            --model qwen2_vl \
            --model_args "pretrained=Qwen/Qwen2-VL-2B-Instruct" \
            --tasks mmmu \
            --batch_size 1 \
            --log_samples \
            --log_samples_suffix test \
            --output_path ./logs/ \
            --limit 300

In [None]:
# Сохранённый результат команды выше

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-10-22 14:23:27.810962: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761143007.844239    9479 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761143007.854133    9479 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1761143007.878359    9479 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1761143007.87840

Данный результат совпадает с заявленным на сайте hf: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct

Советую ещё посмотреть в папку Logs, там появилась вся подробная информация о работе вашей модели

## Выберем ещё пару бенчмарков

In [None]:
!python3 -m accelerate.commands.launch \
            --num_processes=1 \
            -m lmms_eval \
            --model qwen2_vl \
            --model_args "pretrained=Qwen/Qwen2-VL-2B-Instruct" \
            --tasks mme,gqa \
            --batch_size 1 \
            --log_samples \
            --log_samples_suffix test \
            --output_path ./logs/ \
            --limit 300

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-10-22 21:38:17.097480: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761169097.118004    5769 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761169097.124282    5769 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1761169097.140340    5769 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1761169097.14036

# Соберём собственный бенч

Сейчас мы пройдём весь путь создания бенча
1) Соберём данные
2) Оформим их на hf (удобно)
3) Напишем конфиг под lmms-eval
4) Прогоним модель

Начнём со сбора данных, мы это сделаем через парсинг страницы steam с играми

In [13]:
import requests
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO
import time

def get_steam_search_page_html(page):
    url = f"https://store.steampowered.com/search/?filter=topsellers&page={page}"
    resp = requests.get(url, timeout=15)
    resp.raise_for_status()
    return resp.text

def parse_steam_search_page(html):
    soup = BeautifulSoup(html, "html.parser")
    results = []
    for row in soup.select("a.search_result_row"):
        title_tag = row.select_one("span.title")
        logo_tag = row.select_one("div.search_capsule img")
        if title_tag and logo_tag and logo_tag.get("src"):
            title = title_tag.text.strip()
            img_url = logo_tag["src"]
            results.append({
                "title": title,
                "img_url": img_url
            })
    return results

def download_images(game_infos):
    games_with_images = []
    for info in game_infos:
        try:
            img_bytes = requests.get(info["img_url"], timeout=10).content
            img = Image.open(BytesIO(img_bytes)).convert("RGB")
            games_with_images.append({
                "title": info["title"],
                "image": img
            })
        except Exception as e:
            print(f"Ошибка с {info['title']}: {e}")
    return games_with_images

def scrape_n_steam_pages(n):
    all_infos = []
    for page in range(1, n+1):
        print(f"Запрашиваю страницу {page}...")
        html = get_steam_search_page_html(page)
        page_infos = parse_steam_search_page(html)
        print(f"  Найдено игр: {len(page_infos)}")
        all_infos.extend(page_infos)
        time.sleep(1)
    print("Скачиваем изображения...")
    games_final = download_images(all_infos)
    return games_final

games = scrape_n_steam_pages(40)
print(f"Всего игр: {len(games)}")

Запрашиваю страницу 1...
  Найдено игр: 25
Запрашиваю страницу 2...
  Найдено игр: 25
Запрашиваю страницу 3...
  Найдено игр: 25
Запрашиваю страницу 4...
  Найдено игр: 25
Запрашиваю страницу 5...
  Найдено игр: 25
Запрашиваю страницу 6...
  Найдено игр: 25
Запрашиваю страницу 7...
  Найдено игр: 25
Запрашиваю страницу 8...
  Найдено игр: 25
Запрашиваю страницу 9...
  Найдено игр: 25
Запрашиваю страницу 10...
  Найдено игр: 25
Запрашиваю страницу 11...
  Найдено игр: 25
Запрашиваю страницу 12...
  Найдено игр: 25
Запрашиваю страницу 13...
  Найдено игр: 25
Запрашиваю страницу 14...
  Найдено игр: 25
Запрашиваю страницу 15...
  Найдено игр: 25
Запрашиваю страницу 16...
  Найдено игр: 25
Запрашиваю страницу 17...
  Найдено игр: 25
Запрашиваю страницу 18...
  Найдено игр: 25
Запрашиваю страницу 19...
  Найдено игр: 25
Запрашиваю страницу 20...
  Найдено игр: 25
Запрашиваю страницу 21...
  Найдено игр: 25
Запрашиваю страницу 22...
  Найдено игр: 25
Запрашиваю страницу 23...
  Найдено игр: 

### Теперь переведём данные hf формат и запушим в репу

In [None]:
from datasets import Dataset, Features, Value, Image as DatasetsImage

features = Features({
    "title": Value("string"),
    "image": DatasetsImage()
})

ds = Dataset.from_list(games, features=features)
print(ds[0])

ds.push_to_hub("tempqqq/steam-exact-match", private=False)

{'title': 'Counter-Strike 2', 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=231x87 at 0x78B5616E25D0>}


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              :   6%|5         |  524kB / 9.20MB            

CommitInfo(commit_url='https://huggingface.co/datasets/tempqqq/steam-exact-match/commit/5505da8f20650cc4b8a6d280d0d17bd938609f31', commit_message='Upload dataset', commit_description='', oid='5505da8f20650cc4b8a6d280d0d17bd938609f31', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/tempqqq/steam-exact-match', endpoint='https://huggingface.co', repo_type='dataset', repo_id='tempqqq/steam-exact-match'), pr_revision=None, pr_num=None)

# Подготовим lmms-eval конфиг для нашего бенча

Папка задачи lmms-eval выглядит следующим образом:

```
task_name
-- task_name.yaml
-- utils.py
```

task_name.yaml - основной конфиг задачи \
utils.py - вспомогательные функции для подсчёта

In [None]:
!mkdir /content/lmms-eval/lmms_eval/tasks/steam_exact_match/

In [None]:
%%writefile /content/lmms-eval/lmms_eval/tasks/steam_exact_match/steam_exact_match.yaml
dataset_path: tempqqq/steam-exact-match
task: steam_exact_match
test_split: train
output_type: generate_until
doc_to_visual: "image"
doc_to_text: "Write only the name of the game"
doc_to_target: "title"
generation_kwargs:
  max_new_tokens: 16
  temperature: 0
  top_p: 1.0
  num_beams: 1
  do_sample: false
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  - version: 0.0

Writing /content/lmms-eval/lmms_eval/tasks/steam_exact_match/steam_exact_match.yaml


In [None]:
!python3 -m accelerate.commands.launch \
            --num_processes=1 \
            -m lmms_eval \
            --model qwen2_vl \
            --model_args "pretrained=Qwen/Qwen2-VL-2B-Instruct" \
            --tasks steam_exact_match \
            --batch_size 1 \
            --log_samples \
            --log_samples_suffix test \
            --output_path ./logs/

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-10-22 21:21:37.129060: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761168097.148608    1472 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761168097.154542    1472 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1761168097.169277    1472 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1761168097.16930

# Поменяем формат задачи на Multiple Choice

Подход выше с exact match - излишне строг, если модель напишет название чуть чуть по-другому, ей ответзасчитан не будет, давайте попробуем изменить формат

Будем выбирать негативы/дистракторы/неправильные варианты ответа случайным образом

In [24]:
import random

def group_choices(games):
    options = []
    total = len(games)
    for idx, ans in enumerate(games):
        other_answers = [games[i]['title'] for i in range(total) if i != idx]
        sampled = random.sample(other_answers, min(3, len(other_answers)))
        group = [ans['title']] + sampled
        random.shuffle(group)
        options.append(group)
    return options


choices_all = group_choices(games)
print(choices_all[0], choices_all[1], sep='\n')
games = [
    {**game_info, 'choices': choices}
    for game_info, choices in zip(games, choices_all)
]

['Gang Beasts', 'F1® 25', 'Golden Lap', 'Battlefield™ 6']
['Sea of Thieves: 2025 Edition', 'Steam Deck', 'Hogwarts Legacy: Dark Arts Pack', 'Football Manager 26']


In [25]:
from datasets import Dataset, Features, Value, Image as DatasetsImage, Sequence

features = Features({
    "title": Value("string"),
    "choices": Sequence(Value("string")),
    "image": DatasetsImage()
})

ds = Dataset.from_list(games, features=features)
print(ds[0])

ds.push_to_hub("tempqqq/steam_multiple_choice", private=False)

{'title': 'Battlefield™ 6', 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=231x87 at 0x7A3F60C1C410>, 'choices': ['Gang Beasts', 'F1® 25', 'Golden Lap', 'Battlefield™ 6']}


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              :  97%|#########6| 35.9MB / 37.0MB            

CommitInfo(commit_url='https://huggingface.co/datasets/tempqqq/steam_multiple_choice/commit/7a72ae145e3ca18b14f7d20eee3a0de7f1690bbb', commit_message='Upload dataset', commit_description='', oid='7a72ae145e3ca18b14f7d20eee3a0de7f1690bbb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/tempqqq/steam_multiple_choice', endpoint='https://huggingface.co', repo_type='dataset', repo_id='tempqqq/steam_multiple_choice'), pr_revision=None, pr_num=None)

In [26]:
!mkdir /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice

In [27]:
%%writefile /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice/steam_multiple_choice.yaml
dataset_path: tempqqq/steam_multiple_choice
task: steam_multiple_choice
test_split: train
output_type: generate_until
doc_to_visual: "image"
doc_to_text: !function utils.prepare_input
process_results: !function utils.process_results
doc_to_target: "title"
generation_kwargs:
  max_new_tokens: 16
  temperature: 0
  top_p: 1.0
  num_beams: 1
  do_sample: false
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  - version: 0.0

Writing /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice/steam_multiple_choice.yaml


В этот раз придётся написать utils.py, чтобы задать в нём обработку нескольких вариантов ответа и финального результата

Обратите внимание на то, как мы ссылаемся из основного конфига на функции из utils.py

In [28]:
%%writefile /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice/utils.py
import numpy as np

def prepare_input(doc):
    letters = ['A', 'B', 'C', 'D']
    choices = [f"{letter}) {doc['choices'][i]}" for i, letter in enumerate(letters)]
    choices_str = '\n'.join(choices)
    prompt = "Choose the right name of the game. Write ONLY one letter A-E\n"
    return prompt + choices_str


def process_results(doc, result):
    letters = ['A', 'B', 'C', 'D']
    predicted_letter = result[0][0]
    if predicted_letter in letters:
        choice_idx = letters.index(predicted_letter)
    else:
        choice_idx = np.random.randint(4)
    return {
        'exact_match': doc['title'] == doc['choices'][choice_idx]
    }

Writing /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice/utils.py


In [29]:
!python3 -m accelerate.commands.launch \
            --num_processes=1 \
            -m lmms_eval \
            --model qwen2_vl \
            --model_args "pretrained=Qwen/Qwen2-VL-2B-Instruct" \
            --tasks steam_multiple_choice \
            --batch_size 1 \
            --log_samples \
            --log_samples_suffix test \
            --output_path ./logs/ \
            --verbosity=DEBUG

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
[32m2025-10-23 13:16:40[0m | [1mINFO    [0m | [36m__main__[0m:[36mcli_evaluate[0m:[36m311[0m - [1mVerbosity set to DEBUG[0m
[32m2025-10-23 13:16:41[0m | [34m[1mDEBUG   [0m | [36mlmms_eval.tasks[0m:[36m_get_task_and_group[0m:[36m458[0m - [34m[1m`group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lmms-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.[0m
[32m2025-10-23 13:16:41[0m | [34m[1mDEBUG   [0m | [36mlmms_eval.tasks[0m:[36m_get_task_and_group[0m:[36m484[0m - [34m[1m

Выглядит не очень информативно...

 # Попробуем улучшить дистракторы

 Есть подозрение, что случайным образом подобранные варианты ответа делают ответ уж слишком очевидным

 Давайте переформируем варианты ответа по следующему принципу:
 1) Возьмём эмбеддинги всех ответов
 2) Для каждого правильного ответа будем брать 3 ближайших других ответов в качестве негативов

In [6]:
# !pip install -U sentence_transformers

In [17]:
from sentence_transformers import SentenceTransformer
import numpy as np
import torch


class DistractorBuilder:
    def __init__(self, model_name='ai-forever/FRIDA'):
        self.emb_model = SentenceTransformer(model_name)

    def _choice_2d(self, data, size, replace):
        res_data = []
        for i, line in enumerate(data):
            res_data += [np.random.choice(line, size=size, replace=replace)]


        return np.array(res_data)

    def build_nearest(self, answers, limits_min=None, limits_max=None,
                      coefs_min=None, coefs_max=None, num_choices=4,
                      duplicates=False, groups=None, sentence_prefix='categorize_topic:'):
        groups = [0] * len(answers) if groups is None else groups
        groups = np.array(groups)
        unique_groups = set(groups)

        limits_min = limits_min if type(limits_min) == dict else {g: limits_min for g in unique_groups}
        limits_max = limits_max if type(limits_max) == dict else {g: limits_max for g in unique_groups}
        coefs_min = coefs_min if type(coefs_min) == dict else {g: coefs_min for g in unique_groups}
        coefs_max = coefs_max if type(coefs_max) == dict else {g: coefs_max for g in unique_groups}

        res_dict = {}
        for cur_group in unique_groups:
            limit_min, limit_max = limits_min[cur_group], limits_max[cur_group]
            coef_min, coef_max = coefs_min[cur_group], coefs_max[cur_group]
            cur_answers = np.array(answers)[groups == cur_group]
            cur_answers = [cur_answer.strip(" .,;:!?") for cur_answer in cur_answers]

            unique_choices, unique_idx = np.unique(
                cur_answers,
                return_index=True
            )

            choices_embs = self.emb_model.encode(
                [sentence_prefix + choice for choice in cur_answers],
                convert_to_tensor=True
            )

            sim_scores = choices_embs @ choices_embs[unique_idx].T
            indices = torch.argsort(sim_scores, dim=1, descending=True).cpu()

            min_idx = int(coef_min * len(unique_choices)) if coef_min is not None else 1
            min_idx = limit_min if (limit_min is not None and limit_min > min_idx) else min_idx
            max_idx = int(coef_max * len(unique_choices)) if coef_max is not None else len(answers)
            max_idx = limit_max if (limit_max is not None and limit_max < max_idx) else max_idx

            answer_indices = indices[:, 0].unsqueeze(-1).numpy()
            distractors_indices = self._choice_2d(indices[:, min_idx:max_idx], size=num_choices - 1, replace=False)
            choices = unique_choices[np.concatenate([distractors_indices, answer_indices], axis=1)].tolist()

            res_dict[cur_group] = choices

        res_list = []
        for group in groups:
            res_list.append(res_dict[group].pop(0))
        return res_list

In [18]:
words = [
    "Рецепт", "Сковорода", "Приправа", "Выпечка", "Маринад",
    "Тренировка", "Мяч", "Стадион", "Победа", "Спортсмен",
    "Процессор", "Алгоритм", "Байт", "Монитор", "Клавиатура",
    "Мелодия", "Гитара", "Нота", "Оркестр", "Ритм"
]

distractor_builder = DistractorBuilder('ai-forever/FRIDA')
# Построить 4 ближайших дистрактора для каждого слова
result = distractor_builder.build_nearest(words, num_choices=4, limits_max=4)
print(result)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


[['Маринад', 'Выпечка', 'Приправа', 'Рецепт'], ['Приправа', 'Выпечка', 'Рецепт', 'Сковорода'], ['Выпечка', 'Рецепт', 'Маринад', 'Приправа'], ['Сковорода', 'Приправа', 'Рецепт', 'Выпечка'], ['Выпечка', 'Рецепт', 'Приправа', 'Маринад'], ['Мяч', 'Стадион', 'Спортсмен', 'Тренировка'], ['Тренировка', 'Спортсмен', 'Стадион', 'Мяч'], ['Победа', 'Спортсмен', 'Мяч', 'Стадион'], ['Стадион', 'Тренировка', 'Мяч', 'Победа'], ['Тренировка', 'Мяч', 'Стадион', 'Спортсмен'], ['Клавиатура', 'Алгоритм', 'Байт', 'Процессор'], ['Байт', 'Процессор', 'Клавиатура', 'Алгоритм'], ['Клавиатура', 'Алгоритм', 'Процессор', 'Байт'], ['Процессор', 'Клавиатура', 'Байт', 'Монитор'], ['Байт', 'Монитор', 'Процессор', 'Клавиатура'], ['Гитара', 'Ритм', 'Оркестр', 'Мелодия'], ['Ритм', 'Мелодия', 'Оркестр', 'Гитара'], ['Ритм', 'Мелодия', 'Клавиатура', 'Нота'], ['Мелодия', 'Ритм', 'Гитара', 'Оркестр'], ['Гитара', 'Оркестр', 'Мелодия', 'Ритм']]


In [19]:
titles_only = [game['title'] for game in games]
choices_only = distractor_builder.build_nearest(titles_only, num_choices=4, limits_max=4)
print('\n'.join([str(choices) for choices in choices_only][:5]))

games = [
    {**game_info, 'choices': choices}
    for game_info, choices in zip(games, choices_only)
]

['Counter-Strike 2', "Tom Clancy's Rainbow Six® Siege X", 'Call of Duty: United Offensive', 'Battlefield™ 6']
['EA SPORTS FC™ 25', 'eFootball™', 'EA SPORTS FC™ 26', 'Football Manager 26']
['GTFO', 'REMATCH', 'Shamania', 'Dispatch']
['Jurassic World Evolution 2', 'Jurassic World Evolution 3: Badlands Set', 'Jurassic World Evolution 2: Camp Cretaceous Dinosaur Pack', 'Jurassic World Evolution 3']
['Star Birds', 'V Rising', 'Victoria 3', 'RV There Yet']


Выглядит значительно лучше/сложнее. Посмотрим на качество модели

In [20]:
from datasets import Dataset, Features, Value, Image as DatasetsImage, Sequence

features = Features({
    "title": Value("string"),
    "choices": Sequence(Value("string")),
    "image": DatasetsImage()
})

ds = Dataset.from_list(games, features=features)
print(ds[0])

ds.push_to_hub("tempqqq/steam_multiple_choice_improved", private=False)

{'title': 'Battlefield™ 6', 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=231x87 at 0x7A3F60A7AA50>, 'choices': ['Counter-Strike 2', "Tom Clancy's Rainbow Six® Siege X", 'Call of Duty: United Offensive', 'Battlefield™ 6']}


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              :   1%|          |  267kB / 37.0MB            

CommitInfo(commit_url='https://huggingface.co/datasets/tempqqq/steam_multiple_choice_improved/commit/ebccf46e8e61174278227f41fe37a5eea662e8e1', commit_message='Upload dataset', commit_description='', oid='ebccf46e8e61174278227f41fe37a5eea662e8e1', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/tempqqq/steam_multiple_choice_improved', endpoint='https://huggingface.co', repo_type='dataset', repo_id='tempqqq/steam_multiple_choice_improved'), pr_revision=None, pr_num=None)

In [5]:
!mkdir /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice_improved/

In [21]:
%%writefile /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice_improved/steam_multiple_choice_improved.yaml
dataset_path: tempqqq/steam_multiple_choice_improved
task: steam_multiple_choice_improved
test_split: train
output_type: generate_until
doc_to_visual: "image"
doc_to_text: !function utils.prepare_input
process_results: !function utils.process_results
doc_to_target: "title"
generation_kwargs:
  max_new_tokens: 16
  temperature: 0
  top_p: 1.0
  num_beams: 1
  do_sample: false
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  - version: 0.0

Overwriting /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice_improved/steam_multiple_choice_improved.yaml


In [22]:
%%writefile /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice_improved/utils.py
import numpy as np

def prepare_input(doc):
    letters = ['A', 'B', 'C', 'D']
    choices = [f"{letter}) {doc['choices'][i]}" for i, letter in enumerate(letters)]
    choices_str = '\n'.join(choices)
    prompt = "Choose the right name of the game. Write ONLY one letter A-E\n"
    return prompt + choices_str


def process_results(doc, result):
    letters = ['A', 'B', 'C', 'D']
    predicted_letter = result[0][0]
    if predicted_letter in letters:
        choice_idx = letters.index(predicted_letter)
    else:
        choice_idx = np.random.randint(4)
    return {
        'exact_match': doc['title'] == doc['choices'][choice_idx]
    }

Overwriting /content/lmms-eval/lmms_eval/tasks/steam_multiple_choice_improved/utils.py


In [23]:
!python3 -m accelerate.commands.launch \
            --num_processes=1 \
            -m lmms_eval \
            --model qwen2_vl \
            --model_args "pretrained=Qwen/Qwen2-VL-2B-Instruct" \
            --tasks steam_multiple_choice_improved \
            --batch_size 1 \
            --log_samples \
            --log_samples_suffix test \
            --output_path ./logs/ \
            --verbosity=DEBUG

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
[32m2025-10-23 13:01:31[0m | [1mINFO    [0m | [36m__main__[0m:[36mcli_evaluate[0m:[36m311[0m - [1mVerbosity set to DEBUG[0m
[32m2025-10-23 13:01:31[0m | [34m[1mDEBUG   [0m | [36mlmms_eval.tasks[0m:[36m_get_task_and_group[0m:[36m458[0m - [34m[1m`group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lmms-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.[0m
[32m2025-10-23 13:01:31[0m | [34m[1mDEBUG   [0m | [36mlmms_eval.tasks[0m:[36m_get_task_and_group[0m:[36m484[0m - [34m[1m

### Стало лучше, но будто бы незначительно. С чем это может быть связано?