<a href="https://colab.research.google.com/github/evlko/Sharable-Data/blob/main/SteamTrender_Summary_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install datasets transformers

Датасет собирался и размечался вручную с помощью Steam API для получения отзывов. Всего было собрано 50 семплов.

Получаем данные:

In [2]:
dataset_url = "https://raw.githubusercontent.com/Steam-Trender/Steam-Trender-ML/refs/heads/main/steam_summaries.json"
filename = dataset_url.split("/")[-1]

In [3]:
%%capture
!wget $dataset_url

Внутри датасета лежит следующая схема данных (*абстрактный пример*):

```python
[
  {
    "game_id": 12345,
    "class": "positive",
    "text": "This game is absolutely amazing. Loved the gameplay and graphics!",
    "summary": "Great gameplay and visuals."
  },
  ...
]
```

очевидно, что `game_id` и `class` для текущей задачи нам не нужны.

In [4]:
from datasets import load_dataset

dataset = load_dataset("json", data_files=filename, split="train")

dataset = dataset.remove_columns(["game_id", "class"])

dataset = dataset.train_test_split(test_size=0.1)

Generating train split: 0 examples [00:00, ? examples/s]

Возьмем небольшую модельку, для начала сварим новые данные

In [5]:
from transformers import AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [6]:
def preprocess(example):
    inputs = tokenizer(example["text"], max_length=512, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        targets = tokenizer(example["summary"], max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

In [7]:
tokenized = dataset.map(preprocess, batched=True, remove_columns=["text", "summary"])

Map:   0%|          | 0/45 [00:00<?, ? examples/s]



Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Загрузим модель

In [8]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.01,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=True, # gpu
    report_to="none"
)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



Сделаем **fine tuning**

In [9]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Seq2SeqTrainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,9.698309
2,No log,6.982814
3,No log,5.429379
4,No log,4.441137
5,No log,3.538127
6,No log,2.86998
7,No log,2.345702
8,No log,1.963758
9,No log,1.719827
10,No log,1.622154


TrainOutput(global_step=120, training_loss=5.8821970621744795, metrics={'train_runtime': 34.0219, 'train_samples_per_second': 13.227, 'train_steps_per_second': 3.527, 'total_flos': 60903810662400.0, 'train_loss': 5.8821970621744795, 'epoch': 10.0})

В целом неплохой loss для такого кол-ва эпох и небольшого датасета.

Самое время запушить модель 😎

In [10]:
%%capture
!pip install huggingface_hub

Входим на hf:

In [11]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
The token `SteamTrender` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authe

Данные для репо:

In [12]:
repo_name = "steam-trender-summarizer-t5-small"
hf_username = "evlko"
repo_id = f"{hf_username}/{repo_name}"

Сохраняем модель:

In [13]:
model.save_pretrained(repo_name)
tokenizer.save_pretrained(repo_name)

('steam-trender-summarizer-t5-small/tokenizer_config.json',
 'steam-trender-summarizer-t5-small/special_tokens_map.json',
 'steam-trender-summarizer-t5-small/spiece.model',
 'steam-trender-summarizer-t5-small/added_tokens.json',
 'steam-trender-summarizer-t5-small/tokenizer.json')

Загружаем модель на HF:

In [14]:
from huggingface_hub import create_repo, upload_folder

create_repo(repo_id, exist_ok=True)  # создаёт repo на HF, если нет

upload_folder(
    repo_id=repo_id,
    folder_path=repo_name,
    commit_message="[add] fine-tuned summarizer"
)

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/evlko/steam-trender-summarizer-t5-small/commit/b3dcb792b6b673f2f50dd5cf0fc6e671ab5c7edb', commit_message='[add] fine-tuned summarizer', commit_description='', oid='b3dcb792b6b673f2f50dd5cf0fc6e671ab5c7edb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/evlko/steam-trender-summarizer-t5-small', endpoint='https://huggingface.co', repo_type='model', repo_id='evlko/steam-trender-summarizer-t5-small'), pr_revision=None, pr_num=None)