## Датасет

Грузим малышку-датасет из статьи

In [1]:
from datasets import load_dataset

# грузим датасет вида (инструкция - ответ)
raw_dataset = load_dataset("csujeong/FinancialStockTerms_Eng")

In [2]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['QA_text'],
        num_rows: 135
    })
})

In [3]:
raw_dataset['train']['QA_text'][0]

'##Question: What is the futures?##Answer: A contract to buy or sell stocks at a certain price at a certain date in the future.'

In [4]:
len(raw_dataset['train'])

135

Создадим часть для тренировки и валидации в пропорции 80/20 (тут не очень умно, для примера)

In [5]:
from datasets import DatasetDict

# сделаем сплит
train_indices = range(int(0.8 * len(raw_dataset['train'])))
val_indices = range(int(0.8 * len(raw_dataset['train'])), len(raw_dataset['train']))

dataset_dict = {"train": raw_dataset["train"].select(train_indices),
                "test": raw_dataset["train"].select(val_indices)}

raw_datasets = DatasetDict(dataset_dict)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['QA_text'],
        num_rows: 108
    })
    test: Dataset({
        features: ['QA_text'],
        num_rows: 27
    })
})

Грузим PLM, которую мы хотим дообучить и ее токенизатор (свой они не делали)

И PLM авторов (они врали!)

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id_falcon = 'tiiuae/falcon-7b'
model_id = 'Qwen/Qwen2.5-7B'

tokenizer_falcon = AutoTokenizer.from_pretrained(model_id_falcon)
model_falcon = AutoModelForCausalLM.from_pretrained(model_id_falcon)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)


# кое-что проверяем в специальных токенах и ограничиваем максимальную длину генерации
# set pad_token_id equal to the eos_token_id if not set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Set reasonable default for models without max length
if tokenizer.model_max_length > 100_000:
    tokenizer.model_max_length = 2048

if tokenizer_falcon.pad_token_id is None:
    tokenizer_falcon.pad_token_id = tokenizer_falcon.eos_token_id

if tokenizer_falcon.model_max_length > 100_000:
    tokenizer_falcon.model_max_length = 2048

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Ответ falcon-7b, которую дообучали авторы. Она уже умеет отвечать

In [7]:
question = 'What is Index?'

question_inputs = tokenizer_falcon(question, return_tensors='pt')
outputs = model_falcon.generate(**question_inputs)
res = tokenizer_falcon.decode(outputs[0], skip_special_tokens=True)
print(res)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


what is Index?
Index is a measure of the performance of a stock market or a stock index


Ответ QWEN2.5-7B  - она пока не умеет

In [8]:
question_inputs = tokenizer(question, return_tensors='pt')
outputs = model.generate(**question_inputs)
res = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(res)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


what is Index? Index is a list of keywords or topics that are used to locate information in a document, database, or other collection of data. It is a tool that helps users quickly find the information they are looking for by providing a quick reference to the relevant sections or pages. Indexes are commonly found in books, journals, and websites, and they can be created manually or automatically using software.


Создадим датасеты для обучения

Здесь можно применить шаблоны промптов

In [9]:
column_names = list(raw_datasets["train"].features)
column_names

['QA_text']

In [10]:
import re
import random
from multiprocessing import cpu_count

def apply_chat_template(example, tokenizer):
    question, answer = example['QA_text'].split('#Answer: ')        
    question = question.replace('##Question: ', '')
    question = question.replace('#', '')
    answer = answer.replace('#', '')

    # специально не будем задавать системный промпт, чтобы исключить его влияние
    messages = [{'role': 'system', 'content': ''},
                {'role': 'user', 'content': question},
                {'role': 'assistant', 'content': answer}]
    
    example["text"] = tokenizer.apply_chat_template(messages, tokenize=False)
        
    return example

column_names = list(raw_datasets["train"].features)
raw_datasets = raw_datasets.map(apply_chat_template,
                                num_proc=cpu_count(),
                                fn_kwargs={"tokenizer": tokenizer},
                                remove_columns=column_names,
                                desc="Applying chat template",)

# create the splits
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

for index in random.sample(range(len(raw_datasets["train"])), 3):
      print(f"Sample {index} of the processed training set:\n\n{raw_datasets['train'][index]['text']}")

Applying chat template (num_proc=96):   0%|          | 0/108 [00:00<?, ? examples/s]

num_proc must be <= 27. Reducing num_proc to 27 for dataset of size 27.


Applying chat template (num_proc=27):   0%|          | 0/27 [00:00<?, ? examples/s]

Sample 36 of the processed training set:

<|im_start|>system
<|im_end|>
<|im_start|>user
What is a Stockbroker?<|im_end|>
<|im_start|>assistant
A Stockbroker is a licensed professional who facilitates the buying and selling of stocks on behalf of clients.<|im_end|>

Sample 28 of the processed training set:

<|im_start|>system
<|im_end|>
<|im_start|>user
What is a Reverse Stock Split?<|im_end|>
<|im_start|>assistant
A Reverse Stock Split is the opposite of a regular stock split reducing the number of shares outstanding and increasing the stock price.<|im_end|>

Sample 90 of the processed training set:

<|im_start|>system
<|im_end|>
<|im_start|>user
What is Earnings Per Share (EPS)?<|im_end|>
<|im_start|>assistant
Earnings Per Share (EPS) is a measure of a company's profitability, calculated by dividing net income by the number of outstanding shares. It represents earnings on a per-share basis.<|im_end|>



На самом деле, всякий раз, когда мы даем модели список словарей system, user, assisstant, ее токенизатор преобразовывает это в нужный формат промпта

In [11]:
messages = [{'role': 'system', 'content': 'ты финансовый аналитик'},
            {'role': 'user', 'content': 'как дела у биржи спб?'},
            {'role': 'assistant', 'content': 'плохо'}]

print(tokenizer.apply_chat_template(messages, tokenize=False))

<|im_start|>system
ты финансовый аналитик<|im_end|>
<|im_start|>user
как дела у биржи спб?<|im_end|>
<|im_start|>assistant
плохо<|im_end|>



Немного квантизуем модель до bfloat16 вместо float32

Не станем здесь останавливаться надолго

In [12]:
from transformers import BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_4bit_quant_type="nf4",
)


device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

model_kwargs = dict(
    torch_dtype="auto",
    use_cache=False,
    device_map=device_map,
    quantization_config=quantization_config,
)

Будем учить с помощью trl

In [15]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm):

In [38]:
from trl import SFTTrainer
from peft import LoraConfig
from transformers import TrainingArguments

# для сохранения
output_dir = 'data/qwen-7b-finance'

# based on config
training_args = TrainingArguments(
    fp16=True, # указываем потому что квантовали
    do_eval=True,
    evaluation_strategy="steps",      
    eval_steps=10,
    gradient_accumulation_steps=10,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    learning_rate=2.0e-05,
    log_level="info",
    logging_steps=5,
    logging_strategy="steps",
    lr_scheduler_type="cosine",
    max_steps=-1,
    num_train_epochs=5,
    output_dir=output_dir,
    overwrite_output_dir=True,
    per_device_eval_batch_size=8,
    per_device_train_batch_size=4,
    load_best_model_at_end=True,
    # save_strategy = "no",
    seed=42,
)

# применяем peft для CLM
# указываем, к каким слоям хотим применить
peft_config = LoraConfig(
        r=256,
        lora_alpha=128,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"],
)

trainer = SFTTrainer(
        model=model_id,
        model_init_kwargs=model_kwargs,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field="text",
        tokenizer=tokenizer,
        # packing=True,
        peft_config=peft_config,
        max_seq_length=tokenizer.model_max_length,
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
PyTorch: setting up devices
loading configuration file config.json from cache at /home/bezgin.aleksey3/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/config.json
Model config Qwen2Config {
  "_name_or_path": "Qwen/Qwen2.5-7B",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "model_type"

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing Qwen2ForCausalLM.

All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-7B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /home/bezgin.aleksey3/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "max_new_tokens": 2048
}

PyTorch: setting up devices


Map:   0%|          | 0/108 [00:00<?, ? examples/s]

Map:   0%|          | 0/27 [00:00<?, ? examples/s]

Using auto half precision backend


## Запускаем SFT


Функция потерь - суммарная кросс-энтропия по выходам токенов

In [39]:
train_result = trainer.train()

***** Running training *****
  Num examples = 108
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 40
  Gradient Accumulation steps = 10
  Total optimization steps = 10
  Number of trainable parameters = 484,442,112


Step,Training Loss,Validation Loss
10,3.8431,4.660173



***** Running Evaluation *****
  Num examples = 27
  Batch size = 8
Saving model checkpoint to data/qwen-7b-finance/checkpoint-10
loading configuration file config.json from cache at /home/bezgin.aleksey3/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.2",
  "use_cache": true,
  "use_mrope": false,
  "use

Сохраняем модель локально

In [40]:
trainer.save_model(output_dir)  # указываем папку

Saving model checkpoint to data/qwen-7b-finance
loading configuration file config.json from cache at /home/bezgin.aleksey3/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.2",
  "use_cache": true,
  "use_mrope": false,
  "use_sliding_window": false,
  "vocab_size": 152064
}

tokenizer config file saved in d

## Загрузка и инференс

Загружать модель после peft тоже нужно правильно

In [41]:
from peft import PeftModel

adapter_model_name = 'data/qwen-7b-finance'
model_id = 'Qwen/Qwen2.5-7B'

base_model = AutoModelForCausalLM.from_pretrained(model_id)
sft_model = PeftModel.from_pretrained(base_model, adapter_model_name)

loading configuration file config.json from cache at /home/bezgin.aleksey3/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/config.json
Model config Qwen2Config {
  "_name_or_path": "Qwen/Qwen2.5-7B",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.2",
  "use_cache": true,
  "use_mrope": false,
  "use_sliding_window": false,
  "vocab_size": 152064
}

loading weights file model.safetensors fro

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing Qwen2ForCausalLM.

All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-7B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /home/bezgin.aleksey3/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "max_new_tokens": 2048
}



In [42]:
sft_model = sft_model.merge_and_unload()

In [43]:
def get_answer(inputs, tokenizer, model, gen_params={'max_new_tokens': 256}):
    inputs_tokenized = tokenizer(inputs, return_tensors='pt')
    
    generated_ids = model.generate(
        **inputs_tokenized.to(model.device),
        **gen_params
    )
    
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs_tokenized.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response 

In [51]:
print(get_answer('What is Index?', tokenizer, base_model))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 Index is a list of all the words in a document or a collection of documents. It is used to quickly locate the information that is required. Indexing is the process of creating an index. Indexing is done by a computer program called an indexer. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Indexing is done on a collection of documents. Ind

In [44]:
print(get_answer('What is Index?', tokenizer, sft_model))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 Index is a statistical measure of the changes in the prices of a selected group of goods or services over time, which is used to track the overall level of prices in an economy. It is typically expressed as a percentage change from a base period, and is used to measure inflation, deflation, and economic growth. Indexes can be constructed for a variety of purposes, including tracking the performance of individual sectors or industries, measuring the impact of changes in government policy, and comparing the economic performance of different countries or regions.


Еще пара тестов

In [81]:
print(get_answer('What is a Portfolio?', tokenizer, base_model, gen_params={'max_new_tokens': 64}))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 A portfolio is a collection of financial assets such as stocks, bonds, mutual funds, and other securities. It is a way for investors to diversify their investments and manage risk by spreading their money across different types of assets. The goal of a portfolio is to achieve a balance between risk and return, and to maximize long


In [79]:
print(get_answer('What is a Portfolio?', tokenizer, sft_model, gen_params={'max_new_tokens': 64}))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 A portfolio is a collection of financial assets such as stocks, bonds, mutual funds, and cash equivalents. It is a way for investors to diversify their investments and manage risk. The goal of a portfolio is to achieve a balance between risk and return, and to meet the investor's financial goals.


In [93]:
print(get_answer('What is a Sector?', tokenizer, base_model, gen_params={'max_new_tokens': 64}))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 - Definition & Formula

An error occurred trying to load this video.

Try refreshing the page, or contact customer support.

Coming up next: What is a Circle? - Definition, Area & Properties

You're on a roll. Keep up the good work!

Take Quiz Watch Next Lesson
Your next lesson will play


In [94]:
print(get_answer('What is a Sector?', tokenizer, sft_model, gen_params={'max_new_tokens': 64}))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 A sector is a portion of a circle. It is bounded by two radii and an arc. The area of a sector is proportional to the area of the circle. The area of a sector is given by the formula: A = (θ/360)πr^2, where θ is the


In [114]:
print(get_answer('What are Options?', tokenizer, base_model, gen_params={'max_new_tokens': 128}))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 Options are contracts that give the holder the right, but not the obligation, to buy or sell an underlying asset at a specified price (the strike price) on or before a certain date (the expiration date). There are two main types of options: calls and puts.

- A call option gives the holder the right to buy the underlying asset at the strike price.
- A put option gives the holder the right to sell the underlying asset at the strike price.

Options can be used for various purposes, such as hedging against potential losses, speculating on price movements, or generating income through option writing. They are traded on exchanges and have


In [115]:
print(get_answer('What are Options?', tokenizer, sft_model, gen_params={'max_new_tokens': 128}))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 Options are contracts that give the holder the right, but not the obligation, to buy or sell an underlying asset at a specified price (the strike price) on or before a certain date (the expiration date). There are two main types of options: calls and puts.

- A call option gives the holder the right to buy the underlying asset at the strike price.
- A put option gives the holder the right to sell the underlying asset at the strike price.

Options can be used for various purposes, such as hedging against potential losses, speculating on price movements, or generating income through option writing. They are traded on exchanges and have


А если бы просто улучшили системный промпт и попросили бы модель отвечать в домене финансов?

Возможно, все было бы даже лучше. Попробуйте!