# [s12] Laguage Models for Code


В последние годы трансформеры и большие языковые модели (LLMs) произвели революцию в различных областях обработки естественного языка. Одной из самых перспективных сфер их применения стало **генерирование программного кода**. Сегодня модели, обученные на масштабных коллекциях исходников, способны дополнять код, решать алгоритмические задачи, автоматически документировать программы и даже выступать в роли ассистентов разработчиков.



## [1] CasualLM



В этом ноутбуке мы рассмотрим процесс обучения **каузальной языковой модели** (causal language model) с нуля на небольшом примере.

Полный мастер-класс доступен на [странице курса Hugging Face](https://huggingface.co/learn/llm-course/chapter7/6).

Пример модели для генерации кода можно найти на [странице CodeParrot DS](https://huggingface.co/huggingface-course/codeparrot-ds?text=plt.imshow%28).



### [1.1] Code generation with a pipeline

In [1]:
import torch
from transformers import pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
pipe = pipeline(
    "text-generation", model="huggingface-course/codeparrot-ds", device=device
)

Device set to use cuda


In [2]:
# результат генерации каждый раз случайный, часто не очень хороший
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
# this uses the min and max y (in y) values


In [3]:
# результат генерации каждый раз случайный, часто не очень хороший
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
x2 = np.random.randn(100)
y2


In [4]:
# результат генерации каждый раз случайный, часто не очень хороший
txt = """\
# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})

# calculate the mean income per profession
df_mean = df


In [5]:
# результат генерации каждый раз случайный, часто не очень хороший
txt = """
# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
est = make_estimators(X.shape[0], max_features


### [1.2] Finetune

### Установка библиотек

Для запуска этого ноутбука необходимо установить библиотеки **Transformers**, **Datasets** и **Evaluate**.


In [6]:
# %pip install -q datasets evaluate transformers[sentencepiece]
# %pip install -q accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
# !apt install git-lfs


Вам потребуется настроить **Git**, указав свой **email** и **имя** пользователя в следующей ячейке.

Это необходимо для корректной работы с репозиториями и сохранения изменений от вашего имени.

In [7]:
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Вам также нужно будет войти в Hugging Face Hub. Выполните следующую команду и введите свои учетные данные.

In [8]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [9]:
from huggingface_hub import whoami

whoami()

{'type': 'user',
 'id': '652ce6600ab8936887cc45c0',
 'name': 'lyutovad',
 'fullname': 'Daria Lyutova',
 'email': 'gromovad@mail.ru',
 'emailVerified': True,
 'canPay': False,
 'periodEnd': None,
 'isPro': False,
 'avatarUrl': 'https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/OHH9ydOztT-uSHlKc_wBa.jpeg',
 'orgs': [{'type': 'org',
   'id': '6676d66e552363255b04301f',
   'name': 'CDA-RFTA',
   'fullname': 'Center for Data Analysis, Russian Foreign Trade Academy',
   'email': 'cda@vavt.ru',
   'canPay': False,
   'periodEnd': None,
   'avatarUrl': 'https://cdn-avatars.huggingface.co/v1/production/uploads/652ce6600ab8936887cc45c0/pgwk3YkE2g0R6fU_nDaeo.png',
   'roleInOrg': 'admin',
   'isEnterprise': False}],
 'auth': {'type': 'access_token',
  'accessToken': {'displayName': 'new',
   'role': 'write',
   'createdAt': '2025-04-28T19:46:21.674Z'}}}

In [10]:
# def any_keyword_in_string(string, keywords):
#     for keyword in keywords:
#         if keyword in string:
#             return True
#     return False

In [11]:
# filters = ["pandas", "sklearn", "matplotlib", "seaborn"]
# example_1 = "import numpy as np"
# example_2 = "import pandas as pd"

# print(
#     any_keyword_in_string(example_1, filters), any_keyword_in_string(example_2, filters)
# )

In [12]:
# from collections import defaultdict
# from tqdm import tqdm
# from datasets import Dataset


# def filter_streaming_dataset(dataset, filters):
#     filtered_dict = defaultdict(list)
#     total = 0
#     for sample in tqdm(iter(dataset)):
#         total += 1
#         if any_keyword_in_string(sample["content"], filters):
#             for k, v in sample.items():
#                 filtered_dict[k].append(v)
#     print(f"{len(filtered_dict['content'])/total:.2%} of data after filtering.")
#     return Dataset.from_dict(filtered_dict)

In [13]:
# # # This cell will take a very long time to execute, so you should skip it and go to
# # # the next one!
# from datasets import load_dataset

# split = "train"  # "valid"
# filters = ["pandas", "sklearn", "matplotlib", "seaborn"]

# data = load_dataset(f"transformersbook/codeparrot-{split}", split=split, streaming=True)
# filtered_data = filter_streaming_dataset(data, filters)

In [14]:
from datasets import load_dataset, DatasetDict

ds_train = load_dataset("huggingface-course/codeparrot-ds-train", split="train")
ds_valid = load_dataset("huggingface-course/codeparrot-ds-valid", split="validation")

raw_datasets = DatasetDict(
    {
        "train": ds_train,  # .shuffle().select(range(50000)),
        "valid": ds_valid,  # .shuffle().select(range(500))
    }
)

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 606720
    })
    valid: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 3322
    })
})

In [15]:
for key in raw_datasets["train"][0]:
    print(f"{key.upper()}: {raw_datasets['train'][0][key][:200]}")

REPO_NAME: kmike/scikit-learn
PATH: sklearn/utils/__init__.py
COPIES: 3
SIZE: 10094
CONTENT: """
The :mod:`sklearn.utils` module includes various utilites.
"""

from collections import Sequence

import numpy as np
from scipy.sparse import issparse

from .murmurhash import murm
LICENSE: bsd-3-clause


In [16]:
from collections import Counter

license_counter = Counter(example["license"] for example in raw_datasets["train"])
print(license_counter)

Counter({'bsd-3-clause': 268686, 'mit': 126918, 'apache-2.0': 69051, 'gpl-3.0': 66611, 'gpl-2.0': 30624, 'agpl-3.0': 14172, 'bsd-2-clause': 10681, 'lgpl-3.0': 6621, 'lgpl-2.1': 4681, 'unlicense': 3492, 'cc0-1.0': 2435, 'mpl-2.0': 954, 'isc': 865, 'artistic-2.0': 550, 'epl-1.0': 379})


In [17]:
repo_counter = Counter(example["repo_name"] for example in raw_datasets["train"])
print(repo_counter.most_common(10))

[('mbayon/TFG-MachineLearning', 1375), ('mne-tools/mne-tools.github.io', 1270), ('ryfeus/lambda-packs', 1265), ('RPGOne/Skynet', 1098), ('datapythonista/pandas', 1092), ('lthurlow/Network-Grapher', 1048), ('gfyoung/pandas', 1036), ('jreback/pandas', 1000), ('rs2/pandas', 923), ('yavalvas/yav_com', 904)]


In [18]:
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
    raw_datasets["train"][:2]["content"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 34
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 117, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 41]
Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [19]:
def tokenize(element):
    outputs = tokenizer(
        element["content"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 16702061
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 93164
    })
})

In [20]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
# config.loss_type = "ForCausalLMLoss"

In [21]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.2M parameters


In [22]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [23]:
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

input_ids shape: torch.Size([5, 128])
attention_mask shape: torch.Size([5, 128])
labels shape: torch.Size([5, 128])


In [24]:
small_train_dataset = tokenized_datasets["train"].select(range(5000))
small_valid_dataset = tokenized_datasets["valid"].select(range(500))

In [25]:
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="codeparrot-ds",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=5,
    # num_train_epochs=0.1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
    # push_to_hub=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    # train_dataset=tokenized_datasets["train"],
    # eval_dataset=tokenized_datasets["valid"],
    train_dataset=small_train_dataset,
    eval_dataset=small_valid_dataset,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  trainer = Trainer(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avo

In [26]:
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Currently logged in as: [33mlyutovad[0m ([33mlyutova[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss


TrainOutput(global_step=95, training_loss=8.462875848067434, metrics={'train_runtime': 1299.4863, 'train_samples_per_second': 19.238, 'train_steps_per_second': 0.073, 'total_flos': 1557300510720000.0, 'train_loss': 8.462875848067434, 'epoch': 4.764331210191083})

In [27]:
trainer.push_to_hub()

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/497M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/lyutovad/codeparrot-ds/commit/4f9e4dc3ce42c48ee0278a6772424b01c5d307e6', commit_message='End of training', commit_description='', oid='4f9e4dc3ce42c48ee0278a6772424b01c5d307e6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/lyutovad/codeparrot-ds', endpoint='https://huggingface.co', repo_type='model', repo_id='lyutovad/codeparrot-ds'), pr_revision=None, pr_num=None)

In [28]:
import torch
from transformers import pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
pipe = pipeline("text-generation", model="lyutovad/codeparrot-ds", device=device)

model.safetensors:   0%|          | 0.00/497M [00:00<?, ?B/s]

Device set to use cuda


In [29]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
#
from_equal(X,..index.1


In [30]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create dataframe from x and y
#
#

#
#
#

from_


In [31]:
txt = """
# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])


# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit random forest model with 300 estimators on X, y:
#
import_n_]


#
import, y


### [1.3] (optional) Training with 🤗 Accelerate

In [32]:
keytoken_ids = []
for keyword in [
    "plt",
    "pd",
    "sk",
    "fit",
    "predict",
    " plt",
    " pd",
    " sk",
    " fit",
    " predict",
    "testtest",
]:
    ids = tokenizer([keyword]).input_ids[0]
    if len(ids) == 1:
        keytoken_ids.append(ids[0])
    else:
        print(f"Keyword has not single token: {keyword}")

Keyword has not single token: testtest


In [33]:
from torch.nn import CrossEntropyLoss
import torch


def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
    # Shift so that tokens < n predict n
    shift_labels = inputs[..., 1:].contiguous()
    shift_logits = logits[..., :-1, :].contiguous()
    # Calculate per-token loss
    loss_fct = CrossEntropyLoss(reduce=False)
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    # Resize and average loss per sample
    loss_per_sample = loss.view(shift_logits.size(0), shift_logits.size(1)).mean(axis=1)
    # Calculate and scale weighting
    weights = torch.stack([(inputs == kt).float() for kt in keytoken_ids]).sum(
        axis=[0, 2]
    )
    weights = alpha * (1.0 + weights)
    # Calculate weighted average
    weighted_loss = (loss_per_sample * weights).mean()
    return weighted_loss

In [34]:
from torch.utils.data.dataloader import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=32, shuffle=True)
eval_dataloader = DataLoader(tokenized_datasets["valid"], batch_size=32)

In [35]:
weight_decay = 0.1


def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [
        {"params": params_with_wd, "weight_decay": weight_decay},
        {"params": params_without_wd, "weight_decay": 0.0},
    ]

In [49]:
def evaluate():
    model.eval()
    total_loss = 0
    total_samples = 0

    for batch in eval_dataloader:
        with torch.no_grad():
            batch = {
                k: v.to(model.device)
                for k, v in batch.items()
                if isinstance(v, torch.Tensor)
            }
            outputs = model(batch["input_ids"], labels=batch["input_ids"])
            loss = outputs.loss

        # Собираем лосс со всех девайсов
        gathered_loss = accelerator.gather(loss)
        batch_size = gathered_loss.shape[0]
        total_loss += gathered_loss.sum()
        total_samples += batch_size

    avg_loss = total_loss / total_samples
    try:
        perplexity = torch.exp(avg_loss)
    except OverflowError:
        perplexity = float("inf")

    return avg_loss.item(), perplexity.item()

In [37]:
model = GPT2LMHeadModel(config)

In [38]:
from torch.optim import AdamW

optimizer = AdamW(get_grouped_params(model), lr=5e-4)

In [39]:
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision="fp16")

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [40]:
from transformers import get_scheduler

num_train_epochs = 1
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=1_000,
    num_training_steps=num_training_steps,
)

In [41]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "codeparrot-ds-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'lyutovad/codeparrot-ds-accelerate'

In [42]:
# from huggingface_hub import create_repo

# repo_name = "codeparrot-ds-accelerate"
# create_repo(repo_name, exist_ok=True)

In [43]:
output_dir = "codeparrot-ds-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got fo

In [50]:
evaluate()

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

In [None]:
from tqdm.notebook import tqdm

gradient_accumulation_steps = 8
eval_steps = 5_000

model.train()
completed_steps = 0
for epoch in range(num_train_epochs):
    for step, batch in tqdm(
        enumerate(train_dataloader, start=1), total=num_training_steps
    ):
        logits = model(batch["input_ids"]).logits
        loss = keytoken_weighted_loss(batch["input_ids"], logits, keytoken_ids)
        if step % 100 == 0:
            accelerator.print(
                {
                    "lr": get_lr(),
                    "samples": step * samples_per_step,
                    "steps": completed_steps,
                    "loss/train": loss.item() * gradient_accumulation_steps,
                }
            )
        loss = loss / gradient_accumulation_steps
        accelerator.backward(loss)
        if step % gradient_accumulation_steps == 0:
            accelerator.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            completed_steps += 1
        if (step % (eval_steps * gradient_accumulation_steps)) == 0:
            eval_loss, perplexity = evaluate()
            accelerator.print({"loss/eval": eval_loss, "perplexity": perplexity})
            model.train()
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
            if accelerator.is_main_process:
                tokenizer.save_pretrained(output_dir)
                repo.push_to_hub(
                    commit_message=f"Training in progress step {step}", blocking=False
                )

  0%|          | 0/521940 [00:00<?, ?it/s]

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

## [2] MaskedLM
- [unixcoder-base](https://huggingface.co/microsoft/unixcoder-base)

**UniXcoder** — это универсальная модель от Microsoft для работы с программным кодом и текстом.  
Она умеет:
- Кодировать код и текст в векторы,
- Генерировать код,
- Выполнять задачи поиска и классификации кода.


## Два режима работы UniXcoder

| Режим | Что делает | Применение |
|:---|:---|:---|
| **Encoder-only Mode** | Кодирует код или текст в векторное представление | Поиск кода, классификация, feature extraction |
| **Autoregressive Mode** | Генерирует продолжение кода или текста | Дополнение функций, генерация кода, автокомплешн |


## Схема работы режимов UniXcoder

| Encoder-only Mode | Autoregressive Mode |
|:------------------|:--------------------|
| **Вход**: код или текст | **Вход**: начало кода или текста |
| Пропуск через **энкодер (Transformer)** | Пропуск через **энкодер + декодер** |
| ➔ Получение **векторного представления** (эмбеддинга) | ➔ **Генерация продолжения** текста или кода токен за токеном |
| Использование эмбеддинга для **поиска**, **классификации**, **feature extraction** | Использование продолжения для **автодополнения** или **генерации функций** |

-
## Примеры задач

| Encoder-only Mode | Autoregressive Mode |
|:---|:---|
| Поиск кода по описанию ("найди функцию сортировки массива") | Генерация кода сортировки массива |
| Классификация кода (например, определение языка) | Дописывание начатой функции |
| Построение эмбеддингов кода для других моделей | Автоматическое дополнение комментариев к коду |



### [2.1] Model loading

In [None]:
!wget https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2025-04-28 23:49:41--  https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10352 (10K) [text/plain]
Saving to: ‘unixcoder.py’


2025-04-28 23:49:41 (2.67 MB/s) - ‘unixcoder.py’ saved [10352/10352]



In [None]:
import torch
from unixcoder import UniXcoder

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniXcoder("microsoft/unixcoder-base")
model.to(device)

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/938k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/444k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/504M [00:00<?, ?B/s]

UniXcoder(
  (model): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(51416, 768, padding_idx=1)
      (position_embeddings): Embedding(1026, 768, padding_idx=1)
      (token_type_embeddings): Embedding(10, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): Layer

model.safetensors:   0%|          | 0.00/504M [00:00<?, ?B/s]

### [2.2] Encoder-only Mode

In [None]:
# Encode maximum function
func = "def f(a,b): if a>b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,max_func_embedding = model(source_ids)

# Encode minimum function
func = "def f(a,b): if a<b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,min_func_embedding = model(source_ids)

# Encode NL
nl = "return maximum value"
tokens_ids = model.tokenize([nl],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,nl_embedding = model(source_ids)


In [None]:
# Normalize embedding
norm_max_func_embedding = torch.nn.functional.normalize(max_func_embedding, p=2, dim=1)
norm_min_func_embedding = torch.nn.functional.normalize(min_func_embedding, p=2, dim=1)
norm_nl_embedding = torch.nn.functional.normalize(nl_embedding, p=2, dim=1)

max_func_nl_similarity = torch.einsum("ac,bc->ab",norm_max_func_embedding,norm_nl_embedding)
min_func_nl_similarity = torch.einsum("ac,bc->ab",norm_min_func_embedding,norm_nl_embedding)

print(max_func_nl_similarity)
print(min_func_nl_similarity)

tensor([[0.3002]], device='cuda:0', grad_fn=<ViewBackward0>)
tensor([[0.1881]], device='cuda:0', grad_fn=<ViewBackward0>)


### [2.3] Decoder-only Mode

In [None]:
context = """
def f(data,file_path):
    # write json data into file_path in python language
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<decoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=True, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print(context+predictions[0][0])


def f(data,file_path):
    # write json data into file_path in python language
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)


### [2.4] Encoder-Decoder Mode

In [None]:
# Function Name Prediction
context = """
def <mask0>(data,file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['write', 'write_json', 'write_file']


In [None]:
# API Recommendation
context = """
def write_json(data,file_path):
    data = <mask0>(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['json.dumps', 'json.loads', 'json_encode']


In [None]:
# Code Summarization
context = """
# Write <mask0>
def write_json(data,file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])


['JSON to file', 'json to file', 'JSON file']


## [3] LLM
- [Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct)

### [3.1] Model loading

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-Coder-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

### [3.2] Prompt Mode

In [None]:
prompt = "write a quick sort algorithm without recursion."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Sure! Here's a simple implementation of a quick sort algorithm without using recursion:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quick_sort(left) + middle + quick_sort(right)

# Example usage:
arr = [3, 6, 8, 10, 1, 2, 5, 7]
sorted_arr = quick_sort(arr)
print(sorted_arr)
```

### Explanation:
- The function `quick_sort` takes an array `arr` as input.
- If the array has 0 or 1 element, it returns the array as is.
- Otherwise, it selects a pivot element from the array (the middle element in this case).
- It then divides the array into three parts: elements less than the pivot, elements equal to the pivot, and elements greater than the pivot.
- The function recursively sorts the left and right parts of the array and concatenates the sorted left and right parts w

In [None]:
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quick_sort(left) + middle + quick_sort(right)

# Example usage:
arr = [-5, 3, 6, 8, 10, 1, 2, 1, 7, 15]
sorted_arr = quick_sort(arr)
print(sorted_arr)

[-5, 1, 1, 2, 3, 6, 7, 8, 10, 15]


In [None]:
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    num_return_sequences=3,  # генерируем сразу 3 разных ответа
    do_sample=True,  # включаем стохастическую генерацию (sampling)
    top_p=0.95,  # используем nucleus sampling для разнообразия
    temperature=0.7  # регулируем креативность генерации
)

In [None]:
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(
        model_inputs.input_ids.repeat_interleave(3, dim=0), generated_ids
    )
]

responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, resp in enumerate(responses):
    print(f"Вариант {i+1}:\n{resp}\n{'-'*50}")

Вариант 1:
Sure! Here's a Python implementation of the quick sort algorithm without using recursion:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quick_sort(left) + middle + quick_sort(right)

# Example usage:
arr = [3, 6, 8, 10, 1, 2, 4, 7]
sorted_arr = quick_sort(arr)
print(sorted_arr)  # Output: [1, 2, 3, 4, 5, 6, 7, 8]
```

### Explanation:
- **Base Case**: If the length of the array is less than or equal to 1, it is already sorted, so we return the array as is.
- **Pivot Selection**: We choose the middle element of the array as the pivot.
- **Partitioning**: We create three lists: `left`, `middle`, and `right`. One list contains elements less than the pivot, one contains elements equal to the pivot, and the other contains elements greater than the pivot.
- *

In [None]:
new_prompt = 'Write a Python implementation of quick sort without recursion, by manually using a stack to simulate recursion.'

messages = [
    {
        "role": "system",
        "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.",
    },
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    num_return_sequences=3, 
    do_sample=True,  
    top_p=0.95,
    temperature=0.7
)

generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(
        model_inputs.input_ids.repeat_interleave(3, dim=0), generated_ids
    )
]

responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, resp in enumerate(responses):
    print(f"Вариант {i+1}:\n{resp}\n{'-'*50}")

Вариант 1:
Quick Sort is a simple yet efficient sorting algorithm. It works by selecting a pivot element from the array and partitioning the other elements into two sub-arrays, one with elements less than the pivot and another with elements greater than the pivot. The pivot is chosen to be the middle element of the array.

Here's a Python implementation of Quick Sort without using recursion:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    
    # Choose a pivot element
    pivot = arr[len(arr) // 2]
    
    # Partition the array into two sub-arrays
    less_than_pivot = [x for x in arr if x < pivot]
    greater_than_pivot = [x for x in arr if x > pivot]
    
    # Recursively sort the two sub-arrays
    return quick_sort(less_than_pivot) + [pivot] + quick_sort(greater_than_pivot)

# Example usage:
arr = [64, 34, 25, 12, 22, 11, 90]
sorted_arr = quick_sort(arr)
print("Sorted array:", sorted_arr)
```

### Explanation:
- **Base Case**: If the array has 0 or 1 