# Задание 1

LLM для генерации текста
Цель: Ознакомиться с использованием языковых моделей для генерации текста.

Описание задачи:
Напишите код для генерации текстов с использованием предобученной языковой модели Hugging Face, например, GPT-2 или другой подходящей модели. Генерируйте текст по заданной теме, например, "Прогноз погоды", "Советы по фитнесу" или "История из будущего".

Ключевые шаги:

Установите и настройте библиотеку Hugging Face Transformers.
Загрузите предобученную модель (например, GPT-2).
Напишите функцию для генерации текста на основе текстового префикса.
Проверьте работу модели с разными параметрами (например, длина текста, температура, топ-k).
Сравните результаты для разных начальных префиксов.

Базово - GPT2, но лучше посмотреть LLama или Mistral, которую вы сможете запустить в коллабе (скорее всего не выше 8B)

In [27]:
class TextGenerator():
    def __init__(self, model, tokenizer, device = 'cpu'):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

    def generate(
        self,
        prompt,
        max_length=100, 
        temperature=1.0, 
        top_k=50, 
        top_p=0.95, 
        num_return_sequences=1
    ):
        outputs = self.model.generate(
            self.tokenizer.encode(prompt, return_tensors='pt').to(self.device),
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            num_return_sequences=num_return_sequences,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id       
        )

        return [
            self.tokenizer.decode(
                output, 
                skip_special_tokens=True
            ).strip() for output in outputs
        ]

| **Параметр**            | **Описание**                                              | **Пример**           |
|--------------------------|----------------------------------------------------------|---------------------------------|
| `prompt`                | Начальный текст для генерации.                           | Строка (например, `"Hello AI"`) |
| `max_length`            | Максимальная длина генерируемого текста (в токенах).     | `50–200`                       |
| `temperature`           | Управляет случайностью выбора слов.                     | `0.1–1.5`                      |
| `top_k`                 | Ограничивает выбор **k самых вероятных слов**.           | `50` (для разнообразия)         |
| `top_p`                 | Ограничивает выбор **по сумме вероятностей**.           | `0.9` (nucleus sampling)        |
| `num_return_sequences`  | Количество сгенерированных текстов на один `prompt`.     | `1–5`

In [28]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch.utils

text_generator = TextGenerator(
    model=GPT2LMHeadModel.from_pretrained('gpt2'), 
    tokenizer=GPT2Tokenizer.from_pretrained('gpt2'),
    device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
)

In [38]:
prompts = [
    "The weather today",
    "ChatGPT is",
    "The world in 2050"
]

In [39]:
from colorama import Fore, Style

def generate_and_print_texts(
        prompts,
        max_length=100, 
        temperature=1.0, 
        top_k=50, 
        top_p=0.95, 
        num_return_sequences=1
    ):
    for prompt in prompts:
        print(Fore.CYAN + Style.BRIGHT + f"\n--- Prompt ---" + Style.RESET_ALL)
        print(Fore.YELLOW + f"{prompt}" + Style.RESET_ALL)

        texts = text_generator.generate(
            prompt=prompt,
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            num_return_sequences=num_return_sequences
        )
        
        print(Fore.CYAN + Style.BRIGHT + "\nGenerated Texts:" + Style.RESET_ALL)
        for idx, text in enumerate(texts, 1):
            print(Fore.GREEN + f"[{idx}] " + Style.RESET_ALL + f"{text}")

In [40]:
generate_and_print_texts(prompts)


[36m[1m
--- Prompt ---[0m
[33mThe weather today[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe weather today could be a little warmer than normal.

"If you were going back around, it would be about 7 degrees F in the middle of day tomorrow, but it will be slightly lower then yesterday because the wind is going to be colder," said Dr Chris Waddell of West Yorkshire's University of Wrexham.

He added: "People are thinking 'What the heck is going on?'"

Earlier today it was reported a train leaving Glasgow was halted because passengers
[36m[1m
--- Prompt ---[0m
[33mChatGPT is[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mChatGPT is in the works at Mozilla with a plan to put together its own website. In order to create that, Mozilla is in the process of developing its own "Firefox Extension Service."
[36m[1m
--- Prompt ---[0m
[33mThe world in 2050[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe world in 2050, around 300% more people live in countries with the highest

In [41]:
# Повысим температуру
generate_and_print_texts(prompts=prompts, temperature=2.0, num_return_sequences=2)

[36m[1m
--- Prompt ---[0m
[33mThe weather today[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe weather today wasn't too much nicer this afternoon so don't be shy – be sure!

The local bike path along to and along on one of two bike paths has the nice chance to become 'Wendoc'. To walk over both trails start at St Peter to find their starting point. Walking over St Peter also makes great use of 'Bicycycle Lanes (LMB). I didn't spend much time in Cattlin for this ride today and decided to use it for
[32m[2] [0mThe weather today turned quite bad today for some of our crew here at Base Brest, where in winter we'd try to hold off rain so weather isn't bad too, though that probably wont hurt much either here but it has put a strain on some cool conditions. The morning and end of each day's runs will end there as expected!
 All weekend this past evening this is so pretty and my whole life has seemed normal and perfect as you walk through our gardens. It can always be
[36m[1m
--- 

In [42]:
# Теперь попробуем с низкой температурой
generate_and_print_texts(prompts=prompts, temperature=0.2, num_return_sequences=3)

[36m[1m
--- Prompt ---[0m
[33mThe weather today[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe weather today is very good, but I'm not sure if I'll be able to get to the airport tomorrow. I'm not sure if I'll be able to get to the airport tomorrow.

I'm not sure if I'll be able to get to the airport tomorrow. I'm not sure if I'll be able to get to the airport tomorrow.

I'm not sure if I'll be able to get to the airport tomorrow.

I'm not sure if I
[32m[2] [0mThe weather today was very good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was
[32m[3] [0mThe weather today was very good. I was able to get a good view of t

Как можно заметить, при более высокой температуре, модель выдает разнообразные текста.

# Итоги:

Таким образом, можно выделить несколько комбинаций параметров:

1. **Строгий, детерминированный вывод (максимальная точность)**:
```python
do_sample=False, temperature=0.0
```

2. **Творческий и разнообразный текст:**
```python
do_sample=True, temperature=0.8, top_p=0.9
```

3. **Баланс между качеством и разнообразием:**
```python
do_sample=True, temperature=0.7, top_k=50
```

# Задание 2

Классификация текста с BERT
Цель: Понять, как использовать предобученные модели типа BERT для задач классификации текста.

Описание задачи:
Реализуйте модель классификации текстов на основе предобученной модели BERT. Используйте датасет (IMDb Reviews) https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews  для классификации отзывов на положительные и отрицательные, оцените модель и дайте выводы. Можно модель дообучать или использовать зиро шот, на ваш выбор, главное добиться нужного качества

Дополнительные баллы (взамен задания 2): выбрать модель на русском языке и попробовать классифицировать запрос (нужно отправить фото или нет) -- например "скинь фото" = 1, "как дела?" = 0, здесь задача сиро-шот, нужно найти правильно модель для русского, добавьте в бук примеров вызова

In [1]:
import torch
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import get_scheduler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torch.optim import AdamW

In [2]:
data = pd.read_csv('IMDB Dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# Преобразование меток в числовой формат
label_encoder = LabelEncoder()
data["sentiment"] = label_encoder.fit_transform(data["sentiment"])  # positive -> 1, negative -> 0

In [4]:
# Разделение на тренировочные и тестовые выборки
train_texts, test_texts, train_labels, test_labels = train_test_split(
    data["review"].tolist(), data["sentiment"].tolist(), test_size=0.2, random_state=42
)

In [5]:
# Параметры
MODEL_NAME = "bert-base-uncased"
BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 5e-5

In [6]:
# Инициализация токенизатора и модели
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# Кастомный класс Dataset
class IMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }

In [8]:
# Создание DataLoader
train_dataset = IMDBDataset(train_texts, train_labels, tokenizer)
test_dataset = IMDBDataset(test_texts, test_labels, tokenizer)

In [9]:
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

In [10]:
# Оптимизатор и планировщик
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
num_training_steps = EPOCHS * len(train_dataloader)
scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

In [11]:
# Подключение к GPU
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [12]:
for param in model.bert.embeddings.parameters():
    param.requires_grad = False
for layer in model.bert.encoder.layer[:6]:  # Заморозка первых 6 слоёв
    for param in layer.parameters():
        param.requires_grad = False

In [13]:
torch.cuda.empty_cache()

In [14]:
from tqdm import tqdm

# Обучение модели
model.train()
for epoch in range(EPOCHS):
    total_loss = 0
    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{EPOCHS}")
    for batch in progress_bar:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        progress_bar.set_postfix({"Loss": total_loss / len(progress_bar)})
    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(train_dataloader):.4f}")

Epoch 1/3: 100%|██████████| 2500/2500 [11:55<00:00,  3.50it/s, Loss=0.218] 


Epoch 1, Loss: 0.2177


Epoch 2/3: 100%|██████████| 2500/2500 [11:56<00:00,  3.49it/s, Loss=0.12]  


Epoch 2, Loss: 0.1201


Epoch 3/3: 100%|██████████| 2500/2500 [11:56<00:00,  3.49it/s, Loss=0.0535]

Epoch 3, Loss: 0.0535





In [15]:
# Оценка модели
model.eval()
accuracy = 0
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)
        accuracy += (predictions == batch["labels"]).sum().item()

accuracy = accuracy / len(test_dataset)
print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 0.9437


In [16]:
# Сохранение дообученной модели
model.save_pretrained("./bert_imdb")
tokenizer.save_pretrained("./bert_imdb")

('./bert_imdb\\tokenizer_config.json',
 './bert_imdb\\special_tokens_map.json',
 './bert_imdb\\vocab.txt',
 './bert_imdb\\added_tokens.json')

In [17]:
# Пример использования обученной модели для произвольного текста
def predict_sentiment(text, model, tokenizer, device):
    model.eval()
    encoding = tokenizer(text, truncation=True, padding="max_length", max_length=512, return_tensors="pt")
    encoding = {k: v.to(device) for k, v in encoding.items()}
    with torch.no_grad():
        outputs = model(**encoding)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
    sentiment = "positive" if prediction == 1 else "negative"
    return sentiment

In [18]:
# Пример вызова функции с текстом
example_text = "This movie was absolutely fantastic! The plot was engaging and the characters were well-developed."
loaded_model = BertForSequenceClassification.from_pretrained("./bert_imdb").to(device)
loaded_tokenizer = BertTokenizer.from_pretrained("./bert_imdb")
result = predict_sentiment(example_text, loaded_model, loaded_tokenizer, device)
print(f"Predicted Sentiment: {result}")

Predicted Sentiment: positive


In [19]:
# Пример вызова функции с текстом
example_text = "I had high expectations, but this film was a letdown. The pacing was uneven, and the plot felt overly complicated and dragged out."
loaded_model = BertForSequenceClassification.from_pretrained("./bert_imdb").to(device)
loaded_tokenizer = BertTokenizer.from_pretrained("./bert_imdb")
result = predict_sentiment(example_text, loaded_model, loaded_tokenizer, device)
print(f"Predicted Sentiment: {result}")

Predicted Sentiment: negative
