# Задание 1

LLM для генерации текста
Цель: Ознакомиться с использованием языковых моделей для генерации текста.

Описание задачи:
Напишите код для генерации текстов с использованием предобученной языковой модели Hugging Face, например, GPT-2 или другой подходящей модели. Генерируйте текст по заданной теме, например, "Прогноз погоды", "Советы по фитнесу" или "История из будущего".

Ключевые шаги:

Установите и настройте библиотеку Hugging Face Transformers.
Загрузите предобученную модель (например, GPT-2).
Напишите функцию для генерации текста на основе текстового префикса.
Проверьте работу модели с разными параметрами (например, длина текста, температура, топ-k).
Сравните результаты для разных начальных префиксов.

Базово - GPT2, но лучше посмотреть LLama или Mistral, которую вы сможете запустить в коллабе (скорее всего не выше 8B)

In [27]:
class TextGenerator():
    def __init__(self, model, tokenizer, device = 'cpu'):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

    def generate(
        self,
        prompt,
        max_length=100, 
        temperature=1.0, 
        top_k=50, 
        top_p=0.95, 
        num_return_sequences=1
    ):
        outputs = self.model.generate(
            self.tokenizer.encode(prompt, return_tensors='pt').to(self.device),
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            num_return_sequences=num_return_sequences,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id       
        )

        return [
            self.tokenizer.decode(
                output, 
                skip_special_tokens=True
            ).strip() for output in outputs
        ]

| **Параметр**            | **Описание**                                              | **Пример**           |
|--------------------------|----------------------------------------------------------|---------------------------------|
| `prompt`                | Начальный текст для генерации.                           | Строка (например, `"Hello AI"`) |
| `max_length`            | Максимальная длина генерируемого текста (в токенах).     | `50–200`                       |
| `temperature`           | Управляет случайностью выбора слов.                     | `0.1–1.5`                      |
| `top_k`                 | Ограничивает выбор **k самых вероятных слов**.           | `50` (для разнообразия)         |
| `top_p`                 | Ограничивает выбор **по сумме вероятностей**.           | `0.9` (nucleus sampling)        |
| `num_return_sequences`  | Количество сгенерированных текстов на один `prompt`.     | `1–5`

In [28]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch.utils

text_generator = TextGenerator(
    model=GPT2LMHeadModel.from_pretrained('gpt2'), 
    tokenizer=GPT2Tokenizer.from_pretrained('gpt2'),
    device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
)

In [38]:
prompts = [
    "The weather today",
    "ChatGPT is",
    "The world in 2050"
]

In [39]:
from colorama import Fore, Style

def generate_and_print_texts(
        prompts,
        max_length=100, 
        temperature=1.0, 
        top_k=50, 
        top_p=0.95, 
        num_return_sequences=1
    ):
    for prompt in prompts:
        print(Fore.CYAN + Style.BRIGHT + f"\n--- Prompt ---" + Style.RESET_ALL)
        print(Fore.YELLOW + f"{prompt}" + Style.RESET_ALL)

        texts = text_generator.generate(
            prompt=prompt,
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            num_return_sequences=num_return_sequences
        )
        
        print(Fore.CYAN + Style.BRIGHT + "\nGenerated Texts:" + Style.RESET_ALL)
        for idx, text in enumerate(texts, 1):
            print(Fore.GREEN + f"[{idx}] " + Style.RESET_ALL + f"{text}")

In [40]:
generate_and_print_texts(prompts)


[36m[1m
--- Prompt ---[0m
[33mThe weather today[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe weather today could be a little warmer than normal.

"If you were going back around, it would be about 7 degrees F in the middle of day tomorrow, but it will be slightly lower then yesterday because the wind is going to be colder," said Dr Chris Waddell of West Yorkshire's University of Wrexham.

He added: "People are thinking 'What the heck is going on?'"

Earlier today it was reported a train leaving Glasgow was halted because passengers
[36m[1m
--- Prompt ---[0m
[33mChatGPT is[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mChatGPT is in the works at Mozilla with a plan to put together its own website. In order to create that, Mozilla is in the process of developing its own "Firefox Extension Service."
[36m[1m
--- Prompt ---[0m
[33mThe world in 2050[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe world in 2050, around 300% more people live in countries with the highest

In [41]:
# Повысим температуру
generate_and_print_texts(prompts=prompts, temperature=2.0, num_return_sequences=2)

[36m[1m
--- Prompt ---[0m
[33mThe weather today[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe weather today wasn't too much nicer this afternoon so don't be shy – be sure!

The local bike path along to and along on one of two bike paths has the nice chance to become 'Wendoc'. To walk over both trails start at St Peter to find their starting point. Walking over St Peter also makes great use of 'Bicycycle Lanes (LMB). I didn't spend much time in Cattlin for this ride today and decided to use it for
[32m[2] [0mThe weather today turned quite bad today for some of our crew here at Base Brest, where in winter we'd try to hold off rain so weather isn't bad too, though that probably wont hurt much either here but it has put a strain on some cool conditions. The morning and end of each day's runs will end there as expected!
 All weekend this past evening this is so pretty and my whole life has seemed normal and perfect as you walk through our gardens. It can always be
[36m[1m
--- 

In [42]:
# Теперь попробуем с низкой температурой
generate_and_print_texts(prompts=prompts, temperature=0.2, num_return_sequences=3)

[36m[1m
--- Prompt ---[0m
[33mThe weather today[0m
[36m[1m
Generated Texts:[0m
[32m[1] [0mThe weather today is very good, but I'm not sure if I'll be able to get to the airport tomorrow. I'm not sure if I'll be able to get to the airport tomorrow.

I'm not sure if I'll be able to get to the airport tomorrow. I'm not sure if I'll be able to get to the airport tomorrow.

I'm not sure if I'll be able to get to the airport tomorrow.

I'm not sure if I
[32m[2] [0mThe weather today was very good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was good. The weather was
[32m[3] [0mThe weather today was very good. I was able to get a good view of t

Как можно заметить, при более высокой температуре, модель выдает разнообразные текста.

# Итоги:

Таким образом, можно выделить несколько комбинаций параметров:

1. **Строгий, детерминированный вывод (максимальная точность)**:
```python
do_sample=False, temperature=0.0
```

2. **Творческий и разнообразный текст:**
```python
do_sample=True, temperature=0.8, top_p=0.9
```

3. **Баланс между качеством и разнообразием:**
```python
do_sample=True, temperature=0.7, top_k=50
```

# Задание 2

Классификация текста с BERT
Цель: Понять, как использовать предобученные модели типа BERT для задач классификации текста.

Описание задачи:
Реализуйте модель классификации текстов на основе предобученной модели BERT. Используйте датасет (IMDb Reviews) https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews  для классификации отзывов на положительные и отрицательные, оцените модель и дайте выводы. Можно модель дообучать или использовать зиро шот, на ваш выбор, главное добиться нужного качества

Дополнительные баллы (взамен задания 2): выбрать модель на русском языке и попробовать классифицировать запрос (нужно отправить фото или нет) -- например "скинь фото" = 1, "как дела?" = 0, здесь задача сиро-шот, нужно найти правильно модель для русского, добавьте в бук примеров вызова