<a href="https://colab.research.google.com/github/daryaami/NLP-ITMO-Course/blob/Task2/Task2_NLP_course2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Задание 2. Курс "Анализ естественного языка методами машинного обучения"

Для анализа был выбран датасет ["News Articles Classification Dataset for NLP & ML"](https://www.kaggle.com/datasets/banuprakashv/news-articles-classification-dataset-for-nlp-and-ml) с платформы Kaggle.
Этот набор данных предлагает обширную коллекцию новостных статей, охватывающих различные области, включая бизнес, технологии, спорт, образование и развлечения.

Для этого датасета мы будем решать задачу суммаризации новостей, используя метрики BLEU и ROUGE для оценки качества модели

## Импорт библиотек и датасета

In [1]:
import os
import nltk
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from nltk.translate.bleu_score import sentence_bleu
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
paths = []

folder_path = '/content'
files_and_folders = os.listdir(folder_path)
paths = [file for file in files_and_folders if os.path.isfile(os.path.join(folder_path, file)) and file.endswith('.csv')]

print(paths)

['education_data.csv', 'sports_data.csv', 'technology_data.csv', 'business_data.csv', 'entertainment_data.csv']


In [3]:
data = pd.DataFrame(columns=['headlines', 'description', 'content', 'url', 'category'])

for path in paths:
    data = pd.concat([data, pd.read_csv(path)])

assert (len(paths) != 0), 'Data is empty'

data = data[['content', 'description']].reset_index(drop=True)
data

Unnamed: 0,content,description
0,The Common University Entrance Test Postgradua...,CUET PG 2024: UGC said that the list of partic...
1,Less than a year after the Oxford University s...,"On April 10, 2023 TCS had announced that it ha..."
2,Student enrollments in Computer Engineering ha...,AISHE Report 2021-22: The enrollment in STEM (...
3,The New Delhi Municipal Council (NDMC) is invi...,The coaching partner will also provide up-to-d...
4,Bachelor of Arts (BA) courses had the highest ...,AISHE Report 2021-22: For programmes including...
...,...,...
9995,"Katrina Kaif’s father-in-law, Sham Kaushal, is...",Katrina Kaif played the role of Zoya in Tiger ...
9996,Months after the release of the Prabhas and Kr...,Manoj Muntashir also defended the Adipurush di...
9997,Farah Khan Kunder made her debut as a choreogr...,Farah Khan started her career as a choreograph...
9998,Salman Khan and Katrina Kaif starrer Tiger 3 w...,Tiger 3 box office collection Day 6 early esti...


## Предобработка

In [4]:
data['content'] = data['content'].apply(lambda x: x.replace('\n', ' '))
data['description'] = data['description'].apply(lambda x: x.replace('\n', ' '))

In [5]:
data = data.rename(columns={'description': 'summary', 'content': 'article'})

## Fine-tuning

In [6]:
# !pip uninstall -y transformers accelerate > None
# !pip install transformers accelerate > None

In [7]:
!pip install transformers datasets torch > None

In [78]:
!pip install rouge-score > None

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=1916f9a60e2ce65f6bde850c6ede72b14fb3e7ffafedea3aefb81e5525a197f9
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [6]:
import torch
import time
from datasets import Dataset
from sklearn.model_selection import train_test_split
from rouge_score import rouge_scorer
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq

Определим процессор, на котором будем работать с моделью

In [7]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Для задачи суммаризации текстов я выбрала модель Shobhank-iiitdwd/BERT_summary. https://huggingface.co/Shobhank-iiitdwd/BERT_summary

Она хорошо подходит для суммаризации текста и была также обучена на новостном датасете.

In [8]:
from nltk.translate.bleu_score import SmoothingFunction

In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Shobhank-iiitdwd/BERT_summary")
model = AutoModelForSeq2SeqLM.from_pretrained("Shobhank-iiitdwd/BERT_summary").to(device)

Разделим датасет

In [10]:
data_sm = data.sample(500, random_state=42)

train_df, test_df = train_test_split(data_sm,
                                     test_size=0.05,
                                     random_state=42)

train_dataset = Dataset.from_pandas(train_df)

Оценим качество модели до обучения

In [11]:
from tqdm import tqdm

# Функция для вычисления BLEU
def compute_bleu(references, predictions):
    sf = SmoothingFunction()
    scores = []
    for ref, pred in tqdm(zip(references, predictions), total=len(references)):
        ref_tokens = nltk.word_tokenize(ref)
        pred_tokens = nltk.word_tokenize(pred)
        score = sentence_bleu([ref_tokens], pred_tokens, smoothing_function=sf.method2)
        scores.append(score)
    return sum(scores) / len(scores)


def compute_overall_rouge(references, predictions):
    # Инициализация объекта ROUGE-скорера
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    # Вычисление ROUGE для каждой пары reference-prediction
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []

    for reference, prediction in zip(references, predictions):
        scores = scorer.score(reference, prediction)
        rouge1_scores.append(scores['rouge1'].fmeasure)
        rouge2_scores.append(scores['rouge2'].fmeasure)
        rougeL_scores.append(scores['rougeL'].fmeasure)

    # Вычисление среднего значения ROUGE для каждой метрики
    avg_rouge1 = sum(rouge1_scores) / len(rouge1_scores)
    avg_rouge2 = sum(rouge2_scores) / len(rouge2_scores)
    avg_rougeL = sum(rougeL_scores) / len(rougeL_scores)

    # Вычисление общей метрики ROUGE
    overall_rouge = (avg_rouge1 + avg_rouge2 + avg_rougeL) / 3

    return overall_rouge


# Генерация предсказаний
def generate_summary(model, tokenizer, texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    summaries = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
    return [tokenizer.decode(s, skip_special_tokens=True) for s in summaries]

In [12]:
# Из тестовой выборки выделим сами статьи и их обощенное описание
references = test_df['summary'].tolist()
articles = test_df['article'].tolist()

In [13]:
predictions_before = generate_summary(model, tokenizer, articles)

bleu_before = compute_bleu(references, predictions_before)
print(f"\nBLEU до fine-tuning: {bleu_before}")

rouge_before = compute_overall_rouge(references, predictions_before)
print(f"ROUGE до fine-tuning: {rouge_before}")

100%|██████████| 25/25 [00:00<00:00, 789.11it/s]

BLEU до fine-tuning: 0.09023959526396962
ROUGE до fine-tuning: 0.2690220063386979





Пеперь дообучим модель на своих данных

In [14]:
# Определим функцию для обработки данных перед подачей в модель
def preprocess_data(examples):
    inputs = [doc for doc in examples['article']]
    model_inputs = tokenizer(inputs, max_length=512, padding='max_length', truncation=True)

    # Настраиваем суммаризацию
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['summary'], max_length=64, padding='max_length', truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [15]:
# Применяем токенизацию к нашему тренировочному датасету
tokenized_dataset = train_dataset.map(preprocess_data, batched=True)

# Разделим датасет также на тренировочный и валидационный
split = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = split['train']
val_dataset = split['test']

Map:   0%|          | 0/475 [00:00<?, ? examples/s]



In [16]:
training_args = TrainingArguments(
    output_dir='./fine_tunned',      # директория для сохранения модели
    num_train_epochs=4,             # количество эпох
    per_device_train_batch_size=2,   # размер батча на устройстве
    per_device_eval_batch_size=2,    # размер батча для оценки
    gradient_accumulation_steps=4,   # Накопление градиентов для большего эффективного размера батча
    warmup_steps=500,                # количество шагов для разогрева
    weight_decay=0.01,               # коэффициент регуляризации
    logging_dir='./logs',            # директория для логов
    logging_steps=1,
    eval_strategy="epoch",           # Оценка модели после каждой эпохи
    save_total_limit=2,              # Сохранение только двух последних моделей
)

# Data collator для динамического паддинга
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

Запустим процесс дообучения модели

In [17]:
# Запуск обучения
start_time = time.time()

trainer.train()

end_time = time.time()
training_time = end_time - start_time
print(f"Время обучения: {round(training_time, 2)} секунд")


model.save_pretrained('./fine_tuned_model_BERT')
tokenizer.save_pretrained('./fine_tuned_model_BERT')

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Epoch,Training Loss,Validation Loss
0,3.2223,4.283321
2,1.4551,1.393764
3,1.3225,1.227318


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3}


Время обучения: 288.45 секунд


('./fine_tuned_model_BERT/tokenizer_config.json',
 './fine_tuned_model_BERT/special_tokens_map.json',
 './fine_tuned_model_BERT/vocab.txt',
 './fine_tuned_model_BERT/added_tokens.json',
 './fine_tuned_model_BERT/tokenizer.json')

Теперь загрузим обученную модель и оценим на тренировочных данных

In [18]:
del model
del tokenizer

In [19]:
torch.cuda.empty_cache()

In [20]:
tuned_tokenizer = AutoTokenizer.from_pretrained('./fine_tuned_model_BERT')
tuned_model = AutoModelForSeq2SeqLM.from_pretrained('./fine_tuned_model_BERT').to(device)

In [21]:
predictions_after = generate_summary(tuned_model, tuned_tokenizer, articles)

bleu_after = compute_bleu(references, predictions_after)
print(f"\nBLEU после fine-tuning: {bleu_after}")

rouge_after = compute_overall_rouge(references, predictions_after)
print(f"ROUGE после fine-tuning: {rouge_after}")

100%|██████████| 25/25 [00:00<00:00, 950.77it/s]


BLEU после fine-tuning: 0.11681717825647554
ROUGE после fine-tuning: 0.33647722933641866





Как видно, метрики показывают улучшение результатов на тестовых данных

In [26]:
comparison_df = pd.DataFrame({'original': references, 'before': predictions_before, 'after': predictions_after})

In [28]:
comparison_df.sample(5)

Unnamed: 0,original,before,after
8,"As regards maturity, RBI said the minimum teno...",reserve bank of india issued guidelines for le...,"in february, the central bank had come out wit..."
16,The James Webb Space Telescope has captured ne...,james webb's nircam and miri ( mid - infrared ...,the james webb space telescope has looked at t...
0,Tired of the Bing Wallpapers app not refreshin...,bing gallery has a nifty trick that lets you s...,bing gallery has one of the most beautiful col...
23,While the Indian Space Research Organisation’s...,isro's istrac centre and mox will play a cruci...,isro's istrac centre and mox will see scientis...
11,Raveena Tandon is mother of daughter Rasha and...,"raveena tandon is mother to four children, dau...","raveena tandon is mother of four children, dau..."
