
# Summarization Indonesian Language

---

Ini adalah bagian summary suatu teks fine-tuning menggunakan model [Bart](https://huggingface.co/transformers/model_doc/bart.html) untuk membuat kesimpulan dari sebuah artikel. Kelompok Kami menggunakan dataset dari [GEM Benchmark](https://huggingface.co/datasets/GEM/xlsum) dataset dengan model [sshleifer/distilbart-xsum-12-3](https://huggingface.co/sshleifer/distilbart-xsum-12-3) Bart checkpoint.

## Setup

---

Instalasi setiap library yang digunakan

Kelompok kami menggunakan library-library berikut:

Transformers: Library ini digunakan untuk memuat model bahasa besar, seperti BERT, GPT-3, dan LaMDA.

Datasets: Library ini digunakan untuk mengunduh dan memuat dataset.


SentencePiece: Library ini digunakan untuk mengkonversi teks menjadi bentuk token.

Rouge_Score: Library ini digunakan untuk menghitung skor kemiripan teks.

Wandb: Library ini digunakan untuk memantau kemajuan pelatihan model.

dengan menggunakan perintah berikut:

In [1]:
! pip install transformers
! pip install datasets
! pip install sentencepiece
! pip install rouge_score
! pip install wandb
! pip install accelerate -U

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_

In [2]:
# import torch
import numpy as np
import datasets

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)

from tabulate import tabulate
import nltk
from datetime import datetime

In [3]:
WANDB_INTEGRATION = True
if WANDB_INTEGRATION:
    import wandb

    wandb.login() ### IMPORTANT!! IF DO NOT LOGIN TO wandb.ai, THE TRAINER CAN'T TRAIN THE MODEL

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## Set language

---

Inisasi variabel language untuk dipakai pada beberapa cell nantinya.

In [4]:
language = "indonesian"

## Model and tokenizer

---

Download model and tokenizer.

In [6]:
model_name = "sshleifer/distilbart-xsum-12-3"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# tokenization
encoder_max_length = 512 #256
decoder_max_length = 128 #64

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

## Data

---

### Download

Selanjutnya, kita akan memuat dataset GEM Benchmark. Dataset ini berisi sekitar 38.242 contoh data. Contoh data tersebut terdiri dari berbagai artikel berbahasa Indonesia.

In [7]:
data = datasets.load_dataset("GEM/xlsum", name=language, split="train[:38242]")

Downloading builder script:   0%|          | 0.00/4.82k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/41.5M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

### Prepare

**Format dan bagi menjadi train dataset dan validasi dataset untuk evaluasi model nantinya.**

In [8]:
# Pembagian article yang menjadi text asli dan bentuk rangkumannya
def flatten(example):
    return {
        "text": example["text"],
        "summary": example["target"],
    }

# Memisahkan text asli dengan hasil rangkuman
def list2samples(example):
    texts = []
    summaries = []
    for sample in zip(example["text"], example["summary"]):
        text = sample[0]
        summary = sample[1]
        texts.append(text)
        summaries.append(summary)
    return {"text": texts, "summary": summaries}

# Struktur datanya diubah dan menghapus kolom-kolom yang tidak digunakan
dataset = data.map(flatten, remove_columns=["gem_id", "url", "title", "references"])
# Pemisahan text dari dataset
dataset = dataset.map(list2samples, batched=True)

# Untuk pelatihan model
train_data_txt, validation_data_txt = dataset.train_test_split(test_size=0.1).values()

Map:   0%|          | 0/38242 [00:00<?, ? examples/s]

Map:   0%|          | 0/38242 [00:00<?, ? examples/s]

**Preprocess dan tokenize**

In [9]:
# Fungsi melakukan tokenisasi
def batch_tokenize_preprocess(batch, tokenizer, max_source_length, max_target_length):
    source, target = batch["text"], batch["target"]
    source_tokenized = tokenizer(
        source, padding="max_length", truncation=True, max_length=max_source_length
    )
    target_tokenized = tokenizer(
        target, padding="max_length", truncation=True, max_length=max_target_length
    )
    # Baris ini menyalin tokenisasi teks sumber ke dalam dictionary batch.
    batch = {k: v for k, v in source_tokenized.items()}
    # Ignore padding in the loss
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in l]
        for l in target_tokenized["input_ids"]
    ]
    return batch

train_data = train_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=train_data_txt.column_names,
)

validation_data = validation_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=validation_data_txt.column_names,
)

Map:   0%|          | 0/34417 [00:00<?, ? examples/s]

Map:   0%|          | 0/3825 [00:00<?, ? examples/s]

## Training

---

### Metrics

In [10]:
# Borrowed from https://github.com/huggingface/transformers/blob/master/examples/seq2seq/run_summarization.py

# Pembuatan token yang akan di gunakan dalam kalimat yang ingin di rangkum
nltk.download("punkt", quiet=True)

# Metrik yang digunakan untuk mengevaluasi model dari Hugging Face
metric = datasets.load_metric("rouge")

# Pembersihan data
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

# Menghitung keakuratan model
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results from ROUGE
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

  metric = datasets.load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

### Training arguments

Argumen pelatihan ini adalah konfigurasi yang ditetapkan untuk mengarahkan proses pelatihan model seq2seq dalam notebook ini.

In [11]:
! pip install accelerate -U
! pip install transformers[torch]



Pada bagian ini, kita mengatur dan menentukan parameter-parameter yang diperlukan untuk melatih model seq2seq.

1. Seq2SeqTrainingArguments: Di sini, kita membuat objek yang berisi berbagai argumen atau konfigurasi yang diperlukan selama proses pelatihan. 
2. DataCollatorForSeq2Seq: Objek ini bertanggung jawab untuk memproses dataset dan mengelola pembentukan batch data yang sesuai untuk tugas seq2seq. 
3. Seq2SeqTrainer: Di sini, kita menginisialisasi trainer untuk model seq2seq. Trainer ini membutuhkan model, argumen pelatihan, data collator yang telah dibuat, dataset pelatihan, dataset evaluasi, tokenizer, serta fungsi perhitungan metrik evaluasi.

Dengan konfigurasi ini, kita siap untuk menjalankan proses pelatihan model seq2seq dengan parameter yang telah ditetapkan.

In [12]:
training_args = Seq2SeqTrainingArguments(
    output_dir="results",
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=6e-05,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="logs",
    logging_steps=50,
    save_total_limit=3,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
    eval_dataset=validation_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

### Train

Wandb integration

In [13]:
if WANDB_INTEGRATION:
    wandb_run = wandb.init(
        project="bart_GEM_xlsum",
        config={
            "per_device_train_batch_size": training_args.per_device_train_batch_size,
            "learning_rate": training_args.learning_rate,
            "dataset": "GEM/xlsum " + language,
        },
    )

    now = datetime.now()
    current_time = now.strftime("%H%M%S")
    wandb_run.name = "run_" + language + "_" + current_time

[34m[1mwandb[0m: Currently logged in as: [33mseptiotriwahyudi[0m. Use [1m`wandb login --relogin`[0m to force relogin


Train model

In [14]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,6.6212
100,5.5736
150,5.2127
200,5.2165
250,5.0346
300,4.9047
350,4.8252
400,4.678
450,4.806
500,4.7605


TrainOutput(global_step=17209, training_loss=3.5907774439832822, metrics={'train_runtime': 9283.7651, 'train_samples_per_second': 3.707, 'train_steps_per_second': 1.854, 'total_flos': 2.13095579123712e+16, 'train_loss': 3.5907774439832822, 'epoch': 1.0})

Evaluasi setelah fine-tuning

In [15]:
trainer.evaluate()

{'eval_loss': 3.221283435821533,
 'eval_rouge1': 28.5046,
 'eval_rouge2': 11.3959,
 'eval_rougeL': 23.144,
 'eval_rougeLsum': 23.3079,
 'eval_gen_len': 45.4021,
 'eval_runtime': 1540.2269,
 'eval_samples_per_second': 2.483,
 'eval_steps_per_second': 1.242,
 'epoch': 1.0}

In [16]:
if WANDB_INTEGRATION:
    wandb_run.finish()

VBox(children=(Label(value='0.001 MB of 0.021 MB uploaded\r'), FloatProgress(value=0.05629702615549982, max=1.…

0,1
eval/gen_len,▁
eval/loss,▁
eval/rouge1,▁
eval/rouge2,▁
eval/rougeL,▁
eval/rougeLsum,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████

0,1
eval/gen_len,45.4021
eval/loss,3.22128
eval/rouge1,28.5046
eval/rouge2,11.3959
eval/rougeL,23.144
eval/rougeLsum,23.3079
eval/runtime,1540.2269
eval/samples_per_second,2.483
eval/steps_per_second,1.242
train/epoch,1.0


## Evaluation

---

**Generate summaries from the fine-tuned model and compare them with those generated from the original, pre-trained one.**

In [17]:
def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples["text"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return outputs, output_str


model_before_tuning = AutoModelForSeq2SeqLM.from_pretrained(model_name)

test_samples = validation_data_txt.select(range(16))

summaries_before_tuning = generate_summary(test_samples, model_before_tuning)[1]
summaries_after_tuning = generate_summary(test_samples, model)[1]

In [18]:
print(
    tabulate(
        zip(
            range(len(summaries_after_tuning)),
            summaries_after_tuning,
            summaries_before_tuning,
        ),
        headers=["Id", "Summary after", "Summary before"],
    )
)
print("\nTarget summaries:\n")
print(
    tabulate(list(enumerate(test_samples["summary"])), headers=["Id", "Target summary"])
)
print("\nSource documents:\n")
print(tabulate(list(enumerate(test_samples["text"])), headers=["Id", "Text"]))

  Id  Summary after                                                                                                                                                    Summary before
----  ---------------------------------------------------------------------------------------------------------------------------------------------------------------  ---------------------------------------------------------------------------------------------------------------------------------
   0  Sebuah longsor di Kabupaten Lebong Provinsi Bengkulu menimpa kamp karyawan sehingga menimbulkan korban jiwa.                                                     Longsor menimpa kamp karyawan sehingga menimbulkan korban jiwa, a juru bicara badan Nasional Bencana BNPB.
   1  Presiden Joko Widodo menegaskan bahwa moda transportasi umum dengan menggunakan aplikasi internet 'Gojek' yang dibutuhkan masyarakat.                            Keputusan pelarangan ojek online telah menimbulkan penolakan di masyarakat.
   2

## Simpan model yang sudah dilatih.

---

In [19]:
trainer.save_model("summary_model_bart_4")

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [21]:
import shutil

PATH_MODEL = '/content/summary_model_bart_4'
DRIVE_DESTINATION = '/content/drive/My Drive/Model_DL'

shutil.move(PATH_MODEL, DRIVE_DESTINATION)

'/content/drive/My Drive/Model_DL/summary_model_bart_4'