# **Summarization**
Text summarization is the task of condensing long documents into summaries. Summarization can be extractive summarization (extract the most relevant information from a document) or abstractive summarization (generate new text that captures the most relevant information).

We shall fine-tune a mT5 model using TensorFlow on Multilingual Amazon Reviews Corpus to create a bilingual abstractive summarizer.

### **1. Install and Import Required Libraries**

In [None]:
!pip install datasets transformers[sentencepiece] evaluate rouge_score nltk

In [None]:
import tensorflow as tf
import numpy as np
import evaluate
import nltk

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, create_optimizer, pipeline
from datasets import load_dataset, concatenate_datasets, DatasetDict

nltk.download('punkt')

### **2. Load Data**

In [None]:
english_dataset = load_dataset('amazon_reviews_multi', 'en')
spanish_dataset = load_dataset('amazon_reviews_multi', 'es')

In [None]:
english_dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

In [None]:
spanish_dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

### **3. Preprocess Data**

In [None]:
english_dataset.set_format('pandas')
english_df = english_dataset['train'][:]
english_df['product_category'].value_counts()[:20]

home                      17679
apparel                   15951
wireless                  15717
other                     13418
beauty                    12091
drugstore                 11730
kitchen                   10382
toy                        8745
sports                     8277
automotive                 7506
lawn_and_garden            7327
home_improvement           7136
pet_products               7082
digital_ebook_purchase     6749
pc                         6401
electronics                6186
office_product             5521
shoes                      5197
grocery                    4730
book                       3756
Name: product_category, dtype: int64

In [None]:
spanish_dataset.set_format('pandas')
spanish_df = spanish_dataset['train'][:]
spanish_df['product_category'].value_counts()[:20]

home                        26962
wireless                    25886
toy                         13647
sports                      13189
pc                          11191
home_improvement            10879
electronics                 10385
beauty                       7337
automotive                   7143
kitchen                      6695
apparel                      5737
drugstore                    5513
book                         5264
furniture                    5229
baby_product                 4881
office_product               4771
lawn_and_garden              4237
other                        3937
pet_products                 3713
personal_care_appliances     3573
Name: product_category, dtype: int64

In [None]:
def filter_books(example):
  return(example['product_category'] == 'book' or example['product_category'] == 'digital_ebook_purchase')

english_dataset.reset_format()
spanish_dataset.reset_format()

english_books = english_dataset.filter(filter_books)
spanish_books = spanish_dataset.filter(filter_books)

In [None]:
# Concatenating English and Spanish datasets
books_dataset = DatasetDict()

for split in english_books.keys():
  books_dataset[split] = concatenate_datasets([english_books[split], spanish_books[split]])
  books_dataset[split] = books_dataset[split].shuffle(seed=44)

In [None]:
# Filtering out examples with very short titles
books_dataset = books_dataset.filter(lambda x: len(x['review_title'].split()) > 2)

In [None]:
model_checkpoint = 'google/mt5-small'

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [None]:
max_input_length = 512
max_target_length = 30

def preprocess_function(examples):
  tokenized_inputs = tokenizer(examples['review_body'], max_length=max_input_length, truncation=True)
  tokenized_labels = tokenizer(examples['review_title'], max_length=max_target_length, truncation=True)

  tokenized_inputs['labels'] = tokenized_labels['input_ids']
  return tokenized_inputs

In [None]:
tokenized_dataset = books_dataset.map(preprocess_function, batched=True)

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 9672
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 238
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 245
    })
})

In [None]:
batch_size = 8
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors='tf')

tf_train_dataset = model.prepare_tf_dataset(
    tokenized_dataset['train'],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=batch_size
)

tf_validation_dataset = model.prepare_tf_dataset(
    tokenized_dataset['validation'],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=batch_size
)

### **4. Create a Baseline**

In [None]:
metric = evaluate.load('rouge')

def three_sentence_summary(text):
  return '\n'.join(nltk.tokenize.sent_tokenize(text)[:3])

def evaluate_baseline(dataset, metric):
  summaries = [three_sentence_summary(text) for text in dataset['review_body']]
  return metric.compute(predictions=summaries, references=dataset['review_title'])

In [None]:
score = evaluate_baseline(books_dataset['validation'], metric)
score

{'rouge1': 0.1680260170708547,
 'rouge2': 0.088155998756527,
 'rougeL': 0.1557126261248912,
 'rougeLsum': 0.1599222144354075}

### **5. Fine-tune the Model**

In [None]:
num_epochs = 5
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_train_steps = num_train_steps,
    num_warmup_steps = 0,
    weight_decay_rate=0.01
)

model.compile(optimizer=optimizer, metrics=['accuracy'])

In [None]:
# Training in mixed-precision float16
tf.keras.mixed_precision.set_global_policy('mixed_float16')

history = model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=num_epochs, verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### **6. Compute Metrics after Fine-tuning the Model**

In [None]:
generation_data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, pad_to_multiple_of=320, return_tensors='tf')

tf_generate_dataset = model.prepare_tf_dataset(
    tokenized_dataset['validation'],
    collate_fn=generation_data_collator,
    shuffle=False,
    batch_size=batch_size,
    drop_remainder=True
)

@tf.function(jit_compile=True)
def generate_with_xla(batch):
  return model.generate(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], max_new_tokens=32)

all_preds = list()
all_labels = list()

for batch, labels in tf_generate_dataset:
  predictions = generate_with_xla(batch)
  decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

  labels = labels.numpy()
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  decoded_preds = ['\n'.join(nltk.tokenize.sent_tokenize(pred.strip())) for pred in decoded_preds]
  decoded_labels = ['\n'.join(nltk.tokenize.sent_tokenize(label.strip())) for label in decoded_labels]
  all_preds.extend(decoded_preds)
  all_labels.extend(decoded_labels)

In [None]:
score = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
score

{'rouge1': 0.0638888888888889,
 'rouge2': 0.0,
 'rougeL': 0.0638888888888889,
 'rougeLsum': 0.0638888888888889}

### **7. Predict using the Fine-tuned Model**

In [None]:
summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)

In [None]:
def display_summary(idx):
  review = books_dataset['test'][idx]['review_body']
  title = books_dataset['test'][idx]['review_title']

  prediction = summarizer(books_dataset['test'][idx]['review_body'])

  print(f"Review: {review}")
  print(f"Title: {title}")
  print(f"Summary: {prediction[0]['summary_text']}")

In [None]:
display_summary(10)

Review: HAPPY DANCE, HAPPY DANCE, HAPPY DANCE!!! This story is a rollercoaster ride of emotions. Here you have tragic events that ripped my heart out, then you have a swoon worthy romance that was sexy, sweet, and just made me believe in love, on to the nosey spiteful small town drama, heartbreak, and ending it all with a warm sweet feeling that I just didn't want to be over. Wyatt and Hannah are two of the most relatable characters that I have experienced this year. Out of all the books that I have read, these two rank pretty high in my memorable couples list. I just want more and more, I was so sad when I got to the end because I just wanted to keep their story going. This specific video review will be included in the October 2018 wrap-up. For other video book reviews check out my YouTube Channel: Steph's Rom Book Talk.
Title: Hearts and Tears...So Amazing!
Summary: A emotional story


In [None]:
display_summary(15)

Review: Tras leer las numerosas críticas buenas sobre esta novela me animé a comprarla, y que largo se me ha hecho. La trama da giros sin sentido, sin explicar nada y dando todo por supuesto. Me ha dejado sin ganas de más.
Title: Un poco decepcionante
Summary: A poco de más
