<a href="https://colab.research.google.com/github/francescodisalvo05/polito-deep-nlp/blob/main/Labs/Lab_05_Automatic_Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 5:** Automatic Text Summarization

## Extractive Text Summarization

Content is extracted from the original data, but the extracted content is not modified in any way.

![](https://images.deepai.org/machine-learning-models/8f66b1eb608e4eb681b2ec0c0631385c/summarization.jpg)

For this part of the practice we will use the BBC News Summary dataset available in [Kaggle](https://www.kaggle.com/pariza/bbc-news-summary).

In [1]:
%%capture
! wget https://github.com/MorenoLaQuatra/DeepNLP/raw/main/practices/P5/bbc_news.zip
! unzip bbc_news.zip

### **Question 1: split data collection**

Read the data collection and split it into train/test/eval. Data are provided with different classes (e.g., business, sport, tech...), be sure to select 10% of data for testing **for each class**.

**Note 1:** Some files can report UnicodeError, feel free to ignore it (`errors` parameter)

**Note 2:** you can fix encoding after file reading by using [ftfy](https://pypi.org/project/ftfy/) library 

In [None]:
!pip install ftfy

In [None]:
# Your code here

In [4]:
import os 
import pandas as pd
import ftfy

from collections import Counter
from sklearn.model_selection import train_test_split

In [12]:
classes = ['business', 'entertainment', 'politics', 'sport', 'tech']
path = '/content/BBC News Summary/'

dataset = {}

for c in classes:

  dataset[c] = {}

  articles = os.listdir(path + "/News Articles/" + c )

  # the name is the same for the summary
  for article in articles:

    # article : 001.txt (id = 001)
    id = article.split(".")[0]

    dataset[c][id] = {}

    # load article
    with open(path + "News Articles/" + c + "/" + article, 'r', encoding='utf-8', errors='ignore') as f:
      curr_data = f.readlines()

      # clean rows

      phrases = []

      for line in curr_data:
        if line != "\n": # remove \n rows
          curr_line = line.replace("\n","") # remove final \n
          phrases.append(ftfy.fix_text(curr_line)) # fix encoding

      
      dataset[c][id]["text"] = ' '.join(phrases)

      f.close()

    # load summary
    with open(path + "Summaries/" + c + "/" + article, 'r', encoding='utf-8', errors='ignore') as f:
      curr_data = f.readlines()

      # clean rows

      phrases = []

      for line in curr_data:
        if line != "\n": # remove \n rows
          curr_line = line.replace("\n","") # remove final \n
          phrases.append(ftfy.fix_text(curr_line)) # fix encoding

      dataset[c][id]["summary"] = ' '.join(phrases)

In [14]:
# taken inspiration from instructor's solution

train_ds, eval_ds, test_ds = [], [], []

for c in classes:

  # get the keys for each class
  keys_per_class = list(dataset[c].keys())

  train, eval_test = train_test_split(keys_per_class, test_size=0.2)
  test, eval = train_test_split(eval_test, test_size=0.5)

  for k in train:
    train_ds.append(dataset[c][k])

  for k in eval:
    eval_ds.append(dataset[c][k])

  for k in test:
    test_ds.append(dataset[c][k])

### **Question 2: Unsupervised Text Summarization (TextRank)**

[TextRank](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) is an unsupervised text summarization approach that relies on graph modelling. Implement a `TextrankSummarizer` class that expose the `summarize(sentences, N)` function to get the `N` most relevant sentences from a list (`sentences`). 

The main steps are reported here:

1. Each sentence is a node in a graph (undirected)
2. A pair of sentence is connected with an edge whose weight is computed according to the number of common words (see Note 1).
3. Pagerank is used to compute a relevance score for each node in the graph (for each sentence in the list)
4. The `summarize` function return the summary concatenating the `N`  most relevant sentences (according to the score computed at step 3).

**Note 1:** An example of the similarity function that can be used to compute graph weights is repoted below.

In [None]:
import math

def compute_similarity(tokens_sent_1, tokens_sent_2):

    n_common_words = len(set(tokens_sent_1) & set(tokens_sent_2))

    log_s1 = math.log10(len(tokens_sent_1))
    log_s2 = math.log10(len(tokens_sent_2))

    if log_s1 + log_s2 == 0:
        return 0

    return n_common_words / (log_s1 + log_s2)

In [None]:
# Your code here

In [None]:
import networkx as nx

class TextrankSummarizer:

    def __init__(self):
        self.nodes = set()
        self.edges = []
        self.graph = nx.DiGraph()

   
    def summarize(self, sentences, N=2):

      len_sentences = len(sentences)

      for i in range(len_sentences):
        for j in range(i,len_sentences):
          
          self.nodes.add(sentences[i])
          self.nodes.add(sentences[j])

          # add weighted edges to the graph
          weight = compute_similarity(sentences[i], sentences[j])
          self.edges.append((sentences[i], sentences[j], weight))
          self.edges.append((sentences[j], sentences[i], weight))

      # page rank
      self.graph.add_nodes_from(list(self.nodes))
      self.graph.add_weighted_edges_from(self.edges)

      p = nx.pagerank(self.graph, max_iter=100)

      # sort the results in descending order and take the top N
      ordered_scores = sorted(list(p.items()), key=lambda x : -x[1])
      top_N_tuple = sorted(list(p.items()), key=lambda x : -x[1])[:N]
      top_N = [t[0] for t in top_N_tuple]

      return top_N

### **Question 3: Unsupervised Text Summarization (TextRank + TF-IDF)**

Implement a `TextrankTFIDFSummarizer` class that expose the `summarize(sentences, N)` function to get the `N` most relevant sentences from a list (`sentences`). 

Implement the class similarly to Q2. This version uses a different similarity function to weigh edges connecting sentences. It uses [TF-IDF vectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to compute sentence-to-sentence similarity.

- Compute TF-IDF vectors for each sentence
- Compute edges' weights using the cosine similarity between TF-IDF vector representations.

In [None]:
# Your code here

In [None]:
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def compute_similarity_tfidf(tokens_sent_1, tokens_sent_2):
        
  vectorizer = TfidfVectorizer()

  tfidf_1 = vectorizer.fit_transform([tokens_sent_1])

  # do only transform for having the same shape
  tfidf_2 = vectorizer.transform([tokens_sent_2])

  return cosine_similarity(tfidf_1,tfidf_2)

In [None]:
class TextrankTFIDFSummarizer:

    def __init__(self):
        self.nodes = set()
        self.edges = []
        self.graph = nx.DiGraph()

   
    def summarize(self, sentences, N=2):

      len_sentences = len(sentences)

      for i in range(len_sentences):
        for j in range(len_sentences):

          if i == j:
            continue

          self.nodes.add(sentences[i])
          self.nodes.add(sentences[j])

          # add weighted edges to the graph
          weight = compute_similarity_tfidf(sentences[i], sentences[j])
          self.edges.append((sentences[i], sentences[j], weight))

      # page rank
      self.graph.add_nodes_from(list(self.nodes))
      self.graph.add_weighted_edges_from(self.edges)

      p = nx.pagerank(self.graph, max_iter=100)

      # sort the results in descending order and take the top N
      ordered_scores = sorted(list(p.items()), key=lambda x : -x[1])
      top_N_tuple = sorted(list(p.items()), key=lambda x : -x[1])[:N]
      top_N = [t[0] for t in top_N_tuple]

      return top_N

### **Question 4: Unsupervised Text Summarization (Pretrained BERT)**

Both Textrank and Lexrank relies on syntactic scores to compute sentence similarity. 
Use Sentence-Transformer library to encode sentences into semantic-aware vectors and compute semantic similarity to interconnect sentences (e.g., use cosine similarity of bert encodings). Implement `BERTSummarizer` class similarly to Q2 and Q3.

Note 1: use `sentence-transformers` library to obtain sentence embeddings (https://www.sbert.net/).

In [None]:
# Your code here

In [None]:
!pip install -U sentence-transformers

In [None]:
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

class  BERTSummarizer:

    def __init__(self):
        self.nodes = set()
        self.edges = []
        self.graph = nx.DiGraph()
        self.bert = SentenceTransformer("stsb-mpnet-base-v2")

    
    def compute_similarity(self, tokens_sent_1, tokens_sent_2):
        
        tokens_enc_1 = self.bert.encode(tokens_sent_1)
        tokens_enc_2 = self.bert.encode(tokens_sent_2)
        
        return cosine_similarity(tokens_enc_1.reshape(-1,1),tokens_enc_2.reshape(-1,1))

   
    def summarize(self, sentences, N=2):

      len_sentences = len(sentences)

      for i in range(len_sentences):
        for j in range(i,len_sentences):

          self.nodes.add(sentences[i])
          self.nodes.add(sentences[j])

          # add weighted edges to the graph
          weight = self.compute_similarity(sentences[i], sentences[j])
          self.edges.append((sentences[i], sentences[j], weight))
          self.edges.append((sentences[j], sentences[i], weight))

      # page rank
      self.graph.add_nodes_from(list(self.nodes))
      self.graph.add_weighted_edges_from(self.edges)

      p = nx.pagerank(self.graph, max_iter=100)

      # sort the results in descending order and take the top N
      ordered_scores = sorted(list(p.items()), key=lambda x : -x[1])
      top_N_tuple = sorted(list(p.items()), key=lambda x : -x[1])[:N]
      top_N = [t[0] for t in top_N_tuple]

      return top_N

### **Question 5: ROUGE-based evaluation**

Using only the **test set** obtained in Q1 compare the performance of the three summarizers implemented in Q2, Q3 and Q4. 

Report their results in terms of average precision, recall and F1-score for Rouge 2 metrics. Set the number of extracted sentences to 4 for all summarizers.

**Which method obtain the best scores?**

Note 1: You can use the python implementation of ROUGE available [here](https://pypi.org/project/rouge/)

In [None]:
! pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [None]:
# Your code here

In [None]:
textrank = TextrankSummarizer()
summary_text_rank = textrank.summarize(X_test,4)

In [None]:
textrank_tfidf = TextrankTFIDFSummarizer()
summary_tr_tfidf = textrank_tfidf.summarize(X_test,4)

# -- row, column, and data arrays must be 1-D

In [None]:
textrank_bert = BERTSummarizer()
summary_bert = textrank_bert.summarize(X_test,4)

## Abstractive Text Summarization

Abstractive methods build an internal semantic representation of the original content, and then use this representation to create a summary that is closer to what a human might express. Abstraction may transform the extracted content by paraphrasing sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction.

![https://techcommunity.microsoft.com/t5/image/serverpage/image-id/180981i9EA877DDFF97D50D?v=v2](https://techcommunity.microsoft.com/t5/image/serverpage/image-id/180981i9EA877DDFF97D50D?v=v2)

Also for this part of the practice we use the BBC News Summary dataset available in [Kaggle](https://www.kaggle.com/pariza/bbc-news-summary).

### **Question 6: BART (pretrained) seq2seq model**

Exploit [BART](https://huggingface.co/facebook/bart-large-cnn) pretrained on CNN Daily Mail dataset to summarize the article in the BBC test set. Compute the obtained scores in terms of average precision, recall and F1-score for Rouge 2 metrics.

Note 1: for generated summaries set the maximum length to 100 and the minimum length to your preferred value.

Note 2: **to speed up computation**, you can use the distilled version of the BART model (e.g., `sshleifer/distilbart-cnn-12-6` available [here](https://huggingface.co/sshleifer/distilbart-cnn-12-6))

Note 3: You can use the [summarization pipeline](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.SummarizationPipeline). Explictly set truncation to True to avoid index errors (e.g. `summarizer(..., truncation=True)`)

Note 4: Explictly set the device to use GPU acceleration (colab runtime should be also set to GPU) while creating the pipeline object (e.g., `pipeline(..., device=0)`)

In [None]:
# Your code here

In [31]:
!pip install Rouge

Collecting Rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: Rouge
Successfully installed Rouge-1.0.1


In [None]:
!pip install transformers

In [36]:
from transformers import pipeline 
from rouge import Rouge

# BART 
summarizer = pipeline("summarization", 
                      model="sshleifer/distilbart-cnn-12-6", 
                      tokenizer="sshleifer/distilbart-cnn-12-6")

rouge = Rouge()

precision, recall, f1_score, summaries, pred = [], [], [], [], []

for v in test_ds:
  
  summary = v["summary"]
  
  pred = summarizer(v["text"], do_sample=False, truncation=True)
  pred = pred[0]["summary_text"] 
  
  scores = rouge.get_scores(pred, summary) 
  
  precision.append(scores[0]["rouge-2"]["p"]) 
  recall.append(scores[0]["rouge-2"]["r"]) 
  f1_score.append(scores[0]["rouge-2"]["f"])

In [36]:
precision, recall, f1_score

### **Question 7 (bonus): Finetuning seq2seq model**

Exploit the BBC dataset to finetune BART-based model on the proposed dataset. Create a fine-tuning procedure using the article text as input and the ground-truth summary as output of the model.

Exploit the [Datasets framework](https://huggingface.co/docs/datasets/) and [Trainer API](https://huggingface.co/transformers/training.html#fine-tuning-in-pytorch-with-the-trainer-api) for training and evaluating the model.

Even in this case, evaluate the model using ROUGE-2 precision, recall and f1-score. At this time, you may want to use [metrics python library](https://huggingface.co/metrics) to set the [`compute_metrics`](https://huggingface.co/transformers/main_classes/trainer.html#id1) parameter in Trainer.

In [None]:
# Your code here