# Abstract

This notebook holds an implementation of the TextRank algorithm to summarize the detoxified text obtained from the notebook on Text Detoxification with BERT. The TextRank algorithm is an unsupervised learning technique for summarising text. It uses Google's page rank algorithm to assign importance to the sentences in the text which are connected through a Markov Chain with transition probabilities same as the similarity score of the source and the destination sentence. We then take the most important sentences based on the importance assigned to construct the summary of the input text. We use the text of the CNN Dailymail dataset in order to test our methods. We use the GloVe embeddings to vectorize the sentence and the ROUGE score as a metric to evaluate the summarisation.

# Imports and Installations

We used the numpy and pandas to read ans store the CNN Dailymail dataset, nltk library for text preprocessing, the sklearn library to calculate the cosine similarity between 2 vectorized sentences.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
from torch.nn import functional as func
import torch
import pickle

import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt') # one time execution
stop_words = stopwords.words('english')

import re
import os

from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
--2022-05-14 07:53:06--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-05-14 07:53:06--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-05-14 07:53:06--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)

# Importing the CNN Daily Mail Dataset

The CNN Daily Mail Dataset is a popular dataset and is noted to perform well for extractive as well as summarisation with a parallel dataset. In this notebook, we would be using the TextRank Algorithm, which is an extractive algorithm to summarise the sentences. 

In [2]:
TRAIN_CSV = '../input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv'

In [3]:
train_df = pd.read_csv(TRAIN_CSV)
train_df.head()

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


# Importing the TF-IDF Vectorizer

In [4]:
vectorizer = pickle.load(open("/kaggle/input/tfidf-vectorizer/vectorizer.pickle", "rb"))

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


# Importing the BoW Toxicity Classifier

In [5]:
BoWClf = torch.load('/kaggle/input/toxicityclassifier/BoWClf.pt')

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


# Importing Word-Toxicity Scores

In [6]:
wordtoxicities_df = pd.read_csv('/kaggle/input/wordtoxicityscores/wordtoxicities.csv', index_col = 0)
wordtoxicities_df.head()

Unnamed: 0,word,toxicity
0,aaaaaaaaaah,0.155084
1,ab,-0.08955
2,aba,-0.160136
3,abandon,-0.060261
4,abbasidumayyad,-0.050418


# Importing the fine-tuned BERT Model

In [7]:
token_logits = torch.load('/kaggle/input/detoxifier/BERTLogits.pt')
token_logits

tensor([[[-2.4611e+00, -8.5963e+00, -8.7396e+00,  ..., -9.5399e+00,
          -1.1249e+01, -5.7524e+00],
         [-4.1351e+00, -7.5703e+00, -7.4469e+00,  ..., -6.2272e+00,
          -7.3544e+00, -8.3536e+00],
         [-1.4242e+00, -5.3903e+00, -5.3435e+00,  ..., -3.8646e+00,
          -5.8753e+00, -6.3117e+00],
         ...,
         [ 1.8846e+01,  5.3556e+00,  4.9811e+00,  ...,  1.7870e+00,
           1.6240e+00,  8.8052e-01],
         [ 1.9111e+01,  5.5498e+00,  5.3283e+00,  ...,  2.7407e+00,
           2.2640e+00,  1.0450e+00],
         [ 1.8342e+01,  4.5627e+00,  4.3856e+00,  ...,  1.3594e+00,
           1.4697e+00,  7.5490e-03]],

        [[-4.4690e+00, -9.7690e+00, -9.9032e+00,  ..., -8.8216e+00,
          -1.1228e+01, -5.3673e+00],
         [-8.0420e-01, -6.5122e+00, -6.1049e+00,  ..., -7.9757e+00,
          -7.6022e+00, -4.5425e+00],
         [-2.1502e+00, -7.5105e+00, -7.3783e+00,  ..., -7.8517e+00,
          -7.7302e+00, -4.9595e+00],
         ...,
         [ 1.9179e+01,  5

# Word Embeddings

We use the GloVe word embeddings to tokenize the sentences. In order to test our method, we have taken the second article out of the given set of articles.

In [8]:
text = train_df['article'][1]

def extract_word_vectors():
    word_embeddings = {}
    f = open('glove.6B.100d.txt', encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
    f.close()
    return word_embeddings

embeddings = extract_word_vectors()

# Functions for Preprocessing and Summarising the Text

## Preprocessing

- Splitting into sentences
- Removing punctuations, converting to lower case 
- Remove stop words

## Summarising

- Tokenizing each sentence and applying the word embeddings to each word of the sentence to convert the sentence into a list of word embeddings. 
- Calulating the Markov Chain Matrix by calculating the similarity between two vectorized sentences and computing the consine similarity between the 2 vectors.
- Find sentence importance rankings using the page rank algorithm.


We lastly pick the 2/3 of the sentences from the list rankings and store them as the summary.

In [9]:
def split_into_sentences(text):
    return sent_tokenize(text)

def remove_punctuations(sentences):
    return pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

def convert2lower(sentences):
    return [s.lower() for s in sentences]

def remove_stopwords(sen):
    return " ".join([i for i in sen.split() if i not in stop_words])

def vectorize_sentences(clean_sentences, embeddings):
    sentence_vectors = []
    for i in clean_sentences:
        if len(i) != 0:
            v = sum([embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vectors.append(v)
    return sentence_vectors

def calc_similarity_mat(sentences, sentence_vectors):
    sim_mat = np.zeros([len(sentences), len(sentences)])
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), 
                                                  sentence_vectors[j].reshape(1,100))[0,0]
    return sim_mat
                
def find_sentence_rankings(sim_mat):
    nx_graph = nx.from_numpy_array(sim_mat)
    return nx.pagerank(nx_graph)

def summarize(text):
    sentences = split_into_sentences(text)
    clean_sentences = [remove_stopwords(sent) for sent in convert2lower(remove_punctuations(sentences))]
    
    sentence_vectors = vectorize_sentences(clean_sentences, embeddings)
    
    sim_mat = calc_similarity_mat(sentences, sentence_vectors)
    sentence_scores = find_sentence_rankings(sim_mat)
    ranked_sentences = sorted(((sentence_scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    
    return ' '.join(sent for score, sent in ranked_sentences[:int(0.66 * len(ranked_sentences))])

In [10]:
summary = summarize(text)

  """


# Evaluating the Summarisation

We evaluate the summarisation performed by using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score. The candidate text is nothing but the summary obtained of the input text while we use the `highlights` column of the CNN DailyMail dataset for the reference.  

In [11]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
[0m

In [12]:
from rouge import Rouge

reference = train_df['highlights'][1]


def calculate_rouge_score(cand, ref):
    return Rouge().get_scores(summary, reference)

candidates = train_df['article'][1]
refs = train_df['highlights'][1]
scores = calculate_rouge_score(summary, refs)

print("@@CANDIDATE@@", summary)
print("@@REFERENCE@@", refs)

@@CANDIDATE@@ A criminal complaint unsealed in U.S. District Court in New Jersey Tuesday accuses Mata, also known as "The Milk Man," of using his role as a police officer to help the drug trafficking organization in exchange for money and gifts, including a Rolex watch. Outside the office, authorities allege that the 45-year-old longtime officer worked with a drug trafficking organization to help plan a murder plot and get guns. "Ultimately, the (organization) decided not to move forward with the murder plot, but Mata still received a payment for setting up the meetings," federal prosecutors said in a statement. Mata has worked for the Miami-Dade Police Department since 1992, including directing investigations in Miami Gardens and working as a lieutenant in the K-9 unit at Miami International Airport, according to the complaint. In one instance, the complaint alleges, Mata arranged to pay two assassins to kill rival drug dealers. It was not immediately clear whether Mata has an attorne

In [13]:
print("Scores for unigram comparison")
print("RECALL:", scores[0]['rouge-1']['r'])
print("PRECISION:", scores[0]['rouge-1']['p'])
print("F1-SCORE", scores[0]['rouge-1']['f'])
print()

print("Scores for bigram comparison")
print("RECALL:", scores[0]['rouge-2']['r'])
print("PRECISION:", scores[0]['rouge-2']['p'])
print("F1-SCORE", scores[0]['rouge-2']['f'])
print()

print("Scores for Longest Common Subsequence (LCS) Comparison")
print("RECALL:", scores[0]['rouge-l']['r'])
print("PRECISION:", scores[0]['rouge-l']['p'])
print("F1-SCORE", scores[0]['rouge-l']['f'])

Scores for unigram comparison
RECALL: 0.6666666666666666
PRECISION: 0.12571428571428572
F1-SCORE 0.21153845886880548

Scores for bigram comparison
RECALL: 0.3235294117647059
PRECISION: 0.04044117647058824
F1-SCORE 0.07189542286129272

Scores for Longest Common Subsequence (LCS) Comparison
RECALL: 0.6666666666666666
PRECISION: 0.12571428571428572
F1-SCORE 0.21153845886880548


# References

- https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/
- https://medium.com/analytics-vidhya/sentence-extraction-using-textrank-algorithm-7f5c8fd568cd
- https://www.tensorflow.org/datasets/catalog/cnn_dailymail
- https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail
- https://towardsdatascience.com/introduction-to-text-summarization-with-rouge-scores-84140c64b471

# Human Evaluation of Text Detoxification and Summarization

## Inputs

In [14]:
test1 = "Why in the fucking world is chidamabaram wasting time in the CONG party meeting ?? Discuss and decide shit what the party need to do in order to sustain in india... or just wind-up and stay at home or join with BJP.."
test2 = "Neither of you guys has made any contribution to this Italian history article other than to shove your unhistorical unconstructive modern POV in my face. This is the reason why so many people get pissed off about the pedantry and idiocy and triviality of Wikipedia. Jesus. Get a fucking life."

## BERT Snippet

In [15]:
max_tokens = 64

wordtoxicities_dict = {}
for i in range(len(wordtoxicities_df)):
    wordtoxicities_dict[wordtoxicities_df['word'].tolist()[i]] = wordtoxicities_df['toxicity'].tolist()[i]
    
max_word_toxicity = np.max(wordtoxicities_df.toxicity.tolist())
min_word_toxicity = np.min(wordtoxicities_df.toxicity.tolist())
avg_word_toxicity = np.mean(wordtoxicities_df.toxicity.tolist())

def give_mask(s): 
    #assigning toxicity scores to tokens known from BoW Toxicity Classifier
    inp = tokenizer(s, return_tensors = 'pt', max_length = max_tokens, truncation = True, padding = 'max_length') 
    
    tokens = []
    for token in inp.input_ids[0]:
        tokens.append(''.join(tokenizer.decode(token).split()))
    
    token_scores = []
    
    for token in tokens:
        if token in wordtoxicities_dict:
            token_scores.append(wordtoxicities_dict[token])
        else:
            token_scores.append(avg_word_toxicity)        
    
    #finding words to be nasked using an adaptive threshold
    min_threshold = 0.2
    
    s_array = np.array([s]) 
    s_series = pd.Series(s_array)
    s_vectorized = vectorizer.transform(s_series).todense()
    
    if(BoWClf.predict(s_vectorized)[0]):
        scores = np.zeros(max_tokens)
        scores[:len(token_scores)] = np.array(token_scores)
        scores = torch.from_numpy(scores)

        threshold = max(min_threshold, max(token_scores) / 2)

        mask = (scores > threshold) * (inp.input_ids[0] != 101) * (inp.input_ids[0] != 102) * (inp.input_ids[0] != 0)
    else:
        mask = False * (inp.input_ids[0] != 101) * (inp.input_ids[0] != 102) * (inp.input_ids[0] != 0)
    
    #applying mask
    selection = torch.flatten(mask.nonzero()).tolist()
    inp.input_ids[0, selection] = 103
    
    return inp.input_ids[0]
        
def detoxify(s):
    s_masked = give_mask(s) #masking toxic comment
    
    #retrieving detoxified comment
    
    s_detoxified = []
    for token in s_masked:
        s_detoxified.append(''.join(tokenizer.decode(token).split()))

    mask_token_indices = torch.where(s_masked == tokenizer.mask_token_id)[0]

    softy = func.softmax(token_logits, dim = -1)
    mask_token_logits = softy[0, mask_token_indices, :]
    
    #rescoring on the basis of toxicities
    for i in range(len(mask_token_indices.tolist())):
        mask_tokens =  torch.sort(mask_token_logits, dim = 1, stable = True).indices[i].tolist()
        logits_tokens = torch.sort(mask_token_logits, dim = 1, stable = True).values[i].tolist()

        better_mask_tokens = []
        better_mask_token_logits = []
        for j in range(len(logits_tokens)):
            better_mask_tokens.append(mask_tokens[j])

            token = ''.join(tokenizer.decode(mask_tokens[j]).split())
            if token in wordtoxicities_dict:
                better_mask_token_logits.append(logits_tokens[j] / (max_word_toxicity - min_word_toxicity + wordtoxicities_dict[token]))
            else:
                better_mask_token_logits.append(logits_tokens[j] / (max_word_toxicity - min_word_toxicity + avg_word_toxicity))

        better_mask_tokens.sort(key = dict(zip(better_mask_tokens, better_mask_token_logits)).get)
        best_mask_token = better_mask_tokens[-1: ]

        s_detoxified[mask_token_indices[i]] = tokenizer.decode(best_mask_token)
        
    return ' '.join(word for word in s_detoxified[1:s_detoxified.index('[SEP]')]).replace(" ##", "")

## Outputs

In [16]:
summarized_text1 = summarize(test1)

detoxified_text1 = detoxify(test1)

summarized_detoxified_text1 = summarize(detoxified_text1)

print(test1)
print()
print(summarized_text1)
print()
print(summarized_detoxified_text1)

  """


Why in the fucking world is chidamabaram wasting time in the CONG party meeting ?? Discuss and decide shit what the party need to do in order to sustain in india... or just wind-up and stay at home or join with BJP..

Why in the fucking world is chidamabaram wasting time in the CONG party meeting ??

discuss and decide to what the party need to do in order to sustain in india . or just wind - up and stay at home or join with bjp . why in the your world is chidamabaram wasting time in the cong party meeting ? ?


  """


In [17]:
summarized_text2 = summarize(test2)

detoxified_text2 = detoxify(test2)

summarized_detoxified_text2 = summarize(detoxified_text2)

print(test2)
print()
print(summarized_text2)
print()
print(summarized_detoxified_text2)

  """


Neither of you guys has made any contribution to this Italian history article other than to shove your unhistorical unconstructive modern POV in my face. This is the reason why so many people get pissed off about the pedantry and idiocy and triviality of Wikipedia. Jesus. Get a fucking life.

Get a fucking life. This is the reason why so many people get pissed off about the pedantry and idiocy and triviality of Wikipedia.

neither of you guys has made any contribution to this italian history article other than to somehow your unhistorical unconstructive modern pov in my believe .


  """


### Note that, in the summarized texts post detoxification, the toxic words 'fucking' and 'shit' no longer appear!