We will use use the Microblog Opinion Summarisation corpus described in https://arxiv.org/pdf/2208.04083.pdf .

In [113]:
!pip install datasets
!pip install sentence-transformers
!pip install transformers
!pip install -r rouge/requirements.txt
!pip install rouge-score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [106]:
import json
import os
import numpy
import re
import shutil

from transformers import BartForConditionalGeneration, AutoModel, AutoTokenizer, DataCollatorForSeq2Seq 
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
from sentence_transformers import SentenceTransformer
from nltk.tokenize import word_tokenize
from rouge_score import rouge_scorer

from datasets import load_dataset


Most sequence-to-sequence models have input limit of 512/1024 characters.

Order the tweets within a cluster by relevance. 
We use sentence-tranformers library from Hugging Face to embed the tweets in a vector space and then use the mean cosine similarity as proxy for a tweet's relevance in the cluster.  

In [14]:
def cluster_distance(tweet_emb):
    distances = [1-util.pytorch_cos_sim(tweet_emb, tweet_emb2) for tweet_emb2 in tweet_embeddings]
    return sum(distances)

def cluster_distance2(tweet_emb):
    distances = [numpy.linalg.norm(tweet_emb- tweet_emb2) for tweet_emb2 in tweet_embeddings]
    return sum(distances)    


In our preprocessing, we remove emojis, URL placeholders and whitespace characters.

In [80]:
#Preprocess tweets
def preprocess_(tweets):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        "]+", flags=re.UNICODE)
    
    tweets = [emoji_pattern.sub(r'', tweet) for tweet in tweets]
    tweets = [tweet.replace('\n', ' ') for tweet in tweets]
    tweets = [tweet.replace('\r', ' ') for tweet in tweets]
    tweets = [tweet.replace('\t', ' ') for tweet in tweets]
    tweets = [tweet.replace('URL_LINK','') for tweet in tweets]
   
    return tweets

tweet = 'Hello World!😄 \nWhat a great day to learn! Check this URL_LINK'
print(tweet)
print(preprocess_([tweet]))    

Hello World!😄 
What a great day to learn! Check this URL_LINK
['Hello World!  What a great day to learn! Check this ']


In [84]:

k = 7
model_st = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

def sort_relevance(cluster):
    tweets = []
    minv = 100000 
    pp_tweets = preprocess_(cluster['tweets'])
    tweet_embeddings = model_st.encode(pp_tweets)

    for i, tweet in enumerate(pp_tweets):
        tweets.append({'tweet':tweet, 'score': cluster_distance(tweet_embeddings[i])})

    tweets = sorted(tweets, key=lambda x: x['score'])
    return [x['tweet'] for x in tweets]

with open('/Users/ibilal/Desktop/Research/Opinion_Text_Summarisation/testing_corpus/opi/2014-07-04-09_oil_0') as f:
    cluster = json.load(f)
print('The most relevant {} tweets in the cluster are:'.format(k), sort_relevance(cluster)[:k])    




The most relevant 7 tweets in the cluster are: ['.@Susie_Wolff has stopped on the track at Silverstone. She is reporting a problem with oil pressure  #bbcf1', ",@Susie_Wolff stopped on track after four laps with oil pressure problem. Here's Claire Williams on Susie ", "Susie #Wolff's car is stopped on the side of the track in the last sector, reporting lost oil pressure...#F1 #BritishGP", 'MT “@bbcf1: @Susie_Wolff stopped on the track at Silverstone. She is reporting problem with oil pressure  #bbcf1” :-(', "Gutted for @Susie_Wolff - car stopped out on track with oil pressure failure. 22 mins in she won't be able to run #F1 ", 'Just as I tweet its great to see out on track then @Susie_Wolff has oil pressure problems :(', "Susie Wolff has stopped on track. Reporting that it's an oil pressure issue. Looks like the end of her session with 1:08.36 to go. #f1"]


# Summary Generation

We work with Transformer-based model BART for the summarisation task. Its paper can be found here https://arxiv.org/pdf/1910.13461.pdf 
The model produces competitive results across many tasks and domains and it is especially effective for fine-tuning strategies.
We follow the implementation from HuggingFace library.


In [82]:
# The final summary is the concatenation of the components: main story, majority opinion (if any) and any minority opinions (if any). 
def summary_concat(cluster):
    opinionated = 1
    if 'majority_opinion' in cluster.keys() and 'minority_opinions' in cluster.keys():
        summary = cluster['main_story'] + ' ' + cluster['majority_opinion'] + ' ' + cluster['minority_opinions']
    elif 'majority_opinion' in cluster.keys() and 'minority_opinions' not in cluster.keys():  
        summary = cluster['main_story'] + ' ' + cluster['majority_opinion']
    elif 'majority_opinion' not in cluster.keys() and 'minority_opinions' in cluster.keys():
        summary = cluster['main_story'] + ' ' + cluster['minority_opinions']   
    else:
       summary = cluster['main_story']  
       opinionated = 0
    return summary, opinionated        


avg_len_opi = 0
avg_len_nopi = 0
for file in os.listdir('/Users/ibilal/Desktop/Research/Code/mos_workspace/Partition_10'):
    if not os.path.isfile(os.path.join('/Users/ibilal/Desktop/Research/Code/mos_workspace/Partition_10',file)):
        continue
    with open(os.path.join('/Users/ibilal/Desktop/Research/Code/mos_workspace/Partition_10',file),'r') as f:
        cluster = json.load(f)
    if summary_concat(cluster)[1] == 0:
        avg_len_nopi += len(word_tokenize(summary_concat(cluster)[0]))  
    
    else:
        avg_len_opi += len(word_tokenize(summary_concat(cluster)[0]))  

avg_len_opi = avg_len_opi/len(os.listdir('/Users/ibilal/Desktop/Research/Code/mos_workspace/Partition_10/opinionated'))
avg_len_nopi = avg_len_nopi/len(os.listdir('/Users/ibilal/Desktop/Research/Code/mos_workspace/Partition_10/non_opinionated'))

print('Opinionated summaries are {:2} tokens long on avg'.format(avg_len_opi))
print('Non-Opinionated summaries are {:2} tokens long on avg'.format(avg_len_nopi))




Opinionated summaries are 37.8859649122807 tokens long on avg
Non-Opinionated summaries are 18.41 tokens long on avg


In [87]:
model_name = 'facebook/bart-large-cnn'
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


def bart_zero(cluster):
    original_text = ' '.join(sort_relevance(cluster))
    inputs = tokenizer.encode(original_text, return_tensors="pt", max_length=1024)
    outputs = model.generate(inputs, max_length=46, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True,no_repeat_ngram_size=4)

    summary = tokenizer.decode(outputs[0],skip_special_tokens=True)
    return summary
print(bart_zero(cluster))


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Susie Wolff's Williams has come to a stop after four laps. Williams say the stoppage is not #Wolff's fault, she is reporting oil pressure problems. Massa has just hit the wall
31


# Model Fine-tuning

BART is fine-tuned for 5 epochs on the 10% partition of the MOS corpus.

In [None]:
def preprocess_function(examples):
    inputs = examples["text"]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


training_args = TrainingArguments(
    output_dir='/Users/ibilal/Desktop/Research/Code/mos_workspace/BART_model',          # output directory
    num_train_epochs=5,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='/Users/ibilal/Desktop/Research/Code/mos_workspace/BART_model/logs',            # directory for storing logs
    logging_steps=10,
)


datafiles ='/Users/ibilal/Desktop/Research/Code/mos_workspace/partition10.json'
dataset = load_dataset('json', data_files=datafiles, split='train')
tokenized_dataset = dataset.map(preprocess_function,batched=False)
print(tokenized_dataset[0])
print(len(tokenized_dataset))

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_dataset,         # training dataset
    tokenizer = tokenizer,
    data_collator=data_collator,
)

trainer.train()

# Summary Evaluation

Generated summaries are evaluated against two dimensions:

<b>Word-overlap</b>: ROUGE (Lin et al.,2004, https://aclanthology.org/W04-1013.pdf) is the most used metric in summarisation tasks. It calculates the n-gram overlap between system and candidate summaries. In our experiment, we use the harmonic mean F_1 of ROUGE-1 (unigrams), ROUGE-2 (bigrams) and ROUGE-L (longest sequence). By design, the metric is unable to detect semantic similarity (<i>happy kid</i> and <i>joyful child</i> are identified as disjoint and are penalised).

<b>Semantic Similarity</b>: BERTScore (Zhang et al, 2020, https://arxiv.org/pdf/1904.09675.pdf) is used in machine generation tasks to detect the level of semantic similarity between a pair of texts. It relies on BERT contextual embeddings and greedy token matching and it has been shown to correlate well with human judgement. It achieved competitive results in machine translation and image captioning due to its robustness in paraphrasing.

In [None]:
# ROUGE Evaluation

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2','rougeL'], use_stemmer=True)
testing_set_path = '/Users/ibilal/Desktop/Research/Code/mos_workspace/testing_set'

for partition in ['nopi_covid','nopi_elections','opi_covid','opi_elections']:
    rouge1_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0}
    rouge2_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0}
    rougel_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0}
    bertscore_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0} 
    
    for file in os.listdir(os.path.join(testing_set_path,partition)):
        with open(os.path.join(testing_set_path,partition,file),'r') as f:
            cluster = json.load(f)
        gold_standard = summary_concat(cluster)[0]
        
        #Generate the summary candidates
        medoid_summary = sort_relevance(cluster)[0]
        bart_zero_summary = bart_zero(cluster)
        summaries = {'medoid':medoid_summary, 'bart_zero':bart_zero_summary}
        #For each summary type, compute its scores vs the human-written summary
        for key in summaries.keys():
        
            scores = scorer.score(gold_standard, summaries[key])

            rouge1_score[key] += scores['rouge1'][2]
            rouge2_score[key] += scores['rouge2'][2]
            rougel_score[key] += scores['rougel'][2]

        

