# Constructing models for the Task of Microblog Opinion Summarisation
----------

We will use use the Microblog Opinion Summarisation corpus described in https://arxiv.org/pdf/2208.04083.pdf .

#Table of contents

>[Constructing models for the Task of Microblog Opinion Summarisation](#scrollTo=PGGvuJqQkotL)

>>[Table of contents](#scrollTo=U_Dzawo3k2YQ)

>>[Prepare your environment](#scrollTo=dKE_mAtIk9u8)

>>[Construction of the gold standard summary](#scrollTo=TKXz8EbKOc9g)

>>[Summarisation Model and Fine-tuning](#scrollTo=JYEsFM58Oc9i)

>>[Generated Summary Evaluation](#scrollTo=RBlNZjywOc9k)

#Prepare your environment

In [None]:
!pip install datasets
!pip install sentence-transformers
!pip install transformers
!pip install -r rouge/requirements.txt
!pip install rouge-score
!pip install bert_score
!pip install tabulate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [None]:
import json
import os
import numpy
import re
import shutil
import tabulate

from transformers import BartForConditionalGeneration, AutoModel, AutoTokenizer, DataCollatorForSeq2Seq 
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
from sentence_transformers import SentenceTransformer
from nltk.tokenize import word_tokenize
from rouge_score import rouge_scorer
from bert_score import score
from datasets import load_dataset


In [None]:
dir_path = '/mos_workspace/'

Most sequence-to-sequence models have input limit of 512/1024 tokens.

Order the tweets within a cluster by relevance. 
We use sentence-tranformers library from Hugging Face to embed the tweets in a vector space and then use the cosine similarity as proxy for a tweet's relevance in the cluster. This can be done in two ways:
- <i>Medoid</i>: For each tweet, compute the mean distance to all other tweets in the vector space. Find the tweet (medoid) which minimises this. As an alternative to cosine similarity, one can use Euclidean distance.
- <i>Centroid</i>: Compute the centroid of the cluster (which usually does not exists in practice) as the mean average of their embeddings. Then find the closest tweet to the centroid as the centroid approximation.

In [None]:
def cluster_distance(tweet_emb):
    distances = [1-util.pytorch_cos_sim(tweet_emb, tweet_emb2) for tweet_emb2 in tweet_embeddings]
    return sum(distances)

def cluster_distance2(tweet_emb):
    distances = [numpy.linalg.norm(tweet_emb- tweet_emb2) for tweet_emb2 in tweet_embeddings]
    return sum(distances)  

def centroid_distance(tweet_emb, centroid):
    return 1- util.pytorch_cos_sim(tweet_emb, centroid)



In our preprocessing, we remove emojis, URL placeholders and whitespace characters.

In [None]:
#Preprocess tweets
def preprocess_(tweets):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        "]+", flags=re.UNICODE)
    
    tweets = [emoji_pattern.sub(r'', tweet) for tweet in tweets]
    tweets = [tweet.replace('\n', ' ') for tweet in tweets]
    tweets = [tweet.replace('\r', ' ') for tweet in tweets]
    tweets = [tweet.replace('\t', ' ') for tweet in tweets]
    tweets = [tweet.replace('URL_LINK','') for tweet in tweets]
   
    return tweets

tweet = 'Hello World!😄 \nWhat a great day to learn! Check this URL_LINK'
print(tweet)
print(preprocess_([tweet]))    

Hello World!😄 
What a great day to learn! Check this URL_LINK
['Hello World!  What a great day to learn! Check this ']


In [None]:

k = 7
model_st = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

def sort_relevance(cluster):
    tweets = []
    minv = 100000 
    pp_tweets = preprocess_(cluster['tweets'])
    tweet_embeddings = model_st.encode(pp_tweets)
    centroid = sum(tweet_embeddings)/len(tweet_embeddings)

    for i, tweet in enumerate(pp_tweets):
        tweets.append({'tweet':tweet, 'score': cluster_distance(tweet_embeddings[i])})

    tweets = sorted(tweets, key=lambda x: x['score'])
    return [x['tweet'] for x in tweets]

with open(dir_path + '/testing_set/opi_elections/2014-08-25-22_homeless_0') as f:
    cluster = json.load(f)

print('The most relevant {} tweets in the cluster are:'.format(k), sort_relevance(cluster)[:k])   







loading configuration file /Users/ibilal/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/config.json
Model config MPNetConfig {
  "_name_or_path": "/Users/ibilal/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/",
  "architectures": [
    "MPNetForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "mpnet",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.22.2",
  "vocab_size": 30527
}

loading weights file /Users/ibilal/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/pytorch_model.bin
All model checkpoint weights were used when initializing MPNetModel.

All the

The most relevant 7 tweets in the cluster are: ['@DisruptedSkies cos apparently a homeless man  presented Miley with an award', "Also, if Miley hadn't won, dragging along the homeless bloke would have been a waste of time. I'm calling shenanigans. #VMA2014", "@elzmoyz she didn't accept her award for video of the year and instead go a homeless young man to accept it on behalf of homeless youths", '@Mushie_May instead of accepting the award herself her date was a homeless man she helped and he accepted the award in behalf of her charity', 'Miley was less Wrecking Ball and more bawling wreck when she had homeless Jesse accept her VMA award. Watch: ', "@chl0ehampson Yeah a homeless person collects her award for her doesn't he! And she cries", 'Nice of #vma2014 miley cyrus to bring along a publicity stunt... i mean homeless person.']


# Construction of the gold standard summary

Our chosen summary structure diverges from current summarisation approaches that reconstruct the “most popular opinion” (Bražinskas et al., 2020; Angelidis et al., 2021). Instead, we aim to showcase a spectrum of diverse opinions regarding the same event as this is a rich resource for practitioners. Thus, the summary template comprises
three components: 

* **Main Story**: serves to succinctly present the focus of the cluster (often an event)
* **Majority Opinion**: the opinion expressed by most posts in the cluster regarding the main story
* **Minority Opinion(s)**: opinions expressed by few posts in the cluster regarding the main story. 

For the purpose of our project, the gold standard summary is the concatenation of all components: main story, majority opinion (if any) and minority opinions (if any). 

We define a cluster with many opinions as *opinionated* and a cluster with mostly factual information (and few or no opinions) as *non-opinionated*. 


In [None]:

def summary_concat(cluster):
    opinionated = 1
    if 'majority_opinion' in cluster.keys() and 'minority_opinions' in cluster.keys():
        summary = cluster['main_story'] + ' ' + cluster['majority_opinion'] + ' ' + cluster['minority_opinions']
    elif 'majority_opinion' in cluster.keys() and 'minority_opinions' not in cluster.keys():  
        summary = cluster['main_story'] + ' ' + cluster['majority_opinion']
    elif 'majority_opinion' not in cluster.keys() and 'minority_opinions' in cluster.keys():
        summary = cluster['main_story'] + ' ' + cluster['minority_opinions']   
    else:
       summary = cluster['main_story']  
       opinionated = 0
    return summary, opinionated        


avg_len_opi = 0
avg_len_nopi = 0
list_clusters = []
for file in os.listdir(dir_path + 'Partition_10'):
    if not os.path.isfile(os.path.join(dir_path + 'Partition_10',file)):
        continue
    with open(os.path.join(dir_path + 'Partition_10',file),'r') as f:
        cluster = json.load(f)
    
    list_clusters.append({'text':' '.join(sort_relevance(cluster)), 'summary':summary_concat(cluster)[0]})
    
    if summary_concat(cluster)[1] == 0:
        avg_len_nopi += len(word_tokenize(summary_concat(cluster)[0]))  
    
    else:
        avg_len_opi += len(word_tokenize(summary_concat(cluster)[0]))  

avg_len_opi = avg_len_opi/len(os.listdir(dir_path + 'Partition_10/opinionated'))
avg_len_nopi = avg_len_nopi/len(os.listdir(dir_path + 'Partition_10/non_opinionated'))

with open(dir_path + 'partition10_relevance.json','w') as g:
    for cluster in list_clusters:
        g.write(json.dumps(cluster)+'\n')


print('Opinionated summaries are {:2} tokens long on avg'.format(avg_len_opi))
print('Non-Opinionated summaries are {:2} tokens long on avg'.format(avg_len_nopi))




Opinionated summaries are 37.8859649122807 tokens long on avg
Non-Opinionated summaries are 18.41 tokens long on avg


# Summarisation Model and Fine-tuning

We work with Transformer-based model BART for the summarisation task. Its paper can be found here https://arxiv.org/pdf/1910.13461.pdf 
The model produces competitive results across many tasks and domains and it is especially effective for fine-tuning strategies.
We follow the implementation from HuggingFace library and use both its off-the-shelf version as well as fine-tuned version for our task.

BART is fine-tuned for 5 epochs on the 10% partition of the original MOS corpus. The data is stored in file *partition10_relevance.json*. The final model will be saved in the folder ./mos_workspace/BART_model.

Note that we use the <b>Trainer</b> class for ease of use. 'Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers.' More details can be found at https://huggingface.co/docs/transformers/main_classes/trainer .



In [None]:
#Load the off-the-shelf BART model

model_name = 'facebook/bart-large-cnn'
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
#Set the length limits for each summary type: opinionated and non-opinionated clusters.
length_limits= {'opi':[12,24],'nopi':[26,40]}

def bart_zero(cluster,max_length,min_length):
    original_text = ' '.join(sort_relevance(cluster))
    inputs = tokenizer.encode(original_text, return_tensors="pt", max_length=1024)
    outputs = model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True,no_repeat_ngram_size=4)

    summary = tokenizer.decode(outputs[0],skip_special_tokens=True)
    return summary
print(bart_zero(cluster))


In [None]:
def preprocess_function(examples):
    inputs = examples["text"]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Fine-tune the summarisation model on our dataset. Note that running this cell is optional,  a fine-tuned version of the model can be found [here](https://drive.google.com/drive/folders/1cUrIDM4C6vVM1Rha0UJ9Pg6Ds33cUg33?usp=sharing) (you need to download the entire folder you are directed to in the link; download it and put it in your workspace folder.).

In [None]:
training_args = TrainingArguments(
    output_dir= dir_path + 'BART_model',          # output directory
    num_train_epochs=5,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir= dir_path + 'BART_model/logs',            # directory for storing logs
    logging_steps=10,
)


datafiles =dir_path + 'partition10_relevance.json'
dataset = load_dataset('json', data_files=datafiles, split='train')
tokenized_dataset = dataset.map(preprocess_function,batched=False)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_dataset,         # training dataset
    tokenizer = tokenizer,
    data_collator=data_collator,
)

trainer.train()
trainer.save_state()
trainer.save_model()

In [None]:
model_name_ft = dir_path +'BART_model'
model_ft = BartForConditionalGeneration.from_pretrained(model_name_ft)
tokenizer_ft = AutoTokenizer.from_pretrained(model_name_ft)


def bart_ft(cluster,max_length,min_length):
    original_text = ' '.join(sort_relevance(cluster))
    inputs = tokenizer_ft.encode(original_text, return_tensors="pt", max_length=1024)
    outputs = model_ft.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True,no_repeat_ngram_size=4)

    summary = tokenizer_ft.decode(outputs[0],skip_special_tokens=True)
    return summary
print(bart_ft(cluster))

loading configuration file /Users/ibilal/Desktop/Research/Code/mos_workspace/BART_model/config.json
Model config BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"


Miley Cyrus asked a homeless man to accept Video of the Year at the VMA Awards on behalf of her charity The majority think it was a good idea for Miley to allow the homeless man to


# Generated Summary Evaluation

The generated summaries are evaluated against two dimensions:

1. <b>Word-overlap</b>: ROUGE (Lin et al.,2004, https://aclanthology.org/W04-1013.pdf) is the most used metric in summarisation tasks. It calculates the n-gram overlap between system and candidate summaries. In our experiment, we use the harmonic mean F_1 of ROUGE-1 (unigrams), ROUGE-2 (bigrams) and ROUGE-L (longest sequence). By design, the metric is unable to detect semantic similarity (<i>happy kid</i> and <i>joyful child</i> are identified as disjoint and are penalised).

2. <b>Semantic Similarity</b>: BERTScore (Zhang et al, 2020, https://arxiv.org/pdf/1904.09675.pdf) is used in machine generation tasks to detect the level of semantic similarity between a pair of texts. It relies on BERT contextual embeddings and greedy token matching and it has been shown to correlate well with human judgement. It achieved competitive results in machine translation and image captioning due to its robustness in paraphrasing. Note that the metric tends to have a narrow score range between [0.70, 1.00].

In [None]:
# ROUGE & BERTScore Evaluation

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2','rougeL'], use_stemmer=True)
testing_set_path = dir_path + 'testing_set'

for partition in ['nopi_covid','nopi_elections','opi_covid','opi_elections']:
    rouge1_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0}
    rouge2_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0}
    rougel_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0}
    bertscore_score = {'medoid': 0, 'bart_zero':0, 'bart_ft':0} 
    
    for file in os.listdir(os.path.join(testing_set_path,partition)):
        if file[0]=='.':
            continue
        with open(os.path.join(testing_set_path,partition,file),'r') as f:
            cluster = json.load(f)
    
        gold_standard = summary_concat(cluster)[0]
        
        #Generate the summary candidates
        medoid_summary = sort_relevance(cluster)[0]
        
        min_length = length_limits[partition.split('_')[0]][0]
        max_length = length_limits[partition.split('_')[0]][1]
        bart_zero_summary = bart_zero(cluster,min_length,max_length)
        bart_ft_summary = bart_ft(cluster,min_length,max_length)
        summaries = {'medoid':medoid_summary, 'bart_zero':bart_zero_summary, 'bart_ft':bart_ft_summary}
        print(summaries)
        #For each summary type, compute its scores vs the human-written summary
        for key in summaries.keys():
        
            scores = scorer.score(gold_standard, summaries[key])

            rouge1_score[key] += scores['rouge1'][2]
            rouge2_score[key] += scores['rouge2'][2]
            rougel_score[key] += scores['rougeL'][2]
            
            P, R, F1 = score([gold_standard], [summaries[key]], lang='en', verbose=False)
            bertscore_score[key] += F1
    print('Results for partition: ',partition)
    data = [['Model','ROUGE-1', 'ROUGE-2','ROUGE-L', 'BERTScore']]
    for key in summaries.keys():
        data.append([key,rouge1_score[key]/25,rouge2_score[key]/25,rougel_score[key]/25,bertscore_score[key]/25])
    table = tabulate.tabulate(data)  
    print(table)
            
