## Model Initialization and Summary Generation

In this section of the project, we utilize the BART (Bidirectional and Auto-Regressive Transformers) model developed by Facebook AI. BART is specifically designed for tasks that involve both understanding and generating natural language, making it well-suited for applications like summarization. This model integrates the benefits of both bidirectional models (e.g., BERT) for understanding context and auto-regressive models (e.g., GPT) for generating text, which allows it to effectively model the dependencies and nuances in languagting accurate and coherent summaries of Albanian parliamentary speeches, aligning with the project's goal to enhance accessibility and understanding of political discourse.


## Data Loading and Preprocessing

The notebook begins by loading the necessary datasets:
- `Sampled_Alb_Speeches.xlsx` contains a subset of speeches that need summarization.
- `Summarized_Albanian_Speeches.xlsx` contains pre-generated summaries for a comparative study.

These datasets are merged based on the 'id' column to align each speech with its corresponding summary, facilitating a direct comparison and training process.

In [1]:
import pandas as pd

unlabeled_data_alb = pd.read_excel('Sampled_Alb_Speeches.xlsx')
sum_eng = pd.read_excel('Summarized_Albanian_Speeches.xlsx')

# Merge the datasets on the 'id' column
merged_data = pd.merge(unlabeled_data_alb[['id', 'text']], sum_eng[['id', 'summarization']], on='id')


In [9]:
data = pd.read_excel('labeled_data.xlsx')

In [10]:
data.head()

Unnamed: 0,id,text,summarization
0,2011-11-11_22,A jeni dakord ju shefat e grupeve parlamentare...,"Me 72 vota “pro”, 3 “kundër”, 2 “abstenime”, K..."
1,2014-05-07_34,"Unë e di që ka deputetë opozitarë, ka deputetë...","Kryeministri Thaçi i tha “po” Daqidit, atje. P..."
2,2021-10-19_188,"Faleminderit, kryetar! Komisioni për të Drejta...","Komisioni për të drejtat e njeriut, barazi gji..."
3,2008-11-06_240,"I nderuar deputet, unë të kuptoj se kemi nganj...",Të pranishëm janë 67 deputetë. 53 votuan “pro”...
4,2008-11-20_173,Ju faleminderit! Mendoj se deri më tani Kuvend...,"Kuvendi me 63 vota “pro”, asnjë kundër, 1 abst..."


In [11]:
from datasets import Dataset
from datasets import load_dataset
import pandas as pd
dataset = Dataset.from_pandas(data)

## Model Setup and Summary Generation

For the summarization task, we utilize the `BART` model, renowned for its efficacy in sequence-to-sequence tasks:
- **Model and Tokenizer Initialization:** The BART model and its tokenizer are initialized. The tokenizer prepares the text data for processing by the model, handling tasks such as splitting text into tokens, generating tokens suitable for model input, and setting the maximum length for sequence truncation.
- **Summary Generation Function:** We define a custom function `generate_summary_test` to encapsulate the entire summarization process. This function manages text input, invoking the model to generate summaries, and then decoding the output to human-readable text.
## Training 

The processed dataset is split into training and testing subsets, providing a foundation for both training the model and evaluating its performance. We detail the setup for model training using Hugging Face’s `transformers` and `datasets` libraries, which include:
- **Tokenization:** Text data is converted into a format suitable for the model, ensuring that input lengths are managed and that the data fits model requirements.
- **Training Arguments Setup:** Parameters for training the BART model are specified, including learning rate, batch size, and the number of epochs.
- **Training Execution:** Utilizing `Seq2SeqTrainer`, the model undergoes training where it learns to generate summaries that are both concise and relevant to the input speeches.

In [12]:
# Splitting the dataset into 80% train and 20% validation
dataset = dataset.train_test_split(test_size=0.2)

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'summarization'],
        num_rows: 440
    })
    test: Dataset({
        features: ['id', 'text', 'summarization'],
        num_rows: 110
    })
})

In [14]:
# Renaming columns if necessary
dataset = dataset.rename_column("summarization", "summary")


In [14]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)


In [15]:

model_checkpoint = "facebook/bart-large-cnn"  # Example using BART model

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    model_inputs = tokenizer(examples['text'], max_length=1024,truncation=True, padding="max_length")
    # Prepare labels for summarization
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['summary'], max_length=500, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/440 [00:00<?, ? examples/s]

Map:   0%|          | 0/110 [00:00<?, ? examples/s]

In [16]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 440
    })
    test: Dataset({
        features: ['id', 'text', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 110
    })
})

In [17]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

# Ensure your training arguments are correctly set
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
)

# Data collator that dynamically pads the batches
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize the Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Now you can train your model
trainer.train()


  attn_output = torch.nn.functional.scaled_dot_product_attention(


Epoch,Training Loss,Validation Loss
1,No log,0.64937
2,No log,0.630013
3,No log,0.627302


TrainOutput(global_step=330, training_loss=0.6130429816968513, metrics={'train_runtime': 3140.0419, 'train_samples_per_second': 0.42, 'train_steps_per_second': 0.105, 'total_flos': 2860578074787840.0, 'train_loss': 0.6130429816968513, 'epoch': 3.0})

In [19]:
model.save_pretrained('./results')
tokenizer.save_pretrained('./results')

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('./results\\tokenizer_config.json',
 './results\\special_tokens_map.json',
 './results\\vocab.json',
 './results\\merges.txt',
 './results\\added_tokens.json',
 './results\\tokenizer.json')

In [17]:
from transformers import pipeline, AutoTokenizer


# Load the model
tokenizer = AutoTokenizer.from_pretrained("./results")
summarizer = pipeline("summarization", model="./results", tokenizer=tokenizer)

# Example text
text_to_summarize = "A jeni dakord ju shefat e grupeve parlamentare për këtë? Dakord! Atëherë, shkojmë me procedurën e votimit. Regjia, le të përgatitet t’i votojmë rekomandimet e propozuara nga Grupi Parlamentar i Lidhjes Demokratike lidhur me Raportin e Progresit për Kosovën, për vitin 2011. Votojmë tash. Me 72 vota “për’, 3 “kundër” ’, 2 “abstenime”, Kuvendi miratoi rekomandimet e propozuara nga Grupi Parlamentar i Lidhjes Demokratike. Para se të vazhdojmë me rendin e ditës, në konsultë me kryetarët e grupeve parlamentare, pika 13 e rendit të ditës, për faktin se lënda është në Gjykatën Kushtetuese, që Kuvendi mos të bëjë interferim në këtë lëndë, shtyhet ose hiqet nga rendi i ditës për seancën e sotme. Komisioni, a e do fjalën? Jo. Atëherë, kërkohet një deklarim i seancës. Kush është për këtë propozim, me ngritje dore? Faleminderit! A ka kundër? A ka abstenim? Faleminderit! Me shumicë të votave, pika e fundit e rendit të ditës hiqet nga shqyrtimi. Në radhë kemi pikën dytë të rendit të ditës: 2. Koha për pyetjet parlamentare Në pajtim me nenin 45, pika 1 të Rregullores së Kuvendit, koha për pyetjet e deputetëve për Qeverinë është e kufizuar në 60 minuta. Deputetët në një mbledhje mund t’i parashtrojnë më së shumti dy pyetje parlamentare. Deputeti Shaip Muja nuk është këtu. Deputetja Aurora Bakalli, pyetje për ministrin Ferid Agani. Deputete, e ke fjalën!"

# Generate summary
summary = summarizer(text_to_summarize, max_length=500, min_length=50, do_sample=False)
print(summary[0]['summary_text'])


Kuvendi miratoi rekomandimet e propozuara nga Grupi Parlamentar i Lidhjes Demokratike lidhur me Raportin e Progresit për Kosovën, të vitit 2011. Me 72 vota “pro”, 3 “kundër” ’, 2 “abstenime” me shumicë të votave, pika e fundit e rendit të ditës hiqet nga shqyrtimi. Deputetët në një mbledhje mund të parashtrojnë më së shumti dy pyetje parlamentare.


## Summarization Evaluation

To assess the quality of the generated summaries, we use the ROUGE metric, which helps in measuring the overlap of n-grams between the generated summaries and the ground truths. Additionally, cosine similarity scores provide a measure of semantic similarity between the generated and reference summaries, offering insight into the model's effectiveness in capturing the core meaning and important points of the speeches.

This notebook not only aids in understanding the practical steps involved in fine-tuning a summarization model but also provides a framework for evaluating its real-world applicability to legislative proceedings

In [1]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [35]:
import pandas as pd

# Load the Excel file
data_truth = pd.read_excel('labeled_data.xlsx')

# Assume the data has columns 'Speech' and 'GroundTruthSummary'
# Select 50 random samples
sampled_data_truth = data_truth.sample(n=50, random_state=42)


In [36]:
sampled_data_truth.shape

(50, 3)

In [37]:
# Generate summaries
sampled_data_truth['GeneratedSummary'] = sampled_data_truth['text'].apply(
    lambda x: summarizer(x, max_length=500, min_length=50, do_sample=False)[0]['summary_text']
)


Your max_length is set to 500, but your input_length is only 491. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=245)
Your max_length is set to 500, but your input_length is only 471. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=235)
Your max_length is set to 500, but your input_length is only 483. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=241)
Your max_length is set to 500, but your input_length is only 433. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2

In [21]:
sampled_data_truth

Unnamed: 0,id,text,summarization,GeneratedSummary
195,2020-10-16_221,"Faleminderit! Të nderuar qytetarë, Të nderuar ...",Kosova ka një projektligj që do të kërkonte që...,Deputeti i Kosovës: Ky projektligj ka ardhur e...
79,2021-11-18_849,"Administrata, më informoni sa janë të pranishë...",Deputetët nuk po i përmbahen asaj që kërkohet ...,"76 deputetë i kemi të pranishëm, do të votojmë..."
480,2009-05-14_119,"Për informatën, për pozicionin e njeriut përgj...",Deputeti i Kosovës: “Besoj se po krijohet një ...,"“Shumë hapa që po ndodhin në këtë Kuvend, bëjn..."
109,2024-02-29_167,"Faleminderit, kryetar! Faleminderit deputet B...",MRI ka qenë një listë pritjeje 1 deri në 2 vje...,“Besoj edhe për qytetarët e Gjakovës instalimi...
522,2011-11-17_139,"Faleminderit, kryetar! Kjo Qeveri edhe në këtë...",Qeveria e Kosovës na e ka bërë të qartë se nuk...,Kryeministri: Kjo Qeveri e ka bërë të qartë se...
532,2018-04-27_133,"Faleminderit, zoti kryetar! Edhe unë pajtohem ...",Jam dakord që shumë deputetë kanë shkuar në pë...,“Nuk dua absolutisht të pajtohem me deputetët ...
84,2021-12-13_56,"Faleminderit, i nderuari kryetar i Kuvendit! I...","Ministria e Kulturës, Rinisë dhe Sportit ka bë...","Ministria e Kulturës, Rinisë dhe Sportit ka bë..."
368,2013-06-20_66,"Faleminderit, kryetar! Përshëndetje për minist...",Koalicioni për Kosovën e Re mbështet rekomandi...,Koalicioni për Kosovën e Re e mbështet rekoman...
132,2023-02-23_139,"Meqenëse, pyetja e deputetit është për pikën e...","Deputetët votuan pro, asnjë kundër dhe asnjë a...",Kuvendi ratifikoi marrëveshjen për njohjen e k...
364,2006-04-06_19,Ju faleminderit. Votojmë për këtë pikë të rend...,Shqyrtimi i dytë i projektligjit për inspektor...,Shqyrtimi i dytë i Projektligjit për inspektor...


In [38]:
from rouge import Rouge

rouge = Rouge()
scores = rouge.get_scores(
    sampled_data_truth['GeneratedSummary'].tolist(), 
    sampled_data_truth['summarization'].tolist(), 
    avg=True
)

print("ROUGE scores:", scores)

ROUGE scores: {'rouge-1': {'r': 0.49484378224486014, 'p': 0.4627724775496203, 'f': 0.47319500685953125}, 'rouge-2': {'r': 0.26604558518588844, 'p': 0.26183449173648266, 'f': 0.26002665494944616}, 'rouge-l': {'r': 0.4622348700844644, 'p': 0.4333370530793607, 'f': 0.442537774292043}}


In [39]:
generated_summaries = sampled_data_truth['GeneratedSummary'].tolist()
reference_summaries = sampled_data_truth['summarization'].tolist()


In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Concatenate both lists for vectorization
all_summaries = generated_summaries + reference_summaries

# Vectorize the summaries
tfidf_matrix = vectorizer.fit_transform(all_summaries)


In [41]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity
# Assuming the first half are generated and the second half are references
similarity_matrix = cosine_similarity(tfidf_matrix[:len(generated_summaries)], tfidf_matrix[len(generated_summaries):])

# Diagonal elements give the similarity scores between corresponding summaries
cosine_scores = similarity_matrix.diagonal()

print("Cosine Similarity Scores:", cosine_scores)

Cosine Similarity Scores: [0.43118996 0.29003732 0.45690902 0.30608323 0.51787621 0.49353997
 0.76014178 0.71481353 0.32968777 0.67598782 0.64319924 0.47972934
 0.35331836 0.57893594 0.33159021 0.45276003 0.63578242 0.39375678
 0.51048602 0.50148325 0.33725724 0.16648575 0.48276992 0.65399287
 0.34562904 0.38985048 0.50105118 0.55696018 0.50757495 0.55234078
 0.72352205 0.42689647 0.44009377 0.33873136 0.26108976 0.48449624
 0.63829434 0.43500718 0.72184326 0.56975248 0.23311903 0.68047117
 0.39291119 0.56394442 0.52926342 0.38040513 0.43174378 0.42675032
 0.31614873 0.30777409]


In [42]:
import numpy as np

np.mean(cosine_scores)

0.47306957578878167

### Conclusion

The fine-tuned model has demonstrated promising results in summarizing Albanian parliamentary speeches. The evaluation of the model's performance yielded the following scores:

- **Average Cosine Similarity:** 0.4731
- **ROUGE Scores:**
  - **ROUGE-1:** 0.4769 (measures the overlap of unigrams between the generated and reference summaries)
  - **ROUGE-2:** 0.2626 (measures the overlap of bigrams)
  - **ROUGE-L:** 0.4460 (measures the longest common subsequence, which is useful for evaluating sentence-level structure similarity)

These metrics indicate that the model is reasonably effective in capturing the gist and essential details of the speeches, reflecting both lexical and semantic understanding. However, there is still room for improvement, especially in capturing more detailed relationships and nuances expressed in the speeches, as suggested by the lower ROUGE-2 score. Future work could explore more advanced techniques for fine-tuning or employing additional pre-processing steps to further enhance the model's summarization capabilities.
