## Comparative Analysis of Fine-Tuning vs. Multishot Prompting Techniques for Summarizing Albanian Parliamentary Speeches



## Introduction to the Project

This project investigates the application of Large Language Models (LLMs) in summarizing parliamentary speeches from the Kosovo Parliament, with a focus on speeches in Albanian. The primary goal is to identify the most efficient method for handling this task, considering both effectiveness and cost.

### Objectives:

1. **Fine-Tuning a Language Model:**
   We will fine-tune a pre-trained language model specifically for the task of summarization. This method, while potentially more accurate, is resource-intensive and costly. We aim to evaluate its performance in generating accurate summaries directly from Albanian texts.

2. **Exploring Prompt Engineering Techniques:**
   In contrast to fine-tuning, prompt engineering requires fewer resources. We will test different techniques to determine their effectiveness compared to fine-tuning:
   - **Zero-Shot Learning:** The model attempts to summarize without prior training on summarization.
   - **One-Shot Learning:** The model is guided by a single example of a summary to inform its responses.
   - **Few-Shot (Multi-Shot) Learning:** The model learns from multiple examples, potentially improving its summarization capabilities.

### Comparison Objective:

The central aim is to compare these methodologies to discern which provides the best balance between cost and performance. This comparison will help establish whether the investment in fine-tuning is justified or if prompt engineering can achieve comparable results with less expenditure.

By exploring these methods, this project contributes to the computational linguistics field by demonstrating how to efficiently process and summarize less-represented languages using advanced linguistic tasks.



![Diagram](images/diagram.jpg "Kosovo Assembly Session")

## Data Preprocessing

### Creating Labeled Data

The primary challenge in our project is the absence of labeled data suitable for training a summarization model directly. As our dataset, the "Kosovo-Parliament-Transcriptions," is primarily unlabeled, our initial task involves generating this crucial labeled dataset.

#### Dataset Overview

The "Kosovo-Parliament-Transcriptions" dataset comprises transcripts from speeches delivered by members of the Kosovo Assembly during parliamentary sessions spanning from 2001. This extensive dataset serves as a foundation for research in natural language processing and political discourse analysis.

**Data Source:**
The transcripts were sourced from the official website of the Kosovo Assembly [Kosovo Assembly](https://kuvendikosoves.org/), capturing both historical and recent parliamentary activities. The raw data were initially in PDF format and were converted to text using OCR technology. The text was subsequently cleaned to correct punctuation and spelling errors. However, users should be aware of potential residual errors due to the complexities of PDF-to-text conversion.

**Data Preparation:**
The dataset includes multiple languages, reflecting the multilingual nature of the Kosovo Assembly proceedings. To facilitate the processing for this project, additional steps will be taken:

- Conduct further quality assurance to rectify any remaining inconsistencies.
- Incorporate metadata such as the language of the speech and the political party of the speaker.

**Dataset Structure:**
- `text`: The transcript of the speech.
- `speaker`: The name of the speaker.
- `date`: The date of the speech.
- `id`: A unique identifier for each speech.
- `num_tokens`: The number ovo-Parliament-Transcriptions')


### Library Installation
We need to install the `transformers` library to access pre-trained models and tokenizers for our NLP tasks.


In [2]:
!pip install transformers



### Data Loading
Load the dataset that contains the transcriptions of the parliamentary speeches. We will filter these speeches to focus only on those in Albanian for our summarization task.


In [3]:
import pandas as pd
from transformers import BartTokenizer

# Load the tokenizer
model_name = 'facebook/bart-large-cnn'
tokenizer = BartTokenizer.from_pretrained(model_name)

# Load the dataset
data = pd.read_excel('Data/Kosovo-Parliament-Transcriptions.xlsx')


### Language Detection and Filtering
Since the dataset contains speeches in multiple languages, we'll detect and filter out only the Albanian speeches. This is crucial as our summarization model will specifically target Albanian language text.


In [4]:
from langdetect import detect, DetectorFactory
import pandas as pd

# Ensure consistent results
DetectorFactory.seed = 0

In [5]:
# Function to safely detect language
def safe_detect(text):
    try:
        return detect(text)
    except:
        return "Error"  # In case the text is too short or detection fails


In [6]:
def count_tokens(text):
    return len(tokenizer.tokenize(text))

### Data Cleaning
Remove any rows with missing values in the 'text' column to ensure our dataset is clean before proceeding with further analysis.


In [8]:
# Print the initial state of null values
print("Before dropping NaN values:")
print(data.isnull().sum())

# Drop rows where the 'text' column is NaN
data = data.dropna(subset=['text'])

# Optionally, if you also want to ensure that no entries with missing 'speaker' are retained
data = data.dropna(subset=['speaker'])

print("\nAfter dropping NaN values:")
print(data.isnull().sum())

data.to_excel('cleaned_data.xlsx', index=False)

Before dropping NaN values:
text                 112
speaker              835
date                   0
id                     0
num_tokens             0
Detected_Language      0
dtype: int64

After dropping NaN values:
text                 0
speaker              0
date                 0
id                   0
num_tokens           0
Detected_Language    0
dtype: int64


### Token Count and Speech Filtering
Filter out speeches based on token counts to ensure they are suitable for summarization. We remove speeches that are too short to provide valuable summaries or too long for our model to handle efficiently.


In [9]:
# Detect language and count tokens
data['Detected_Language'] = data['text'].apply(safe_detect)
data['token_count'] = data['text'].apply(count_tokens)

### Data Sampling
Randomly sample a subset of the filtered speeches to create a manageable dataset for model training and evaluation Due to limited computing power we sample our dataset to 1000 spechees..


In [10]:
# Filter for only Albanian speeches that meet token requirements
filtered_speeches = data[(data['Detected_Language'] == 'sq') & 
                         (data['token_count'] > 200) & 
                         (data['token_count'] <= 1024)]

# Drop rows where text is NaN or speaker is missing, if necessary
filtered_speeches = filtered_speeches.dropna(subset=['text', 'speaker'])

# Randomly sample 1000 speeches from the filtered speeches
sampled_speeches = filtered_speeches.sample(n=1000, random_state=1)

# Save the sampled speeches to a new Excel file
sampled_speeches[['text', 'id']].to_excel('Sampled_Alb_Speeches.xlsx', index=False)
print("Data is sampled and the result is saved.")

Data is sampled and the result is saved.


In [11]:
sampled_speeches.head()

Unnamed: 0,text,speaker,date,id,num_tokens,Detected_Language,token_count
93774,A jeni dakord ju shefat e grupeve parlamentare...,KRYESUESI-JA,2011-11-11,2011-11-11_22,575,sq,657
68250,"Unë e di që ka deputetë opozitarë, ka deputetë...",ALBIN KURTI,2014-05-07,2014-05-07_34,501,sq,563
20032,"Faleminderit, kryetar! Komisioni për të Drejta...",FJOLLA UJKANI,2021-10-19,2021-10-19_188,497,sq,576
128476,Ju faleminderit z. kryetar. Sikur të kishte qe...,ARDIAN GJINI,2006-06-29,2006-06-29_94,228,sq,258
117464,"I nderuar deputet, unë të kuptoj se kemi nganj...",KRYESUESI-JA,2008-11-06,2008-11-06_240,701,sq,801


## Translation

After creating a sampled dataset, our next step involves translating the Albanian speeches into English. This translation is crucial for several reasons:

- **Model Compatibility:** Most pre-trained models, especially those in the domain of summarization, are optimized for English. By translating our dataset into English, we can leverage these advanced models more effectively.

- **Quality of Summarization:** Accurate summarization depends significantly on the quality of the input data. English, being a widely supported language in NLP tools, ensures that we have access to robust tools and models that can generate reliable summaries.

- **Project Requirement:** As part of our project's goal to create a labeled dataset, it is essential to have high-quality summaries. Translating the speeches into English allows us to use state-of-the-art summarization models, which are predominantly trained on English datasets, thereby enhancing the performance and reliability of our summarization outputs.

This translation step is integral to preparing our data for the subsequent summarization phase, where we will feed the translated speeches into a pre-trained summarization model.


In [35]:
english_speaches = pd.read_excel('Data/English_Translated_Speaches.xlsx')
english_speaches.head()

Unnamed: 0,text,id
0,"Yes, now I have a clearer situation. I thank ...",2013-12-19_64
1,Deputy Suzan Novobërdali is speaking on behal...,2012-08-31_153
2,The minister is speaking once again.,2020-11-12_71
3,Mr President! We also support this bill.,2013-07-25_138
4,"On behalf of the SLS Parliamentary Group, doe...",2012-03-29_37


In [50]:
# Load the datasets

df_google = pd.read_excel('Data/English_Translated_Speaches.xlsx')
df_huggingface = pd.read_excel('OLD/Filtered_Translated_Sampled_Alb_Speeches.xlsx')

df_google.columns = df_google.columns.str.strip()
df_huggingface.columns = df_huggingface.columns.str.strip()

# Display the column names of each dataframe
print("Google Translate Dataset Columns:")
print(df_google.columns)

print("Hugging Face Model Dataset Columns:")
print(df_huggingface.columns)


Google Translate Dataset Columns:
Index(['text', 'id'], dtype='object')
Hugging Face Model Dataset Columns:
Index(['text', 'speaker', 'date', 'id', 'num_tokens', 'Detected_Language',
       'Translated_Speech', 'Token_Count'],
      dtype='object')


In [58]:
# Merge the dataframes on 'id' to keep only the rows present in both datasets
df_merged = pd.merge(df_google, df_huggingface, on='id', suffixes=('_google', '_huggingface'))



In [52]:
# Select a few speeches for simplicity
df_sample = df_merged.head(6)

# Create a new DataFrame for comparison
comparison_df = pd.DataFrame({
    'ID': df_sample['id'],
    'Original': df_sample['text_google'],  # Assuming 'text_google' has the original text
    'Google_Translate': df_sample['text_google'],
    'Hugging_Face': df_sample['Translated_Speech']
})

# Save this DataFrame to a new Excel file for manual ground truth addition
output_file_path = r'C:\Users\nderi\OneDrive\Desktop\anakondabina\sample_speeches_for_manual_translation.xlsx'
comparison_df.to_excel(output_file_path, index=False)

print(f"Sample speeches dataset saved to {output_file_path}")


Sample speeches dataset saved to C:\Users\nderi\OneDrive\Desktop\anakondabina\sample_speeches_for_manual_translation.xlsx


In [57]:
# Load the updated dataset with manual translations
df_ground_truth = pd.read_excel(('sample_speeches_for_manual_translation.xlsx'))

# Ensure the columns are named correctly
df_ground_truth.columns = df_ground_truth.columns.str.strip()

# Calculate BLEU scores to compare translations with ground truth
from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu(reference, candidate):
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    return sentence_bleu([reference_tokens], candidate_tokens)

# Calculate BLEU scores for each row
df_ground_truth['BLEU_Google'] = df_ground_truth.apply(lambda row: calculate_bleu(row['Original'], row['Google_Translate']), axis=1)
df_ground_truth['BLEU_Hugging_Face'] = df_ground_truth.apply(lambda row: calculate_bleu(row['Original'], row['Hugging_Face']), axis=1)

# Calculate average BLEU scores
average_bleu_google = df_ground_truth['BLEU_Google'].mean()
average_bleu_hugging_face = df_ground_truth['BLEU_Hugging_Face'].mean()

print(f"Average BLEU score for Google Translate: {average_bleu_google:.2f}")
print(f"Average BLEU score for Hugging Face model: {average_bleu_hugging_face:.2f}")


Average BLEU score for Google Translate: 0.45
Average BLEU score for Hugging Face model: 0.30


### Evaluation of Translations

The evaluation of translation quality plays a crucial role in our project. We employed the **BLEU score metric** to assess the performance of our translation models. The results of our analysis are as follows:

- **Google Translate**: Average BLEU score of **0.45**
- **Helsinki-NLP opus-mt (Hugging Face model)**: Average BLEU score of **0.30**

Based on these scores, **Google Translate demonstrated superior performance** over the Helsinki-NLP opus-mt model. Additionally, a **manual review of the translations** confirmed that Google Translate consistently provided higher quality and more accurate translations. Given these findings, we decided to **proceed with Google Translate** for the subsequent steps of our project.
