## Summarization Model Setup and Execution

In this section, we establish the setup for summarizing parliamentary speeches using the BART model from Hugging Face's transformers library. The BART model, specifically `facebook/bart-large-cnn`, is utilized for generating concise summaries from extended texts. This is particularly useful in contexts like summarizing legislative discussions where the main points need to be distilled efficiently.

### Importing Libraries

We start by importing necessary Python libraries, including BART tokenizer and model from transformers, tqdm for progress bars, and pandas for data handling.
#### Model Reference

The summarization task utilizes the BART model, which stands for Bidirectional and Auto-Regressive Transformers. BART is particularly designed for natural language generation, translation, and comprehension tasks. For detailed methodology and insights, refer to the original paper:

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461. Available at [http://arxiv.org/abs/1910.13461](http://arxiv.org/abs/1910.13461).

This pre-trained model is leveraged in our project to ensure that the summaries generated from the Albanian parliamentary speeches maintain a high level of coherence and factual accuracy.


In [20]:
from transformers import BartForConditionalGeneration, BartTokenizer

model_name = 'facebook/bart-large-cnn'
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

In [21]:
from tqdm import tqdm  # Instead of from tqdm.auto import tqdm
import pandas as pd

Testing the model on a single example 

In [22]:
def generate_summary_test(text):
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs['input_ids'], max_length=500, min_length=50, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

text = "Thank you, Mr. Mayor. The minister cleared it up, partly. The dilemma still remains in the government's decision, and this decision must be made. The most serious decision he will make is whether the law will be implemented as the sponsor has proposed, which should begin to be implemented since the last quarter of this year. So it's got to be the mixers. An estimate has been made there for the last quarter of 2007. It means, this has to be decided today, even in principle. Let the Parliamentary Commission not be allowed that decision but, in principle, to make that decision, not be accepted by the Government's decision. Accept the proposal from the Ministry. Let's start implementation after the adoption, only on that condition. If we want to be serious in the Assembly, we have to approve that proposal. This decision will automatically not be issued because it is proposed that implementation of the bill begin in 2011, according to the Government's proposal, the prime minister. It means to proceed for approval in the Kosovo Assembly. If we don't approve of this decision, then we can go in principle and be serious. If the 2011 clause remains, let another government approve."
print(generate_summary_test(text))


The most serious decision he will make is whether the law will be implemented as the sponsor has proposed. An estimate has been made there for the last quarter of 2007. It means, this has to be decided today, even in principle. If we want to be serious in the Assembly, we have to approve that proposal.


 Summarization Function The function `generate_summary` is defined to encode texts into model-readable inputs, generate summaries, and decode these summaries back into readable text. It captures any errors during the process, ensuring that our pipeline can handle unexpected inputs gracefully

In [28]:
def generate_summary(text):
    try:
        # Encode the text into input ids and truncate if necessary without adding any prefix
        inputs = tokenizer.encode(text, return_tensors='pt', max_length=1024, truncation=True)
        # Generate summary ids with constraints
        summary_ids = model.generate(inputs, max_length=500, min_length=50, num_beams=4, early_stopping=True)
        # Decode the generated ids to text
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        return summary
    except Exception as e:
        print(f"Error summarizing text: {e}")
        return ""

In [34]:
data = pd.read_excel('Data/English_Translated_Speaches.xlsx')
# For a single column
print(data['text'].dtype)

# For multiple specific columns
print(data[['text']].dtypes)  # Notice the double brackets for DataFrame slice


object
text    object
dtype: object


### Data Loading and Processing
We load our dataset containing English-translated speeches. After ensuring columns are correctly named (stripping any extra whitespace), we apply the summarization function across our text data.

In [42]:
def summarize_dataset(file_path):
    # Load the dataset
    df = pd.read_excel(file_path)

    # Rename the columns to remove any leading or trailing spaces
    df.columns = df.columns.str.strip()

    print("Columns in DataFrame after renaming:", df.columns)

    # Initialize progress bar
    tqdm.pandas(desc="Summarizing speeches")
    
    # Apply the summary generation function to the 'text' column
    df['Summarized_Speech'] = df['text'].progress_apply(generate_summary)

    # Check for 'id' column and only proceed if present
    if 'id' in df.columns:
        output_df = df[['id', 'Summarized_Speech']]
        # Save the updated DataFrame to a new Excel file
        output_df.to_excel('Summarized_English_Speeches.xlsx', index=False)
        return "Summarization complete and saved."
    else:
        return "Warning: 'id' column not found. Make sure your Excel file has an 'id' column."

# Path to the Excel file
file_path = 'Data/English_Translated_Speeches.xlsx'
print(summarize_dataset(file_path))


Columns in DataFrame after renaming: Index(['text', 'id'], dtype='object')


Summarizing speeches: 100%|██████████| 1000/1000 [2:54:38<00:00, 10.48s/it] 

Summarization complete and saved.





Finally, we execute our summarization pipeline on the specified file and print the completion status.