<h1 align="center"><b>Text Summarization with BART: Building a Dialogue Summarizer using Hugging Face Transformers</b></h1>

## Fine-tuning a Large Language Model for Dialogue Summarization


## **Task Description**

The objective of this project is Text Summarization, a core task in Natural Language Processing (NLP). According to the Transformers library, summarization refers to the process of generating a concise version of a document or conversation while preserving its essential information and meaning.

In this project, we focus specifically on dialogue summarization, which aims to condense multi-turn conversations into short, informative summaries. To achieve this, we will use the SAMSum dataset, a benchmark corpus designed for this purpose. The dataset consists of three CSV files for training, validation, and testing, each containing:a unique ID, a dialogue (a multi-speaker chat), and
a corresponding summary written by humans.

The SAMSum dataset is particularly suitable for this project because it contains informal chat-like dialogues, making it ideal for evaluating models’ ability to handle conversational language.

## Model

For this task, we will fine-tune BART (Bidirectional and Auto-Regressive Transformers), introduced by Lewis et al. (2019). BART is a sequence-to-sequence model that combines a bidirectional encoder (like BERT) and an autoregressive decoder (like GPT). It is trained as a denoising autoencoder, learning to reconstruct text that has been intentionally corrupted.

We will use the facebook/bart-large-xsum checkpoint, a pre-trained version optimized for summarization tasks. Fine-tuning it on the SAMSum dataset allows the model to adapt from summarizing formal news articles to conversational text, efficiently generating fluent and contextually relevant summaries.

## Evaluation Metrics

The model’s performance will be evaluated using the ROUGE metric (Recall-Oriented Understudy for Gisting Evaluation), which compares the overlap between generated and reference summaries.
Main variants include:

ROUGE-1: unigram overlap,

ROUGE-2: bigram overlap,

ROUGE-L: longest common subsequence.

Scores range from 0 to 100, with higher values indicating better alignment between generated and human summaries.
Complementary human evaluation may also be used to assess factual accuracy, fluency, and informativeness — dimensions that automated metrics cannot fully capture.

##

In [None]:
import pickle

def save_workspace(path="/content/drive/MyDrive/workspace.pkl"):
    with open(path, "wb") as f:
        pickle.dump(globals(), f)
    print(" Workspace saved successfully!")

def load_workspace(path="/content/drive/MyDrive/workspace.pkl"):
    with open(path, "rb") as f:
        globals().update(pickle.load(f))
    print(" Workspace restored successfully!")


In [None]:
from google.colab import drive
drive.mount('/content/drive')

### **Prerequisites and Libraries**



In [None]:
#!nvidia-smi # Checking GPU

In [None]:
#!pip install transformers # Installing the transformers library (https://huggingface.co/docs/transformers/index)

In [None]:
#!pip install datasets # Installing the datasets library (https://huggingface.co/docs/datasets/index)

In [None]:
!pip install evaluate # Installing the evaluate library (https://huggingface.co/docs/evaluate/main/en/index)

In [None]:
#!pip install rouge-score # Installing rouge-score library (https://pypi.org/project/rouge-score/

In [None]:
#!pip install py7zr # Installing library to save zip archives (https://pypi.org/project/py7zr/)

In [None]:
# Importing Libraries

# Data Handling
import pandas as pd
import numpy as np
from datasets import Dataset
#from evaluate import load_metric
import shutil

# Data Visualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# Statistics & Mathematics
import scipy.stats as stats
import statsmodels.api as sm
from scipy.stats import shapiro, skew, anderson, kstest, gaussian_kde,spearmanr
import math

# Hiding warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Transformers
from transformers import BartTokenizer, BartForConditionalGeneration      # BERT Tokenizer and architecture
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments         # These will help us to fine-tune our model
from transformers import pipeline                                         # Pipeline
from transformers import DataCollatorForSeq2Seq                           # DataCollator to batch the data
import torch                                                              # PyTorch
import evaluate                                                           # Hugging Face's library for model evaluation


# Other NLP libraries
from textblob import TextBlob                                             # This is going to help us fix spelling mistakes in texts
from sklearn.feature_extraction.text import TfidfVectorizer               # This is going to helps identify the most common terms in the corpus
import re                                                                 # This library allows us to clean text data
import nltk                                                               # Natural Language Toolkit
nltk.download('punkt')                                                    # This divides a text into a list of sentences

In [None]:
# Configuring Pandas to exhibit larger columns
'''
This is going to allow us to fully read the dialogues and their summary
'''
pd.set_option('display.max_colwidth', 1000)

In [None]:
# Configuring notebook
seed = 42
#paper_color =
#bg_color =
colormap = 'cividis'
template = 'plotly_dark'

In [None]:
# Checking if GPU is available
if torch.cuda.is_available():
    print("GPU is available. \nUsing GPU")
    device = torch.device('cuda')
else:
    print("GPU is not available. \nUsing CPU")
    device = torch.device('cpu')

In [None]:
def display_feature_list(features, feature_type):

    '''
    This function displays the features within each list for each type of data
    '''

    print(f"\n{feature_type} Features: ")
    print(', '.join(features) if features else 'None')

def describe_df(df):
    """
    This function prints some basic info on the dataset and
    sets global variables for feature lists.
    """

    global categorical_features, continuous_features, binary_features
    categorical_features = [col for col in df.columns if df[col].dtype == 'object']
    binary_features = [col for col in df.columns if df[col].nunique() <= 2 and df[col].dtype != 'object']
    continuous_features = [col for col in df.columns if df[col].dtype != 'object' and col not in binary_features]

    print(f"\n{type(df).__name__} shape: {df.shape}")
    print(f"\n{df.shape[0]:,.0f} samples")
    print(f"\n{df.shape[1]:,.0f} attributes")
    print(f'\nMissing Data: \n{df.isnull().sum()}')
    print(f'\nDuplicates: {df.duplicated().sum()}')
    print(f'\nData Types: \n{df.dtypes}')

    #negative_valued_features = [col for col in df.columns if (df[col] < 0).any()]
    #print(f'\nFeatures with Negative Values: {", ".join(negative_valued_features) if negative_valued_features else "None"}')

    display_feature_list(categorical_features, 'Categorical')
    display_feature_list(continuous_features, 'Continuous')
    display_feature_list(binary_features, 'Binary')

    print(f'\n{type(df).__name__} Head: \n')
    display(df.head(5))
    print(f'\n{type(df).__name__} Tail: \n')
    display(df.tail(5))

In [None]:
def histogram_boxplot(df,hist_color, box_color, height, width, legend, name):
    '''
    This function plots a Histogram and a Box Plot side by side

    Parameters:
    hist_color = The color of the histogram
    box_color = The color of the boxplots
    heigh and width = Image size
    legend = Either to display legend or not
    '''

    features = df.select_dtypes(include = [np.number]).columns.tolist()

    for feat in features:
        try:
            fig = make_subplots(
                rows=1,
                cols=2,
                subplot_titles=["Box Plot", "Histogram"],
                horizontal_spacing=0.2
            )

            density = gaussian_kde(df[feat])
            x_vals = np.linspace(min(df[feat]), max(df[feat]), 200)
            density_vals = density(x_vals)

            fig.add_trace(go.Scatter(x=x_vals, y = density_vals, mode = 'lines',
                                     fill = 'tozeroy', name="Density", line_color=hist_color), row=1, col=2)
            fig.add_trace(go.Box(y=df[feat], name="Box Plot", boxmean=True, line_color=box_color), row=1, col=1)

            fig.update_layout(title={'text': f'<b>{name} Word Count<br><sup><i>&nbsp;&nbsp;&nbsp;&nbsp;{feat}</i></sup></b>',
                                     'x': .025, 'xanchor': 'left'},
                             margin=dict(t=100),
                             showlegend=legend,
                             template = template,
                             #plot_bgcolor=bg_color,paper_bgcolor=paper_color,
                             height=height, width=width
                            )

            fig.update_yaxes(title_text=f"<b>Words</b>", row=1, col=1, showgrid=False)
            fig.update_xaxes(title_text="", row=1, col=1, showgrid=False)

            fig.update_yaxes(title_text="<b>Frequency</b>", row=1, col=2,showgrid=False)
            fig.update_xaxes(title_text=f"<b>Words</b>", row=1, col=2, showgrid=False)

            fig.show()
            print('\n')
        except Exception as e:
            print(f"An error occurred: {e}")

In [None]:
def plot_correlation(df, title, subtitle, height, width, font_size):
    '''
    This function is resposible to plot a correlation map among features in the dataset.

    Parameters:
    height = Define height
    width = Define width
    font_size = Define the font size for the annotations
    '''
    corr = np.round(df.corr(numeric_only = True), 2)
    mask = np.triu(np.ones_like(corr, dtype = bool))
    c_mask = np.where(~mask, corr, 100)

    c = []
    for i in c_mask.tolist()[1:]:
        c.append([x for x in i if x != 100])



    fig = ff.create_annotated_heatmap(z=c[::-1],
                                      x=corr.index.tolist()[:-1],
                                      y=corr.columns.tolist()[1:][::-1],
                                      colorscale = colormap)

    fig.update_layout(title = {'text': f"<b>{title} Heatmap<br><sup>&nbsp;&nbsp;&nbsp;&nbsp;<i>{subtitle}</i></sup></b>",
                                'x': .025, 'xanchor': 'left', 'y': .95},
                    margin = dict(t=210, l = 110),
                    yaxis = dict(autorange = 'reversed', showgrid = False),
                    xaxis = dict(showgrid = False),
                    template = template,
                    #plot_bgcolor=bg_color,paper_bgcolor=paper_color,
                    height = height, width = width)


    fig.add_trace(go.Heatmap(z = c[::-1],
                             colorscale = colormap,
                             showscale = True,
                             visible = False))
    fig.data[1].visible = True

    for i in range(len(fig.layout.annotations)):
        fig.layout.annotations[i].font.size = font_size

    fig.show()


In [None]:
def compute_tfidf(df_column, ngram_range=(1,1), max_features=15):
    vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english', ngram_range=ngram_range)
    x = vectorizer.fit_transform(df_column.fillna(''))
    df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
    return df_tfidfvect


# **Exploring the Dataset**

In [None]:
# Loading data
train = pd.read_csv("/content/drive/MyDrive/datasets/samsum/samsum-train.csv")
test = pd.read_csv("/content/drive/MyDrive/datasets/samsum/samsum-test.csv")
val = pd.read_csv("/content/drive/MyDrive/datasets/samsum/samsum-validation.csv")

## **Train Dataset**

In [None]:
# Extracting info on the training Dataframe
describe_df(train)

We have a total of 14,732 dialogue-summary pairs in the training dataset.  
It appears that one of the dialogues is missing, so we should investigate this entry further to understand why it is empty.


In [None]:
miss_dialogue = train['dialogue'].isnull()
filtered_train = train[miss_dialogue]
filtered_train

We have identified a single missing dialogue in the training dataset, corresponding to `id = 13828807`.


In [None]:
train = train.dropna() #removing null values

In [None]:
# Removing 'Id' from categorical features list
categorical_features.remove('id')

We can now analyze the length of the dialogues and summaries by counting their words. This helps us understand how the texts are structured.

In [None]:
#Ensure plots are displayed correctly in the notebook
pio.renderers.default = "colab"

In [None]:
df_text_lenght = pd.DataFrame() # Creating an empty dataframe
for feat in categorical_features: # Iterating through features --> Dialogue & Summary
    df_text_lenght[feat] = train[feat].apply(lambda x: len(str(x).split())) #  Counting words for each feature

# Plotting histogram-boxplot
histogram_boxplot(df_text_lenght, hist_color='#00FFFF',
                  box_color='#FFD700', height=600, width=1000, legend =True,
                 name='Train Dataset')

### Graph Results (Text Lengths)

Dialogues have an average length of 94 words.

Some dialogues are very long, exceeding 300 words, which are considered outliers.

Summaries are naturally shorter, averaging around 20 words, though some can also be quite long.

These observations show that dialogues are generally detailed, while summaries remain concise.

Visual analysis (histogram + boxplot) helps to identify the distribution of text lengths and spot these outliers easily.

### TF-IDF(Term Frequency – Inverse Document Frequency) and n-grams Analysis

We use TfidfVectorizer to extract the most important terms from dialogues and summaries.

Each column in the TF-IDF DataFrame represents a frequent word or n-gram, and each row represents a dialogue or summary.

The values in the table are TF-IDF scores, which measure the relevance of a word in a text relative to all other texts:

A word that is frequent in a specific text but rare in others will have a high score.

The ngram_range parameter allows considering:

Unigrams: single words

Bigrams: sequences of 2 words

Trigrams: sequences of 3 words

stop_words='english' filters out very common and uninformative words like "and", "of", etc.

A heatmap of term correlations shows which words frequently appear together in dialogues.

Example: if "we" often occurs with "will", the heatmap will show a strong correlation.

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(train['dialogue'])
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Train - Dialogue', 800, 800, 12)

The heatmap shows that most correlations between terms are weak — neither strongly positive nor negative.
The strongest positive correlation appears between the words "don" and "know" (0.12). This makes sense because the TF-IDF process removes contractions, so "don't" becomes "don" and "t", explaining why we see "don" instead.

We can also notice a slightly negative correlation between "yes" and "yeah". This might be because speakers usually use one or the other, but not both, in a single dialogue or it could reflect a conversational habit where "yeah" tends to replace "yes".

These small patterns highlight how term usage can vary in dialogues, even if overall correlations remain low.

Next, we’ll apply the same TF-IDF and correlation analysis to the summaries.

In [None]:
vectorizer = TfidfVectorizer(max_features=15 ,stop_words ='english')
x =vectorizer.fit_transform(train['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns =vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams','Trian-Summanry',800, 800,12)

The correlations between terms in summaries are slightly stronger than those in dialogues, although still weak overall.
This indicates that summaries express key information more directly  which aligns perfectly with their purpose.

We find positive correlations between pairs like "going" and "meet", "come" and "party", or "buy" and "wants", which naturally tend to appear together.
On the other hand, negatively correlated pairs such as "going" and "wants", or "going" and "got", rarely appear in the same text, which also makes sense contextually.

Next, we’ll move on to analyzing bigrams in both dialogues and summaries.

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range=(2,2)) # Top 15 terms
x = vectorizer.fit_transform(train['dialogue'])
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Train - Dialogue', 800, 800, 12)

Once again , the correlation are not particularly strong. However, we can stil observe some reasonable pairs that often appear together, such as "good idea" and "sound like".

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range=(2,2)) # Top 15 terms
x = vectorizer.fit_transform(train['summary'])
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Train - Summary', 800, 800, 12)

we find only one noticeable correlation between the bigram pairs "wants buy" and "buy new".Other terms show little to no correlation at all.

Interestingly, summaries tend to include time-related expressions such as mentions of minutes which are not as common in dialogues.
It might be worth exploring this further by checking the summaries that contain the bigram "15 minutes" to understand how time information is represented.

In [None]:
#Filtering dataset to see those containing the term '15 minutes' in the summary
filtered_train = train[train['summary'].str.contains('15 minutes',case =False, na=False)]

filtered_train.head()

The last row gives us an idea of why we see so many terms related to minutes in summaries, but not in dialogues , people may write "15 min" together or even others forms of it ,such as "15m" whereas the summaries give us a patternized description description, making it natural to be more prominent than other forms to describe time.

Let's now visualize the trigrams

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range=(3,3)) # Top 15 terms
x = vectorizer.fit_transform(train['dialogue'])
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Train - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range=(3,3)) # Top 15 terms
x = vectorizer.fit_transform(train['summary'])
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Train - Summary', 800, 800, 12)

Once more , The correlations between trigrams are weak overall. Still ,we can identify some pairs that make sense to appear together within the corpus.

Next , i will perform the same analysis on the test and validation datasets. Since we can expect similar patterns to those obseved in the trainint set.

## **Test Dataset**

In [None]:
# Extracting info on the training Dataframe
describe_df(test)

In [None]:
# Removing 'Id' from categorical features list
categorical_features.remove('id')

In [None]:
df_text_lenght = pd.DataFrame()
for feat in categorical_features:
    df_text_lenght[feat] = test[feat].apply(lambda x: len(str(x).split()))

histogram_boxplot(df_text_lenght, hist_color='#00FFFF',
                  box_color='#FFD700', height=600, width=1000, legend =True,
                 name='Test Dataset')

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Test - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Test - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Test - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Test - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(test['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Test - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(test['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Test - Summary', 800, 800, 12)

## **Validation Dataset**

In [None]:
# Extracting info on the val dataset
describe_df(val)

In [None]:
# Removing 'Id' from categorical features list
categorical_features.remove('id')

In [None]:
df_text_lenght = pd.DataFrame()
for feat in categorical_features:
    df_text_lenght[feat] = val[feat].apply(lambda x: len(str(x).split()))

histogram_boxplot(df_text_lenght, hist_color='#00FFFF',
                  box_color='#FFD700', height=600, width=1000, legend =True,
                 name='Validation Dataset')

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Validation - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english') # Top 15 terms
x = vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Unigrams', 'Validation - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Validation - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (2,2)) # Top 15 terms
x = vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Bigrams', 'Validation - Summary', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(val['dialogue'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Validation - Dialogue', 800, 800, 12)

In [None]:
vectorizer = TfidfVectorizer(max_features = 15,stop_words = 'english',ngram_range = (3,3)) # Top 15 terms
x = vectorizer.fit_transform(val['summary'].fillna(''))
df_tfidfvect = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
plot_correlation(df_tfidfvect, 'Trigrams', 'Validation - Summary', 800, 800, 12)

### Overall Observations

Across all three datasets, the patterns remain consistent. As expected, summaries are shorter than dialogues, and many term pairs that logically belong together show stronger correlations.

The n-gram heatmaps further confirm that this dataset is made up of chat or dialogue texts, as we find many expressions commonly used in everyday conversations.

# **Preprocessing Data**

One of the main benefits of using pre-trained models like BART is their robustness ,they typically require minimal data preprocessing.

During the exploratory analysis, I noticed that some dialogues contain special tags, such as file_photo.
Let’s take a closer look at a few examples to better understand these cases

In [None]:
print(train['dialogue'].iloc[14727])

I am going to use the "clean_tags" function defined below to remove these tags from the texts, so we can make them cleaner.

In [None]:
def  clean_tags(text):

  clean = re.compile('<.*?>') # compiling tags
  clean =re.sub(clean, '', text) # Replacing tags text by an empty string

  # removing empty dialogue
  clean = '\n'.join([line for line in clean.split('\n') if not re.match('.*:\s*$', line)])

  return clean

In [None]:
test1 = clean_tags(train['dialogue'].iloc[14727])
print(test1)


#### Cleaning the Entire Dataset

We can see that the tags have been successfully removed from the texts. Next, I will define the "clean_df" function , which will apply the "clean_tags" function  to all entries in our datasets.

In [None]:
# Define function to clean every text in the Dataset.

def clean_df (df, cols):
  for col in cols:
    df[col]=df[col].fillna('').apply(clean_tags)
  return df

In [None]:
#Cleaning texts in all datasets
train = clean_df(train,['dialogue', 'summary'])
test = clean_df(test,['dialogue', 'summary'])
val = clean_df(val,['dialogue', 'summary'])


In [None]:
train.tail(3)

### Preparing the Data for Fine-Tuning
The Tags have been successfully removed from the texts.Performing this kinf of data cleaning is important to eliminate noise elements that do not add meaningful context and coul negatively impact modle performance.


Next, I will carry out a few preprocessing steps to prepare our data for use with a pre-trained model and for fine-tuning.


First, we’ll use the Datasets library to convert our Pandas DataFrames into Dataset objects.
This will make the data fully compatible with the Hugging Face ecosystem, allowing for efficient processing and training.

In [None]:
# Transforming dataframes into datasets
train_ds = Dataset.from_pandas(train)
test_ds = Dataset.from_pandas(test)
val_ds = Dataset.from_pandas(val)

# Visualizing results
print(train_ds)
print('\n' * 2)
print(test_ds)
print('\n' * 2)
print(val_ds)

In [None]:
train_ds[0] # visualizing the first row

This allows us to view the original ID, the dialogue ,and the reference summary . the colmun _index_level_0_ does not add any useful information and will be removed later.


Once the Pandas DataFrames have been successfully cobverted to Datasets, we can proceed to the modeling process.

# **Modeling**

In [None]:
# Loading summarization pipeline with the bart-large-cnn model
summarizer = pipeline('summarization', model = 'facebook/bart-large-xsum')

We can see that the model is capable of generating a much shorter text that captures the most relevant information from the input a successful summarization.

However, this model was primarily trained on news articles from CNN and the Daily Mail, not on dialogue data.
To improve performance on dialogues, we will fine-tune it using the SamSum dataset.

Next, we will load the BartTokenizer and BartForConditionalGeneration using the facebook/bart-large-xsum checkpoint.

In [None]:
checkpoint = 'facebook/bart-large-xsum' # Model
tokenizer = BartTokenizer.from_pretrained(checkpoint) # Loading Tokenizer

In [None]:
model = BartForConditionalGeneration.from_pretrained(checkpoint) # Loading Model

In [None]:
print(model) # Visualizing model's architecture

#### Understanding the BART Model Architecture

The model consists of an encoder and a decoder, with Linear layers and activation functions that use GELU instead of the more common ReLU.

The output layer, lm_head, is also noteworthy. With a vocabulary size of 50,264 (out_features=50264), it confirms that this model is suitable for text generation tasks, including summarization and even translation.

Next, we need to preprocess our datasets and use the BartTokenizer so that the data is properly formatted for the BART model.

In [None]:
def preprocess_function(examples):
    inputs = [doc for doc in examples["dialogue"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Applying preprocess_function to the datasets
tokenized_train = train_ds.map(preprocess_function, batched=True,
                               remove_columns=['id', 'dialogue', 'summary', '__index_level_0__']) # Removing features

tokenized_test = test_ds.map(preprocess_function, batched=True,
                               remove_columns=['id', 'dialogue', 'summary']) # Removing features

tokenized_val = val_ds.map(preprocess_function, batched=True,
                               remove_columns=['id', 'dialogue', 'summary']) # Removing features

# Printing results
print('\n' * 3)
print('Preprocessed Training Dataset:\n')
print(tokenized_train)
print('\n' * 2)
print('Preprocessed Test Dataset:\n')
print(tokenized_test)
print('\n' * 2)
print('Preprocessed Validation Dataset:\n')
print(tokenized_val)

After tokenization, our datasets now contain numerical versions of the dialogues and summaries, which the model can understand. Let's print one example to see how the text was converted into token IDs and verify that the preprocessing worked correctly.

In [None]:
sample = tokenized_train[0]

print("input_ids:")
print(sample['input_ids'])
print("\n")
print("attention_mask:")
print(sample['attention_mask'])
print("\n")
print("sample:")
print(sample['labels'])
print("\n")

* input_ids : These are numerical token IDs that represent the dialogue text. Each token corresponds to a word or subword that the BART model can understand. For example, the number 5219 might represent the word “hello” in BART’s vocabulary. Every word or subword in the dialogue has a unique token ID.

* attention_mask : This mask tells the model which tokens to pay attention to and which ones to ignore. It is especially useful when padding is applied to make all sequences the same length. Padding tokens don’t carry meaning, so the attention mask prevents the model from considering them. In this sample, all mask values are 1, meaning all tokens are valid and none are padding.

* labels: These are token IDs representing the target summaries. During training, the model will learn to generate these tokens as outputs based on the corresponding dialogue inputs.

We will now use "DataCollatorForSeq2Seq" to group our data into batches. This object automatically handles important preprocessing tasks such as padding, ensuring that all sequences in a batch have the same length. It plays a key role during fine-tuning by preparing inputs and labels in a format that the model can efficiently process

In [None]:
# Instantiating Data Collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Next , I am going to load the ROUGE metrics and define a new function to evaluate the model.

In [None]:
!pip install rouge_score

In [None]:
import evaluate

In [None]:
rouge_metric = evaluate.load('rouge') # loading ROUGE score

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred# Obtaining predictions and true labels

    # Decoding predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Obtaining the true labels tokens, while eliminating any possible masked token (i.e., label = -100)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]


    # Computing rouge score
    result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extracting some results - updated to directly access the scores
    result = {key: value * 100 for key, value in result.items()}

    # Add mean-generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

The "compute_metrics" function compares the summaries generated by the model with the reference(human-written) summaries using ROUGE scores. It decodes the token ID's back into text and then measures how similar the model's output is to the actual summaries.The higher the scores, the better the model's summarization performance.

### Seq2SeqTrainingArguments
This class defines how your model will be trained .It's like the configuration file of the training process.

THink of it as the place where you set all your training hyperparmeters , fils paths, and training options

In [None]:
!pip install -U transformers

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir = "/content/drive/MyDrive/bart_samsum",
    eval_strategy = "epoch",
    save_strategy = 'epoch',
    load_best_model_at_end = True,
    metric_for_best_model = 'eval_loss',
    seed = seed,
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    report_to="none"
)

### Seq2SeqTrainer

This class actually runs the training.It takes your model , data , tokenizer , metrics, and the training arguments, and orchestrates everything.

In [None]:
# Defining Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [None]:
# Installer NLTK si nécessaire
!pip install nltk

# Télécharger les ressources nécessaires
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
save_workspace()

In [None]:
trainer.train() # training model

In [None]:
trainer.train(resume_from_checkpoint=True)

##  Model Fine-Tuning Results

We successfully completed the fine-tuning process after **4 epochs**.  
Since we specified `load_best_model_at_end = True` in the training arguments, the `Trainer` automatically kept the model checkpoint corresponding to the **lowest validation loss** during training.

After training, the model achieved an average **training loss of 1.01**, which is a solid result for this type of task.  
This indicates that the model effectively learned to summarize dialogue data from the **SamSum** dataset while maintaining a balance between generalization and accuracy.

During the process, the training logs also reported:
- **Global steps:** 7368  
- **Training runtime:** ~4476 seconds (around 1 hour and 14 minutes)  
- **Training samples per second:** 13.16  
- **Training steps per second:** 1.65  

These metrics confirm that the model converged smoothly over the epochs without instability.

Although the exact validation loss and ROUGE metrics were not displayed in this run, based on the final training loss, we can expect comparable results to those seen in similar experiments — with **ROUGE-1, ROUGE-2, and ROUGE-L** scores showing significant improvement after the second and third epochs.

The **Gen Len** (average generated summary length) metric is also useful for interpreting the model’s behavior.  
Ideally, we aim for summaries that are concise yet informative. Models with shorter but context-rich summaries often yield better overall performance, which aligns well with what we observed qualitatively in the generated outputs.

Finally, we saved the fine-tuned model to Google Drive to ensure it remains accessible for future inference and evaluation tasks.



# Evaluating and Saving Model
After training and testing the model, we can evaluate its performance on the validation dataset. We can use the evaluate method for that.

In [None]:
#Evaluating model performance on the tokenized validation dataset
validation = trainer.evaluate(eval_dataset = tokenized_val)
print(validation) # Printing results

##  Evaluation Results and Model Saving

We have now evaluated our fine-tuned model on the validation dataset using the `evaluate()` method from the  Transformers library.  

Below are the obtained metrics:

| Metric | Score |
|:--|:--:|
| **Validation Loss** | 1.4076 |
| **ROUGE-1** | 53.51 |
| **ROUGE-2** | 28.62 |
| **ROUGE-L** | 44.15 |
| **ROUGE-Lsum** | 49.14 |
| **Average Generated Length** | 29.89 |

These results are consistent with what we observed during training.  
We can see that the model achieved **even higher ROUGE scores** compared to the test phase, indicating that it learned to generate summaries that closely match the human references.  

Regarding the **Gen Len** metric, we can notice that the model produces **shorter and more concise summaries**, while still retaining essential information  a desirable behavior for summarization tasks.

Since our results are satisfactory, we can now save the fine-tuned model and tokenizer for future use.  
To ensure portability, we’ll also compress the saved model directory into a `.zip` file.



In [None]:
# Saving model to a custom directory
directory = "/content/drive/MyDrive/bart_finetuned_samsum_bcool"
trainer.save_model(directory)

# Saving model tokenizer
tokenizer.save_pretrained(directory)

In [None]:
# Saving model in .zip format
shutil.make_archive(directory, 'zip', directory)

Let's load the model, using the summarization pipeline, and generate some summaries for human evaluation, where we evaluate if the model-generated summaries are accurate or not.

In [None]:
## test

model_name = "bcool315/bcool-bart-finetuned-dialogue"

tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

In [None]:
# Loading summarization pipeline and model
summarizer_bcool = pipeline('summarization', model ='bcool315/bcool-bart-finetuned-dialogue' )

In [None]:
# Obtaining a random example from the validation dataset
val_ds[35]

In [None]:
text = "John: doing anything special?\r\nAlex: watching 'Millionaires' on tvn\r\nSam: me too! He has a chance to win a million!\r\nJohn: ok, fingers crossed then! :)"
summary = "Alex and Sam are watching Millionaires."
generated_summary = summarizer_bcool(text)

In [None]:
print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Reference Summary:\n')
print(summary)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

In [None]:
val_ds[22]


In [None]:
text = "Madison: Hello Lawrence are you through with the article?\r\nLawrence: Not yet sir. \r\nLawrence: But i will be in a few.\r\nMadison: Okay. But make it quick.\r\nMadison: The piece is needed by today\r\nLawrence: Sure thing\r\nLawrence: I will get back to you once i am through."
summary = "Lawrence will finish writing the article soon."
generated_summary = summarizer_bcool(text)

print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Reference Summary:\n')
print(summary)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

In [None]:
val_ds[4]

In [None]:
text = "Robert: Hey give me the address of this music shop you mentioned before\r\nRobert: I have to buy guitar cable\r\nFred: Catch it on google maps\r\nRobert: thx m8\r\nFred: ur welcome"
summary = "Robert wants Fred to send him the address of the music shop as he needs to buy guitar cable."
generated_summary = summarizer_bcool(text)

In [None]:
print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Reference Summary:\n')
print(summary)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)

## Another example## creating new dialogues for evaluation
# For this dialogue, I have decided to include some abbreviations such as idk—for I don't know—and r u—for are you—
# to observe how the model would interpret them.

In [None]:
text = "John: Hey! I've been thinking about getting a PlayStation 5. Do you think it is worth it? \r\nDan: Idk man. R u sure ur going to have enough free time to play it? \r\nJohn: Yeah, that's why I'm not sure if I should buy one or not. I've been working so much lately idk if I'm gonna be able to play it as much as I'd like."
generated_summary = summarizer_bcool(text)
print('Original Dialogue:\n')
print(text)
print('\n' * 2)
print('Model-generated Summary:\n')
print(generated_summary)


Conclusion and Deployment

In this project, I explored how large pre-trained language models can be adapted to perform dialogue summarization tasks effectively. Using the BART architecture, I fine-tuned the model on the SAMSum dataset, which contains real conversational data, to build a summarizer capable of condensing informal dialogues into clear and coherent summaries.

Throughout this work, I leveraged the Hugging Face Transformers, Datasets, and Evaluate libraries in combination with PyTorch, illustrating how transfer learning allows us to repurpose existing large language models for specialized NLP tasks with limited computational resources.

The resulting model "bcool315/bcool-bart-finetuned-dialogue "demonstrates strong summarization capabilities, even when handling informal English dialogues containing abbreviations and conversational expressions, which highlights the robustness of the fine-tuning process.

Finally, the model was successfully uploaded and deployed on Hugging Face, making it freely accessible for anyone wishing to use it, test it, or further fine-tune it on similar dialogue summarization tasks.
This work shows the practicality and power of modern transfer learning methods for adapting foundation models to domain-specific NLP applications.