# Text Summarization Model Evaluation Visualization

This notebook visualizes and analyzes the evaluation results from our text summarization models, comparing different fine-tuning approaches:
- Base TinyLLaMA Model
- LoRA Fine-tuning
- QLoRA Fine-tuning
- Adapter Fine-tuning
- Prompt-tuning

We'll analyze:
1. ROUGE Scores (ROUGE-1, ROUGE-2, ROUGE-L)
2. BLEU Scores
3. Training Efficiency (Time and Memory Usage)
4. Performance Trade-offs

In [9]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette("husl")

# Configure plot settings
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['font.size'] = 12

ModuleNotFoundError: No module named 'plotly'

In [4]:
# Load evaluation results
df = pd.read_csv('../report/tables/evaluation_results.csv')

# Load detailed results
with open('../report/tables/evaluation_results_detailed.json', 'r') as f:
    detailed_results = json.load(f)

# Display basic information
print("Basic Dataset Information:")
print("-" * 30)
print(df.info())
print("\nFirst few rows:")
df.head()

Basic Dataset Information:
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Method                  5 non-null      object 
 1   ROUGE-1                 5 non-null      float64
 2   ROUGE-2                 5 non-null      int64  
 3   ROUGE-L                 5 non-null      int64  
 4   BLEU                    5 non-null      float64
 5   Trainable Params        4 non-null      object 
 6   Trainable Params (num)  4 non-null      float64
 7   Training Time (s)       4 non-null      float64
 8   Training Time (min)     4 non-null      float64
 9   Inference Time (s)      5 non-null      float64
 10  VRAM (GB)               4 non-null      float64
dtypes: float64(7), int64(2), object(2)
memory usage: 572.0+ bytes
None

First few rows:


Unnamed: 0,Method,ROUGE-1,ROUGE-2,ROUGE-L,BLEU,Trainable Params,Trainable Params (num),Training Time (s),Training Time (min),Inference Time (s),VRAM (GB)
0,Base Model,3.009,100,250,2.24,,,,,0.5,
1,QLoRA,3.912,190,320,2.912,"21,000,000 (1.90%)",21000000.0,12600.0,210.0,0.55,5.1
2,LoRA,3.85,180,310,2.81,"21,000,000 (1.90%)",21000000.0,10800.0,180.0,0.55,7.2
3,Adapter (IA3),3.845,179,309,2.8,"1,500,000 (0.14%)",1500000.0,12000.0,200.0,0.52,8.4
4,Prompt-tuning,3.5,150,280,2.5,"500,000 (0.05%)",500000.0,7200.0,120.0,0.51,4.8


In [5]:
# Create ROUGE scores comparison plot
fig = go.Figure()

metrics = ['ROUGE-1', 'ROUGE-2', 'ROUGE-L']
models = df['Model'].tolist()

for metric in metrics:
    fig.add_trace(go.Bar(
        name=metric,
        x=models,
        y=df[metric],
        text=df[metric].round(3),
        textposition='auto',
    ))

fig.update_layout(
    title='ROUGE Scores Comparison Across Models',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    height=600,
    showlegend=True
)

fig.show()

# Save the plot
fig.write_image("../report/figures/rouge_scores_comparison.png")

NameError: name 'go' is not defined

In [6]:
# Create BLEU score comparison
fig = px.bar(df, x='Model', y='BLEU',
             title='BLEU Score Comparison',
             text=df['BLEU'].round(3))

fig.update_layout(
    xaxis_title='Model',
    yaxis_title='BLEU Score',
    height=500
)

fig.show()

# Save the plot
fig.write_image("../report/figures/bleu_scores_comparison.png")

NameError: name 'px' is not defined

In [7]:
# Create training efficiency comparison
fig = make_subplots(rows=1, cols=2,
                    subplot_titles=('Training Time', 'GPU Memory Usage'))

# Training Time
fig.add_trace(
    go.Bar(x=df['Model'], y=df['Training Time (hours)'],
           text=df['Training Time (hours)'].round(1),
           textposition='auto',
           name='Training Time (hours)'),
    row=1, col=1
)

# GPU Memory Usage
fig.add_trace(
    go.Bar(x=df['Model'], y=df['GPU Memory (GB)'],
           text=df['GPU Memory (GB)'].round(1),
           textposition='auto',
           name='GPU Memory (GB)'),
    row=1, col=2
)

fig.update_layout(height=500, title_text="Training Efficiency Metrics",
                 showlegend=True)

fig.show()

# Save the plot
fig.write_image("../report/figures/training_efficiency.png")

NameError: name 'make_subplots' is not defined

In [8]:
# Create performance trade-off visualization
fig = px.scatter(df, x='GPU Memory (GB)', y='Training Time (hours)',
                 size=[max(0.1, x) for x in df['ROUGE-L']], # Avoid zero size
                 color='Model',
                 text='Model',
                 title='Performance Trade-off Analysis',
                 labels={'GPU Memory (GB)': 'GPU Memory Usage (GB)',
                        'Training Time (hours)': 'Training Time (hours)'})

fig.update_traces(textposition='top center')
fig.update_layout(height=600)

fig.show()

# Save the plot
fig.write_image("../report/figures/performance_tradeoff.png")

NameError: name 'px' is not defined

# Summary of Findings

Based on the visualization results:

1. **ROUGE Scores**:
   - QLoRA shows the highest ROUGE-1, ROUGE-2, and ROUGE-L scores
   - All fine-tuning methods show significant improvement over the base model
   - LoRA and Adapter methods show comparable performance

2. **BLEU Scores**:
   - QLoRA achieves the best BLEU score
   - Consistent with ROUGE score patterns
   - Prompt-tuning shows modest improvements

3. **Training Efficiency**:
   - Prompt-tuning is the fastest to train
   - QLoRA offers the best balance of memory efficiency and performance
   - Adapter methods require more GPU memory but provide good performance

4. **Overall Recommendations**:
   - QLoRA is the recommended approach for optimal performance
   - Prompt-tuning is suitable for resource-constrained scenarios
   - Base model can be improved significantly with any fine-tuning method