I'm using GPT-4-mini to generate reference summaries (ground_truth). 

This is an important nuance. Essentially, i'll be measuring how well Qwen-1.7B mimics GPT-4-mini's style and content, rather than evaluating its absolute summarization capability.

Looks like it necessarily bad, it's just something to keep in mind when interpreting results.

---

#### Lexical Metric: **ROUGE**

This metric measures the overlap of n-grams (word sequences) between generated and reference texts.

- ROUGE-1: Unigram overlap (single words). Indicates how well key terms are preserved.
- ROUGE-2: Bigram overlap (word pairs). Evaluates retention of short phrases.
- ROUGE-L: Based on the Longest Common Subsequence (LCS). Assesses structural similarity of sentences.

In [1]:
!pip install evaluate rouge_score transformers torch

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting absl-py (from rouge_score)
  Downloading absl_py-2.3.0-py3-none-any.whl.metadata (2.4 kB)
Downloading evaluate-0.4.4-py3-none-any.whl (84 kB)
Downloading absl_py-2.3.0-py3-none-any.whl (135 kB)
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (pyproject.toml) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24987 sha256=955968762dd9753e3deaafce8da72d783fbf4fb1553441073a6f42b0e024de2a
  Stored in directory: /Users/danildorofeev/Library/Caches/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: a

In [4]:
import evaluate
import pandas as pd

# Load dataset
data = pd.read_csv('/Users/danildorofeev/Desktop/financial-news-summarizer/data/dataset/ready_dataset.csv')

# Extract text lists
ground_truths = data["ground_truth"].tolist()
predictions_zero_shot = data["prediction_zero_shot"].str.replace("<think>\n\n</think>\n\n", "").tolist()
predictions_few_shot = data["prediction_few_shot"].str.replace("<think>\n\n</think>\n\n", "").tolist()

In [6]:
# Load metric
rouge_metric = evaluate.load('rouge')

# Calculate for zero-shot
results_zero_shot_rouge = rouge_metric.compute(
    predictions=predictions_zero_shot,
    references=ground_truths
)

# Calculate for few-shot
results_few_shot_rouge = rouge_metric.compute(
    predictions=predictions_few_shot,
    references=ground_truths
)

print("--- ROUGE Scores ---")
print("\nZero-shot:")

for key, value in results_zero_shot_rouge.items():
    print(f"{key}: {value*100:.2f}") # values multiplied by 100 for readability

print("\nFew-shot:")
for key, value in results_few_shot_rouge.items():
    print(f"{key}: {value*100:.2f}")

--- ROUGE Scores ---

Zero-shot:
rouge1: 53.54
rouge2: 26.65
rougeL: 34.81
rougeLsum: 50.28

Few-shot:
rouge1: 53.00
rouge2: 27.51
rougeL: 34.30
rougeLsum: 49.85


**Summary:** ROUGE metrics suggest that few-shot learning did not yield significant or unambiguous improvements in summary quality. The results for both approaches are nearly identical, with minor fluctuations (0.5-0.8 points) likely falling within the margin of error for a 250-example sample.


**ROUGE-1 (unigram overlap)**
Zero-shot: 53.54
Few-shot: 53.00
Interpretation: The zero-shot version marginally outperforms in reproducing individual keywords from reference summaries. The 0.5-point difference is negligible.

**ROUGE-2 (bigram overlap)**
Zero-shot: 26.65
Few-shot: 27.51
Interpretation: Few-shot shows slight improvement here—the most notable finding. This suggests examples helped the model generate more accurate short phrases (e.g., "rocket launch" vs. disjointed mentions of "launch" and "rocket"). While modest, this is a positive signal.

**ROUGE-L (sentence structure similarity)**
Zero-shot: 34.81
Few-shot: 34.30
Interpretation: Zero-shot marginally better preserves reference-like sentence structure. Again, the difference is trivial.

#### Semantic Metric: **BERTScore**

While ROUGE fails to capture synonyms (e.g., "launch" vs. "start"), BERTScore addresses this by comparing token embeddings from generated and reference texts. It evaluates semantic similarity.

In [5]:
!pip install evaluate bert_score sentence_transformers

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting sentence_transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting scikit-learn (from sentence_transformers)
  Using cached scikit_learn-1.7.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting scipy (from sentence_transformers)
  Downloading scipy-1.16.0-cp311-cp311-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->sentence_transformers)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
Using cached sentence_transformers-4.1.0-py3-none-any.whl (345 kB)
Using cached scikit_learn-1.7.0-cp311-cp311-macosx_12_0_arm64.whl (10.7 MB)
Downloading scipy-1.16.0-cp311-cp311-macosx_14_0_arm64.whl (20.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.8/20.8 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:

In [7]:
import evaluate

# Load metric
bertscore_metric = evaluate.load("bertscore")

# Calculate for zero-shot
results_zero_shot_bert = bertscore_metric.compute(
    predictions=predictions_zero_shot,
    references=ground_truths,
    lang="en",
    device="mps"
)

# Calculate for few-shot
results_few_shot_bert = bertscore_metric.compute(
    predictions=predictions_few_shot,
    references=ground_truths,
    lang="en",
    device="mps"
)

# BERTScore returns Precision, Recall, F1(most interested)
avg_f1_zero_shot = sum(results_zero_shot_bert['f1']) / len(results_zero_shot_bert['f1'])
avg_f1_few_shot = sum(results_few_shot_bert['f1']) / len(results_few_shot_bert['f1'])

print("\n--- BERTScore (Average F1) ---")
print(f"Zero-shot: {avg_f1_zero_shot*100:.2f}")
print(f"Few-shot:  {avg_f1_few_shot*100:.2f}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- BERTScore (Average F1) ---
Zero-shot: 90.02
Few-shot:  89.69


Actually, 0.33-point BERTScore difference (e.g., 90.02 vs. 89.69) on a 250-sample evaluation set is extremely small and likely not statistically significant.

**METEOR (Metric for Evaluation of Translation with Explicit Ordering)** evaluates both precision (n-gram matches) and recall (n-gram overlap), while accounting for word order in LLM outputs and reference texts. METEOR also leverages external linguistic databases like WordNet to incorporate synonyms. The final score is computed as the harmonic mean of precision and recall, with a penalty for word-order violations.

In [13]:
from nltk.translate.meteor_score import meteor_score
from tqdm.notebook import tqdm

zero_shot_scores = []
few_shot_scores = []

for pred, ref in tqdm(zip(predictions_zero_shot, ground_truths), total=len(predictions_zero_shot)):
    meteor_score_zero = meteor_score([ref], [pred])
    zero_shot_scores.append(meteor_score_zero)

for pred, ref in tqdm(zip(predictions_few_shot, ground_truths), total=len(predictions_few_shot)):
    meteor_score_few = meteor_score([ref], [pred])
    few_shot_scores.append(meteor_score_few)

print("Zero-shot:", sum(zero_shot_scores) / len(zero_shot_scores))
print("Few-shot:", sum(few_shot_scores) / len(few_shot_scores))


  0%|          | 0/249 [00:00<?, ?it/s]

TypeError: "reference" expects pre-tokenized reference (Iterable[str]): Headline: Morgan Stanley Cuts Apple Price Target Amid Demand Concerns

Core Essence: Morgan Stanley lowered its price target for Apple (AAPL) to $236 from $253, citing weak smartphone demand in China, marking the third price cut for the company this week.

Key Points:
- The Event: Morgan Stanley announced a price target reduction for Apple, reflecting concerns over a slowing smartphone market in China.
- Financial Metrics: Morgan Stanley's new price target for Apple is $236, down from $253. Apple's shares dropped over 2% following this announcement.
- Market Reaction: Apple shares fell more than 2% amid the news of the price cut and overall market trends.
- Key Quote or Context: Morgan Stanley noted that rising average selling prices and improved smartphone quality are lengthening replacement cycles, negatively impacting demand for new devices.
- Outlook/Next Steps: Morgan Stanley indicated that revenues from wearables and services could mitigate the negative impact of declining iPhone demand.

Calculating the METEOR metric didn’t work out due to tight deadlines. As a next step, we could try computing a QAG Score or G-Eval.