# Tennis Match Report Generation

**Goal:** Generate short, factual tennis match reports from structured match statistics.

We compare three models:

1. **T5 baseline** – pre-trained `t5-small` used out-of-the-box.
2. **Fine-tuned T5** – `t5-small` fine-tuned on 40 human-written reports.
3. **GPT-4.1-mini** – API model used as a strong reference.

We evaluate all models on the same 10 test matches using:
- ROUGE
- Sentence-level cosine similarity
- An LLM-as-a-judge evaluation (accuracy, completeness, consistency, clarity, overall).

## 1. Data and task definition

We start from a small custom dataset with **50 tennis matches**.  
For each match we have:
- structured statistics (aces, double faults, first-serve %, break points, etc.)
- a short human-written report.

We split the data into **40 training** and **10 test** matches.

In [7]:
import pandas as pd

# Full cleaned dataset (50 matches)
df_full = pd.read_csv("data/tennis_matches_for_human_reports_50_clean.tsv", sep="\t")
print("Full dataset shape:", df_full.shape)

df_full.head(1)

Full dataset shape: (50, 24)


Unnamed: 0,tourney_name,round,surface,winner_name,loser_name,score,winner_rank,loser_rank,w_ace,w_df,...,w_bpFaced,l_ace,l_df,l_1stIn,l_1stWon,l_2ndWon,l_bpSaved,l_bpFaced,input_text,target_text
0,Australian Open,R16,Hard,Novak Djokovic,Adrian Mannarino,6-0 6-0 6-3,1.0,19.0,17.0,5.0,...,3.0,1.0,0.0,34.0,17.0,8.0,4.0,11.0,"You are a sports journalist. Write a short, fa...","Novak Djokovic dominated Adrian Mannarino 6-0,..."


Below we load the pre-split training and test sets (40 train, 10 test).
Each row already contains:
- `input_text`: prompt + match statistics
- `target_text`: human-written reference report

In [8]:
train_df = pd.read_csv("data/train.tsv", sep="\t")
test_df  = pd.read_csv("data/test.tsv",  sep="\t")

print("Train samples:", len(train_df))
print("Test samples:",  len(test_df))

test_df[["input_text", "target_text"]].head(2)

Train samples: 40
Test samples: 10


Unnamed: 0,input_text,target_text
0,"You are a sports journalist. Write a short, fa...",The fifth seed and 2021 titlist continued his ...
1,"You are a sports journalist. Write a short, fa...",The Italian reached the last eight at a major ...


## 2. Baseline model: T5-small

As a simple baseline we use the pre-trained `t5-small` model **without any fine-tuning** on our data.
We feed the `input_text` (prompt + stats) and let the model generate a report.

In [10]:
import torch
from transformers import T5ForConditionalGeneration, T5TokenizerFast

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
# Load baseline T5 model (unchanged t5-small)
baseline_model_name = "t5-small"
baseline_tokenizer = T5TokenizerFast.from_pretrained(baseline_model_name)
baseline_model = T5ForConditionalGeneration.from_pretrained(baseline_model_name)

baseline_model.eval()

def generate_t5_baseline_report(text, max_input_len=256, max_output_len=180):
    """Generate a report using the untuned T5-small baseline."""
    inputs = baseline_tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_input_len,
    )

    with torch.no_grad():
        output_ids = baseline_model.generate(
            **inputs,
            max_length=max_output_len,
            num_beams=4,
            early_stopping=True,
            no_repeat_ngram_size=2,
        )

    return baseline_tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Generate baseline reports for all test samples
baseline_outputs = [generate_t5_baseline_report(t) for t in test_df["input_text"]]
test_df["t5_baseline_output"] = baseline_outputs

test_df[["target_text", "t5_baseline_output"]].head(3)

Unnamed: 0,target_text,t5_baseline_output
0,The fifth seed and 2021 titlist continued his ...,: 36.0 - 2nd serve points won: 14.0. Break poi...
1,The Italian reached the last eight at a major ...,": 3.0 - Double faults: 0.0, 1st serve in: 50.0..."
2,Lorenzo Musetti delivered arguably the Grand S...,: 63.0 - 1st serve points won: 29.0- Break poi...


### Example match: baseline output

Below we show one full example (match 1) with:
- input prompt + stats
- human reference report
- baseline T5 report

In [13]:
example_idx = 1  # you can change this to 1 or 2 in the presentation

row = test_df.iloc[example_idx]

print("=== INPUT TEXT ===")
print(row["input_text"], "\n")
print("=== TARGET TEXT ===")
print(row["target_text"], "\n")
print("=== T5 BASELINE OUTPUT ===")
print(row["t5_baseline_output"])

=== INPUT TEXT ===
You are a sports journalist. Write a short, factual tennis match report in MAXIMUM 3 sentences.
Use ONLY the information from the stats below. Do NOT invent any extra facts or numbers.

Tournament: Wimbledon
Round: R16
Surface: Grass

Winner: Lorenzo Musetti (rank 25.0)
Loser: Giovanni Mpetshi Perricard (rank 58.0)
Score: 4-6 6-3 6-3 6-2

Winner stats:
- Aces: 3.0
- Double faults: 0.0
- 1st serve in: 63.0
- 1st serve points won: 50.0
- 2nd serve points won: 21.0
- Break points saved: 0.0 out of 1.0

Loser stats:
- Aces: 10.0
- Double faults: 8.0
- 1st serve in: 70.0
- 1st serve points won: 47.0
- 2nd serve points won: 20.0
- Break points saved: 10.0 out of 15.0 

=== TARGET TEXT ===
The Italian reached the last eight at a major for the first time on Monday at Wimbledon, where he ended the run of French lucky loser Giovanni Mpetshi Perricard with a 4-6, 6-3, 6-3, 6-2 victory.
Mpetshi Perricard was celebrating his 21st birthday and entered the match high in confidence.

## 3. Fine-tuned T5 model

We fine-tuned `t5-small` on the 40 training matches (input_text → target_text).
The training code lives in a separate notebook (`t5_training.ipynb`).
Here we simply load the saved model and generate reports on the 10 test matches.

In [14]:
# Load fine-tuned T5 model from disk
ft_model_path = "./tiny_t5_tennis_report_model_clean"

ft_tokenizer = T5TokenizerFast.from_pretrained(ft_model_path)
ft_model     = T5ForConditionalGeneration.from_pretrained(ft_model_path)
ft_model.eval()

def generate_t5_finetuned_report(text, max_input_len=256, max_output_len=180):
    """Generate a report using the fine-tuned T5 model."""
    inputs = ft_tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_input_len,
    )

    with torch.no_grad():
        output_ids = ft_model.generate(
            **inputs,
            max_length=max_output_len,
            num_beams=4,
            early_stopping=True,
            no_repeat_ngram_size=2,
        )

    return ft_tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Generate fine-tuned outputs for all test samples
ft_outputs = [generate_t5_finetuned_report(t) for t in test_df["input_text"]]
test_df["t5_finetuned_output"] = ft_outputs

test_df[["target_text", "t5_baseline_output", "t5_finetuned_output"]].head(3)

Unnamed: 0,target_text,t5_baseline_output,t5_finetuned_output
0,The fifth seed and 2021 titlist continued his ...,: 36.0 - 2nd serve points won: 14.0. Break poi...,"Daniil Medvedev beat Nuno Borges 6-0, 6-1, 6-3..."
1,The Italian reached the last eight at a major ...,": 3.0 - Double faults: 0.0, 1st serve in: 50.0...",Lorenzo Musetti beat Giovanni Mpetshi Perricar...
2,Lorenzo Musetti delivered arguably the Grand S...,: 63.0 - 1st serve points won: 29.0- Break poi...,"Lorenzo Musetti beat Taylor Fritz 3-6, 7-6(5),..."


### Example match: fine-tuned T5 vs baseline

We now compare the same match for baseline T5 and fine-tuned T5.

In [15]:
row = test_df.iloc[example_idx]

print("=== TARGET TEXT ===")
print(row["target_text"], "\n")

print("=== T5 BASELINE OUTPUT ===")
print(row["t5_baseline_output"], "\n")

print("=== T5 FINETUNED OUTPUT ===")
print(row["t5_finetuned_output"])

=== TARGET TEXT ===
The Italian reached the last eight at a major for the first time on Monday at Wimbledon, where he ended the run of French lucky loser Giovanni Mpetshi Perricard with a 4-6, 6-3, 6-3, 6-2 victory.
Mpetshi Perricard was celebrating his 21st birthday and entered the match high in confidence. The big-serving Frenchman defeated Sebastian Korda, Yoshihito Nishioka and Emil Ruusuvuori en route to his first fourth-round appearance at a major, hitting 105 aces across his first three matches.
The Lyon champion was unable to fire at his best level against Musetti in the pair’s first Lexus ATP Head2Head meeting. The 25th seed broke Mpetshi Perricard’s serve five times and was the more consistent in the baseline exchanges, committing just eight unforced errors compared to 42 from his opponent. 

=== T5 BASELINE OUTPUT ===
: 3.0 - Double faults: 0.0, 1st serve in: 50.0  2nd serve points won: 21.0. Break points saved: 10.0 out of 15.0 Loser stats : 1. 

=== T5 FINETUNED OUTPUT ===

## 4. GPT-4.1-mini as a strong reference model

Finally, we query the GPT-4.1-mini API using the same `input_text` as prompt.
We treat these reports as a strong reference system.

In [17]:
from openai import OpenAI
import time

client = OpenAI()

def generate_gpt41mini(text):
    """Call GPT-4.1-mini with the full input_text prompt."""
    resp = client.responses.create(
        model="gpt-4.1-mini",
        input=text,
    )
    return resp.output[0].content[0].text

gpt41_outputs = []
for i, prompt in enumerate(test_df["input_text"]):
    out = generate_gpt41mini(prompt)
    gpt41_outputs.append(out)
    time.sleep(0.3)  # small pause for rate limiting

test_df["gpt41_output"] = gpt41_outputs

test_df[["target_text", "gpt41_output"]].head(1)

Unnamed: 0,target_text,gpt41_output
0,The fifth seed and 2021 titlist continued his ...,"Daniil Medvedev defeated Nuno Borges 6-0, 6-1,..."


### Example match: all three models

We now show the same test match for all three models.

In [18]:
row = test_df.iloc[example_idx]

print("=== TARGET TEXT ===")
print(row["target_text"], "\n")

print("=== T5 BASELINE OUTPUT ===")
print(row["t5_baseline_output"], "\n")

print("=== T5 FINETUNED OUTPUT ===")
print(row["t5_finetuned_output"], "\n")

print("=== GPT-4.1-MINI OUTPUT ===")
print(row["gpt41_output"])

=== TARGET TEXT ===
The Italian reached the last eight at a major for the first time on Monday at Wimbledon, where he ended the run of French lucky loser Giovanni Mpetshi Perricard with a 4-6, 6-3, 6-3, 6-2 victory.
Mpetshi Perricard was celebrating his 21st birthday and entered the match high in confidence. The big-serving Frenchman defeated Sebastian Korda, Yoshihito Nishioka and Emil Ruusuvuori en route to his first fourth-round appearance at a major, hitting 105 aces across his first three matches.
The Lyon champion was unable to fire at his best level against Musetti in the pair’s first Lexus ATP Head2Head meeting. The 25th seed broke Mpetshi Perricard’s serve five times and was the more consistent in the baseline exchanges, committing just eight unforced errors compared to 42 from his opponent. 

=== T5 BASELINE OUTPUT ===
: 3.0 - Double faults: 0.0, 1st serve in: 50.0  2nd serve points won: 21.0. Break points saved: 10.0 out of 15.0 Loser stats : 1. 

=== T5 FINETUNED OUTPUT ===

## 5. Save combined test file with all model outputs

For further analysis and for the report, we save a single TSV file containing:
- basic match metadata,
- the prompt (`input_text`),
- the reference report,
- outputs from all three models.

In [20]:
# Select only the columns we really want in the "all models" file
cols_to_keep = [
    "tourney_name", "round", "surface",
    "winner_name", "loser_name", "score",
    "input_text", "target_text",
    "t5_baseline_output",
    "t5_finetuned_output",
    "gpt41_output",
]

df_all = test_df[cols_to_keep].copy()
df_all.to_csv("data/test_with_all_models.tsv", sep="\t", index=False)

print("Saved to data/test_with_all_models.tsv")
df_all.head(1)

Saved to data/test_with_all_models.tsv


Unnamed: 0,tourney_name,round,surface,winner_name,loser_name,score,input_text,target_text,t5_baseline_output,t5_finetuned_output,gpt41_output
0,Us Open,R16,Hard,Daniil Medvedev,Nuno Borges,6-0 6-1 6-3,"You are a sports journalist. Write a short, fa...",The fifth seed and 2021 titlist continued his ...,: 36.0 - 2nd serve points won: 14.0. Break poi...,"Daniil Medvedev beat Nuno Borges 6-0, 6-1, 6-3...","Daniil Medvedev defeated Nuno Borges 6-0, 6-1,..."


## 6. Automatic evaluation

We evaluate all three models on the 10 test matches using:

- **ROUGE:** overlap between generated and reference n-grams.
- **Cosine similarity:** sentence embeddings from `all-MiniLM-L6-v2`.

In [21]:
from evaluate import load

rouge = load("rouge")

refs      = test_df["target_text"].tolist()
base_out  = test_df["t5_baseline_output"].tolist()
ft_out    = test_df["t5_finetuned_output"].tolist()
gpt_out   = test_df["gpt41_output"].tolist()

rouge_base = rouge.compute(predictions=base_out, references=refs, use_stemmer=True)
rouge_ft   = rouge.compute(predictions=ft_out,   references=refs, use_stemmer=True)
rouge_gpt  = rouge.compute(predictions=gpt_out,  references=refs, use_stemmer=True)

summary_rouge = pd.DataFrame({
    "metric": ["rouge1", "rouge2", "rougeL", "rougeLsum"],
    "T5_baseline": [round(rouge_base[m], 3) for m in ["rouge1","rouge2","rougeL","rougeLsum"]],
    "T5_finetuned": [round(rouge_ft[m], 3)   for m in ["rouge1","rouge2","rougeL","rougeLsum"]],
    "GPT_4_1_mini": [round(rouge_gpt[m], 3)  for m in ["rouge1","rouge2","rougeL","rougeLsum"]],
})
summary_rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Unnamed: 0,metric,T5_baseline,T5_finetuned,GPT_4_1_mini
0,rouge1,0.125,0.393,0.33
1,rouge2,0.051,0.205,0.151
2,rougeL,0.099,0.26,0.231
3,rougeLsum,0.109,0.297,0.26


In [22]:
from sentence_transformers import SentenceTransformer, util

model_st = SentenceTransformer("all-MiniLM-L6-v2")

def cosine_scores(refs, preds):
    emb_a = model_st.encode(refs, convert_to_tensor=True)
    emb_b = model_st.encode(preds, convert_to_tensor=True)
    return util.cos_sim(emb_a, emb_b).diagonal().cpu().tolist()

test_df["cosine_baseline"] = cosine_scores(refs, base_out)
test_df["cosine_finetuned"] = cosine_scores(refs, ft_out)
test_df["cosine_gpt"] = cosine_scores(refs, gpt_out)

summary_cosine = test_df[["cosine_baseline", "cosine_finetuned", "cosine_gpt"]].mean().round(3)
summary_cosine

cosine_baseline     0.338
cosine_finetuned    0.783
cosine_gpt          0.777
dtype: float64

## 7. LLM-as-a-judge evaluation

Finally, we ask GPT-4.1-mini to score each generated report on:

- **accuracy** (facts correct, no hallucinations)
- **completeness** (covers main aspects of the match)
- **consistency** (internally coherent)
- **clarity** (easy to read and understand)
- **overall** (1–10 overall quality)

Each score is on a 1–10 scale.

In [31]:
from openai import OpenAI
import json
import re

client = OpenAI()

def llm_score(reference_text, candidate_text, model_name="gpt-4.1-mini"):
    """
    Evaluates a candidate match report using an LLM.
    Score categories: accuracy, completeness, consistency, clarity, overall (1–10).
    """

    prompt = f"""
You are an expert tennis journalist and evaluation judge.
Evaluate the candidate match report.

REFERENCE REPORT:
\"\"\"{reference_text}\"\"\"

CANDIDATE REPORT:
\"\"\"{candidate_text}\"\"\"

Rate the candidate report from 1–10 in the following categories:
- accuracy
- completeness
- consistency
- clarity
- overall

Respond **ONLY** with a Python dictionary like this:
{{"accuracy": x, "completeness": y, "consistency": z, "clarity": c, "overall": o}}
    """

    # Call OpenAI
    resp = client.responses.create(
        model=model_name,
        input=prompt
    )

    raw_text = resp.output[0].content[0].text

    # Versuch: das LLM Dictionary extrahieren
    match = re.search(r"\{.*\}", raw_text, re.DOTALL)
    if not match:
        raise ValueError("LLM returned no dictionary! Raw output:\n" + raw_text)

    dict_text = match.group(0)

    # In echtes Python-Dict umwandeln
    try:
        scores = json.loads(dict_text)
    except json.JSONDecodeError:
        raise ValueError("Could not parse JSON. Text was:\n" + dict_text)

    return scores

In [38]:
import pandas as pd

def evaluate_model_with_llm(df, pred_column, n_samples=None, model_name="gpt-4.1-mini"):
    """
    Evaluates one prediction column against target_text using the LLM judge.
    
    df: DataFrame with columns 'target_text' and pred_column
    pred_column: name of the column with model outputs
    n_samples: if not None, evaluate only that many random samples
    """
    if n_samples is not None:
        sub = df.sample(n_samples, random_state=42).reset_index(drop=True)
    else:
        sub = df.reset_index(drop=True)
    
    all_scores = []
    for i, (ref, cand) in enumerate(zip(sub["target_text"], sub[pred_column]), start=1):
        scores = llm_score(ref, cand, model_name=model_name)
        all_scores.append(scores)
    
    scores_df = pd.DataFrame(all_scores)
    
    return scores_df

In [39]:
# Falls dein Test-DataFrame anders heißt, hier anpassen
df_test = pd.read_csv("data/test_with_all_models.tsv", sep="\t")  # oder z.B. df_test = pd.read_csv("data/test_with_all_models.tsv", sep="\t")

scores_base = evaluate_model_with_llm(df_test, "t5_baseline_output", n_samples=None)
scores_ft   = evaluate_model_with_llm(df_test, "t5_finetuned_output", n_samples=None)
scores_gpt  = evaluate_model_with_llm(df_test, "gpt41_output",       n_samples=None)

In [40]:
summary_llm = pd.DataFrame({
    "metric": ["accuracy", "completeness", "consistency", "clarity", "overall"],
    "T5_baseline": scores_base.mean().values,
    "T5_finetuned": scores_ft.mean().values,
    "GPT_4.1_mini": scores_gpt.mean().values,
})

summary_llm = summary_llm.round(2)
summary_llm

Unnamed: 0,metric,T5_baseline,T5_finetuned,GPT_4.1_mini
0,accuracy,1.9,3.8,8.5
1,completeness,1.2,2.5,6.0
2,consistency,1.6,3.7,9.0
3,clarity,1.2,5.2,8.2
4,overall,1.3,3.6,7.3
