<a href="https://colab.research.google.com/github/Vicky-YTZ/25MSGAI/blob/main/FastRead.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
!pip install transformers datasets rouge-score accelerate

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=3635b7dcb801069ea107da4d4f037521f9468d5fa3ba1f7e1c6d17696c03ba2d
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


## Environment & Setup —— Baseline Model

#### 1. Use baseline Model(FP32 baseline)

In [7]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

### init model
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)

print("Model loaded (FP32 baseline).")

Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Model loaded (FP32 baseline).


#### 2. Load Dataset

In [None]:
from datasets import load_dataset

### Load dataset
dataset = load_dataset("abisee/cnn_dailymail", "3.0.0")

train_data = dataset["train"]
val_data   = dataset["validation"]
test_data  = dataset["test"]

train_data.to_csv("train.csv")
val_data.to_csv("val.csv")
test_data.to_csv("test.csv")

print("Dataset loaded:")
print("Train:", len(train_data))
print("Validation:", len(val_data))
print("Test:", len(test_data))

README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

#### 3.⭐⭐⭐ Summarization Pipeline(重点研究)-(single request)

In [None]:
MAX_SUMMARY_LEN = 128
NUM_BEAMS = 4

def summarize(text, max_len=MAX_SUMMARY_LEN, num_beams=NUM_BEAMS):
    """
    Single-request summarization using model.generate.
    Returns a decoded string summary.
    """
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=1024
    ).to(device)
    with torch.no_grad():
        out_ids = model.generate(
            inputs["input_ids"],
            attention_mask=inputs.get("attention_mask", None),
            num_beams=num_beams,
            max_length=max_len,
            early_stopping=True
        )
    return tokenizer.decode(out_ids[0], skip_special_tokens=True)

#### 4. Build （Latency / Throughput / ROUGE） Tools

##### 4.1 ROUGE Scorer

In [None]:
from rouge_score import rouge_scorer

rouge = rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'], use_stemmer=True)

def eval_rouge(pred, ref):
    score = rouge.score(ref, pred)
    return {k: v.fmeasure for k, v in score.items()}


##### 4.2 Latency

In [None]:
import time

def measure_latency(text):
    start = time.time()
    _ = summarize(text)
    return time.time() - start

##### 4.3 Throughput

In [None]:
def measure_throughput(batch_texts):
    start = time.time()
    for text in batch_texts:
        summarize(text)
    total_time = time.time() - start
    return len(batch_texts) / total_time

#### Test: Prepare small test set

In [None]:
SAMPLE_NUM = 50
def get_samples(ds, n=SAMPLE_NUM):
    n = min(n, len(ds))
    return ds.select(range(n))

In [None]:
small_test = get_samples(test_data, SAMPLE_NUM)
print(f"[INFO] Using {len(small_test)} samples for baseline evaluation.")

#### 🌟🌟🌟 5.Bulid BaseLine

In [None]:
# baseLine
def run_baseline(dataset):
    results = []
    for item in dataset:
        article = item["article"]
        ref = item["highlights"]
        pred = summarize(article)
        results.append((pred, ref))
    return results

In [None]:
baseline_results = run_baseline(small_test)
print("Baseline inference finished.")

In [None]:
def evaluate_rouge_dataset(results):
    rouge1_scores, rouge2_scores, rougeL_scores = [], [], []
    for pred, ref in results:
        score = eval_rouge(pred, ref)
        rouge1_scores.append(score["rouge1"])
        rouge2_scores.append(score["rouge2"])
        rougeL_scores.append(score["rougeL"])

    return {
        "ROUGE-1": sum(rouge1_scores)/len(rouge1_scores),
        "ROUGE-2": sum(rouge2_scores)/len(rouge2_scores),
        "ROUGE-L": sum(rougeL_scores)/len(rougeL_scores),
    }

baseline_rouge = evaluate_rouge_dataset(baseline_results)
baseline_rouge

In [None]:
import numpy as np

latencies = [measure_latency(item["article"]) for item in small_test]
baseline_latency = np.mean(latencies)
baseline_latency

In [None]:
texts = [item["article"] for item in small_test]
baseline_throughput = measure_throughput(texts)
baseline_throughput

##### 6. Build  Baseline Report

6.1 记录 Baseline 指标

平均延迟（sec/request）

吞吐量（req/sec）

ROUGE-1/2/L 基线

生成一份 baseline 报告（供优化对照）

In [None]:
latencies = [measure_latency(item["article"]) for item in small_test]
baseline_latency = np.mean(latencies)
baseline_latency

texts = [item["article"] for item in small_test]
baseline_throughput = measure_throughput(texts)
baseline_throughput

baseline_rouge = evaluate_rouge_dataset(baseline_results)
baseline_rouge


6.2 记录模型在不同输入长度下的性能

文章长度 vs latency 曲线

## Optimization Phase

执行 PTQ（Post-Training Quantization）

测试四种精度：

FP32（baseline）

FP16

INT8

INT4

动态批处理（Dynamic Batching）：在 batch size = 1,4,8,16 下测量延迟 / 吞吐 / ROUGE，并把量化与批处理结合比较。