<a href="https://colab.research.google.com/github/VivianOuou/NLP-Course/blob/main/course/en/chapter5/section3_time_to_slice_and_dice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Time to slice and dice

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [24]:
!pip install datasets evaluate transformers[sentencepiece]



In [32]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2025-03-31 08:31:08--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip.2’

drugsCom_raw.zip.2      [        <=>         ]  41.00M  1.47MB/s    in 36s     

2025-03-31 08:31:45 (1.15 MB/s) - ‘drugsCom_raw.zip.2’ saved [42989872]

Archive:  drugsCom_raw.zip
replace drugsComTest_raw.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: drugsComTest_raw.tsv    
replace drugsComTrain_raw.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [34]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

In [35]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Print the first few examples of the dataset
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

In [36]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

In [37]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [38]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


drug_dataset.map(lowercase_condition)

Map:   0%|          | 0/161297 [00:00<?, ? examples/s]

AttributeError: 'NoneType' object has no attribute 'lower'

In [42]:
def filter_nones(x):
    return x["condition"] is not None

In [41]:
(lambda x: x * x)(3)

9

In [40]:
(lambda base, height: 0.5 * base * height)(4, 8)

16.0

In [43]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

In [44]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

In [45]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

In [46]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

In [47]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

In [48]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

{'train': 138514, 'test': 46108}


In [49]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [50]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

In [51]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

In [52]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

In [53]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 30.2 s, sys: 200 ms, total: 30.4 s
Wall time: 19.5 s


In [54]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

In [56]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

In [57]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

In [58]:
#没有移除原始列
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

ArrowInvalid: Column 8 named input_ids expected length 1000 but got length 1463

In [59]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [60]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

这个改进通过利用 overflow_to_sample_mapping，将原始列扩展到与分词后的样本数量一致，解决了长度不匹配问题，同时保留了所有原始数据。它比之前的 remove_columns 方法更灵活，适用于更广泛的场景。

In [62]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [63]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

From Dataset s to DataFrame s and back

以下是你请求的“两者的优缺点对比”和“为什么需要转换？”部分的重新整理和返回内容：

---

### 两者的优缺点对比

| 特性                | `Dataset` (Hugging Face Datasets)                  | `DataFrame` (Pandas)                      |
|---------------------|---------------------------------------------------|-------------------------------------------|
| **内存效率**        | 高（基于 Apache Arrow 的列式存储，支持零拷贝）     | 中等（基于 NumPy 的行式存储，内存占用较高）|
| **处理速度**        | 批量操作快（支持并行处理，如 `batched=True`）      | 单次操作快，但在大数据时较慢              |
| **数据规模**        | 适合超大数据（支持流式加载和懒加载）              | 适合中小数据（受内存限制）                |
| **操作类型**        | 擅长批量映射、过滤、分词等机器学习预处理          | 擅长统计、聚合、可视化等数据分析操作       |
| **学习曲线**        | 需要熟悉 Hugging Face API                         | 广泛使用，易于上手                       |
| **与 ML 集成**      | 与 Transformers 等框架无缝对接                    | 需要额外转换（如 `to_numpy()`）           |

#### `Dataset` 的优缺点
- **优点**：
  1. **高效内存使用**：基于 Apache Arrow，列式存储和零拷贝特性使其在大数据处理中内存效率高。
  2. **快速批量处理**：支持 `map` 和 `filter` 的批量操作，结合 fast tokenizer 可并行处理大量数据。
  3. **机器学习友好**：直接生成模型所需格式（如 `input_ids`），与 Hugging Face 生态深度集成。
  4. **大数据支持**：流式加载和磁盘缓存适合处理超大数据集。
  5. **跨框架兼容**：可转换为 PyTorch、TensorFlow 等格式。
- **缺点**：
  1. 统计和分析功能较弱，缺乏类似 Pandas 的 `groupby` 或 `value_counts`。
  2. 对于非机器学习任务，API 使用不够直观。

#### `DataFrame` 的优缺点
- **优点**：
  1. **灵活操作**：提供丰富的 API（如 `groupby`, `merge`），适合复杂数据处理。
  2. **数据探索**：支持 `head()`, `describe()` 等，易于快速检查数据。
  3. **生态支持**：与可视化工具和 Excel、SQL 集成良好。
  4. **小型数据友好**：中小数据集上性能足够且操作简单。
- **缺点**：
  1. 内存效率较低，大数据集可能导致内存溢出。
  2. 批量处理速度不如 `Dataset`，尤其在分词等任务中。

---

### 为什么需要转换？
1. **任务驱动的需求**：
   - **数据探索阶段**：用 `DataFrame` 进行统计分析、分布检查（如计算 `"condition"` 频率），因为 Pandas 提供了高效的工具。
   - **模型训练阶段**：用 `Dataset` 进行批量预处理（如分词、映射），因为它与机器学习框架集成更好且效率更高。
   - 转换允许在不同阶段使用最适合的工具。

2. **工具功能的限制**：
   - `Dataset` 擅长批量操作和机器学习数据准备，但缺乏复杂的统计功能。例如，计算频率分布需要自定义逻辑，而 Pandas 的 `value_counts()` 更直接。
   - `DataFrame` 适合分析，但对大数据和模型输入格式的支持不如 `Dataset`。例如，分词长文本并生成 `input_ids` 在 `Dataset` 中更高效。

3. **工作流优化**：
   - 数据分析中，常用 Pandas 清洗和探索数据，然后转为 `Dataset` 用于模型训练。例如，课程中先用 `DataFrame` 计算频率，再转为 `Dataset` 保存或进一步处理。
   - `Dataset.set_format("pandas")` 和 `Dataset.from_pandas()` 提供灵活的切换方式，不改变底层数据。

4. **特定场景的适配**：
   - **Pandas**：需要可视化、统计或复杂特征工程时，转换为 `DataFrame`。
   - **Dataset**：需要高效分词、保存到磁盘或上传 Hub 时，转回 `Dataset`。
   - 例如，课程中将 `Dataset` 转为 `DataFrame` 计算频率，再转回 `Dataset` 便于后续使用。

5. **熟悉度和生态**：
   - Pandas 是数据分析的标准工具，用户更熟悉其操作。`Dataset` 是 Hugging Face 专为 NLP 和机器学习设计的新工具，转换可以结合两者的优势。

---

### 总结
- **`Dataset` 和 `DataFrame` 的优缺点**决定了它们分别适合机器学习预处理和数据分析任务。
- **转换的原因**在于任务需求和工具特性：用 `DataFrame` 探索和分析数据，用 `Dataset` 高效处理和训练模型。通过灵活转换，可以优化工作流并发挥两者的长处。

In [67]:
drug_dataset.set_format("pandas")

In [68]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


In [69]:
train_df = drug_dataset["train"][:]

In [70]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,frequency,count
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


In [71]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 819
})

In [72]:
drug_dataset.reset_format()

In [73]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

In [74]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]

In [75]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

In [76]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/111 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

In [77]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


In [78]:
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

任务 1：训练分类器预测患者病情
目标：基于药品评论（review 列），预测对应的 condition（患者病情）。这是一个多分类任务，因为 condition 有多个类别（例如 "birth control", "depression" 等）。

In [80]:
pip install transformers datasets scikit-learn



In [81]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# 加载数据集
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

# 创建标签映射，仅基于训练集，并添加“未知”类别
unique_conditions = drug_dataset_reloaded["train"].unique("condition")
label2id = {condition: idx for idx, condition in enumerate(unique_conditions)}
id2label = {idx: condition for condition, idx in label2id.items()}
label2id["unknown"] = len(unique_conditions)
id2label[len(unique_conditions)] = "unknown"

# 预处理函数
def preprocess_data(examples):
    examples["labels"] = [label2id.get(cond, label2id["unknown"]) for cond in examples["condition"]]
    return examples

dataset = drug_dataset_reloaded.map(preprocess_data, batched=True)

# 加载分词器和模型
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=len(label2id), id2label=id2label, label2id=label2id
)

# 分词
def tokenize_function(examples):
    return tokenizer(examples["review"], padding="max_length", truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 训练参数
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# 评估指标
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# 初始化 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# 训练和评估
trainer.train()
test_results = trainer.evaluate(tokenized_dataset["test"])
print(test_results)

Map:   0%|          | 0/27703 [00:00<?, ? examples/s]

KeyError: 'cerebrovascular insufficiency'

In [82]:
from transformers import pipeline

# 加载摘要管道
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# 从训练集中取前 5 条评论作为示例
reviews = drug_dataset_reloaded["train"]["review"][:5]

# 生成摘要
summaries = []
for review in reviews:
    # 设置最大长度和最小长度，确保摘要简洁
    summary = summarizer(review, max_length=50, min_length=10, do_sample=False)
    summaries.append(summary[0]["summary_text"])

# 打印原始评论和摘要
for i, (review, summary) in enumerate(zip(reviews, summaries)):
    print(f"Review {i+1}: {review}")
    print(f"Summary {i+1}: {summary}")
    print("-" * 50)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Review 1: "I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on."
Summary 1: "I have used Restasis for about a year now and have seen almost no progress" "The only difference I notice was for the first couple weeks, but now I'm ready to move on"
--------------------------------------------------
Review 