## 文本分类数据集：emotion

在练习文本分类时我们第一步是需要先有相关文本的数据。这个数据集一般需要有2列：
- 第一列就是文本(一句话)
- 第二列是这句话是正面情绪还是负面情绪，可以用数字`0`、`1`来标识。也可以是多种情绪的比如：愤怒、厌恶、恐惧、喜悦、悲伤、惊讶等。
  > HuggingFace中的`emotion`数据集就是有：anger、disgust、fear、joy、sadness和surprise。6中情绪的数据集。

In [1]:
import torch
import datasets
import huggingface_hub

### 1. 加载HuggingFace中的数据集

#### 1.1 使用load_dataset加载数据集

In [2]:
# 第一次加载的话去线上下载数据
# 后续会使用本地的缓存，一般是：/Users/$Home/.cache/huggingface/modules/datasets_modules/datasets/emotion
ds = datasets.load_dataset("emotion")
ds

Using the latest cached version of the module from /Users/alex.zhou/.cache/huggingface/modules/datasets_modules/datasets/emotion/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd (last modified on Fri May 31 14:39:28 2024) since it couldn't be found locally at emotion, or remotely on the Hugging Face Hub.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [3]:
# 查看数据集的键、查看数据集的类型
ds.keys(), type(ds)

(dict_keys(['train', 'validation', 'test']), datasets.dataset_dict.DatasetDict)

In [4]:
print(ds["train"])
print(type(ds["test"]))

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})
<class 'datasets.arrow_dataset.Dataset'>


> `DatasetDict`对象类似于Python的字段，每个键对应不同的数据集(Dataset)。

#### 1.2 数据集基本操作

**查看数据集的某类数据集(根据key获取):**

In [5]:
# 训练数据集
train_ds = ds["train"]
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

In [6]:
# 查看数据集的长度
len(train_ds)

16000

In [7]:
# 查看数据集的类型
type(train_ds)

datasets.arrow_dataset.Dataset

**查看数据中的列/key:**

In [8]:
train_ds.column_names

['text', 'label']

我们可以通过`Dataset`对象的`features`属性类查看背后使用了哪些数据类型。
> datasets库是使用了`Apache Arrow`构建的，`Apache Arrow`定义了一种类型化的列格式。比原生的Python更有效的利用内存。

In [9]:
train_ds.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

**访问数据：**

In [10]:
train_ds[0]

{'text': 'i didnt feel humiliated', 'label': 0}

In [11]:
train_ds[:2]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake'],
 'label': [0, 0]}

In [12]:
train_ds.__class__.__mro__

(datasets.arrow_dataset.Dataset,
 datasets.arrow_dataset.DatasetInfoMixin,
 datasets.search.IndexableMixin,
 datasets.arrow_dataset.TensorflowDatasetMixin,
 object)

#### 1.3 数据集DatasetDict格式转换

> 我们可以通过`DatasetDict`对象的`set_format()`方法，更改数据集的输出格式进行转换，且随时可以切换另外一种格式。   
> 可选的格式有：`[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']`。
> 
> 如果想恢复数据了，那么使用`DatasetDict`对象的`reset_format()`方法即可。


##### 1. 把数据转换为pandas的DataFrame

In [13]:
# 转换之前我们先看一下里面数据的格式
ds["train"][0]

{'text': 'i didnt feel humiliated', 'label': 0}

In [14]:
type(ds["train"][0])

dict

In [15]:
ds.set_format(type="pandas")
# 没传递columns，那么默认就是全部的列
# ds.set_format(type="pandas", columns=["text", "label"])

In [16]:
ds["train"][0]

Unnamed: 0,text,label
0,i didnt feel humiliated,0


In [17]:
# 查看类型
type(ds["train"][0])

pandas.core.frame.DataFrame

> 可以看到我们在对数据集设置type之前，里面的数据是`dict`类型，设置之后就变成`DataFrame`了。

In [18]:
# 我们在前面有执行：train_ds = ds["train"]
train_ds[:5]

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


> 可以发现，`set_format()`方法，并不会改变底层的数据格式（Arrow）表，我们可以通过`reset_format`方法恢复默认格式，或者重新切换到另外一种格式。

##### 2. 切换为PyTorch的张量

> 前面我们把数据格式设置为了`pandas`，现在我们把它们设置为`torch`（PyTorch的张量）。

In [19]:
type(ds["train"][0])

pandas.core.frame.DataFrame

In [20]:
ds.set_format(type="torch")

In [21]:
# 再次查看训练数据集的第一个数据
ds["train"][:5]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy'],
 'label': tensor([0, 0, 3, 2, 3])}

In [22]:
type(ds["train"][0]["label"])

torch.Tensor

##### 3. 重置数据格式

In [23]:
ds.reset_format()

#### 1.4 给数据集添加列

In [24]:
type(train_ds['label'])

list

我们再次把数据格式转换为`pandas`后，再来添加列。

In [25]:
ds.set_format("pandas")

In [26]:
type(train_ds['label'])

pandas.core.series.Series

In [27]:
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

In [28]:
train_ds.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [29]:
train_ds.features["label"]

ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)

In [30]:
type(train_ds.features["label"])

datasets.features.features.ClassLabel

In [31]:
# ClassLabel有个`int2str`的方法可以把数值转换为字符
train_ds.features["label"].int2str(2)

'love'

In [32]:
# train_dataframe = train_ds[:]
train_df = train_ds[:]

In [33]:
train_df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


现在我们利用ClassLabel的`int2str`方法给数据添加一个列`label_name`。

In [34]:
train_df["label_name"] = train_df["label"].apply(lambda l: train_ds.features["label"].int2str(l))

In [35]:
train_df.head()

Unnamed: 0,text,label,label_name
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger


**现在我们加好了一列了**。

In [36]:
train_df.columns

Index(['text', 'label', 'label_name'], dtype='object')

### 2. 给数据集的文本分词和获取嵌入向量

> 我们使用预训练的模型`bert-base-uncased`给文本分词和获取嵌入。

In [37]:
from transformers import BertModel, BertTokenizer

In [38]:
model_name = "bert-base-uncased"

In [39]:
# 先给数据恢复为默认格式
ds.reset_format()

### 2.1 分词

In [40]:
train_df.head()

Unnamed: 0,text,label,label_name
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger


先实例化分词器：

In [41]:
tokenizer = BertTokenizer.from_pretrained(model_name)
tokenizer

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [42]:
tokenizer.tokenize("I love python and transforms.")

['i', 'love', 'python', 'and', 'transforms', '.']

In [43]:
tokenizer.tokenize(train_df["text"][0])

['i', 'didn', '##t', 'feel', 'humiliated']

In [44]:
tokenizer(train_df["text"][0], max_length=15, padding="max_length")

{'input_ids': [101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]}

In [45]:
tokenizer.convert_ids_to_tokens([101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0])

['[CLS]',
 'i',
 'didn',
 '##t',
 'feel',
 'humiliated',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

**现在我们给其添加tokens的列**

In [46]:
train_df["tokens"] = train_df["text"].apply(lambda x: tokenizer.convert_ids_to_tokens(tokenizer(x).input_ids))

In [47]:
train_df.head()

Unnamed: 0,text,label,label_name,tokens
0,i didnt feel humiliated,0,sadness,"[[CLS], i, didn, ##t, feel, humiliated, [SEP]]"
1,i can go from feeling so hopeless to so damned...,0,sadness,"[[CLS], i, can, go, from, feeling, so, hopeles..."
2,im grabbing a minute to post i feel greedy wrong,3,anger,"[[CLS], im, grabbing, a, minute, to, post, i, ..."
3,i am ever feeling nostalgic about the fireplac...,2,love,"[[CLS], i, am, ever, feeling, nos, ##tal, ##gi..."
4,i am feeling grouchy,3,anger,"[[CLS], i, am, feeling, gr, ##ou, ##chy, [SEP]]"


`tokenizer()`方法会返回有`input_ids`、`token_type_ids`、和`attention_mask`字段，我们想把这几个字段加入到`train_df`中

#### 2.2 直接给数据集加上tokenizer返回的字段

`datasets.dataset_dict.DatasetDict`对象的`map()`方法，默认是按单个样本操作的，我们可以设置其为一批一批的操作(`batch=True`)即可。`batch_size=None`会把整个数据集作为一个批量应用map的函数。    
还可以设置多线程来处理数据：`num_proc=3`.

In [48]:
def tokenize_handler(batch):
    # padding = True 是零填充样本，如果设置max_length=N, padding="max_length"，那么会填充[PAD]到末尾
    # truncation = True 是将样本截断为模型的最大上下文大小
    return tokenizer(batch["text"], padding=True, truncation=True)

In [49]:
# map方法会把input_ids、token_type_ids、和attention_mask字段直接添加到数据集中
ds_encode = ds.map(tokenize_handler, batched=True, batch_size=1000, num_proc=3)

Map (num_proc=3):   0%|          | 0/16000 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/2000 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [50]:
ds_encode

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [51]:
tokenizer.model_input_names

['input_ids', 'token_type_ids', 'attention_mask']

执行完`map()`后，我们可以看到`tokenizer`返回的3个数据列(`input_ids`、`token_type_ids`、`attention_mask`)添加到了数据集中了

In [52]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [53]:
ds["train"].column_names, ds_encode["train"].column_names

(['text', 'label'],
 ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'])

#### 2.3 获取嵌入向量

In [54]:
model = BertModel.from_pretrained(model_name)

In [55]:
# 查看模型编码器的层数
len(model.encoder.layer)

12

**第一步：**  先获取一个句子的特征向量

In [56]:
token_inputs = tokenizer(["I love python and transformer."], return_tensors="pt")
token_inputs

{'input_ids': tensor([[  101,  1045,  2293, 18750,  1998, 10938,  2121,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [57]:
outputs = model(**token_inputs)
type(outputs)

transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions

In [58]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [59]:
last_hidden_state, pooler_output = outputs['last_hidden_state'], outputs['pooler_output']

In [60]:
type(last_hidden_state)

torch.Tensor

In [61]:
tokens = tokenizer.convert_ids_to_tokens(token_inputs["input_ids"][0])
print(tokens)
print(len(tokens))

['[CLS]', 'i', 'love', 'python', 'and', 'transform', '##er', '.', '[SEP]']
9


In [62]:
last_hidden_state.shape

torch.Size([1, 9, 768])

`last_hidden_state`最后一层隐藏状态为`[batch_size, n_tokens, hidden_dim]`。     
我们有一个句子`I love python and transformer.`这里`tokens`的长度是9(`['[CLS]', 'i', 'love', 'python', 'and', 'transform', '##er', '.', '[SEP]']`)。

`768`是模型隐藏状态的维度。

In [63]:
# [CLS]就是代表整个句子的特征值: 在文本分类，对整个句子的情感分析，用的就是这个特征向量
last_hidden_state[0, 0].shape

torch.Size([768])

**第二步：** 批量获取数据的特征向量

In [64]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [65]:
# 设置ds的格式为torch
ds.set_format("torch")

In [66]:
def get_text_hidden_states(batch, device="cpu"):
    # 先判断是否可以使用GPU
    # device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

    # 通过text直接分词
    inputs = tokenizer(batch["text"], truncation=True, padding=True, return_tensors="pt")
    
    # inputs = {
    #     k: v.to(device) for k, v in batch.items()
    #     if k in tokenizer.model_input_names
    # }
            
        
    with torch.no_grad():
        try:
            last_hidden_state = model(**inputs).last_hidden_state
        except Exception as e:
            print(e)
            print(inputs)
            print(batch, batch["text"])
            return {}

    # 返回
    # return {"hidden_state": last_hidden_state[:,0].numpy()}
    return {"hidden_state": last_hidden_state[:,0]}

In [67]:
# 继续使用ds的map方法, 默认batch_size是1000
%time ds_embeddings = ds.map(get_text_hidden_states, batched=True, batch_size=500)

# ds_encode.set_format("torch")
# %time ds_embeddings = ds_encode.map(get_text_hidden_states2, batched=True, batch_size=1000)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

CPU times: user 20min 31s, sys: 9min 36s, total: 30min 8s
Wall time: 3min 33s


In [68]:
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

device = "cpu"
model.to(device)
device

'cpu'

In [69]:
# %time ds_embeddings = ds.map(get_text_hidden_states, batched=True)
# GPU获取嵌入向量，待优化，会报错。取batch数据的时候，Tensor2维的数据变成了一个Tensor的列表了

> CPU times: user 19min 22s, sys: 8min 39s, total: 28min 2s    
Wall time: 3min 34s

In [70]:
ds_embeddings

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'hidden_state'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'hidden_state'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'hidden_state'],
        num_rows: 2000
    })
})

In [71]:
ds_embeddings["train"][0]["hidden_state"].shape

torch.Size([768])

**现在就可以利用`hidden_state`进行文本分类了**。