[toc]

# Pytorch torchtext

## torchtext概述

torchtext预处理流程：

1. 定义Field：声明如何处理数据
2. 定义Dataset：得到数据集，此时数据集里每一个样本是一个 经过 Field声明的预处理 预处理后的 wordlist
3. 建立vocab：在这一步建立词汇表，词向量(word embeddings)
4. 构造Iterator，用来分批次训练模型

### Field对象

Field对象指定要如何处理某个字段.

### Dataset

Dataset定义数据源信息.

### Iterator

迭代器返回模型所需要的处理后的数据.迭代器主要分为Iterator, BucketIerator, BPTTIterator三种。

*   Iterator：标准迭代器
*   BucketIerator：相比于标准迭代器，会将类似长度的样本当做一批来处理，因为在文本处理中经常会需要将每一批样本长度补齐为当前批中最长序列的长度，因此当样本长度差别较大时，使用BucketIerator可以带来填充效率的提高。除此之外，我们还可以在Field中通过 fix_length参数来对样本进行截断补齐操作。
*   BPTTIterator: 基于BPTT(基于时间的反向传播算法)的迭代器，一般用于语言模型中。

## Field

- sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.

- use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.

- init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.

- eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.

- fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.

- dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.

- preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.

- postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.

- lower – Whether to lowercase the text in this field. Default: False.

- tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.

- tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.

- include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.

- batch_first – Whether to produce tensors with the batch dimension first. Default: False.

- pad_token – The string token used as padding. Default: “<pad>”.

- unk_token – The string token used to represent OOV words. Default: “<unk>”.

- pad_first – Do the padding of the sequence at the beginning. Default: False.

- truncate_first – Do the truncating of the sequence at the beginning. Default: False

- stop_words – Tokens to discard during the preprocessing step. Default: None

- is_target – Whether this field is a target variable. Affects iteration over batches. Default: False

In [14]:
import torch
from torchtext import data
TEXT = data.Field(sequential=True, lower=True, fix_length=200)
LABEL = data.Field(sequential=False, use_vocab=False)

## Dataset

### TabularDataset

In [13]:
"""
我们不需要 'PhraseId' 和 'SentenceId'这两列, 所以我们给他们的field传递 None
如果你的数据有列名，如我们这里的'Phrase','Sentiment',...
设置skip_header=True,不然它会把列名也当一个数据处理
"""
train,val = data.TabularDataset.splits(
        path='.', train='train.csv',validation='val.csv', format='csv',skip_header=True,
        fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT), ('Sentiment', LABEL)])

test = data.TabularDataset('test.tsv', format='tsv',skip_header=True,
        fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT)])

NameError: name 'data' is not defined

In [15]:
print(train[5])
print(train[5].__dict__.keys())
print(train[5].Phrase,train[0].Sentiment)

NameError: name 'train' is not defined

我们可以看到第6行的输入，它是一个Example对象。Example对象绑定了一行中的所有属性，可以看到，句子已经被分词了，但是没有转化为数字。

这是因为我们还没有建立vocab，我们将在下一步建立vocab。

Torchtext可以将词转化为数字，但是它需要被告知需要被处理的全部范围的词。我们可以用下面这行代码：

In [None]:
TEXT.build_vocab(train)

vocab 建立之后就可以将词和数字之间相互转化了

In [None]:
print(TEXT.vocab.itos[1510])
print(TEXT.vocab.stoi['bore'])

在建立词表的过程中，默认 `<unk>` 的 index 是 0，然后依次添加其他词语。因此会在有的代码中出现下面的对齐下标的操作。

In [1]:
for batch in data_iter:
    feature, target = batch.text, batch.label
    feature.t_(), target.sub_(1)  # batch first, index align

NameError: name 'train_iter' is not defined

### 自定义 Dataset

In [None]:
from torchtext import data
import random
import numpy as np

class MyDataset(data.Dataset):
    def __init__(self, csv_path, text_field, label_field, test=False, aug=False, **kwargs):
        
        csv_data = pd.read_csv(csv_path)
        
        # 数据处理操作格式
        fields = [("id", None),("text", text_field), ("label", label_field)]
        
        examples = []
        if test:
            # 如果为测试集，则不加载标签
            for text in tqdm(csv_data['text']):
                examples.append(data.Example.fromlist([None, text, None], fields))
        else:
            for text, label in tqdm(zip(csv_data['text'], csv_data['label'])):
                # 数据增强
                if aug:
                    rate = random.random()
                    if rate > 0.5:
                        text = self.dropout(text)
                    else:
                        text = self.shuffle(text)
                examples.append(data.Example.fromlist([None, text, label], fields))
                
        # 上面是一些预处理操作，此处调用super调用父类构造方法，产生标准Dataset
        # super(MyDataset, self).__init__(examples, fields, **kwargs)
        super(MyDataset, self).__init__(examples, fields)

    def shuffle(self, text):
        # 序列随机排序
        text = np.random.permutation(text.strip().split())
        return ' '.join(text)

    def dropout(self, text, p=0.5):
        # 随机删除一些文本
        text = text.strip().split()
        len_ = len(text)
        indexs = np.random.choice(len_, int(len_ * p))
        for i in indexs:
            text[i] = ''
        return ' '.join(text)

## Iterator

In [17]:
train_iter = data.BucketIterator(train, batch_size=128, sort_key=lambda x: len(x.Phrase), 
                                 shuffle=True,device=DEVICE)

val_iter = data.BucketIterator(val, batch_size=128, sort_key=lambda x: len(x.Phrase), 
                                 shuffle=True,device=DEVICE)

# 在 test_iter , sort一定要设置成 False, 要不然会被 torchtext 搞乱样本顺序
test_iter = data.Iterator(dataset=test, batch_size=128, train=False,
                          sort=False, device=DEVICE)

NameError: name 'train' is not defined

这里的属性名就是我们之前在 fields 中设置的属性名

In [None]:
for batch in train_iter:
    data = batch.Phrase
    label = batch.Sentiment
    print(batch.Phrase.shape)
    print(batch.Phrase)

### API

#### vocab

#### numericalize

Field.numericalize([['eward', 'elric']]) 将词语转化为 one-hot 表示

# References
1. [TorchText用法示例及完整代码_nlpuser的博客-CSDN博客](https://blog.csdn.net/nlpuser/article/details/88067167)
2. [Torchtext使用教程 文本数据处理 - 林震宇 - 博客园](https://www.cnblogs.com/linzhenyu/p/13277552.html)
3. [Torchtext使用教程_NLP Tutorial-CSDN博客](https://blog.csdn.net/JWoswin/article/details/92821752)