Reference: https://github.com/bentrevett/pytorch-seq2seq/

# 数据准备

## Datasets

使用opus-100数据集的en-zh子数据集

In [None]:
from datasets import load_dataset

ds = load_dataset("Helsinki-NLP/opus-100", "en-zh")

train_data = [x['translation'] for x in ds["train"]]
valid_data = [x['translation'] for x in ds["validation"]]
test_data = [x['translation'] for x in ds["test"]]

如果提示hub连接失败，可是试试换源

Huggleface镜像源替换环境变量

export HF_ENDPOINT=https://hf-mirror.com

$env:HF_ENDPOINT = "https://hf-mirror.com"

检验dataset是否下载和加载成功

In [None]:
print(ds)
print(train_data[0])

## Tokenizer

接下来使用spacy进行分词，即将一个句子中的单词和短语分离出来，方便进行相关处理和学习训练。

在分词之前，我们需要下载spacy的相关分析模型。

In [None]:
!python -m spacy download zh_core_web_sm

!python -m spacy download en_core_news_sm

或者使用pip的github连接下载，本地使用pip安装也可，注意安装环境。

pip install https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.7.0/zh_core_web_sm-3.7.0-py3-none-any.whl

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl

加载模型

In [None]:
import spacy

en_nlp = spacy.load("en_core_web_sm")
zh_nlp = spacy.load("zh_core_web_sm")

测试加载结果

In [None]:
test_text1 = "This is amazing!"
test_text2 = "这好棒啊"

test_token1 = [token.text for token in en_nlp.tokenizer(test_text1)]
test_token2 = [token.text for token in zh_nlp.tokenizer(test_text2)]
print(test_token1)
print(test_token2)

接下来创建一个函数用于tokenizer，将相应的数据集数据进行分词。

In [None]:
def tokenize_en_zh(example, en_nlp, zh_nlp, max_length, lower, sos_token, eos_token):
    en_tokens = [token.text for token in en_nlp.tokenizer(example["en"])][:max_length]
    zh_tokens = [token.text for token in zh_nlp.tokenizer(example["zh"])][:max_length]
    if lower:
        en_tokens = [token.lower() for token in en_tokens]
    en_tokens = [sos_token] + en_tokens + [eos_token]
    zh_tokens = [sos_token] + zh_tokens + [eos_token]
    return {"en_tokens":en_tokens,"zh_tokens":zh_tokens}

max_length = 1000
lower = True
sos_token = "<sos>"
eos_token = "<eos>"

fn_kwargs = {
    "en_nlp": en_nlp,
    "zh_nlp": zh_nlp,
    "max_length": max_length,
    "lower": lower,
    "sos_token": sos_token,
    "eos_token": eos_token,
}

train_data = ds["train"].map(tokenize_en_zh, fn_kwargs=fn_kwargs)
valid_data = ds["validation"].map(tokenize_en_zh, fn_kwargs=fn_kwargs)
test_data = ds["test"].map(tokenize_en_zh, fn_kwargs=fn_kwargs)

测试一下分词结果。

In [None]:
train_data[0]

## Vocabularies

接下来开始构建词表，将每个单词用一个对应的索引编号来表示。

In [None]:
import torchtext.vocab

min_freq = 1 # 出现次数少于这个的不建立索引
# 特殊词元
sos_token = "<sos>"
eos_token = "<eos>"
unk_token = "<unk>"
pad_token = "<pad>"

special_tokens = {
    unk_token,
    pad_token,
    sos_token,
    eos_token
}

en_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["en_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

zh_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["zh_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

# 处理默认返回结果
en_vocab.set_default_index(en_vocab[unk_token])
zh_vocab.set_default_index(zh_vocab[unk_token])

查看词表建立结果

In [None]:
print(en_vocab.get_itos()[:10])
print(zh_vocab.get_itos()[:10])

接下来创建一个对数据集进行numericalize编码的函数。

In [None]:
def numericalize_en_zh(example, en_vocab, zh_vocab):
    en_ids = en_vocab.lookup_indices(example["en_tokens"])
    zh_ids = zh_vocab.lookup_indices(example["zh_tokens"])
    return {"en_ids": en_ids, "zh_ids": zh_ids}

fn_kwargs = {"en_vocab": en_vocab, "zh_vocab": zh_vocab}
train_data = train_data.map(numericalize_en_zh, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(numericalize_en_zh, fn_kwargs=fn_kwargs)
test_data = test_data.map(numericalize_en_zh, fn_kwargs=fn_kwargs)

查看numericalize结果

In [None]:
train_data[0]

保存词表en_vocab, zh_vocab和处理好的数据集train_data, valid_data, test_data到文件中备用。（可选）

In [12]:
import os
import json

folder = "./data/opus100_en-zh_preprocessed"

def dump_data(folder:str, **kwarg):
    if not os.path.exists(folder):
        os.makedirs(folder)

    for key, value in kwarg.items():
        with open(folder + '/' + key + '.json', 'w') as f:
            json.dump(value, f)

dump_data(folder, en_vocab=en_vocab, zh_vocab=zh_vocab, train_data=train_data, valid_data=valid_data, test_data=test_data)

读取保存内容

In [13]:
import json

folder = "./data/opus100_en-zh_preprocessed/"
files = ['en_vocab.json','zh_vocab.json','train_data.json','valid_data.json','test_data.json']

def load_data(folder:str, files:list):
    data = {}
    for file in files:
        with open(folder + '/' + file, 'r') as f:
            d = {file.split('.')[0]: json.load(f)}
            data.update(d)
    return data

_data = load_data(folder, files)
en_vocab = _data['en_vocab']
zh_vocab = _data['zh_vocab']
train_data = _data['train_data']
valid_data = _data['valid_data']
test_data = _data['test_data']

{'en_vocab': [123], 'zh_vocab': [123], 'train_data': [123], 'valid_data': [123], 'test_data': [123]}


DataLoader