# On Multilingual Sequence Transformers

tl;dr: [`sentence-transformers/paraphrase-multilingual-mpnet-base-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2), a multilingual model for mapping sentences and paragraphs to a 768-dimensional space, deserves your consideration.

Covers the tutorial from Pinecone: [Tomayto, Tomahto, Transformer: Multilingual Sentence Transformers](https://www.pinecone.io/learn/series/nlp/multilingual-transformers/). 

In [1]:
from datasets import load_dataset

ted = load_dataset('ted_multi', split='train')
ted

Dataset({
    features: ['translations', 'talk_name'],
    num_rows: 258098
})

In [2]:
print(ted[0]['translations']['language'])

['ar', 'bg', 'de', 'el', 'en', 'es', 'eu', 'fa', 'fr', 'fr-ca', 'he', 'hr', 'hu', 'it', 'ja', 'ko', 'nb', 'nl', 'pl', 'pt', 'pt-br', 'ro', 'ru', 'sq', 'tr', 'vi', 'zh-cn', 'zh-tw']


In [3]:
idx = ted[0]['translations']['language'].index('en')
idx

4

In [4]:
# use the index to get the corresponding translation
source = ted[0]['translations']['translation'][idx]
source

'Amongst all the troubling deficits we struggle with today — we think of financial and economic primarily — the ones that concern me most is the deficit of political dialogue — our ability to address modern conflicts as they are , to go to the source of what they &apos;re all about and to understand the key players and to deal with them .'

In [5]:
# use that info to create all (source, translation) pairs
pairs = []
for i, translation in enumerate(ted[0]['translations']['translation']):
    # we don't want to use the source language (English) as a translation
    if i != idx:
        pairs.append((source, translation))

# let's see what we have
pairs[0]

('Amongst all the troubling deficits we struggle with today — we think of financial and economic primarily — the ones that concern me most is the deficit of political dialogue — our ability to address modern conflicts as they are , to go to the source of what they &apos;re all about and to understand the key players and to deal with them .',
 'من ضمن جميع المثبطات المقلقة التي نعاني منها اليوم نفكر في المقام الاول في الامور المالية والاقتصادية واكثر ما يهمني بشكل اكثر هو عجز الحوار السياسي — قدرتنا على فهم الصراعات الحديثة على ماهي عليه , بالذهاب الى اصلها الفعلي وعلى فهم اللاعبين الرئيسيين وعلى التعامل معهم')

In [6]:
from sentence_transformers import InputExample
from tqdm.auto import tqdm  # so we see progress bar

# initialize list of languages to keep
lang_list = ['ja']
# create dict to store our pairs
train_samples = {f'en-{lang}': [] for lang in lang_list}

# now build our training samples list
for row in tqdm(ted):
    # get source (English)
    idx = row['translations']['language'].index('en')
    source = row['translations']['translation'][idx].strip()
    # loop through translations
    for i, lang in enumerate(row['translations']['language']):
        # check if lang is in lang list
        if lang in lang_list:
            translation = row['translations']['translation'][i].strip()
            train_samples[f'en-{lang}'].append(
                source+'\t'+translation
            )

  0%|          | 0/258098 [00:00<?, ?it/s]

In [7]:
# how many pairs for each language?
for lang_pair in train_samples.keys():
    print(f'{lang_pair}: {len(train_samples[lang_pair])}')

en-ja: 204090


In [8]:
import gzip
import os

if not os.path.exists('./data'):
    os.mkdir('./data')

# save to file, sentence transformers reader will expect tsv.gz file
for lang_pair in train_samples.keys():
    with gzip.open(f'./data/ted-train-{lang_pair}.tsv.gz', 'wt', encoding='utf-8') as f:
        f.write('\n'.join(train_samples[lang_pair]))

In [9]:
!ls data

ted-train-en-ja.tsv.gz


----

### Selecting a tokenizer for the Student

In [10]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [11]:
sentences = [
    'we will include several languages',
    'いくつかの言語を含める'
]

for text in sentences:
    print(bert_tokenizer.tokenize(text))

['we', 'will', 'include', 'several', 'languages']
['い', '##く', '##つ', '##か', '##の', '[UNK]', '語', 'を', '[UNK]', 'め', '##る']


In [12]:
from transformers import XLMRobertaTokenizer

xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

for text in sentences:
    print(xlmr_tokenizer.tokenize(text))

['▁we', '▁will', '▁include', '▁several', '▁language', 's']
['▁', 'いくつか', 'の', '言語', 'を含め', 'る']


### Student model

In [13]:
from sentence_transformers import models, SentenceTransformer

xlmr = models.Transformer('xlm-roberta-base')
pooler = models.Pooling(
    xlmr.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

student = SentenceTransformer(modules=[xlmr, pooler])
student

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

### Teacher model

In [14]:
teacher = SentenceTransformer('all-mpnet-base-v2')
teacher

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

... but we need access to the _raw logits in the output layer_, not a layer-normalized result...

Please see...

In [15]:
teacher = SentenceTransformer('paraphrase-distilroberta-base-v2')
teacher

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

### Fine-tuning

* `sentence_transformers.ParallelSentencesDataset`

In [16]:
from sentence_transformers import ParallelSentencesDataset

data = ParallelSentencesDataset(
    student_model=student, 
    teacher_model=teacher, 
    batch_size=32, 
    use_embedding_cache=True
)

In [17]:
max_sentences_per_language = 500000
train_max_sentence_length = 250 # max num of characters per sentence

train_files = [f for f in os.listdir('./data') if 'train' in f]
for f in train_files:
    print(f)
    data.load_data('./data/'+f, max_sentences=max_sentences_per_language, max_sentence_length=train_max_sentence_length)

ted-train-en-ja.tsv.gz


In [18]:
from torch.utils.data import DataLoader

loader = DataLoader(data, shuffle=True, batch_size=32)

In [19]:
from sentence_transformers import losses

loss = losses.MSELoss(model=student)

In [20]:
from sentence_transformers import evaluation
import numpy as np

epochs = 1
warmup_steps = int(len(loader) * epochs * 0.1)

student.fit(
    train_objectives=[(loader, loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='./xlmr-ted',
    #optimizer_params={'lr': 2e-5, 'eps': 1e-6, 'correct_bias': False},
    optimizer_params={'lr': 2e-5, 'eps': 1e-6},
    save_best_model=True,
    show_progress_bar=True
)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/11912 [00:00<?, ?it/s]

  labels = torch.tensor(labels)


----

### Evaluation

* 

> 多言語用 STS ベンチマークデータセット(stsb_multi_mt)は huggigface datasets として公開されています。ただ日本語だけ対象外となっています。 どうやら、github issue によると train データに deepl で翻訳すると空行になってしまう箇所があり、そのためエラーとなっているようです。 今回は test データのみを使用しますのでデータとしては問題ないのですが、読み込みの設定で日本語が除外されていますので、 ローカルにコピーして修正します。

from [https://tech.yellowback.net/posts/sentence-transformers-japanese-models](https://tech.yellowback.net/posts/sentence-transformers-japanese-models)

In [21]:
en = load_dataset('stsb_multi_mt_ja', 'en', split='test')
en

Dataset({
    features: ['sentence1', 'sentence2', 'similarity_score'],
    num_rows: 1379
})

In [22]:
en[0]

{'sentence1': 'A girl is styling her hair.',
 'sentence2': 'A girl is brushing her hair.',
 'similarity_score': 2.5}

In [23]:
ja = load_dataset('stsb_multi_mt_ja', 'ja', split='test')
ja

Dataset({
    features: ['sentence1', 'sentence2', 'similarity_score'],
    num_rows: 1379
})

In [24]:
ja[0]

{'sentence1': '女の子が髪をスタイリングしています。',
 'sentence2': '少女が髪をとかしている。',
 'similarity_score': 2.5}

In [25]:
en = en.map(lambda x: {'similarity_score': x['similarity_score'] / 5.0})
ja = ja.map(lambda x: {'similarity_score': x['similarity_score'] / 5.0})

ja[0]

{'sentence1': '女の子が髪をスタイリングしています。',
 'sentence2': '少女が髪をとかしている。',
 'similarity_score': 0.5}

In [26]:
from sentence_transformers import InputExample

en_samples = []
ja_samples = []
en_ja_samples = []

for i in range(len(en)):
    en_samples.append(InputExample(
        texts=[en[i]['sentence1'], en[i]['sentence2']],
        label=en[i]['similarity_score']
    ))

    ja_samples.append(InputExample(
        texts=[ja[i]['sentence1'], ja[i]['sentence2']],
        label=ja[i]['similarity_score']
    ))

    en_ja_samples.append(InputExample(
        texts=[en[i]['sentence1'], ja[i]['sentence2']],
        label=en[i]['similarity_score']
    ))

In [27]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

ja_eval = EmbeddingSimilarityEvaluator.from_input_examples(
    ja_samples, write_csv=False
)

en_eval = EmbeddingSimilarityEvaluator.from_input_examples(
    en_samples, write_csv=False
)

en_ja_eval = EmbeddingSimilarityEvaluator.from_input_examples(
    en_ja_samples, write_csv=False
)

In [28]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('./xlmr-ted')

en_eval(model)

0.7143563910719034

In [29]:
ja_eval(model)

0.6598930363883629

In [30]:
en_ja_eval(model)

0.6091859290134577

In [31]:
xlmr = models.Transformer('xlm-roberta-base')
pooler = models.Pooling(
    xlmr.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

student = SentenceTransformer(modules=[xlmr, pooler])

In [32]:
en_eval(student)

0.47525931826733264

In [33]:
ja_eval(student)

0.4771557350064595

In [34]:
en_ja_eval(student)

0.20667255470707896

----

In [35]:
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

In [36]:
en_eval(model)

0.8682218476677823

In [37]:
ja_eval(model)

0.793185431984425

In [38]:
en_ja_eval(model)

0.7765553636736806