### Train a SP model.
For larger corpora you will need to set a larger vocab size.

In [None]:
#@title Train a SP model. 
#@markdown For larger corpora you will need to set a larger vocab size.
import sentencepiece as spm
spm.SentencePieceTrainer.train(input='train.en', model_prefix='en', vocab_size=8000)  
spm.SentencePieceTrainer.train(input='train.fr', model_prefix='fr', vocab_size=9000)


Providing the *.model and *.vocab to OpenNMT config yaml file will allow you to train with it:


    src_subword_model: fr.model
    tgt_subword_model: en.model
    src_subword_vocab: fr.vocab
    tgt_subword_vocab: en.vocab

adding sentencepiece to the corpora as needed:

    data:
       corpus_1:
          path_src: train.fr
          path_tgt: train.en
          transforms: [sentencepiece, filtertoolong]
      valid:
          path_src: val.fr
          path_tgt: val.en
          transforms: [sentencepiece]

### Converting SPM

It's possible to convert back and forth from sentence piece (in fact OpenNMT will give the outputs with this tokenization so you'll need ot detokenize before calculating BLEU). For use with regular pytorch you can just apply the already trained model as such (examples from Danish)

In [None]:
#@title Converting SPM
#@markdown It's possible to convert back and forth from sentence piece (in fact OpenNMT will give the outputs with this tokenization so you'll need ot detokenize before calculating BLEU). For use with regular pytorch you can 
#@markdown just apply the already trained model as such (examples from Danish)

#Loading and converting to sentence piece format
with open("train.da",'r',encoding='utf-8') as fp:
  sent = fp.readline()
  fp.close()
sp = spm.SentencePieceProcessor(model_file='da.model')
print("Input Sentence: "+sent)
tokd = sp.encode(sent,out_type=str)
print("SentencePiece Tokens: "+str(tokd))
output = " ".join(tokd)

# How to convert from SentencePiece format to "normal" text
# START HERE IF USING OpenNMT
print("Expected OpenNMT output: "+output)
split_out = output.split(" ")
print("List of re-split tokens"+str(split_out))
detokd = sp.decode(split_out).replace("▁", " ") #capture any remaining underscores that escape
print("Detokenized: "+detokd)



Input Sentence: På Det Blandede EØS-Udvalgs vegne

SentencePiece Tokens: ['▁På', '▁Det', '▁Blande', 'de', '▁EØS', '-', 'Udvalg', 's', '▁vegne']
Expected OpenNMT output: ▁På ▁Det ▁Blande de ▁EØS - Udvalg s ▁vegne
List of re-split tokens['▁På', '▁Det', '▁Blande', 'de', '▁EØS', '-', 'Udvalg', 's', '▁vegne']
Detokenized: På Det Blandede EØS-Udvalgs vegne


### BLEU

The updated BLEU script works for detokenized text, so we if we've saved the outputs of the detokenization from above to "predictions.detokd" we can use BLEU as follows:

In [None]:
#@title BLEU
#@markdown The updated BLEU script works for detokenized text, so we if we've saved the outputs of the detokenization from above to "predictions.detokd" we can use BLEU as follows:

!perl  OpenNMT-py/tools/multi-bleu-detok.perl gold_standard < predictions.detokd
