## Update sp model with protobuf

Find out if we can update SentencePiece's trained model `spm_test1k_uds.model` by editing it directly with protobuf.

See https://github.com/google/sentencepiece/issues/121#issuecomment-400362011

#### Compile `.proto` spec

In [1]:
!protoc ./sentencepiece_model.proto --python_out='.'

The above should generate a `sentencepiece_model_pb2.py`:

In [2]:
!ls

protobuf-read-spm.ipynb    sentencepiece_model_pb2.py
sentencepiece_model.proto  spm_test1k_uds.model


#### Import the generated script

In [3]:
import sentencepiece_model_pb2

In [4]:
dir(sentencepiece_model_pb2)

['DESCRIPTOR',
 'ModelProto',
 'NormalizerSpec',
 'SelfTestData',
 'TrainerSpec',
 '_MODELPROTO',
 '_MODELPROTO_SENTENCEPIECE',
 '_MODELPROTO_SENTENCEPIECE_TYPE',
 '_NORMALIZERSPEC',
 '_SELFTESTDATA',
 '_SELFTESTDATA_SAMPLE',
 '_TRAINERSPEC',
 '_TRAINERSPEC_MODELTYPE',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_descriptor',
 '_message',
 '_reflection',
 '_sym_db',
 '_symbol_database']

#### Read the exported sp model from file

In [5]:
# Create a ModelProto message object
model = sentencepiece_model_pb2.ModelProto()

# Read from binary file
with open('spm_test1k_uds.model', 'rb') as fh:
    model.ParseFromString(fh.read())

In [8]:
dir(model)

['ByteSize',
 'Clear',
 'ClearExtension',
 'ClearField',
 'CopyFrom',
 'DESCRIPTOR',
 'DiscardUnknownFields',
 'Extensions',
 'FindInitializationErrors',
 'FromString',
 'HasExtension',
 'HasField',
 'IsInitialized',
 'ListFields',
 'MergeFrom',
 'MergeFromString',
 'ParseFromString',
 'RegisterExtension',
 'SentencePiece',
 'SerializePartialToString',
 'SerializeToString',
 'SetInParent',
 'UnknownFields',
 'WhichOneof',
 '_CheckCalledFromGeneratedFile',
 '_SetListener',
 '__class__',
 '__deepcopy__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_extensions_by_name',
 '_extensions_by_number',
 'normalizer_spec',
 'pieces',
 'self_test_data',
 

We're interested in the `pieces` field. This contains the tokens:

In [12]:
for piece in model.pieces[:15]:
    print(piece)

piece: "<unk>"
score: 0.0
type: UNKNOWN

piece: "<s>"
score: 0.0
type: CONTROL

piece: "</s>"
score: 0.0
type: CONTROL

piece: "\345\274\265\350\200\201\345\270\253"
score: 0.0
type: USER_DEFINED

piece: "\346\225\231\350\202\262\351\203\250"
score: 0.0
type: USER_DEFINED

piece: "\345\244\252\345\271\263\346\264\213"
score: 0.0
type: USER_DEFINED

piece: "\345\210\251\345\245\207\351\246\254"
score: 0.0
type: USER_DEFINED

piece: "\346\237\257\347\276\205\350\216\216"
score: 0.0
type: USER_DEFINED

piece: "\350\214\203\346\226\257\351\253\230"
score: 0.0
type: USER_DEFINED

piece: "\351\242\261\351\242\250"
score: 0.0
type: USER_DEFINED

piece: "\345\217\260\347\201\243"
score: 0.0
type: USER_DEFINED

piece: "\344\270\212\345\215\210"
score: 0.0
type: USER_DEFINED

piece: "\344\270\255\345\244\256\346\260\243\350\261\241\345\261\200"
score: 0.0
type: USER_DEFINED

piece: ","
score: -2.557969093322754

piece: "\342\226\201"
score: -3.8373160362243652



We know from the `.vocab` file that our user-defined symbols include the following:

```
<unk>	0
<s>	0
</s>	0
張老師	0
教育部	0
太平洋	0
利奇馬	0
柯羅莎	0
范斯高	0
颱風	0
台灣	0
上午	0
中央氣象局	0
,	-2.55797
▁	-3.83732
...
```

In [17]:
model.pieces[3]

piece: "\345\274\265\350\200\201\345\270\253"
score: 0.0
type: USER_DEFINED

In [18]:
model.pieces[3].piece

'張老師'

In [19]:
model.pieces[3].score

0.0

In [20]:
model.pieces[3].type

4

#### Add new user-defined pieces

Add these terms: `您好 逼近`

In [21]:
new_piece = model.pieces.add()

In [22]:
new_piece



From the `.proto` file, we know that certain fields can't be empty:

```proto
optional string piece = 1;   // piece must not be empty.
optional float  score = 2;
optional Type   type = 3 [ default =  NORMAL ];
```

and that we have to specify the piece type as USER_DEFINED:

```proto
enum Type {
    NORMAL       = 1;  // normal symbol
    UNKNOWN      = 2;  // unknown symbol. only <unk> for now.
    CONTROL      = 3;  // control symbols. </s>, <s>, <2ja> etc.
    USER_DEFINED = 4;  // user defined symbols.
                       // Typical usage of USER_DEFINED symbol
                       // is placeholder.
    UNUSED       = 5;  // this piece is not used.
};
```

In [23]:
new_piece.piece = '您好'
new_piece.score = 0
new_piece.type = 4 # 4 means type = USER_DEFINED

In [24]:
new_piece.IsInitialized()

True

In [25]:
new_piece2 = model.pieces.add()
new_piece2.piece = '逼近'
new_piece2.score = 0
new_piece2.type = 4
new_piece2.IsInitialized()

True

#### Check result

Verify if the model includes the 2 new pieces:

In [30]:
for piece in model.pieces:
    if piece.type == 4:
        print(piece.piece)

張老師
教育部
太平洋
利奇馬
柯羅莎
范斯高
颱風
台灣
上午
中央氣象局
您好
逼近


#### new pieces are appended to the end:

In [34]:
model.pieces[-1].piece

'逼近'

In [35]:
model.pieces[-2].piece

'您好'

#### Save the updated model

In [38]:
with open('spm_test1k_uds_updated.model', 'wb') as fh:
    fh.write(model.SerializeToString())

---

### See if updated model can segment added terms

Old model:

In [40]:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load('./spm_test1k_uds.model')

text1 = '張老師您好，要給教育部的文件已經寄給您了'
text2 = '太平洋上利奇馬、柯羅莎、范斯高「三颱共舞」！颱風利奇馬更逼近台灣'

Look for '您好':

In [41]:
sp.EncodeAsPieces(text1)

['▁',
 '張老師',
 '您',
 '好',
 ',',
 '要',
 '給',
 '教育部',
 '的',
 '文',
 '件',
 '已',
 '經',
 '寄給您',
 '了']

Look for '逼近':

In [42]:
sp.EncodeAsPieces(text2)

['▁',
 '太平洋',
 '上',
 '利奇馬',
 '、',
 '柯羅莎',
 '、',
 '范斯高',
 '「',
 '三',
 '颱',
 '共',
 '舞',
 '」',
 '!',
 '颱風',
 '利奇馬',
 '更',
 '逼',
 '近',
 '台灣']

Updated model:

In [1]:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load('./spm_test1k_uds_updated.model')

text1 = '張老師您好，要給教育部的文件已經寄給您了'
text2 = '太平洋上利奇馬、柯羅莎、范斯高「三颱共舞」！颱風利奇馬更逼近台灣'

Look for '您好':

In [2]:
sp.EncodeAsPieces(text1)

['▁', '張老師', '您好', ',', '要', '給', '教育部', '的', '文', '件', '已', '經', '寄給您', '了']

In [3]:
sp.EncodeAsIds(text1)

[14, 3, 1000, 13, 93, 0, 4, 16, 994, 603, 802, 85, 0, 61]

Look for '逼近':

In [45]:
sp.EncodeAsPieces(text2)

['▁',
 '太平洋',
 '上',
 '利奇馬',
 '、',
 '柯羅莎',
 '、',
 '范斯高',
 '「',
 '三',
 '颱',
 '共',
 '舞',
 '」',
 '!',
 '颱風',
 '利奇馬',
 '更',
 '逼近',
 '台灣']

In [4]:
sp.EncodeAsIds(text2)

[14, 5, 41, 6, 28, 7, 28, 8, 31, 873, 0, 463, 0, 52, 387, 9, 6, 807, 1001, 10]

### Conclusion

Yes, SentencePiece models *can* be updated after the fact by appending new user-defined symbols via protobuf.

User-defined symbols added after training have IDs starting from `1000`.

---

### What if

What if we try to add 2 identical tokens, or add a new token that already exists in the model?

Only one way to find out!!

#### Load proto model

In [1]:
import sentencepiece_model_pb2

model = sentencepiece_model_pb2.ModelProto()

with open('spm_test1k_uds.model', 'rb') as fh:
    model.ParseFromString(fh.read())

#### Add a redundant piece

In [5]:
new_piece = model.pieces.add()

In [10]:
# '太平洋' already exists
new_piece.piece = '太平洋'

In [9]:
new_piece.type = 4

In [8]:
new_piece.score

0.0

In [12]:
# verify the new piece has been appended
model.pieces[-1].piece

'太平洋'

#### Save updated model

In [13]:
with open('spm_test1k_repeat.model', 'wb') as fh:
    fh.write(model.SerializeToString())

#### Load the model in sentencepiece

In [14]:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load('./spm_test1k_repeat.model')

text1 = '太平洋上利奇馬、柯羅莎、范斯高「三颱共舞」！颱風利奇馬更逼近台灣'

RuntimeError: Internal: 太平洋 is already defined.

Ah-ha, looks like **a redundant piece results in a runtime error** when you try to import the sp model:

`RuntimeError: Internal: 太平洋 is already defined.`