Skip to content

Commit

Permalink
[ML] adds new mpnet tokenization for nlp models (#82234)
Browse files Browse the repository at this point in the history
This commit adds support for MPNet based models.

MPNet models differ from BERT style models in that:

 - Special tokens are different
 - Input to the model doesn't require token positions.

To configure an MPNet tokenizer for your pytorch MPNet based model:

```
"tokenization": {
  "mpnet": {...}
}
```
The options provided to `mpnet` are the same as the previously supported `bert` configuration.
  • Loading branch information
benwtrent committed Jan 5, 2022
1 parent d9e3f1b commit 9dc8aea
Show file tree
Hide file tree
Showing 27 changed files with 1,285 additions and 135 deletions.
20 changes: 20 additions & 0 deletions docs/reference/ml/ml-shared.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -929,6 +929,13 @@ end::inference-config-classification-prediction-field-type[]

tag::inference-config-nlp-tokenization[]
Indicates the tokenization to perform and the desired settings.
The default tokenization configuration is `bert`. Valid tokenization
values are
+
--
* `bert`: Use for BERT-style models
* `mpnet`: Use for MPNet-style models
--
end::inference-config-nlp-tokenization[]

tag::inference-config-nlp-tokenization-bert[]
Expand Down Expand Up @@ -970,6 +977,19 @@ Specifies the maximum number of tokens allowed to be output by the tokenizer.
The default for BERT-style tokenization is `512`.
end::inference-config-nlp-tokenization-bert-max-sequence-length[]

tag::inference-config-nlp-tokenization-mpnet[]
MPNet-style tokenization is to be performed with the enclosed settings.
end::inference-config-nlp-tokenization-mpnet[]

tag::inference-config-nlp-tokenization-mpnet-with-special-tokens[]
Tokenize with special tokens. The tokens typically included in MPNet-style tokenization are:
+
--
* `<s>`: The first token of the sequence being classified.
* `</s>`: Indicates sequence separation.
--
end::inference-config-nlp-tokenization-mpnet-with-special-tokens[]

tag::inference-config-nlp-vocabulary[]
The configuration for retreiving the vocabulary of the model. The vocabulary is
then used at inference time. This information is usually provided automatically
Expand Down
138 changes: 138 additions & 0 deletions docs/reference/ml/trained-models/apis/get-trained-models.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
========
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
========
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]
`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]
`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]
`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
========
=======
`vocabulary`::::
(Optional, object)
Expand Down Expand Up @@ -260,6 +283,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
========
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
========
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]
`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]
`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]
`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
========
=======
`vocabulary`::::
(Optional, object)
Expand Down Expand Up @@ -311,6 +357,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
========
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
========
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]
`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]
`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]
`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
========
=======
`vocabulary`::::
(Optional, object)
Expand Down Expand Up @@ -385,6 +454,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
========
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
========
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]
`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]
`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]
`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
========
=======
`vocabulary`::::
Expand Down Expand Up @@ -436,6 +528,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
========
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
========
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]
`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]
`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]
`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
========
=======
`vocabulary`::::
(Optional, object)
Expand Down Expand Up @@ -502,6 +617,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
========
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
========
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]
`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]
`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]
`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
========
=======
`vocabulary`::::
(Optional, object)
Expand Down
138 changes: 138 additions & 0 deletions docs/reference/ml/trained-models/apis/put-trained-models.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
=======
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
=======
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]

`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]

`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]

`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
=======
======
=====
Expand Down Expand Up @@ -504,6 +527,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
=======
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
=======
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]

`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]

`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]

`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
=======
======
=====
Expand Down Expand Up @@ -544,6 +590,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
=======
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
=======
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]

`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]

`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]

`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
=======
======
=====
Expand Down Expand Up @@ -607,6 +676,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
=======
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
=======
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]

`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]

`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]

`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
=======
======
=====
`text_embedding`:::
Expand Down Expand Up @@ -646,6 +738,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
=======
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
=======
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]

`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]

`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]

`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
=======
======
=====
`zero_shot_classification`:::
Expand Down Expand Up @@ -701,6 +816,29 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenizati
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-with-special-tokens]
=======
`mpnet`::::
(Optional, object)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet]
+
.Properties of mpnet
[%collapsible%open]
=======
`do_lower_case`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-do-lower-case]

`max_sequence_length`::::
(Optional, integer)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-max-sequence-length]

`truncate`::::
(Optional, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-bert-truncate]

`with_special_tokens`::::
(Optional, boolean)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-tokenization-mpnet-with-special-tokens]
=======
======
=====
====
Expand Down

0 comments on commit 9dc8aea

Please sign in to comment.