[ML] add support for xlm_roberta tokenized models #94089

benwtrent · 2023-02-23T16:43:23Z

Many multi-lingual and newer models use a tokenization scheme similar to sentence-piece. This PR adds support for one of those tokenization schemes, XLMRoBERTa.

The main changes are:

Support for xlm_roberta tokenization configuration
Adding scores to the vocabulary document stored, requiring that scores be the same size as the vocabulary
Adding a new flat text file to resources that is the spm char normalizer.

github-actions · 2023-02-23T16:43:38Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2023-02-23T16:43:48Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2023-02-23T16:44:19Z

Hi @benwtrent, I've created a changelog YAML for you.

…rent/elasticsearch into feature/ml-add-xlm-roberta-support

benwtrent · 2023-02-23T19:42:22Z

@elasticmachine update branch

…rent/elasticsearch into feature/ml-add-xlm-roberta-support

benwtrent · 2023-05-12T13:39:09Z

@elasticmachine update branch

davidkyle · 2023-05-25T11:23:54Z

@elasticmachine update branch

benwtrent · 2023-06-06T17:39:56Z

@elasticmachine update branch

szabosteve

Thanks for writing the docs for these changes! LGTM!
I've added xlm_roberta to the list of tokenization values via f0748b8, hope you don't mind.

davidkyle

LGTM

Can you look at the transport version number please

...re/src/main/java/org/elasticsearch/xpack/core/ml/action/PutTrainedModelVocabularyAction.java

davidkyle · 2023-06-13T08:03:39Z

...re/src/main/java/org/elasticsearch/xpack/core/ml/action/PutTrainedModelVocabularyAction.java

        }

        @Override
        public ActionRequestValidationException validate() {
            ActionRequestValidationException validationException = null;
            if (vocabulary.isEmpty()) {
                validationException = addValidationError("[vocabulary] must not be empty", validationException);
+            } else {
+                if (scores.isEmpty() == false && scores.size() != vocabulary.size()) {


Suggested change

if (scores.isEmpty() == false && scores.size() != vocabulary.size()) {

if (scores.size() != vocabulary.size()) {

benwtrent · 2023-06-13T11:55:55Z

@elasticmachine update branch

Many multi-lingual and newer models use a tokenization scheme similar to sentence-piece. This PR adds support for one of those tokenization schemes, XLMRoBERTa. The main changes are: - Support for xlm_roberta tokenization configuration - Adding `scores` to the vocabulary document stored, requiring that scores be the same size as the vocabulary - Adding a new flat text file to resources that is the spm char normalizer.

This allows XLMRoberta models to be uploaded to Elasticsearch. blocked by: elastic/elasticsearch#94089

[ML] add support for xml_roberta tokenized models

ade9414

benwtrent added >feature :ml Machine learning v8.8.0 labels Feb 23, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Feb 23, 2023

minor code change

0cf52fa

Update docs/changelog/94089.yaml

99dd2d7

benwtrent mentioned this pull request Feb 23, 2023

[ML] add ability to upload xlm-roberta tokenized models elastic/eland#518

Merged

benwtrent added 5 commits February 23, 2023 11:50

formatting

fcd9b75

Merge branch 'feature/ml-add-xlm-roberta-support' of github.com:benwt…

feb4034

…rent/elasticsearch into feature/ml-add-xlm-roberta-support

compilation fix

c4b97fc

muting bwc tests

705e0e0

fixing docs

4c6a4aa

elasticmachine and others added 4 commits February 24, 2023 06:12

Merge branch 'main' into feature/ml-add-xlm-roberta-support

46d2c98

fixing serialization

f8fb8be

fixing tests and adding test

49d70fc

Merge branch 'feature/ml-add-xlm-roberta-support' of github.com:benwt…

bb7a51c

…rent/elasticsearch into feature/ml-add-xlm-roberta-support

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

elasticmachine and others added 2 commits May 12, 2023 23:39

Merge branch 'main' into feature/ml-add-xlm-roberta-support

7bd0543

merging main fixes

dced7c1

davidkyle added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 25, 2023

Merge branch 'main' into feature/ml-add-xlm-roberta-support

1464f2a

Merge branch 'main' into feature/ml-add-xlm-roberta-support

1e703a3

benwtrent requested review from davidkyle and szabosteve June 6, 2023 18:25

[DOCS] Adds xlm_roberta to the list of tokenization values.

f0748b8

szabosteve approved these changes Jun 7, 2023

View reviewed changes

szabosteve and others added 3 commits June 7, 2023 16:22

[DOCS] Marks XLMRoBERTa as experimental.

f07e061

Merge branch 'main' into feature/ml-add-xlm-roberta-support

a2b1fdd

fixing compilation

086f8f7

davidkyle approved these changes Jun 13, 2023

View reviewed changes

benwtrent added 2 commits June 13, 2023 07:30

fixing tests

efdc687

fixing transport version

809dc38

Merge branch 'main' into feature/ml-add-xlm-roberta-support

e457945

benwtrent added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 13, 2023

elasticsearchmachine merged commit 14ca8fe into elastic:main Jun 13, 2023
13 checks passed

benwtrent deleted the feature/ml-add-xlm-roberta-support branch June 13, 2023 12:41

benwtrent added a commit to elastic/eland that referenced this pull request Jun 14, 2023

[ML] add ability to upload xlm-roberta tokenized models (#518)

8b327f6

This allows XLMRoberta models to be uploaded to Elasticsearch. blocked by: elastic/elasticsearch#94089

picandocodigo pushed a commit to elastic/eland that referenced this pull request Jul 11, 2023

[ML] add ability to upload xlm-roberta tokenized models (#518)

b066581

This allows XLMRoberta models to be uploaded to Elasticsearch. blocked by: elastic/elasticsearch#94089

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] add support for xlm_roberta tokenized models #94089

[ML] add support for xlm_roberta tokenized models #94089

benwtrent commented Feb 23, 2023

github-actions bot commented Feb 23, 2023

elasticsearchmachine commented Feb 23, 2023

elasticsearchmachine commented Feb 23, 2023

benwtrent commented Feb 23, 2023

benwtrent commented May 12, 2023

davidkyle commented May 25, 2023

benwtrent commented Jun 6, 2023

szabosteve left a comment

davidkyle left a comment

davidkyle Jun 13, 2023

benwtrent commented Jun 13, 2023

	if (scores.isEmpty() == false && scores.size() != vocabulary.size()) {
	if (scores.size() != vocabulary.size()) {

[ML] add support for xlm_roberta tokenized models #94089

[ML] add support for xlm_roberta tokenized models #94089

Conversation

benwtrent commented Feb 23, 2023

github-actions bot commented Feb 23, 2023

elasticsearchmachine commented Feb 23, 2023

elasticsearchmachine commented Feb 23, 2023

benwtrent commented Feb 23, 2023

benwtrent commented May 12, 2023

davidkyle commented May 25, 2023

benwtrent commented Jun 6, 2023

szabosteve left a comment

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Jun 13, 2023

Choose a reason for hiding this comment

benwtrent commented Jun 13, 2023