New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] add support for xlm_roberta tokenized models #94089
[ML] add support for xlm_roberta tokenized models #94089
Conversation
Documentation preview: |
Pinging @elastic/ml-core (Team:ML) |
Hi @benwtrent, I've created a changelog YAML for you. |
…rent/elasticsearch into feature/ml-add-xlm-roberta-support
@elasticmachine update branch |
…rent/elasticsearch into feature/ml-add-xlm-roberta-support
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing the docs for these changes! LGTM!
I've added xlm_roberta
to the list of tokenization values via f0748b8, hope you don't mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can you look at the transport version number please
...re/src/main/java/org/elasticsearch/xpack/core/ml/action/PutTrainedModelVocabularyAction.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public ActionRequestValidationException validate() { | ||
ActionRequestValidationException validationException = null; | ||
if (vocabulary.isEmpty()) { | ||
validationException = addValidationError("[vocabulary] must not be empty", validationException); | ||
} else { | ||
if (scores.isEmpty() == false && scores.size() != vocabulary.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (scores.isEmpty() == false && scores.size() != vocabulary.size()) { | |
if (scores.size() != vocabulary.size()) { |
@elasticmachine update branch |
Many multi-lingual and newer models use a tokenization scheme similar to sentence-piece. This PR adds support for one of those tokenization schemes, XLMRoBERTa. The main changes are: - Support for xlm_roberta tokenization configuration - Adding `scores` to the vocabulary document stored, requiring that scores be the same size as the vocabulary - Adding a new flat text file to resources that is the spm char normalizer.
This allows XLMRoberta models to be uploaded to Elasticsearch. blocked by: elastic/elasticsearch#94089
This allows XLMRoberta models to be uploaded to Elasticsearch. blocked by: elastic/elasticsearch#94089
Many multi-lingual and newer models use a tokenization scheme similar to sentence-piece. This PR adds support for one of those tokenization schemes, XLMRoBERTa.
The main changes are:
scores
to the vocabulary document stored, requiring that scores be the same size as the vocabulary