Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to the new deberta model #735

Merged
merged 4 commits into from
Jul 28, 2023
Merged

Change to the new deberta model #735

merged 4 commits into from
Jul 28, 2023

Conversation

kwalcock
Copy link
Member

No description provided.

@kwalcock
Copy link
Member Author

This shows how to change from the roberta to the deberta model. However, using it will result in #21.

@MihaiSurdeanu
Copy link
Contributor

Hey @kwalcock, any updates on this? Thank you!

@kwalcock
Copy link
Member Author

I'm still digging deep into the Python code. It is going to be difficult. The most recent changes are in the kwalcock/types branch FWIW.

@MihaiSurdeanu
Copy link
Contributor

That message is a warning, and it is possible that the tokens look similar. Did you compare the tokens produced by the Python and Rust tokenizers?

@MihaiSurdeanu
Copy link
Contributor

In any case, if not all tokenizers are supported, it's not the end of the world. We just need to know, so we don't use the deberta model.

@kwalcock
Copy link
Member Author

All of the output coming from Python matches whether the value for use_fast is True or False. I'm assuming that False will result in the Python code running instead of Rust, but I have not yet seen exactly how that works. So far all of it has matched the Rust code called from Scala as long as the tokenizer is one of those available directly from Rust.

@kwalcock
Copy link
Member Author

The top tokenizers here (and maybe more) work from Scala via Rust without the stop in Python. The trick will be to get the ones that include some kind of Python assistance to work. Aren't the tokenizers paired with models so that this limits the models that can be used?

class SentencesTest extends Test {
  // See also test_clu_tokenizer.py.
  val tokenizerNames = Seq(
    "bert-base-cased",
    "distilbert-base-cased",
    "roberta-base",
    "xlm-roberta-base" // ,
    // All of these latter ones will not just fail, but cause a
    // fatal runtime error and end the testing completely.
    // "google/bert_uncased_L-4_H-512_A-8",
    // "google/electra-small-discriminator",
    // "microsoft/deberta-v3-base"
  )

@MihaiSurdeanu
Copy link
Contributor

Yes, limiting the tokenizers will limit the models we have access to. In particular, deberta is an important one for processors.
I wonder if we do the same trick we did with Breeze. That is, replicate the Python parts directly in Scala. Is that complicated?

@MihaiSurdeanu
Copy link
Contributor

Nice!! Ok to merge?

@kwalcock
Copy link
Member Author

Yes. Just finished testing.

@kwalcock kwalcock merged commit 3432394 into balaur Jul 28, 2023
@kwalcock kwalcock deleted the kwalcock/balaur branch July 28, 2023 06:32
@MihaiSurdeanu
Copy link
Contributor

This works great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants