Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding failed with large-nli-stsb model but work for base model #117

Closed
predoctech opened this issue May 20, 2020 · 9 comments · Fixed by #119
Closed

Embedding failed with large-nli-stsb model but work for base model #117

predoctech opened this issue May 20, 2020 · 9 comments · Fixed by #119
Assignees

Comments

@predoctech
Copy link

Trying to make use of the pre-trained model in sentence-transformer. It works for base model (tried bert-base and roberta-base) but failed in large models (for roberta and I think bert as well) with the following errors at time of embedding corpus:

image

Seems like size of the embedding is too large to write as Elasticsearch index?

@tanaysoni
Copy link
Contributor

Hi @predoctech, this might be due to the indexing call to Elasticsearch timing out. The timeout is increased from default 10 seconds to 30 seconds in #119. Can you pull the latest master and try again to see if it resolves the issue?

@tanaysoni tanaysoni reopened this May 22, 2020
@predoctech
Copy link
Author

Hi @tanaysoni, when I tested the changes today the timeout error still persists. Is a 30 second timeframe still not enough, or there is something else going on with indexing call?

Screenshot from 2020-05-25 15-24-36

@tanaysoni
Copy link
Contributor

Hi @predoctech, how large are the documents(number of characters or bytes) you're indexing? Can you try if the indexing works for a single document instead of a batch?

@predoctech
Copy link
Author

@tanaysoni , my document are FAQ pairs of 970 rows all in a single document. That isn't too large a document isn't it, and the base models worked just the large model timed out.

@tanaysoni
Copy link
Contributor

Hi @predoctech, yes, you're right it might be possible that 30 seconds is not enough. Can you try again with a very large timeout(eg, 600) here?

@predoctech
Copy link
Author

predoctech commented May 26, 2020

Hi @tanaysoni , I tried your suggestion and now the error is not with timeout but with the embedding itself. Specifically I noticed the following error message:
"BulkIndexError: ('500 document(s) failed to index.', [{'create': {'_index': 'document2', '_type': '_doc', '_id': 'riFKUHIB1cG6QRRa1aQo', 'status': 400, 'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'Field [question_emb] of type [dense_vector] of doc [riFKUHIB1cG6QRRa1aQo] has exceeded the number of dimensions [768] defined in mapping'}..."
Not sure why my data now exceeded the vector dimension defined for the DocumentStore?
N.B. I'm trying to load roberta-large-nli-stsb-mean-tokens

Screenshot from 2020-05-26 17-30-52

@tholor
Copy link
Member

tholor commented May 26, 2020

The larger models like roberta-large-nli-stsb-mean-tokens create embeddings of dimensionality 1024. Should work fine if you set embedding_dim=1024 when initializing your Documentstore.

@predoctech
Copy link
Author

Thanks @tholor . That's right, setting 1024 as dimensionality for large models does work.
@tanaysoni now with setting a longer timeout the issue of indexing is resolved too. I used timeout=300, which seems to be sufficient for all large models I have tried.
Trust that change would be merge into code base? Thanks both.

@tholor
Copy link
Member

tholor commented Jun 3, 2020

Fixed in #130

@tholor tholor closed this as completed Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants