Embedding failed with large-nli-stsb model but work for base model #117

predoctech · 2020-05-20T09:49:42Z

Trying to make use of the pre-trained model in sentence-transformer. It works for base model (tried bert-base and roberta-base) but failed in large models (for roberta and I think bert as well) with the following errors at time of embedding corpus:

Seems like size of the embedding is too large to write as Elasticsearch index?

tanaysoni · 2020-05-22T08:23:56Z

Hi @predoctech, this might be due to the indexing call to Elasticsearch timing out. The timeout is increased from default 10 seconds to 30 seconds in #119. Can you pull the latest master and try again to see if it resolves the issue?

predoctech · 2020-05-25T07:31:27Z

Hi @tanaysoni, when I tested the changes today the timeout error still persists. Is a 30 second timeframe still not enough, or there is something else going on with indexing call?

tanaysoni · 2020-05-25T07:39:49Z

Hi @predoctech, how large are the documents(number of characters or bytes) you're indexing? Can you try if the indexing works for a single document instead of a batch?

predoctech · 2020-05-25T07:48:37Z

@tanaysoni , my document are FAQ pairs of 970 rows all in a single document. That isn't too large a document isn't it, and the base models worked just the large model timed out.

tanaysoni · 2020-05-26T07:51:08Z

Hi @predoctech, yes, you're right it might be possible that 30 seconds is not enough. Can you try again with a very large timeout(eg, 600) here?

predoctech · 2020-05-26T09:31:43Z

Hi @tanaysoni , I tried your suggestion and now the error is not with timeout but with the embedding itself. Specifically I noticed the following error message:
"BulkIndexError: ('500 document(s) failed to index.', [{'create': {'_index': 'document2', '_type': '_doc', '_id': 'riFKUHIB1cG6QRRa1aQo', 'status': 400, 'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'Field [question_emb] of type [dense_vector] of doc [riFKUHIB1cG6QRRa1aQo] has exceeded the number of dimensions [768] defined in mapping'}..."
Not sure why my data now exceeded the vector dimension defined for the DocumentStore?
N.B. I'm trying to load roberta-large-nli-stsb-mean-tokens

tholor · 2020-05-26T09:49:30Z

The larger models like roberta-large-nli-stsb-mean-tokens create embeddings of dimensionality 1024. Should work fine if you set embedding_dim=1024 when initializing your Documentstore.

predoctech · 2020-05-26T15:37:49Z

Thanks @tholor . That's right, setting 1024 as dimensionality for large models does work.
@tanaysoni now with setting a longer timeout the issue of indexing is resolved too. I used timeout=300, which seems to be sufficient for all large models I have tried.
Trust that change would be merge into code base? Thanks both.

tholor · 2020-06-03T14:28:00Z

Fixed in #130

tholor assigned tanaysoni May 21, 2020

tanaysoni mentioned this issue May 22, 2020

Increase timeout for Elasticsearch bulk indexing #119

Merged

tanaysoni closed this as completed in #119 May 22, 2020

tanaysoni reopened this May 22, 2020

tholor mentioned this issue Jun 3, 2020

Increase timeout for bulk indexing in ES #130

Merged

tholor closed this as completed Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding failed with large-nli-stsb model but work for base model #117

Embedding failed with large-nli-stsb model but work for base model #117

predoctech commented May 20, 2020

tanaysoni commented May 22, 2020

predoctech commented May 25, 2020

tanaysoni commented May 25, 2020

predoctech commented May 25, 2020

tanaysoni commented May 26, 2020

predoctech commented May 26, 2020 •

edited

tholor commented May 26, 2020

predoctech commented May 26, 2020

tholor commented Jun 3, 2020

Embedding failed with large-nli-stsb model but work for base model #117

Embedding failed with large-nli-stsb model but work for base model #117

Comments

predoctech commented May 20, 2020

tanaysoni commented May 22, 2020

predoctech commented May 25, 2020

tanaysoni commented May 25, 2020

predoctech commented May 25, 2020

tanaysoni commented May 26, 2020

predoctech commented May 26, 2020 • edited

tholor commented May 26, 2020

predoctech commented May 26, 2020

tholor commented Jun 3, 2020

predoctech commented May 26, 2020 •

edited