Exception: fit() needs to called before retrieve() #1637

SaffronWolf · 2021-10-22T11:55:20Z

Describe the bug

version: '0.9'

components:    # define all the building-blocks for Pipeline
  - name: DocumentStore
    type: InMemoryDocumentStore
  - name: Retriever
    type: TfidfRetriever
    params:
      document_store: DocumentStore    # params can reference other components defined in the YAML
      top_k: 5
  - name: Reader       # custom-name for the component; helpful for visualization & debugging
    type: FARMReader    # Haystack Class name for the component
    params:
      model_name_or_path: ahotrod/albert_xxlargev1_squad2_512
  - name: TextFileConverter
    type: TextConverter
  - name: PDFFileConverter
    type: PDFToTextConverter
  - name: Preprocessor
    type: PreProcessor
    params:
      clean_empty_lines: True
      clean_whitespace: True
      clean_header_footer: True
      split_by: "word"
      split_length: 100
      split_respect_sentence_boundary: True
      split_overlap: 3
  - name: FileTypeClassifier
    type: FileTypeClassifier

pipelines:
  - name: query    # a sample extractive-qa Pipeline
    type: Query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    type: Indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: PDFFileConverter
        inputs: [FileTypeClassifier.output_2]
      - name: Preprocessor
        inputs: [PDFFileConverter, TextFileConverter]
      - name : Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

I am getting the Exception: fit() needs to called before retrieve() using after running a query using Streamlit UI and RestAPI.

Error message
Exception: fit() needs to called before retrieve()

To Reproduce
Start RestAPI
gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300
UI
streamlit run webapp.py

Run the query

System:

OS: WSL
GPU/CPU: CPU
Haystack version (commit or version number): 0.10.0
DocumentStore: InMemoryDocumentStore
Reader: FARMReader
Retriever: TFIDFRetriever

The text was updated successfully, but these errors were encountered:

ZanSara · 2021-10-22T12:38:55Z

I confirm I can reproduce this, but at a first glance I don't see anything wrong in your setup. I'm going to find out what's going on here and let you know 👍

ZanSara · 2021-10-22T13:38:23Z

Hey @SaffronWolf, could you share the entire logs of your API server, from boot? I'm looking for a line similar to Fit method called with empty document store, or anything else that might help us understand what is going on.

In practice the issue is interesting because it seems like TFIDFRetriever needs the documents to be present in the document store as soon as it is created, which is impractical if you're loading a pipeline from YAML with an InMemoryDocumentStore.
For me what "fixed" the issue was to use an ElasticsearchDocumentStore that already contained a few documents. I am going to do the necessary changes to make this pipeline work in InMemoryDocumentStore too.

SaffronWolf · 2021-10-22T13:56:30Z

Hi, @ZanSara thanks for looking into it.

Have you verified that you have documents in your document store? This error is definitely misleading, but can arise when your document store is simply empty (BTW I will address this in a separate issue).

I am not sure how to add documents to InDocumentMemoryStore. Can you please point me towards the documentation on it? I am using the RestAPI available in haystack repository.

could you share the entire logs of your API server, from boot?

[2021-10-22 17:37:45 +0530] [6870] [INFO] Starting gunicorn 20.1.0
[2021-10-22 17:37:45 +0530] [6870] [INFO] Listening at: http://0.0.0.0:8000 (6870)
[2021-10-22 17:37:45 +0530] [6870] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-10-22 17:37:45 +0530] [6889] [INFO] Booting worker with pid: 6889
/mnt/d/Haystack/env/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
pdftotext version 4.03 [www.xpdfreader.com]
Copyright 1996-2021 Glyph & Cog, LLC
Fit method called with empty document store
Fit method called with empty document store
Some weights of the model checkpoint at ahotrod/albert_xxlargev1_squad2_512 were not used when initializing AlbertModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
[2021-10-22 17:39:13 +0530] [6889] [INFO] Started server process [6889]
[2021-10-22 17:39:13 +0530] [6889] [INFO] Waiting for application startup.
[2021-10-22 17:39:13 +0530] [6889] [INFO] Application startup complete.
[2021-10-22 17:40:04 +0530] [6889] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/mnt/d/Haystack/haystack/haystack/pipeline.py", line 337, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
  File "/mnt/d/Haystack/haystack/haystack/schema.py", line 735, in _dispatch_run
    output, stream = self.run(**run_inputs, **run_params)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 192, in run
    output, stream = run_query_timed(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 43, in wrapper
    ret = fn(*args, **kwargs)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 208, in run_query
    documents = self.retrieve(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/sparse.py", line 182, in retrieve
    raise Exception("fit() needs to called before retrieve()")
Exception: fit() needs to called before retrieve()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 375, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/fastapi/routing.py", line 226, in app
    raw_response = await run_endpoint_function(
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 805, in run_sync_in_worker_thread
    return await future
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 743, in run
    result = func(*args)
  File "/mnt/d/Haystack/haystack/rest_api/controller/search.py", line 48, in query
    result = _process_request(PIPELINE, request)
  File "/mnt/d/Haystack/haystack/rest_api/controller/search.py", line 66, in _process_request
    result = pipeline.run(query=request.query, params=params)
  File "/mnt/d/Haystack/haystack/haystack/pipeline.py", line 340, in run
    raise Exception(f"Exception while running node `{node_id}` with input `{node_input}`: {e}, full stack trace: {tb}")
Exception: Exception while running node `Retriever` with input `{'root_node': 'Query', 'params': {'filters': None, 'Retriever': {'top_k': 3, 'filters': {}}, 'Reader': {'top_k': 3}}, 'query': 'Who is the father of Arya Stark?', 'node_id': 'Retriever'}`: fit() needs to called before retrieve(), full stack trace: Traceback (most recent call last):
  File "/mnt/d/Haystack/haystack/haystack/pipeline.py", line 337, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
  File "/mnt/d/Haystack/haystack/haystack/schema.py", line 735, in _dispatch_run
    output, stream = self.run(**run_inputs, **run_params)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 192, in run
    output, stream = run_query_timed(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 43, in wrapper
    ret = fn(*args, **kwargs)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 208, in run_query
    documents = self.retrieve(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/sparse.py", line 182, in retrieve
    raise Exception("fit() needs to called before retrieve()")
Exception: fit() needs to called before retrieve()

ZanSara · 2021-10-22T14:22:39Z

Unfortunately it won't solve this specific issue, but just FYI the description of the REST API is here: https://haystack.deepset.ai/guides/rest-api It explains in detail all available endpoints, including the file upload one.

Thanks for the stacktrace!

SaffronWolf · 2021-10-22T14:26:44Z

Thanks. Would using a different retriever or ElasticSearch as DocumentStore solve the issue?

ZanSara · 2021-10-22T14:51:51Z

I made it working swapping InMemoryDocumentStore with ElasticsearchDocumentStore and making sure that some document were present in the store when starting (you can start the server, upload some docs and then restart it, for example). But any document store that can persist data should work too.
If you want the pipeline to never crash even if the doc server is empty, though, is better to change the retriever. In this case you can keep InMemoryDocumentStore.

aakar-007 · 2021-10-22T15:35:32Z

Hi @ZanSara Thanks for your response.
Tried using InMemoryDocumentStore With DPR_Retriever. It does not throw the same error (we could see the embeddings being created so we know that document store is not empty) but the output is blank.

ZanSara · 2021-10-22T16:00:14Z

Hi @aakar-007, interesting. Could you share a code snippet to reproduce the issue?

aakar-007 · 2021-10-22T16:07:27Z

Hi @ZanSara, the code is the same as the one @SaffronWolf has posted. Just swapped TfIdf retriever with DensePassageRetriever.

ZanSara · 2021-10-25T13:38:56Z

Hello @aakar-007 and @SaffronWolf, I'm having a hard time reproducing your issue here. Could you try using the POST /documents/get_by_filters endpoint to make sure of the actual content of your document store? And if the document store is not empty, can you tell me something more about the files you're using to populate it, or in general about what it contains?

julian-risch · 2021-10-27T14:55:01Z

I had a look at where the exception is raised:

haystack/haystack/nodes/retriever/sparse.py

Line 177 in 13510aa

raise Exception("fit() needs to called before retrieve()")

The problem is that the TfidfRetriever uses a dataframe df to store paragraphs and term frequencies and inverse document frequencies that need to be calculated in the fit() method based on documents stored in the document store. This calculation needs to be done before any document retrieval step can be executed. To this end, fit()is called in the init() method of the TfidfRetriever here:

haystack/haystack/nodes/retriever/sparse.py

Line 134 in 13510aa

self.fit()

However, if there aren't any documents yet, the dataframe df remains empty, no scores are calculated and any retrieval step fails with the reported exception.

A quick fix is to run self.fit() if self.df is None and before the check that throws the exception:

haystack/haystack/nodes/retriever/sparse.py

Line 176 in 13510aa

if self.df is None:

I created a PR for that #1665

ZanSara mentioned this issue Oct 22, 2021

Feature Request: Add index parameter to TFiDF retriever #1634

Closed

ZanSara added journey:first steps topic:pipeline type:bug Something isn't working labels Oct 22, 2021

ZanSara self-assigned this Oct 22, 2021

julian-risch self-assigned this Oct 27, 2021

julian-risch mentioned this issue Oct 27, 2021

ensure tf-idf matrix calculation before retrieval #1665

Merged

julian-risch closed this as completed in #1665 Oct 28, 2021

julian-risch mentioned this issue Oct 28, 2021

disable file upload for InMemoryDocStore #1677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception: fit() needs to called before retrieve() #1637

Exception: fit() needs to called before retrieve() #1637

SaffronWolf commented Oct 22, 2021

ZanSara commented Oct 22, 2021

ZanSara commented Oct 22, 2021 •

edited

Loading

SaffronWolf commented Oct 22, 2021

ZanSara commented Oct 22, 2021 •

edited

Loading

SaffronWolf commented Oct 22, 2021

ZanSara commented Oct 22, 2021 •

edited

Loading

aakar-007 commented Oct 22, 2021 •

edited

Loading

ZanSara commented Oct 22, 2021

aakar-007 commented Oct 22, 2021 •

edited

Loading

ZanSara commented Oct 25, 2021 •

edited

Loading

julian-risch commented Oct 27, 2021

Exception: fit() needs to called before retrieve() #1637

Exception: fit() needs to called before retrieve() #1637

Comments

SaffronWolf commented Oct 22, 2021

ZanSara commented Oct 22, 2021

ZanSara commented Oct 22, 2021 • edited Loading

SaffronWolf commented Oct 22, 2021

ZanSara commented Oct 22, 2021 • edited Loading

SaffronWolf commented Oct 22, 2021

ZanSara commented Oct 22, 2021 • edited Loading

aakar-007 commented Oct 22, 2021 • edited Loading

ZanSara commented Oct 22, 2021

aakar-007 commented Oct 22, 2021 • edited Loading

ZanSara commented Oct 25, 2021 • edited Loading

julian-risch commented Oct 27, 2021

ZanSara commented Oct 22, 2021 •

edited

Loading

ZanSara commented Oct 22, 2021 •

edited

Loading

ZanSara commented Oct 22, 2021 •

edited

Loading

aakar-007 commented Oct 22, 2021 •

edited

Loading

aakar-007 commented Oct 22, 2021 •

edited

Loading

ZanSara commented Oct 25, 2021 •

edited

Loading