Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: fit() needs to called before retrieve() #1637

Closed
SaffronWolf opened this issue Oct 22, 2021 · 11 comments · Fixed by #1665
Closed

Exception: fit() needs to called before retrieve() #1637

SaffronWolf opened this issue Oct 22, 2021 · 11 comments · Fixed by #1665
Assignees
Labels
topic:pipeline type:bug Something isn't working

Comments

@SaffronWolf
Copy link

Describe the bug

version: '0.9'

components:    # define all the building-blocks for Pipeline
  - name: DocumentStore
    type: InMemoryDocumentStore
  - name: Retriever
    type: TfidfRetriever
    params:
      document_store: DocumentStore    # params can reference other components defined in the YAML
      top_k: 5
  - name: Reader       # custom-name for the component; helpful for visualization & debugging
    type: FARMReader    # Haystack Class name for the component
    params:
      model_name_or_path: ahotrod/albert_xxlargev1_squad2_512
  - name: TextFileConverter
    type: TextConverter
  - name: PDFFileConverter
    type: PDFToTextConverter
  - name: Preprocessor
    type: PreProcessor
    params:
      clean_empty_lines: True
      clean_whitespace: True
      clean_header_footer: True
      split_by: "word"
      split_length: 100
      split_respect_sentence_boundary: True
      split_overlap: 3
  - name: FileTypeClassifier
    type: FileTypeClassifier

pipelines:
  - name: query    # a sample extractive-qa Pipeline
    type: Query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    type: Indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: PDFFileConverter
        inputs: [FileTypeClassifier.output_2]
      - name: Preprocessor
        inputs: [PDFFileConverter, TextFileConverter]
      - name : Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

I am getting the Exception: fit() needs to called before retrieve() using after running a query using Streamlit UI and RestAPI.

Error message
Exception: fit() needs to called before retrieve()

To Reproduce
Start RestAPI
gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300
UI
streamlit run webapp.py

Run the query

System:

  • OS: WSL
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 0.10.0
  • DocumentStore: InMemoryDocumentStore
  • Reader: FARMReader
  • Retriever: TFIDFRetriever
@ZanSara
Copy link
Contributor

ZanSara commented Oct 22, 2021

I confirm I can reproduce this, but at a first glance I don't see anything wrong in your setup. I'm going to find out what's going on here and let you know 👍

@ZanSara
Copy link
Contributor

ZanSara commented Oct 22, 2021

Hey @SaffronWolf, could you share the entire logs of your API server, from boot? I'm looking for a line similar to Fit method called with empty document store, or anything else that might help us understand what is going on.

In practice the issue is interesting because it seems like TFIDFRetriever needs the documents to be present in the document store as soon as it is created, which is impractical if you're loading a pipeline from YAML with an InMemoryDocumentStore.
For me what "fixed" the issue was to use an ElasticsearchDocumentStore that already contained a few documents. I am going to do the necessary changes to make this pipeline work in InMemoryDocumentStore too.

@SaffronWolf
Copy link
Author

Hi, @ZanSara thanks for looking into it.

Have you verified that you have documents in your document store? This error is definitely misleading, but can arise when your document store is simply empty (BTW I will address this in a separate issue).

I am not sure how to add documents to InDocumentMemoryStore. Can you please point me towards the documentation on it? I am using the RestAPI available in haystack repository.

could you share the entire logs of your API server, from boot?

[2021-10-22 17:37:45 +0530] [6870] [INFO] Starting gunicorn 20.1.0
[2021-10-22 17:37:45 +0530] [6870] [INFO] Listening at: http://0.0.0.0:8000 (6870)
[2021-10-22 17:37:45 +0530] [6870] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-10-22 17:37:45 +0530] [6889] [INFO] Booting worker with pid: 6889
/mnt/d/Haystack/env/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
pdftotext version 4.03 [www.xpdfreader.com]
Copyright 1996-2021 Glyph & Cog, LLC
Fit method called with empty document store
Fit method called with empty document store
Some weights of the model checkpoint at ahotrod/albert_xxlargev1_squad2_512 were not used when initializing AlbertModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
[2021-10-22 17:39:13 +0530] [6889] [INFO] Started server process [6889]
[2021-10-22 17:39:13 +0530] [6889] [INFO] Waiting for application startup.
[2021-10-22 17:39:13 +0530] [6889] [INFO] Application startup complete.
[2021-10-22 17:40:04 +0530] [6889] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/mnt/d/Haystack/haystack/haystack/pipeline.py", line 337, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
  File "/mnt/d/Haystack/haystack/haystack/schema.py", line 735, in _dispatch_run
    output, stream = self.run(**run_inputs, **run_params)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 192, in run
    output, stream = run_query_timed(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 43, in wrapper
    ret = fn(*args, **kwargs)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 208, in run_query
    documents = self.retrieve(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/sparse.py", line 182, in retrieve
    raise Exception("fit() needs to called before retrieve()")
Exception: fit() needs to called before retrieve()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 375, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/fastapi/routing.py", line 226, in app
    raw_response = await run_endpoint_function(
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 805, in run_sync_in_worker_thread
    return await future
  File "/mnt/d/Haystack/env/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 743, in run
    result = func(*args)
  File "/mnt/d/Haystack/haystack/rest_api/controller/search.py", line 48, in query
    result = _process_request(PIPELINE, request)
  File "/mnt/d/Haystack/haystack/rest_api/controller/search.py", line 66, in _process_request
    result = pipeline.run(query=request.query, params=params)
  File "/mnt/d/Haystack/haystack/haystack/pipeline.py", line 340, in run
    raise Exception(f"Exception while running node `{node_id}` with input `{node_input}`: {e}, full stack trace: {tb}")
Exception: Exception while running node `Retriever` with input `{'root_node': 'Query', 'params': {'filters': None, 'Retriever': {'top_k': 3, 'filters': {}}, 'Reader': {'top_k': 3}}, 'query': 'Who is the father of Arya Stark?', 'node_id': 'Retriever'}`: fit() needs to called before retrieve(), full stack trace: Traceback (most recent call last):
  File "/mnt/d/Haystack/haystack/haystack/pipeline.py", line 337, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
  File "/mnt/d/Haystack/haystack/haystack/schema.py", line 735, in _dispatch_run
    output, stream = self.run(**run_inputs, **run_params)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 192, in run
    output, stream = run_query_timed(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 43, in wrapper
    ret = fn(*args, **kwargs)
  File "/mnt/d/Haystack/haystack/haystack/retriever/base.py", line 208, in run_query
    documents = self.retrieve(query=query, filters=filters, top_k=top_k, index=index)
  File "/mnt/d/Haystack/haystack/haystack/retriever/sparse.py", line 182, in retrieve
    raise Exception("fit() needs to called before retrieve()")
Exception: fit() needs to called before retrieve()

@ZanSara
Copy link
Contributor

ZanSara commented Oct 22, 2021

Unfortunately it won't solve this specific issue, but just FYI the description of the REST API is here: https://haystack.deepset.ai/guides/rest-api It explains in detail all available endpoints, including the file upload one.

Thanks for the stacktrace!

@SaffronWolf
Copy link
Author

Thanks. Would using a different retriever or ElasticSearch as DocumentStore solve the issue?

@ZanSara
Copy link
Contributor

ZanSara commented Oct 22, 2021

I made it working swapping InMemoryDocumentStore with ElasticsearchDocumentStore and making sure that some document were present in the store when starting (you can start the server, upload some docs and then restart it, for example). But any document store that can persist data should work too.
If you want the pipeline to never crash even if the doc server is empty, though, is better to change the retriever. In this case you can keep InMemoryDocumentStore.

@aakar-007
Copy link

aakar-007 commented Oct 22, 2021

Hi @ZanSara Thanks for your response.
Tried using InMemoryDocumentStore With DPR_Retriever. It does not throw the same error (we could see the embeddings being created so we know that document store is not empty) but the output is blank.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 22, 2021

Hi @aakar-007, interesting. Could you share a code snippet to reproduce the issue?

@ZanSara ZanSara self-assigned this Oct 22, 2021
@aakar-007
Copy link

aakar-007 commented Oct 22, 2021

Hi @ZanSara, the code is the same as the one @SaffronWolf has posted. Just swapped TfIdf retriever with DensePassageRetriever.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 25, 2021

Hello @aakar-007 and @SaffronWolf, I'm having a hard time reproducing your issue here. Could you try using the POST /documents/get_by_filters endpoint to make sure of the actual content of your document store? And if the document store is not empty, can you tell me something more about the files you're using to populate it, or in general about what it contains?

@julian-risch
Copy link
Member

I had a look at where the exception is raised:

raise Exception("fit() needs to called before retrieve()")

The problem is that the TfidfRetriever uses a dataframe df to store paragraphs and term frequencies and inverse document frequencies that need to be calculated in the fit() method based on documents stored in the document store. This calculation needs to be done before any document retrieval step can be executed. To this end, fit()is called in the init() method of the TfidfRetriever here:


However, if there aren't any documents yet, the dataframe df remains empty, no scores are calculated and any retrieval step fails with the reported exception.

A quick fix is to run self.fit() if self.df is None and before the check that throws the exception:

if self.df is None:

I created a PR for that #1665

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:pipeline type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants