Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for indexing pipelines #816

Merged
merged 7 commits into from
Feb 16, 2021
Merged

Add support for indexing pipelines #816

merged 7 commits into from
Feb 16, 2021

Conversation

tanaysoni
Copy link
Contributor

@tanaysoni tanaysoni commented Feb 9, 2021

The Pipelines were introduced to help build directed graphs from Haystack components to process a query.

This PR extends the Pipelines for building indexing workflows using components like file converters, preprocessors, retrievers, and document stores.

Proposed changes

  1. In the YAML configuration of Pipeline, a new field type is added. This can be used in the current version to specify Query & Indexing pipelines. In the future, additional types, for example, RayQuery can be added that builds a query pipeline with the Ray framework. This new type fields allows components like the Retriever to be "polymorphic". For instance, in the case of Query, it embeds a query, while, in an Indexing pipeline, it creates embeddings for the given list of documents.

  2. There can be different root nodes based on the pipeline. For example, a Query pipeline has a Query root node, while an Indexing Pipeline has a File root node. For cases where a user might want to use a text blob instead of a file, the REST API module can convert the text blob to a .txt file before passing it on to the indexing pipeline.

  3. The components should allow run-time parameters, e.g., remove_numeric_tables to be also configurable during component initialization. This enables defining the default values as part of the Pipeline configuration that can later be overridden by individual requests.

Sample Indexing Pipeline

version: '0.7'

components:    
  - name: MyDPRRetriever
    type: DensePassageRetriever
    params:
      document_store: MyDocumentStore  
      query_embedding_model: facebook/dpr-question_encoder-single-nq-base
      passage_embedding_model: facebook/dpr-ctx_encoder-single-nq-base
  - name: MyDocumentStore
    type: MilvusDocumentStore
    params:
        similarity: dot_product
  - name: MyPDFToTextConverter
    type: PDFToTextConverter
    params:
      remove_numeric_tables: false
      valid_languages: ["de", "en"]
  - name: MyPreProcessor
    type: PreProcessor
    params:
      clean_whitespace: true

pipelines:
  - name: my_indexing_pipeline
    type: Indexing
    nodes:
      - name: MyPDFToTextConverter
        inputs: [File]
      - name: MyPreProcessor
        inputs: [MyPDFToTextConverter]
      - name: MyDPRRetriever
        inputs: [MyPreProcessor]
      - name: MyDocumentStore
        inputs: [MyDPRRetriever]

@tanaysoni tanaysoni force-pushed the indexing-pipeline branch 4 times, most recently from 521f907 to deceb82 Compare February 10, 2021 13:09
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the design. Differentiation into pipeline types makes sense. Just not sure about the repetition of all instance attributes in the respective methods. This can create quite a mess in the long run...

def clean(self, document: dict) -> dict:
def process(
self,
document: dict,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why we want the option to overwrite instance attributes here, but it will create quite a mess if we apply this pattern to all potential pipeline nodes (retriever, reader ...). Maybe we can find another, cleaner workaround 🤔

@tanaysoni tanaysoni changed the title WIP: Add support for indexing pipelines Add support for indexing pipelines Feb 16, 2021
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tanaysoni tanaysoni merged commit 07907f9 into master Feb 16, 2021
@tanaysoni tanaysoni deleted the indexing-pipeline branch February 16, 2021 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants