Refactoring of the haystack package (#1624)

* Files moved, imports all broken * Fix most imports and docstrings into * Fix the paths to the modules in the API docs * Add latest docstring and tutorial changes * Add a few pipelines that were lost in the inports * Fix a bunch of mypy warnings * Add latest docstring and tutorial changes * Create a file_classifier module * Add docs for file_classifier * Fixed most circular imports, now the REST API can start * Add latest docstring and tutorial changes * Tackling more mypy issues * Reintroduce from FARM and fix last mypy issues hopefully * Re-enable old-style imports * Fix some more import from the top-level package in an attempt to sort out circular imports * Fix some imports in tests to new-style to prevent failed class equalities from breaking tests * Change document_store into document_stores * Update imports in tutorials * Add latest docstring and tutorial changes * Probably fixes summarizer tests * Improve the old-style import allowing module imports (should work) * Try to fix the docs * Remove dedicated KnowledgeGraph page from autodocs * Remove dedicated GraphRetriever page from autodocs * Fix generate_docstrings.sh with an updated list of yaml files to look for * Fix some more modules in the docs * Fix the document stores docs too * Fix a small issue on Tutorial14 * Add latest docstring and tutorial changes * Add deprecation warning to old-style imports * Remove stray folder and import Dict into dense.py * Change import path for MLFlowLogger * Add old loggers path to the import path aliases * Fix debug output of convert_ipynb.py * Fix circular import on BaseRetriever * Missed one merge block * re-run tutorial 5 * Fix imports in tutorial 5 * Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base * Add latest docstring and tutorial changes * Fix typo in utils __init__ * Fix a few more imports * Fix benchmarks too * New-style imports in test_knowledge_graph * Rollback setup.py * Rollback squad_to_dpr too Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
deepset-ai · Oct 25, 2021 · 13510aa · 13510aa
1 parent 51acf77
commit 13510aa
Show file tree

Hide file tree

Showing 226 changed files with 22,603 additions and 7,677 deletions.
diff --git a/docs/_src/api/api/answer_generator.md b/docs/_src/api/api/answer_generator.md
@@ -0,0 +1,232 @@
+<a name="base"></a>
+# Module base
+
+<a name="base.BaseGenerator"></a>
+## BaseGenerator Objects
+
+```python
+class BaseGenerator(BaseComponent)
+```
+
+Abstract class for Generators
+
+<a name="base.BaseGenerator.predict"></a>
+#### predict
+
+```python
+ | @abstractmethod
+ | predict(query: str, documents: List[Document], top_k: Optional[int]) -> Dict
+```
+
+Abstract method to generate answers.
+
+**Arguments**:
+
+- `query`: Query
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+- `top_k`: Number of returned answers
+
+**Returns**:
+
+Generated answers plus additional infos in a dict
+
+<a name="transformers"></a>
+# Module transformers
+
+<a name="transformers.RAGenerator"></a>
+## RAGenerator Objects
+
+```python
+class RAGenerator(BaseGenerator)
+```
+
+Implementation of Facebook's Retrieval-Augmented Generator (https://arxiv.org/abs/2005.11401) based on
+HuggingFace's transformers (https://huggingface.co/transformers/model_doc/rag.html).
+
+Instead of "finding" the answer within a document, these models **generate** the answer.
+In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages
+for real-world applications:
+a) it has a manageable model size
+b) the answer generation is conditioned on retrieved documents,
+i.e. the model can easily adjust to domain documents even after training has finished
+(in contrast: GPT-3 relies on the web data seen during training)
+
+**Example**
+
+```python
+|     query = "who got the first nobel prize in physics?"
+|
+|     # Retrieve related documents from retriever
+|     retrieved_docs = retriever.retrieve(query=query)
+|
+|     # Now generate answer from query and retrieved documents
+|     generator.predict(
+|        query=query,
+|        documents=retrieved_docs,
+|        top_k=1
+|     )
+|
+|     # Answer
+|
+|     {'query': 'who got the first nobel prize in physics',
+|      'answers':
+|          [{'query': 'who got the first nobel prize in physics',
+|            'answer': ' albert einstein',
+|            'meta': { 'doc_ids': [...],
+|                      'doc_scores': [80.42758 ...],
+|                      'doc_probabilities': [40.71379089355469, ...
+|                      'content': ['Albert Einstein was a ...]
+|                      'titles': ['"Albert Einstein"', ...]
+|      }}]}
+```
+
+<a name="transformers.RAGenerator.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(model_name_or_path: str = "facebook/rag-token-nq", model_version: Optional[str] = None, retriever: Optional[DensePassageRetriever] = None, generator_type: RAGeneratorType = RAGeneratorType.TOKEN, top_k: int = 2, max_length: int = 200, min_length: int = 2, num_beams: int = 2, embed_title: bool = True, prefix: Optional[str] = None, use_gpu: bool = True)
+```
+
+Load a RAG model from Transformers along with passage_embedding_model.
+See https://huggingface.co/transformers/model_doc/rag.html for more details
+
+**Arguments**:
+
+- `model_name_or_path`: Directory of a saved model or the name of a public model e.g.
+                           'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
+                           See https://huggingface.co/models for full list of available models.
+- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
+- `retriever`: `DensePassageRetriever` used to embedded passages for the docs passed to `predict()`. This is optional and is only needed if the docs you pass don't already contain embeddings in `Document.embedding`.
+- `generator_type`: Which RAG generator implementation to use? RAG-TOKEN or RAG-SEQUENCE
+- `top_k`: Number of independently generated text to return
+- `max_length`: Maximum length of generated text
+- `min_length`: Minimum length of generated text
+- `num_beams`: Number of beams for beam search. 1 means no beam search.
+- `embed_title`: Embedded the title of passage while generating embedding
+- `prefix`: The prefix used by the generator's tokenizer.
+- `use_gpu`: Whether to use GPU (if available)
+
+<a name="transformers.RAGenerator.predict"></a>
+#### predict
+
+```python
+ | predict(query: str, documents: List[Document], top_k: Optional[int] = None) -> Dict
+```
+
+Generate the answer to the input query. The generation will be conditioned on the supplied documents.
+These document can for example be retrieved via the Retriever.
+
+**Arguments**:
+
+- `query`: Query
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+- `top_k`: Number of returned answers
+
+**Returns**:
+
+Generated answers plus additional infos in a dict like this:
+
+```python
+|     {'query': 'who got the first nobel prize in physics',
+|      'answers':
+|          [{'query': 'who got the first nobel prize in physics',
+|            'answer': ' albert einstein',
+|            'meta': { 'doc_ids': [...],
+|                      'doc_scores': [80.42758 ...],
+|                      'doc_probabilities': [40.71379089355469, ...
+|                      'content': ['Albert Einstein was a ...]
+|                      'titles': ['"Albert Einstein"', ...]
+|      }}]}
+```
+
+<a name="transformers.Seq2SeqGenerator"></a>
+## Seq2SeqGenerator Objects
+
+```python
+class Seq2SeqGenerator(BaseGenerator)
+```
+
+A generic sequence-to-sequence generator based on HuggingFace's transformers.
+
+Text generation is supported by so called auto-regressive language models like GPT2,
+XLNet, XLM, Bart, T5 and others. In fact, any HuggingFace language model that extends
+GenerationMixin can be used by Seq2SeqGenerator.
+
+Moreover, as language models prepare model input in their specific encoding, each model
+specified with model_name_or_path parameter in this Seq2SeqGenerator should have an
+accompanying model input converter that takes care of prefixes, separator tokens etc.
+By default, we provide model input converters for a few well-known seq2seq language models (e.g. ELI5). 
+It is the responsibility of Seq2SeqGenerator user to ensure an appropriate model input converter 
+is either already registered or specified on a per-model basis in the Seq2SeqGenerator constructor.
+
+For mode details on custom model input converters refer to _BartEli5Converter
+
+
+See https://huggingface.co/transformers/main_classes/model.html?transformers.generation_utils.GenerationMixin#transformers.generation_utils.GenerationMixin
+as well as https://huggingface.co/blog/how-to-generate
+
+For a list of all text-generation models see https://huggingface.co/models?pipeline_tag=text-generation
+
+**Example**
+
+```python
+|     query = "Why is Dothraki language important?"
+|
+|     # Retrieve related documents from retriever
+|     retrieved_docs = retriever.retrieve(query=query)
+|
+|     # Now generate answer from query and retrieved documents
+|     generator.predict(
+|        query=query,
+|        documents=retrieved_docs,
+|        top_k=1
+|     )
+|
+|     # Answer
+|
+|     {'answers': [" The Dothraki language is a constructed fictional language. It's important because George R.R. Martin wrote it."],
+|      'query': 'Why is Dothraki language important?'}
+|
+```
+
+<a name="transformers.Seq2SeqGenerator.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(model_name_or_path: str, input_converter: Optional[Callable] = None, top_k: int = 1, max_length: int = 200, min_length: int = 2, num_beams: int = 8, use_gpu: bool = True)
+```
+
+**Arguments**:
+
+- `model_name_or_path`: a HF model name for auto-regressive language model like GPT2, XLNet, XLM, Bart, T5 etc
+- `input_converter`: an optional Callable to prepare model input for the underlying language model
+                        specified in model_name_or_path parameter. The required __call__ method signature for
+                        the Callable is:
+                        __call__(tokenizer: PreTrainedTokenizer, query: str, documents: List[Document],
+                        top_k: Optional[int] = None) -> BatchEncoding:
+- `top_k`: Number of independently generated text to return
+- `max_length`: Maximum length of generated text
+- `min_length`: Minimum length of generated text
+- `num_beams`: Number of beams for beam search. 1 means no beam search.
+- `use_gpu`: Whether to use GPU (if available)
+
+<a name="transformers.Seq2SeqGenerator.predict"></a>
+#### predict
+
+```python
+ | predict(query: str, documents: List[Document], top_k: Optional[int] = None) -> Dict
+```
+
+Generate the answer to the input query. The generation will be conditioned on the supplied documents.
+These document can be retrieved via the Retriever or supplied directly via predict method.
+
+**Arguments**:
+
+- `query`: Query
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+- `top_k`: Number of returned answers
+
+**Returns**:
+
+Generated answers
+
diff --git a/docs/_src/api/api/crawler.md b/docs/_src/api/api/crawler.md
@@ -1,95 +1,53 @@
-<a name="crawler"></a>
-# Module crawler
+<a name="entity"></a>
+# Module entity
 
-<a name="crawler.Crawler"></a>
-## Crawler Objects
+<a name="entity.EntityExtractor"></a>
+## EntityExtractor Objects
 
 ```python
-class Crawler(BaseComponent)
+class EntityExtractor(BaseComponent)
 ```
 
-Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
+This node is used to extract entities out of documents.
+The most common use case for this would be as a named entity extractor.
+The default model used is dslim/bert-base-NER.
+This node can be placed in a querying pipeline to perform entity extraction on retrieved documents only,
+or it can be placed in an indexing pipeline so that all documents in the document store have extracted entities.
+The entities extracted by this Node will populate Document.entities
 
-**Example:**
-```python
-|    from haystack.connector import Crawler
-|
-|    crawler = Crawler(output_dir="crawled_files")
-|    # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/
-|    docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"],
-|                         filter_urls= ["haystack\.deepset\.ai\/docs\/"])
-```
-
-<a name="crawler.Crawler.__init__"></a>
-#### \_\_init\_\_
+<a name="entity.EntityExtractor.run"></a>
+#### run
 
 ```python
- | __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True)
+ | run(documents: Optional[Union[List[Document], List[dict]]] = None) -> Tuple[Dict, str]
 ```
 
-Init object with basic params for crawling (can be overwritten later).
+This is the method called when this node is used in a pipeline
 
-**Arguments**:
-
-- `output_dir`: Path for the directory to store files
-- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
-- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
-    0: Only initial list of urls 
-    1: Follow links found on the initial URLs (but no further) 
-- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
-    All URLs not matching at least one of the regular expressions will be dropped.
-- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
-
-<a name="crawler.Crawler.crawl"></a>
-#### crawl
+<a name="entity.EntityExtractor.extract"></a>
+#### extract
 
 ```python
- | crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None) -> List[Path]
+ | extract(text)
 ```
 
-Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
-file per URL, including text and basic meta data).
-You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern.
-All parameters are optional here and only meant to overwrite instance attributes at runtime.
-If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.
-
-**Arguments**:
-
-- `output_dir`: Path for the directory to store files
-- `urls`: List of http addresses or single http address
-- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
-                      0: Only initial list of urls
-                      1: Follow links found on the initial URLs (but no further)
-- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
-                   All URLs not matching at least one of the regular expressions will be dropped.
-- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
+This function can be called to perform entity extraction when using the node in isolation.
 
-**Returns**:
-
-List of paths where the crawled webpages got stored
-
-<a name="crawler.Crawler.run"></a>
-#### run
+<a name="entity.simplify_ner_for_qa"></a>
+#### simplify\_ner\_for\_qa
 
 ```python
- | run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False) -> Tuple[Dict, str]
+simplify_ner_for_qa(output)
 ```
 
-Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
-
-**Arguments**:
-
-- `output_dir`: Path for the directory to store files
-- `urls`: List of http addresses or single http address
-- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
-                      0: Only initial list of urls
-                      1: Follow links found on the initial URLs (but no further)
-- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
-                   All URLs not matching at least one of the regular expressions will be dropped.
-- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
-- `return_documents`: Return json files content
-
-**Returns**:
-
-Tuple({"paths": List of filepaths, ...}, Name of output edge)
+Returns a simplified version of the output dictionary
+with the following structure:
+[
+    { 
+        answer: { ... }
+        entities: [ { ... }, {} ]
+    }
+]
+The entities included are only the ones that overlap with
+the answer itself.