Update Documentation (#976)

* Add api pages * Add latest docstring and tutorial changes * First sweep of usage docs * Add link to conversion script * Add import statements * Add summarization page * Add web crawler documentation * Add confidence scores usage * Add crawler api docs * Regenerate api docs * Update summarizer and translator api * Add api pages * Add latest docstring and tutorial changes * First sweep of usage docs * Add link to conversion script * Add import statements * Add summarization page * Add web crawler documentation * Add confidence scores usage * Add crawler api docs * Regenerate api docs * Update summarizer and translator api * Add indentation (pydoc-markdown 3.10.1) * Comment out metadata * Remove Finder deprecation message * Remove Finder in FAQ * Update tutorial link * Incorporate reviewer feedback * Regen api docs * Add type annotations Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
deepset-ai · Apr 22, 2021 · 9626c0d · 9626c0d
1 parent b1e8ebf
commit 9626c0d
Show file tree

Hide file tree

Showing 33 changed files with 924 additions and 151 deletions.
diff --git a/docs/_src/api/api/crawler.md b/docs/_src/api/api/crawler.md
@@ -0,0 +1,95 @@
+<a name="crawler"></a>
+# Module crawler
+
+<a name="crawler.Crawler"></a>
+## Crawler Objects
+
+```python
+class Crawler(BaseComponent)
+```
+
+Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
+
+**Example:**
+```python
+|    from haystack.connector import Crawler
+|
+|    crawler = Crawler()
+|    # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/docs/
+|    docs = crawler.crawl(urls=["https://haystack.deepset.ai/docs/latest/get_startedmd"],
+|                         output_dir="crawled_files",
+|                         filter_urls= ["haystack\.deepset\.ai\/docs\/"])
+```
+
+<a name="crawler.Crawler.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True)
+```
+
+Init object with basic params for crawling (can be overwritten later).
+
+**Arguments**:
+
+- `output_dir`: Path for the directory to store files
+- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
+- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
+                      0: Only initial list of urls
+                      1: Follow links found on the initial URLs (but no further)
+- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
+                   All URLs not matching at least one of the regular expressions will be dropped.
+- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
+
+<a name="crawler.Crawler.crawl"></a>
+#### crawl
+
+```python
+ | crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None) -> List[Path]
+```
+
+Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
+file per URL, including text and basic meta data).
+You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern.
+All parameters are optional here and only meant to overwrite instance attributes at runtime.
+If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.
+
+**Arguments**:
+
+- `output_dir`: Path for the directory to store files
+- `urls`: List of http addresses or single http address
+- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
+                      0: Only initial list of urls
+                      1: Follow links found on the initial URLs (but no further)
+- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
+                   All URLs not matching at least one of the regular expressions will be dropped.
+- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
+
+**Returns**:
+
+List of paths where the crawled webpages got stored
+
+<a name="crawler.Crawler.run"></a>
+#### run
+
+```python
+ | run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, **kwargs) -> Tuple[Dict, str]
+```
+
+Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
+
+**Arguments**:
+
+- `output_dir`: Path for the directory to store files
+- `urls`: List of http addresses or single http address
+- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
+                      0: Only initial list of urls
+                      1: Follow links found on the initial URLs (but no further)
+- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
+                   All URLs not matching at least one of the regular expressions will be dropped.
+- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
+
+**Returns**:
+
+Tuple({"paths": List of filepaths, ...}, Name of output edge)
+
diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md
@@ -657,7 +657,7 @@ Fetch documents by specifying a list of text vector id strings
 
 - `vector_ids`: List of vector_id strings.
 - `index`: Name of the index to get the documents from. If None, the
-              DocumentStore's default index (self.index) will be used.
+DocumentStore's default index (self.index) will be used.
 - `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
 
 <a name="sql.SQLDocumentStore.get_all_documents_generator"></a>
@@ -829,7 +829,6 @@ the vector embeddings are indexed in a FAISS Index.
                                 - "HNSW": Graph-based heuristic. If not further specified,
                                           we use the following config:
                                           HNSW64, efConstruction=80 and efSearch=20
-
                                 - "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist.
                                                   Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point.
                                 For more details see:

diff --git a/docs/_src/api/api/evaluation.md b/docs/_src/api/api/evaluation.md
@@ -0,0 +1,92 @@
+<a name="eval"></a>
+# Module eval
+
+<a name="eval.EvalRetriever"></a>
+## EvalRetriever Objects
+
+```python
+class EvalRetriever()
+```
+
+This is a pipeline node that should be placed after a Retriever in order to assess its performance. Performance
+metrics are stored in this class and updated as each sample passes through it. To view the results of the evaluation,
+call EvalRetriever.print(). Note that results from this Node may differ from that when calling Retriever.eval()
+since that is a closed domain evaluation. Have a look at our evaluation tutorial for more info about
+open vs closed domain eval (https://haystack.deepset.ai/docs/latest/tutorial5md).
+
+<a name="eval.EvalRetriever.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(debug: bool = False, open_domain: bool = True)
+```
+
+**Arguments**:
+
+- `open_domain`: When True, a document is considered correctly retrieved so long as the answer string can be found within it.
+                    When False, correct retrieval is evaluated based on document_id.
+- `debug`: When True, a record of each sample and its evaluation will be stored in EvalRetriever.log
+
+<a name="eval.EvalRetriever.run"></a>
+#### run
+
+```python
+ | run(documents, labels: dict, **kwargs)
+```
+
+Run this node on one sample and its labels
+
+<a name="eval.EvalRetriever.print"></a>
+#### print
+
+```python
+ | print()
+```
+
+Print the evaluation results
+
+<a name="eval.EvalReader"></a>
+## EvalReader Objects
+
+```python
+class EvalReader()
+```
+
+This is a pipeline node that should be placed after a Reader in order to assess the performance of the Reader
+individually or to assess the extractive QA performance of the whole pipeline. Performance metrics are stored in
+this class and updated as each sample passes through it. To view the results of the evaluation, call EvalReader.print().
+Note that results from this Node may differ from that when calling Reader.eval()
+since that is a closed domain evaluation. Have a look at our evaluation tutorial for more info about
+open vs closed domain eval (https://haystack.deepset.ai/docs/latest/tutorial5md).
+
+<a name="eval.EvalReader.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(skip_incorrect_retrieval: bool = True, open_domain: bool = True, debug: bool = False)
+```
+
+**Arguments**:
+
+- `skip_incorrect_retrieval`: When set to True, this eval will ignore the cases where the retriever returned no correct documents
+- `open_domain`: When True, extracted answers are evaluated purely on string similarity rather than the position of the extracted answer
+- `debug`: When True, a record of each sample and its evaluation will be stored in EvalReader.log
+
+<a name="eval.EvalReader.run"></a>
+#### run
+
+```python
+ | run(labels, answers, **kwargs)
+```
+
+Run this node on one sample and its labels
+
+<a name="eval.EvalReader.print"></a>
+#### print
+
+```python
+ | print(mode)
+```
+
+Print the evaluation results
+
diff --git a/docs/_src/api/api/generate_docstrings.sh b/docs/_src/api/api/generate_docstrings.sh
@@ -5,7 +5,14 @@
 pydoc-markdown pydoc-markdown-document-store.yml
 pydoc-markdown pydoc-markdown-file-converters.yml
 pydoc-markdown pydoc-markdown-preprocessor.yml
+pydoc-markdown pydoc-markdown-crawler.yml
 pydoc-markdown pydoc-markdown-reader.yml
 pydoc-markdown pydoc-markdown-generator.yml
 pydoc-markdown pydoc-markdown-retriever.yml
-pydoc-markdown pydoc-markdown-pipelines.yml
+pydoc-markdown pydoc-markdown-summarizer.yml
+pydoc-markdown pydoc-markdown-translator.yml
+pydoc-markdown pydoc-markdown-pipelines.yml
+pydoc-markdown pydoc-markdown-knowledge-graph.yml
+pydoc-markdown pydoc-markdown-graph-retriever.yml
+pydoc-markdown pydoc-markdown-evaluation.yml
+
diff --git a/docs/_src/api/api/graph_retriever.md b/docs/_src/api/api/graph_retriever.md
@@ -0,0 +1,25 @@
+<a name="base"></a>
+# Module base
+
+<a name="text_to_sparql"></a>
+# Module text\_to\_sparql
+
+<a name="text_to_sparql.Text2SparqlRetriever"></a>
+## Text2SparqlRetriever Objects
+
+```python
+class Text2SparqlRetriever(BaseGraphRetriever)
+```
+
+Graph retriever that uses a pre-trained Bart model to translate natural language questions given in text form to queries in SPARQL format.
+The generated SPARQL query is executed on a knowledge graph.
+
+<a name="text_to_sparql.Text2SparqlRetriever.format_result"></a>
+#### format\_result
+
+```python
+ | format_result(result)
+```
+
+Generate formatted dictionary output with text answer and additional info
+
diff --git a/docs/_src/api/api/knowledge_graph.md b/docs/_src/api/api/knowledge_graph.md
@@ -0,0 +1,15 @@
+<a name="base"></a>
+# Module base
+
+<a name="graphdb"></a>
+# Module graphdb
+
+<a name="graphdb.GraphDBKnowledgeGraph"></a>
+## GraphDBKnowledgeGraph Objects
+
+```python
+class GraphDBKnowledgeGraph(BaseKnowledgeGraph)
+```
+
+Knowledge graph store that runs on a GraphDB instance
+
diff --git a/docs/_src/api/api/pydoc-markdown-crawler.yml b/docs/_src/api/api/pydoc-markdown-crawler.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/connector]
+    modules: ['crawler']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: crawler.md
diff --git a/docs/_src/api/api/pydoc-markdown-evaluation.yml b/docs/_src/api/api/pydoc-markdown-evaluation.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack]
+    modules: ['eval']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: evaluation.md
diff --git a/docs/_src/api/api/pydoc-markdown-graph-retriever.yml b/docs/_src/api/api/pydoc-markdown-graph-retriever.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/graph_retriever]
+    modules: ['base', 'text_to_sparql']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: graph_retriever.md
diff --git a/docs/_src/api/api/pydoc-markdown-knowledge-graph.yml b/docs/_src/api/api/pydoc-markdown-knowledge-graph.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/knowledge_graph]
+    modules: ['base', 'graphdb']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: knowledge_graph.md
diff --git a/docs/_src/api/api/pydoc-markdown-summarizer.yml b/docs/_src/api/api/pydoc-markdown-summarizer.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/summarizer]
+    modules: ['base', 'transformers']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: summarizer.md
diff --git a/docs/_src/api/api/pydoc-markdown-translator.yml b/docs/_src/api/api/pydoc-markdown-translator.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/translator]
+    modules: ['base', 'transformers']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: translator.md