Skip to content

Commit

Permalink
Update Documentation (#976)
Browse files Browse the repository at this point in the history
* Add api pages

* Add latest docstring and tutorial changes

* First sweep of usage docs

* Add link to conversion script

* Add import statements

* Add summarization page

* Add web crawler documentation

* Add confidence scores usage

* Add crawler api docs

* Regenerate api docs

* Update summarizer and translator api

* Add api pages

* Add latest docstring and tutorial changes

* First sweep of usage docs

* Add link to conversion script

* Add import statements

* Add summarization page

* Add web crawler documentation

* Add confidence scores usage

* Add crawler api docs

* Regenerate api docs

* Update summarizer and translator api

* Add indentation (pydoc-markdown 3.10.1)

* Comment out metadata

* Remove Finder deprecation message

* Remove Finder in FAQ

* Update tutorial link

* Incorporate reviewer feedback

* Regen api docs

* Add type annotations

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
  • Loading branch information
brandenchan and github-actions[bot] committed Apr 22, 2021
1 parent b1e8ebf commit 9626c0d
Show file tree
Hide file tree
Showing 33 changed files with 924 additions and 151 deletions.
95 changes: 95 additions & 0 deletions docs/_src/api/api/crawler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
<a name="crawler"></a>
# Module crawler

<a name="crawler.Crawler"></a>
## Crawler Objects

```python
class Crawler(BaseComponent)
```

Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.

**Example:**
```python
| from haystack.connector import Crawler
|
| crawler = Crawler()
| # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/docs/
| docs = crawler.crawl(urls=["https://haystack.deepset.ai/docs/latest/get_startedmd"],
| output_dir="crawled_files",
| filter_urls= ["haystack\.deepset\.ai\/docs\/"])
```

<a name="crawler.Crawler.__init__"></a>
#### \_\_init\_\_

```python
| __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True)
```

Init object with basic params for crawling (can be overwritten later).

**Arguments**:

- `output_dir`: Path for the directory to store files
- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content

<a name="crawler.Crawler.crawl"></a>
#### crawl

```python
| crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None) -> List[Path]
```

Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
file per URL, including text and basic meta data).
You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern.
All parameters are optional here and only meant to overwrite instance attributes at runtime.
If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.

**Arguments**:

- `output_dir`: Path for the directory to store files
- `urls`: List of http addresses or single http address
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content

**Returns**:

List of paths where the crawled webpages got stored

<a name="crawler.Crawler.run"></a>
#### run

```python
| run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, **kwargs) -> Tuple[Dict, str]
```

Method to be executed when the Crawler is used as a Node within a Haystack pipeline.

**Arguments**:

- `output_dir`: Path for the directory to store files
- `urls`: List of http addresses or single http address
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content

**Returns**:

Tuple({"paths": List of filepaths, ...}, Name of output edge)

3 changes: 1 addition & 2 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -657,7 +657,7 @@ Fetch documents by specifying a list of text vector id strings

- `vector_ids`: List of vector_id strings.
- `index`: Name of the index to get the documents from. If None, the
DocumentStore's default index (self.index) will be used.
DocumentStore's default index (self.index) will be used.
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.

<a name="sql.SQLDocumentStore.get_all_documents_generator"></a>
Expand Down Expand Up @@ -829,7 +829,6 @@ the vector embeddings are indexed in a FAISS Index.
- "HNSW": Graph-based heuristic. If not further specified,
we use the following config:
HNSW64, efConstruction=80 and efSearch=20

- "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist.
Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point.
For more details see:
Expand Down
92 changes: 92 additions & 0 deletions docs/_src/api/api/evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
<a name="eval"></a>
# Module eval

<a name="eval.EvalRetriever"></a>
## EvalRetriever Objects

```python
class EvalRetriever()
```

This is a pipeline node that should be placed after a Retriever in order to assess its performance. Performance
metrics are stored in this class and updated as each sample passes through it. To view the results of the evaluation,
call EvalRetriever.print(). Note that results from this Node may differ from that when calling Retriever.eval()
since that is a closed domain evaluation. Have a look at our evaluation tutorial for more info about
open vs closed domain eval (https://haystack.deepset.ai/docs/latest/tutorial5md).

<a name="eval.EvalRetriever.__init__"></a>
#### \_\_init\_\_

```python
| __init__(debug: bool = False, open_domain: bool = True)
```

**Arguments**:

- `open_domain`: When True, a document is considered correctly retrieved so long as the answer string can be found within it.
When False, correct retrieval is evaluated based on document_id.
- `debug`: When True, a record of each sample and its evaluation will be stored in EvalRetriever.log

<a name="eval.EvalRetriever.run"></a>
#### run

```python
| run(documents, labels: dict, **kwargs)
```

Run this node on one sample and its labels

<a name="eval.EvalRetriever.print"></a>
#### print

```python
| print()
```

Print the evaluation results

<a name="eval.EvalReader"></a>
## EvalReader Objects

```python
class EvalReader()
```

This is a pipeline node that should be placed after a Reader in order to assess the performance of the Reader
individually or to assess the extractive QA performance of the whole pipeline. Performance metrics are stored in
this class and updated as each sample passes through it. To view the results of the evaluation, call EvalReader.print().
Note that results from this Node may differ from that when calling Reader.eval()
since that is a closed domain evaluation. Have a look at our evaluation tutorial for more info about
open vs closed domain eval (https://haystack.deepset.ai/docs/latest/tutorial5md).

<a name="eval.EvalReader.__init__"></a>
#### \_\_init\_\_

```python
| __init__(skip_incorrect_retrieval: bool = True, open_domain: bool = True, debug: bool = False)
```

**Arguments**:

- `skip_incorrect_retrieval`: When set to True, this eval will ignore the cases where the retriever returned no correct documents
- `open_domain`: When True, extracted answers are evaluated purely on string similarity rather than the position of the extracted answer
- `debug`: When True, a record of each sample and its evaluation will be stored in EvalReader.log

<a name="eval.EvalReader.run"></a>
#### run

```python
| run(labels, answers, **kwargs)
```

Run this node on one sample and its labels

<a name="eval.EvalReader.print"></a>
#### print

```python
| print(mode)
```

Print the evaluation results

9 changes: 8 additions & 1 deletion docs/_src/api/api/generate_docstrings.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,14 @@
pydoc-markdown pydoc-markdown-document-store.yml
pydoc-markdown pydoc-markdown-file-converters.yml
pydoc-markdown pydoc-markdown-preprocessor.yml
pydoc-markdown pydoc-markdown-crawler.yml
pydoc-markdown pydoc-markdown-reader.yml
pydoc-markdown pydoc-markdown-generator.yml
pydoc-markdown pydoc-markdown-retriever.yml
pydoc-markdown pydoc-markdown-pipelines.yml
pydoc-markdown pydoc-markdown-summarizer.yml
pydoc-markdown pydoc-markdown-translator.yml
pydoc-markdown pydoc-markdown-pipelines.yml
pydoc-markdown pydoc-markdown-knowledge-graph.yml
pydoc-markdown pydoc-markdown-graph-retriever.yml
pydoc-markdown pydoc-markdown-evaluation.yml

25 changes: 25 additions & 0 deletions docs/_src/api/api/graph_retriever.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<a name="base"></a>
# Module base

<a name="text_to_sparql"></a>
# Module text\_to\_sparql

<a name="text_to_sparql.Text2SparqlRetriever"></a>
## Text2SparqlRetriever Objects

```python
class Text2SparqlRetriever(BaseGraphRetriever)
```

Graph retriever that uses a pre-trained Bart model to translate natural language questions given in text form to queries in SPARQL format.
The generated SPARQL query is executed on a knowledge graph.

<a name="text_to_sparql.Text2SparqlRetriever.format_result"></a>
#### format\_result

```python
| format_result(result)
```

Generate formatted dictionary output with text answer and additional info

15 changes: 15 additions & 0 deletions docs/_src/api/api/knowledge_graph.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<a name="base"></a>
# Module base

<a name="graphdb"></a>
# Module graphdb

<a name="graphdb.GraphDBKnowledgeGraph"></a>
## GraphDBKnowledgeGraph Objects

```python
class GraphDBKnowledgeGraph(BaseKnowledgeGraph)
```

Knowledge graph store that runs on a GraphDB instance

18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-crawler.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack/connector]
modules: ['crawler']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: crawler.md
18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-evaluation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack]
modules: ['eval']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: evaluation.md
18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-graph-retriever.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack/graph_retriever]
modules: ['base', 'text_to_sparql']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: graph_retriever.md
18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-knowledge-graph.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack/knowledge_graph]
modules: ['base', 'graphdb']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: knowledge_graph.md
18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-summarizer.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack/summarizer]
modules: ['base', 'transformers']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: summarizer.md
18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-translator.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack/translator]
modules: ['base', 'transformers']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: translator.md
Loading

0 comments on commit 9626c0d

Please sign in to comment.