Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade pydoc-markdown #2117

Merged
merged 60 commits into from
Feb 4, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
fea2198
Upgrade pydoc-markdown and fix the YAMLs to work with it
ZanSara Feb 2, 2022
935ef42
Add latest docstring and tutorial changes
github-actions[bot] Feb 2, 2022
efcf56a
Pin pydoc-markdown to major version
ZanSara Feb 3, 2022
5f9bbb9
Add quotes
ZanSara Feb 3, 2022
3316ffa
Restore proper arguments rendering
ZanSara Feb 3, 2022
f922cb5
Add latest docstring and tutorial changes
github-actions[bot] Feb 3, 2022
c97148a
Reintroduce crossref too
ZanSara Feb 3, 2022
bcb39e4
Merge branch 'upgrade-pydoc-markdown' of github.com:deepset-ai/haysta…
ZanSara Feb 3, 2022
8db6e46
Generalize pydoc-markdown workflow
ZanSara Feb 3, 2022
bb7247f
Fix some wrongly formatted return docstrings
ZanSara Feb 3, 2022
308d7ae
Fix some wrongly formatted parameter names
ZanSara Feb 3, 2022
cfaaf04
Add latest docstring and tutorial changes
github-actions[bot] Feb 3, 2022
687b855
Trigger a new workflow
ZanSara Feb 3, 2022
2946e09
Remane workflow
ZanSara Feb 3, 2022
d3373d2
Merge branch 'upgrade-pydoc-markdown' of github.com:deepset-ai/haysta…
ZanSara Feb 3, 2022
f5dbbc0
Merge branch 'master' into upgrade-pydoc-markdown
ZanSara Feb 3, 2022
149ed30
Make a single Action to perform all tasks that require committing int…
ZanSara Feb 3, 2022
8489b00
Add one file to recreate docs/_scr/api/api
ZanSara Feb 3, 2022
302cb40
Update Documentation
github-actions[bot] Feb 3, 2022
b8d2740
Merge the code updates and the docs in the Linux CI to prevent the bo…
ZanSara Feb 3, 2022
7bfcce3
Merge branch 'upgrade-pydoc-markdown' of github.com:deepset-ai/haysta…
ZanSara Feb 3, 2022
4ab8516
Installing Jupyter deps for Black
ZanSara Feb 3, 2022
1e61ea9
Add some comments to understand CI failure
ZanSara Feb 3, 2022
cb5d282
Try disabling fetch-depth
ZanSara Feb 3, 2022
7bcc925
Remove redundant trigger
ZanSara Feb 3, 2022
495825a
-> Update Documentation
github-actions[bot] Feb 3, 2022
4046e45
Build cache before running generation tasks
ZanSara Feb 4, 2022
60f96a4
Merge branch 'upgrade-pydoc-markdown' of github.com:deepset-ai/haysta…
ZanSara Feb 4, 2022
52f3ed5
Merge branch 'master' into upgrade-pydoc-markdown
ZanSara Feb 4, 2022
4c8e0d5
Fix dependency among tasks
ZanSara Feb 4, 2022
85776ea
Update Documentation
github-actions[bot] Feb 4, 2022
11c7b92
Final tweaks to Linux CI
ZanSara Feb 4, 2022
d9e2121
Merge branch 'upgrade-pydoc-markdown' of github.com:deepset-ai/haysta…
ZanSara Feb 4, 2022
bbc555e
Change trigger for Linux CI
ZanSara Feb 4, 2022
0dd863b
Change trigger for Linux CI
ZanSara Feb 4, 2022
3e5280b
Testing trigger
ZanSara Feb 4, 2022
2cfaa27
Change trigger for Linux CI
ZanSara Feb 4, 2022
f13c3e0
Add check not to run the code generation on master
ZanSara Feb 4, 2022
84ac0b6
Fix typo in Linux CI
ZanSara Feb 4, 2022
9ec55f7
Simplify push action
ZanSara Feb 4, 2022
c13e503
Make cache stick a bit longer and try pushing to head_ref
ZanSara Feb 4, 2022
5bdaefb
Typo in cache key
ZanSara Feb 4, 2022
adaa52e
Move commit & push together
ZanSara Feb 4, 2022
6acae9e
Simplify push command
ZanSara Feb 4, 2022
d751869
Persist credentials
ZanSara Feb 4, 2022
ef54e0e
Add more test deps in setup.cfg and remove from GH Action workflow
ZanSara Feb 4, 2022
3587d28
Trying to set the ref explicitly
ZanSara Feb 4, 2022
88021c9
Set an explicit requirement that was not properly enforced between ty…
ZanSara Feb 4, 2022
5d65f06
Add comment to setup.cfg
ZanSara Feb 4, 2022
47b282a
remove forced upgrades on pip install
ZanSara Feb 4, 2022
8961873
Remove constraint on PyYAML, probably unnecessary
ZanSara Feb 4, 2022
c4ccf7d
Re-enable persist credentials
ZanSara Feb 4, 2022
f2dc50c
Merge branch 'master' into upgrade-pydoc-markdown
ZanSara Feb 4, 2022
122bb6b
Update Documentation & Code Style
github-actions[bot] Feb 4, 2022
a279fff
Last test
ZanSara Feb 4, 2022
9adc149
Last test
ZanSara Feb 4, 2022
7fc7914
Test when bot should not trigger
ZanSara Feb 4, 2022
0010dd0
Update Documentation & Code Style
github-actions[bot] Feb 4, 2022
aba8eb5
Remove comment
ZanSara Feb 4, 2022
f6ea921
Merge branch 'upgrade-pydoc-markdown' of github.com:deepset-ai/haysta…
ZanSara Feb 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/update_docsstrings_tutorials.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pydoc-markdown==3.11.0
pip install pydoc-markdown
ZanSara marked this conversation as resolved.
Show resolved Hide resolved
pip install mkdocs
pip install jupytercontrib
pip install watchdog==1.0.2
Expand Down
66 changes: 21 additions & 45 deletions docs/_src/api/api/crawler.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
<a name="crawler"></a>
<a id="crawler"></a>

# Module crawler

<a name="crawler.Crawler"></a>
<a id="crawler.Crawler"></a>

## Crawler

```python
Expand All @@ -20,31 +22,12 @@ Crawl texts from a website so that we can use them later in Haystack as a corpus
| filter_urls= ["haystack\.deepset\.ai\/overview\/"])
```

<a name="crawler.Crawler.__init__"></a>
#### \_\_init\_\_

```python
| __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True)
```

Init object with basic params for crawling (can be overwritten later).
<a id="crawler.Crawler.crawl"></a>

**Arguments**:

- `output_dir`: Path for the directory to store files
- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content

<a name="crawler.Crawler.crawl"></a>
#### crawl

```python
| crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None) -> List[Path]
def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None) -> List[Path]
```

Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
Expand All @@ -53,43 +36,36 @@ You can optionally specify via `filter_urls` to only crawl URLs that match a cer
All parameters are optional here and only meant to overwrite instance attributes at runtime.
If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.

**Arguments**:

- `output_dir`: Path for the directory to store files
- `urls`: List of http addresses or single http address
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
:param output_dir: Path for the directory to store files
:param urls: List of http addresses or single http address
:param crawler_depth: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
:param filter_urls: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
:param overwrite_existing_files: Whether to overwrite existing files in output_dir with new content

**Returns**:
:return: List of paths where the crawled webpages got stored

List of paths where the crawled webpages got stored
<a id="crawler.Crawler.run"></a>

<a name="crawler.Crawler.run"></a>
#### run

```python
| run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False) -> Tuple[Dict, str]
def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False) -> Tuple[Dict, str]
```

Method to be executed when the Crawler is used as a Node within a Haystack pipeline.

**Arguments**:

- `output_dir`: Path for the directory to store files
- `urls`: List of http addresses or single http address
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
:param output_dir: Path for the directory to store files
:param urls: List of http addresses or single http address
:param crawler_depth: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
:param filter_urls: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
- `return_documents`: Return json files content

**Returns**:
:param overwrite_existing_files: Whether to overwrite existing files in output_dir with new content
:param return_documents: Return json files content

Tuple({"paths": List of filepaths, ...}, Name of output edge)
:return: Tuple({"paths": List of filepaths, ...}, Name of output edge)

65 changes: 15 additions & 50 deletions docs/_src/api/api/document_classifier.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,31 @@
<a name="base"></a>
<a id="base"></a>

# Module base

<a name="base.BaseDocumentClassifier"></a>
<a id="base.BaseDocumentClassifier"></a>

## BaseDocumentClassifier

```python
class BaseDocumentClassifier(BaseComponent)
```

<a name="base.BaseDocumentClassifier.timing"></a>
<a id="base.BaseDocumentClassifier.timing"></a>

#### timing

```python
| timing(fn, attr_name)
def timing(fn, attr_name)
```

Wrapper method used to time functions.

<a name="transformers"></a>
<a id="transformers"></a>

# Module transformers

<a name="transformers.TransformersDocumentClassifier"></a>
<a id="transformers.TransformersDocumentClassifier"></a>

## TransformersDocumentClassifier

```python
Expand Down Expand Up @@ -74,57 +79,17 @@ With this document_classifier, you can directly get predictions via predict()
| p.run(file_paths=file_paths)
```

<a name="transformers.TransformersDocumentClassifier.__init__"></a>
#### \_\_init\_\_
<a id="transformers.TransformersDocumentClassifier.predict"></a>

```python
| __init__(model_name_or_path: str = "bhadresh-savani/distilbert-base-uncased-emotion", model_version: Optional[str] = None, tokenizer: Optional[str] = None, use_gpu: bool = True, return_all_scores: bool = False, task: str = 'text-classification', labels: Optional[List[str]] = None, batch_size: int = -1, classification_field: str = None)
```

Load a text classification model from Transformers.
Available models for the task of text-classification include:
- ``'bhadresh-savani/distilbert-base-uncased-emotion'``
- ``'Hate-speech-CNERG/dehatebert-mono-english'``

Available models for the task of zero-shot-classification include:
- ``'valhalla/distilbart-mnli-12-3'``
- ``'cross-encoder/nli-distilroberta-base'``

See https://huggingface.co/models for full list of available models.
Filter for text classification models: https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads
Filter for zero-shot classification models (NLI): https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads&search=nli

**Arguments**:

- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bhadresh-savani/distilbert-base-uncased-emotion'.
See https://huggingface.co/models for full list of available models.
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
- `tokenizer`: Name of the tokenizer (usually the same as model)
- `use_gpu`: Whether to use GPU (if available).
- `return_all_scores`: Whether to return all prediction scores or just the one of the predicted class. Only used for task 'text-classification'.
- `task`: 'text-classification' or 'zero-shot-classification'
- `labels`: Only used for task 'zero-shot-classification'. List of string defining class labels, e.g.,
["positive", "negative"] otherwise None. Given a LABEL, the sequence fed to the model is "<cls> sequence to
classify <sep> This example is LABEL . <sep>" and the model predicts whether that sequence is a contradiction
or an entailment.
- `batch_size`: batch size to be processed at once
- `classification_field`: Name of Document's meta field to be used for classification. If left unset, Document.content is used by default.

<a name="transformers.TransformersDocumentClassifier.predict"></a>
#### predict

```python
| predict(documents: List[Document]) -> List[Document]
def predict(documents: List[Document]) -> List[Document]
```

Returns documents containing classification result in meta field.
Documents are updated in place.

**Arguments**:

- `documents`: List of Document to classify

**Returns**:

List of Document enriched with meta information
:param documents: List of Document to classify
:return: List of Document enriched with meta information