Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regenerate API and Tutorial md files #1480

Merged
merged 9 commits into from
Sep 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 199 additions & 0 deletions docs/_src/api/api/classifier.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
<a name="base"></a>
# Module base

<a name="base.BaseClassifier"></a>
## BaseClassifier Objects

```python
class BaseClassifier(BaseComponent)
```

<a name="base.BaseClassifier.timing"></a>
#### timing

```python
| timing(fn, attr_name)
```

Wrapper method used to time functions.

<a name="farm"></a>
# Module farm

<a name="farm.FARMClassifier"></a>
## FARMClassifier Objects

```python
class FARMClassifier(BaseClassifier)
```

This node classifies documents and adds the output from the classification step to the document's meta data.
The meta field of the document is a dictionary with the following format:
'meta': {'name': '450_Baelor.txt', 'classification': {'label': 'neutral', 'probability' = 0.9997646, ...} }

| With a FARMClassifier, you can:
- directly get predictions via predict()
- fine-tune the model on text classification training data via train()

Usage example:
...
retriever = ElasticsearchRetriever(document_store=document_store)
classifier = FARMClassifier(model_name_or_path="deepset/bert-base-german-cased-sentiment-Germeval17")
p = Pipeline()
p.add_node(component=retriever, name="Retriever", inputs=["Query"])
p.add_node(component=classifier, name="Classifier", inputs=["Retriever"])

res = p.run(
query="Who is the father of Arya Stark?",
params={"Retriever": {"top_k": 10}, "Classifier": {"top_k": 5}}
)

print(res["documents"][0].to_dict()["meta"]["classification"]["label"])
__Note that print_documents() does not output the content of the classification field in the meta data__

__document_dicts = [doc.to_dict() for doc in res["documents"]]__

__res["documents"] = document_dicts__

__print_documents(res, max_text_len=100)__


<a name="farm.FARMClassifier.__init__"></a>
#### \_\_init\_\_

```python
| __init__(model_name_or_path: Union[str, Path], model_version: Optional[str] = None, batch_size: int = 50, use_gpu: bool = True, top_k: int = 10, num_processes: Optional[int] = None, max_seq_len: int = 256, progress_bar: bool = True)
```

**Arguments**:

- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'deepset/bert-base-german-cased-sentiment-Germeval17'.
See https://huggingface.co/models for full list of available models.
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
- `batch_size`: Number of samples the model receives in one batch for inference.
Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
to a value so only a single batch is used.
- `use_gpu`: Whether to use GPU (if available)
- `top_k`: The maximum number of documents to return
- `num_processes`: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
multiprocessing. Set to None to let Inferencer determine optimum number. If you
want to debug the Language Model, you might need to disable multiprocessing!
- `max_seq_len`: Max sequence length of one input text for the model
- `progress_bar`: Whether to show a tqdm progress bar or not.
Can be helpful to disable in production deployments to keep the logs clean.

<a name="farm.FARMClassifier.train"></a>
#### train

```python
| train(data_dir: str, train_filename: str, label_list: List[str], delimiter: str, metric: str, dev_filename: Optional[str] = None, test_filename: Optional[str] = None, use_gpu: Optional[bool] = None, batch_size: int = 10, n_epochs: int = 2, learning_rate: float = 1e-5, max_seq_len: Optional[int] = None, warmup_proportion: float = 0.2, dev_split: float = 0, evaluate_every: int = 300, save_dir: Optional[str] = None, num_processes: Optional[int] = None, use_amp: str = None)
```

Fine-tune a model on a TextClassification dataset.
The dataset needs to be in tabular format (CSV, TSV, etc.), with columns called "label" and "text" in no specific order.
Options:

- Take a plain language model (e.g. `bert-base-cased`) and train it for TextClassification
- Take a TextClassification model and fine-tune it for your domain

**Arguments**:

- `data_dir`: Path to directory containing your training data
- `label_list`: list of labels in the training dataset, e.g., ["0", "1"]
- `delimiter`: delimiter that separates columns in the training dataset, e.g., "\t"
- `metric`: evaluation metric to be used while training, e.g., "f1_macro"
- `train_filename`: Filename of training data
- `dev_filename`: Filename of dev / eval data
- `test_filename`: Filename of test data
- `dev_split`: Instead of specifying a dev_filename, you can also specify a ratio (e.g. 0.1) here
that gets split off from training data for eval.
- `use_gpu`: Whether to use GPU (if available)
- `batch_size`: Number of samples the model receives in one batch for training
- `n_epochs`: Number of iterations on the whole training data set
- `learning_rate`: Learning rate of the optimizer
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
- `warmup_proportion`: Proportion of training steps until maximum learning rate is reached.
Until that point LR is increasing linearly. After that it's decreasing again linearly.
Options for different schedules are available in FARM.
- `evaluate_every`: Evaluate the model every X steps on the hold-out eval dataset
- `save_dir`: Path to store the final model
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
Set to None to use all CPU cores minus one.
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
Available options:
None (Don't use AMP)
"O0" (Normal FP32 training)
"O1" (Mixed Precision => Recommended)
"O2" (Almost FP16)
"O3" (Pure FP16).
See details on: https://nvidia.github.io/apex/amp.html

**Returns**:

None

<a name="farm.FARMClassifier.update_parameters"></a>
#### update\_parameters

```python
| update_parameters(max_seq_len: Optional[int] = None)
```

Hot update parameters of a loaded FARMClassifier. It may not to be safe when processing concurrent requests.

<a name="farm.FARMClassifier.save"></a>
#### save

```python
| save(directory: Path)
```

Saves the FARMClassifier model so that it can be reused at a later point in time.

**Arguments**:

- `directory`: Directory where the FARMClassifier model should be saved

<a name="farm.FARMClassifier.predict_batch"></a>
#### predict\_batch

```python
| predict_batch(query_doc_list: List[dict], top_k: int = None, batch_size: int = None)
```

Use loaded FARMClassifier model to, for a list of queries, classify each query's supplied list of Document.

Returns list of dictionary of query and list of document sorted by (desc.) similarity with query

**Arguments**:

- `query_doc_list`: List of dictionaries containing queries with their retrieved documents
- `top_k`: The maximum number of answers to return for each query
- `batch_size`: Number of samples the model receives in one batch for inference

**Returns**:

List of dictionaries containing query and list of Document with class probabilities in meta field

<a name="farm.FARMClassifier.predict"></a>
#### predict

```python
| predict(query: str, documents: List[Document], top_k: Optional[int] = None) -> List[Document]
```

Use loaded classification model to classify the supplied list of Document.

Returns list of Document enriched with class label and probability, which are stored in Document.meta["classification"]

**Arguments**:

- `query`: Query string (is not used at the moment)
- `documents`: List of Document to be classified
- `top_k`: The maximum number of documents to return

**Returns**:

List of Document with class probabilities in meta field

30 changes: 21 additions & 9 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -985,7 +985,7 @@ the vector embeddings are indexed in a FAISS Index.
#### \_\_init\_\_

```python
| __init__(sql_url: str = "sqlite:///", vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional["faiss.swigfaiss.Index"] = None, return_embedding: bool = False, index: str = "document", similarity: str = "dot_product", embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
| __init__(sql_url: str = "sqlite:///faiss_document_store.db", vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional["faiss.swigfaiss.Index"] = None, return_embedding: bool = False, index: str = "document", similarity: str = "dot_product", embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
```

**Arguments**:
Expand All @@ -1012,8 +1012,11 @@ the vector embeddings are indexed in a FAISS Index.
or one with docs that you used in Haystack before and want to load again.
- `return_embedding`: To return document embedding
- `index`: Name of index in document store to use.
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default since it is
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence-Transformer model.
In both cases, the returned values in Document.score are normalized to be in range [0,1]:
For `dot_product`: expit(np.asarray(raw_score / 100))
FOr `cosine`: (raw_score + 1) / 2
- `embedding_field`: Name of field containing an embedding vector.
- `progress_bar`: Whether to show a tqdm progress bar or not.
Can be helpful to disable in production deployments to keep the logs clean.
Expand Down Expand Up @@ -1174,14 +1177,19 @@ Find the document that is most similar to the provided `query_emb` by using a ve
#### save

```python
| save(file_path: Union[str, Path])
| save(index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None)
```

Save FAISS Index to the specified file.

**Arguments**:

- `file_path`: Path to save to.
- `index_path`: Path to save the FAISS index to.
- `config_path`: Path to save the initial configuration parameters to.
Defaults to the same as the file path, save the extension (.json).
This file contains all the parameters passed to FAISSDocumentStore()
at creation time (for example the SQL path, vector_dim, etc), and will be
used by the `load` method to restore the index with the appropriate configuration.

**Returns**:

Expand All @@ -1192,7 +1200,7 @@ None

```python
| @classmethod
| load(cls, faiss_file_path: Union[str, Path], sql_url: str, index: str)
| load(cls, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None)
```

Load a saved FAISS index from a file and connect to the SQL database.
Expand All @@ -1201,14 +1209,18 @@ Note: In order to have a correct mapping from FAISS to SQL,

**Arguments**:

- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()`
- `index_path`: Stored FAISS index file. Can be created via calling `save()`
- `config_path`: Stored FAISS initial configuration parameters.
Can be created via calling `save()`
- `sql_url`: Connection string to the SQL database that contains your docs and metadata.
Overrides the value defined in the `faiss_init_params_path` file, if present
- `index`: Index name to load the FAISS index as. It must match the index name used for
when creating the FAISS index.
when creating the FAISS index. Overrides the value defined in the
`faiss_init_params_path` file, if present

**Returns**:


the DocumentStore

<a name="milvus"></a>
# Module milvus
Expand Down
3 changes: 3 additions & 0 deletions docs/_src/api/api/generate_docstrings.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,6 @@ pydoc-markdown pydoc-markdown-knowledge-graph.yml
pydoc-markdown pydoc-markdown-graph-retriever.yml
pydoc-markdown pydoc-markdown-evaluation.yml
pydoc-markdown pydoc-markdown-ranker.yml
pydoc-markdown pydoc-markdown-question-generator.yml
pydoc-markdown pydoc-markdown-classifier.yml

11 changes: 11 additions & 0 deletions docs/_src/api/api/pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -824,6 +824,17 @@ Create an instance of Component.

Ray calls this method which is then re-directed to the corresponding component's run().

<a name="pipeline.Docs2Answers"></a>
## Docs2Answers Objects

```python
class Docs2Answers(BaseComponent)
```

This Node is used to convert retrieved documents into predicted answers format.
It is useful for situations where you are calling a Retriever only pipeline via REST API.
This ensures that your output is in a compatible format.

<a name="pipeline.MostSimilarDocumentsPipeline"></a>
## MostSimilarDocumentsPipeline Objects

Expand Down
18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-classifier.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack/classifier]
modules: ['base', 'farm']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: classifier.md
18 changes: 18 additions & 0 deletions docs/_src/api/api/pydoc-markdown-question-generator.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
loaders:
- type: python
search_path: [../../../../haystack/question_generator]
modules: ['question_generator']
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: question_generator.md
30 changes: 30 additions & 0 deletions docs/_src/api/api/question_generator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<a name="question_generator"></a>
# Module question\_generator

<a name="question_generator.QuestionGenerator"></a>
## QuestionGenerator Objects

```python
class QuestionGenerator(BaseComponent)
```

The Question Generator takes only a document as input and outputs questions that it thinks can be
answered by this document. In our current implementation, input texts are split into chunks of 50 words
with a 10 word overlap. This is because the default model `valhalla/t5-base-e2e-qg` seems to generate only
about 3 questions per passage regardless of length. Our approach prioritizes the creation of more questions
over processing efficiency (T5 is able to digest much more than 50 words at once). The returned questions
generally come in an order dictated by the order of their answers i.e. early questions in the list generally
come from earlier in the document.

<a name="question_generator.QuestionGenerator.__init__"></a>
#### \_\_init\_\_

```python
| __init__(model_name_or_path="valhalla/t5-base-e2e-qg", model_version=None, num_beams=4, max_length=256, no_repeat_ngram_size=3, length_penalty=1.5, early_stopping=True, split_length=50, split_overlap=10, prompt="generate questions:")
```

Uses the valhalla/t5-base-e2e-qg model by default. This class supports any question generation model that is
implemented as a Seq2SeqLM in HuggingFace Transformers. Note that this style of question generation (where the only input
is a document) is sometimes referred to as end-to-end question generation. Answer-supervised question
generation is not currently supported.

Loading