deepset-ai · brandenchan · Sep 21, 2021 · Sep 20, 2021 · Sep 20, 2021 · Sep 20, 2021
diff --git a/docs/_src/api/api/classifier.md b/docs/_src/api/api/classifier.md
@@ -0,0 +1,199 @@
+<a name="base"></a>
+# Module base
+
+<a name="base.BaseClassifier"></a>
+## BaseClassifier Objects
+
+```python
+class BaseClassifier(BaseComponent)
+```
+
+<a name="base.BaseClassifier.timing"></a>
+#### timing
+
+```python
+ | timing(fn, attr_name)
+```
+
+Wrapper method used to time functions.
+
+<a name="farm"></a>
+# Module farm
+
+<a name="farm.FARMClassifier"></a>
+## FARMClassifier Objects
+
+```python
+class FARMClassifier(BaseClassifier)
+```
+
+This node classifies documents and adds the output from the classification step to the document's meta data.
+The meta field of the document is a dictionary with the following format:
+'meta': {'name': '450_Baelor.txt', 'classification': {'label': 'neutral', 'probability' = 0.9997646, ...} }
+
+|  With a FARMClassifier, you can:
+ - directly get predictions via predict()
+ - fine-tune the model on text classification training data via train()
+
+Usage example:
+...
+retriever = ElasticsearchRetriever(document_store=document_store)
+classifier = FARMClassifier(model_name_or_path="deepset/bert-base-german-cased-sentiment-Germeval17")
+p = Pipeline()
+p.add_node(component=retriever, name="Retriever", inputs=["Query"])
+p.add_node(component=classifier, name="Classifier", inputs=["Retriever"])
+
+res = p.run(
+    query="Who is the father of Arya Stark?",
+    params={"Retriever": {"top_k": 10}, "Classifier": {"top_k": 5}}
+)
+
+print(res["documents"][0].to_dict()["meta"]["classification"]["label"])
+__Note that print_documents() does not output the content of the classification field in the meta data__
+
+__document_dicts = [doc.to_dict() for doc in res["documents"]]__
+
+__res["documents"] = document_dicts__
+
+__print_documents(res, max_text_len=100)__
+
+
+<a name="farm.FARMClassifier.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(model_name_or_path: Union[str, Path], model_version: Optional[str] = None, batch_size: int = 50, use_gpu: bool = True, top_k: int = 10, num_processes: Optional[int] = None, max_seq_len: int = 256, progress_bar: bool = True)
+```
+
+**Arguments**:
+
+- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'deepset/bert-base-german-cased-sentiment-Germeval17'.
+See https://huggingface.co/models for full list of available models.
+- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
+- `batch_size`: Number of samples the model receives in one batch for inference.
+                   Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
+                   to a value so only a single batch is used.
+- `use_gpu`: Whether to use GPU (if available)
+- `top_k`: The maximum number of documents to return
+- `num_processes`: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
+                      multiprocessing. Set to None to let Inferencer determine optimum number. If you
+                      want to debug the Language Model, you might need to disable multiprocessing!
+- `max_seq_len`: Max sequence length of one input text for the model
+- `progress_bar`: Whether to show a tqdm progress bar or not.
+                     Can be helpful to disable in production deployments to keep the logs clean.
+
+<a name="farm.FARMClassifier.train"></a>
+#### train
+
+```python
+ | train(data_dir: str, train_filename: str, label_list: List[str], delimiter: str, metric: str, dev_filename: Optional[str] = None, test_filename: Optional[str] = None, use_gpu: Optional[bool] = None, batch_size: int = 10, n_epochs: int = 2, learning_rate: float = 1e-5, max_seq_len: Optional[int] = None, warmup_proportion: float = 0.2, dev_split: float = 0, evaluate_every: int = 300, save_dir: Optional[str] = None, num_processes: Optional[int] = None, use_amp: str = None)
+```
+
+Fine-tune a model on a TextClassification dataset.
+The dataset needs to be in tabular format (CSV, TSV, etc.), with columns called "label" and "text" in no specific order.
+Options:
+
+- Take a plain language model (e.g. `bert-base-cased`) and train it for TextClassification
+- Take a TextClassification model and fine-tune it for your domain
+
+**Arguments**:
+
+- `data_dir`: Path to directory containing your training data
+- `label_list`: list of labels in the training dataset, e.g., ["0", "1"]
+- `delimiter`: delimiter that separates columns in the training dataset, e.g., "\t"
+- `metric`: evaluation metric to be used while training, e.g., "f1_macro"
+- `train_filename`: Filename of training data
+- `dev_filename`: Filename of dev / eval data
+- `test_filename`: Filename of test data
+- `dev_split`: Instead of specifying a dev_filename, you can also specify a ratio (e.g. 0.1) here
+                  that gets split off from training data for eval.
+- `use_gpu`: Whether to use GPU (if available)
+- `batch_size`: Number of samples the model receives in one batch for training
+- `n_epochs`: Number of iterations on the whole training data set
+- `learning_rate`: Learning rate of the optimizer
+- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
+- `warmup_proportion`: Proportion of training steps until maximum learning rate is reached.
+                          Until that point LR is increasing linearly. After that it's decreasing again linearly.
+                          Options for different schedules are available in FARM.
+- `evaluate_every`: Evaluate the model every X steps on the hold-out eval dataset
+- `save_dir`: Path to store the final model
+- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
+                      Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
+                      Set to None to use all CPU cores minus one.
+- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
+                Available options:
+                None (Don't use AMP)
+                "O0" (Normal FP32 training)
+                "O1" (Mixed Precision => Recommended)
+                "O2" (Almost FP16)
+                "O3" (Pure FP16).
+                See details on: https://nvidia.github.io/apex/amp.html
+
+**Returns**:
+
+None
+
+<a name="farm.FARMClassifier.update_parameters"></a>
+#### update\_parameters
+
+```python
+ | update_parameters(max_seq_len: Optional[int] = None)
+```
+
+Hot update parameters of a loaded FARMClassifier. It may not to be safe when processing concurrent requests.
+
+<a name="farm.FARMClassifier.save"></a>
+#### save
+
+```python
+ | save(directory: Path)
+```
+
+Saves the FARMClassifier model so that it can be reused at a later point in time.
+
+**Arguments**:
+
+- `directory`: Directory where the FARMClassifier model should be saved
+
+<a name="farm.FARMClassifier.predict_batch"></a>
+#### predict\_batch
+
+```python
+ | predict_batch(query_doc_list: List[dict], top_k: int = None, batch_size: int = None)
+```
+
+Use loaded FARMClassifier model to, for a list of queries, classify each query's supplied list of Document.
+
+Returns list of dictionary of query and list of document sorted by (desc.) similarity with query
+
+**Arguments**:
+
+- `query_doc_list`: List of dictionaries containing queries with their retrieved documents
+- `top_k`: The maximum number of answers to return for each query
+- `batch_size`: Number of samples the model receives in one batch for inference
+
+**Returns**:
+
+List of dictionaries containing query and list of Document with class probabilities in meta field
+
+<a name="farm.FARMClassifier.predict"></a>
+#### predict
+
+```python
+ | predict(query: str, documents: List[Document], top_k: Optional[int] = None) -> List[Document]
+```
+
+Use loaded classification model to classify the supplied list of Document.
+
+Returns list of Document enriched with class label and probability, which are stored in Document.meta["classification"]
+
+**Arguments**:
+
+- `query`: Query string (is not used at the moment)
+- `documents`: List of Document to be classified
+- `top_k`: The maximum number of documents to return
+
+**Returns**:
+
+List of Document with class probabilities in meta field
+
diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md
@@ -985,7 +985,7 @@ the vector embeddings are indexed in a FAISS Index.
 #### \_\_init\_\_
 
 ```python
- | __init__(sql_url: str = "sqlite:///", vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional["faiss.swigfaiss.Index"] = None, return_embedding: bool = False, index: str = "document", similarity: str = "dot_product", embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
+ | __init__(sql_url: str = "sqlite:///faiss_document_store.db", vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional["faiss.swigfaiss.Index"] = None, return_embedding: bool = False, index: str = "document", similarity: str = "dot_product", embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = 'overwrite', **kwargs, ,)
 ```
 
 **Arguments**:
@@ -1012,8 +1012,11 @@ the vector embeddings are indexed in a FAISS Index.
                     or one with docs that you used in Haystack before and want to load again.
 - `return_embedding`: To return document embedding
 - `index`: Name of index in document store to use.
-- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is
-           more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
+- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default since it is
+           more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence-Transformer model.
+           In both cases, the returned values in Document.score are normalized to be in range [0,1]: 
+           For `dot_product`: expit(np.asarray(raw_score / 100))
+           FOr `cosine`: (raw_score + 1) / 2
 - `embedding_field`: Name of field containing an embedding vector.
 - `progress_bar`: Whether to show a tqdm progress bar or not.
                      Can be helpful to disable in production deployments to keep the logs clean.
@@ -1174,14 +1177,19 @@ Find the document that is most similar to the provided `query_emb` by using a ve
 #### save
 
 ```python
- | save(file_path: Union[str, Path])
+ | save(index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None)
 ```
 
 Save FAISS Index to the specified file.
 
 **Arguments**:
 
-- `file_path`: Path to save to.
+- `index_path`: Path to save the FAISS index to.
+- `config_path`: Path to save the initial configuration parameters to.
+    Defaults to the same as the file path, save the extension (.json).
+    This file contains all the parameters passed to FAISSDocumentStore()
+    at creation time (for example the SQL path, vector_dim, etc), and will be 
+    used by the `load` method to restore the index with the appropriate configuration.
 
 **Returns**:
 
@@ -1192,7 +1200,7 @@ None
 
 ```python
  | @classmethod
- | load(cls, faiss_file_path: Union[str, Path], sql_url: str, index: str)
+ | load(cls, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None)
 ```
 
 Load a saved FAISS index from a file and connect to the SQL database.
@@ -1201,14 +1209,18 @@ Note: In order to have a correct mapping from FAISS to SQL,
 
 **Arguments**:
 
-- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()`
+- `index_path`: Stored FAISS index file. Can be created via calling `save()`
+- `config_path`: Stored FAISS initial configuration parameters.
+    Can be created via calling `save()`
 - `sql_url`: Connection string to the SQL database that contains your docs and metadata.
+    Overrides the value defined in the `faiss_init_params_path` file, if present
 - `index`: Index name to load the FAISS index as. It must match the index name used for
-              when creating the FAISS index.
+              when creating the FAISS index. Overrides the value defined in the 
+              `faiss_init_params_path` file, if present
 
 **Returns**:
 
-
+the DocumentStore
 
 <a name="milvus"></a>
 # Module milvus

diff --git a/docs/_src/api/api/generate_docstrings.sh b/docs/_src/api/api/generate_docstrings.sh
@@ -16,3 +16,6 @@ pydoc-markdown pydoc-markdown-knowledge-graph.yml
 pydoc-markdown pydoc-markdown-graph-retriever.yml
 pydoc-markdown pydoc-markdown-evaluation.yml
 pydoc-markdown pydoc-markdown-ranker.yml
+pydoc-markdown pydoc-markdown-question-generator.yml
+pydoc-markdown pydoc-markdown-classifier.yml
+
diff --git a/docs/_src/api/api/pipelines.md b/docs/_src/api/api/pipelines.md
@@ -824,6 +824,17 @@ Create an instance of Component.
 
 Ray calls this method which is then re-directed to the corresponding component's run().
 
+<a name="pipeline.Docs2Answers"></a>
+## Docs2Answers Objects
+
+```python
+class Docs2Answers(BaseComponent)
+```
+
+This Node is used to convert retrieved documents into predicted answers format.
+It is useful for situations where you are calling a Retriever only pipeline via REST API.
+This ensures that your output is in a compatible format.
+
 <a name="pipeline.MostSimilarDocumentsPipeline"></a>
 ## MostSimilarDocumentsPipeline Objects
 

diff --git a/docs/_src/api/api/pydoc-markdown-classifier.yml b/docs/_src/api/api/pydoc-markdown-classifier.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/classifier]
+    modules: ['base', 'farm']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: classifier.md
diff --git a/docs/_src/api/api/pydoc-markdown-question-generator.yml b/docs/_src/api/api/pydoc-markdown-question-generator.yml
@@ -0,0 +1,18 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/question_generator]
+    modules: ['question_generator']
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
+  filename: question_generator.md
diff --git a/docs/_src/api/api/question_generator.md b/docs/_src/api/api/question_generator.md
@@ -0,0 +1,30 @@
+<a name="question_generator"></a>
+# Module question\_generator
+
+<a name="question_generator.QuestionGenerator"></a>
+## QuestionGenerator Objects
+
+```python
+class QuestionGenerator(BaseComponent)
+```
+
+The Question Generator takes only a document as input and outputs questions that it thinks can be
+answered by this document. In our current implementation, input texts are split into chunks of 50 words
+with a 10 word overlap. This is because the default model `valhalla/t5-base-e2e-qg` seems to generate only
+about 3 questions per passage regardless of length. Our approach prioritizes the creation of more questions
+over processing efficiency (T5 is able to digest much more than 50 words at once). The returned questions
+generally come in an order dictated by the order of their answers i.e. early questions in the list generally
+come from earlier in the document.
+
+<a name="question_generator.QuestionGenerator.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(model_name_or_path="valhalla/t5-base-e2e-qg", model_version=None, num_beams=4, max_length=256, no_repeat_ngram_size=3, length_penalty=1.5, early_stopping=True, split_length=50, split_overlap=10, prompt="generate questions:")
+```
+
+Uses the valhalla/t5-base-e2e-qg model by default. This class supports any question generation model that is
+implemented as a Seq2SeqLM in HuggingFace Transformers. Note that this style of question generation (where the only input
+is a document) is sometimes referred to as end-to-end question generation. Answer-supervised question
+generation is not currently supported.
+