From 11a2f12db07fbb471027df69dfae4f8730fc7225 Mon Sep 17 00:00:00 2001
From: Benjamin Trent <benjamin.trent@elastic.co>
Date: Thu, 28 Apr 2022 08:09:41 -0400
Subject: [PATCH] [ML] show expected model outputs for each nlp task type
 (#2112)

* [ML] show expected model outputs for each nlp task type
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: lcawl <lcawley@elastic.co>

(cherry picked from commit ff07d2594911d773a30e20d1ce6225d08e32e9ed)
---
 .../en/stack/ml/nlp/ml-nlp-model-ref.asciidoc | 119 +++++++++++++++++-
 1 file changed, 118 insertions(+), 1 deletion(-)
diff --git a/docs/en/stack/ml/nlp/ml-nlp-model-ref.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-model-ref.asciidoc
index 42ada6e9c..c6d245f38 100644
--- a/docs/en/stack/ml/nlp/ml-nlp-model-ref.asciidoc
+++ b/docs/en/stack/ml/nlp/ml-nlp-model-ref.asciidoc
@@ -45,7 +45,6 @@ refer to <<ml-nlp-overview>>.
 * https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english
 * https://huggingface.co/philschmid/distilroberta-base-ner-conll2003
 
-
 [discrete]
 [[ml-nlp-model-ref-text-embedding]]
 == Third party text embedding models
@@ -97,3 +96,121 @@ Using `DPREncoderWrapper`:
 * https://huggingface.co/valhalla/distilbart-mnli-12-6
 * https://huggingface.co/cross-encoder/nli-distilroberta-base
 * https://huggingface.co/cross-encoder/nli-roberta-base
+
+[discrete]
+== Expected model output
+
+Models used for each NLP task type must output tensors of a specific format to be used in the Elasticsearch NLP pipelines.
+
+Here are the expected outputs for each task type.
+
+[discrete]
+=== Fill mask expected model output
+
+Fill mask is a specific kind of token classification; it is the base training task of many transformer models.
+
+For the Elastic stack's fill mask NLP task to understand the model output, it must have a specific format. It needs to
+be a float tensor with `shape(<number of sequences>, <number of tokens>, <vocab size>)`.
+
+Here is an example with a single sequence `"The capital of [MASK] is Paris"` and with vocabulary
+`["The", "capital", "of", "is", "Paris", "France", "[MASK]"]`.
+
+Should output:
+
+[source]
+----
+ [
+   [
+     [ 0, 0, 0, 0, 0, 0, 0 ], // The
+     [ 0, 0, 0, 0, 0, 0, 0 ], // capital
+     [ 0, 0, 0, 0, 0, 0, 0 ], // of
+     [ 0.01, 0.01, 0.3, 0.01, 0.2, 1.2, 0.1 ], // [MASK]
+     [ 0, 0, 0, 0, 0, 0, 0 ], // is
+     [ 0, 0, 0, 0, 0, 0, 0 ] // Paris
+   ] 
+]
+----
+
+The predicted value here for `[MASK]` is `"France"` with a score of 1.2.
+
+[discrete]
+=== Named entity recognition expected model output
+
+Named entity recognition is a specific token classification task. Each token in the sequence is scored related to
+a specific set of classification labels. For the Elastic Stack, we use Inside-Outside-Beginning (IOB) tagging. Additionally,
+only the following classification labels are supported: "O", "B_MISC", "I_MISC", "B_PER", "I_PER", "B_ORG", "I_ORG", "B_LOC", "I_LOC".
+
+The `"O"` entity label indicates that the current token is outside any entity.
+`"I"` indicates that the token is inside an entity.
+`"B"` indicates the beginning of an entity.
+`"MISC"` is a miscellaneous entity.
+`"LOC"` is a location.
+`"PER"` is a person.
+`"ORG"` is an organization.
+
+The response format must be a float tensor with `shape(<number of sequences>, <number of tokens>, <number of classification labels>)`.
+
+Here is an example with a single sequence `"Waldo is in Paris"`:
+
+[source]
+----
+ [
+   [
+//    "O", "B_MISC", "I_MISC", "B_PER", "I_PER", "B_ORG", "I_ORG", "B_LOC", "I_LOC"
+     [ 0,  0,         0,       0.4,     0.5,     0,       0.1,     0,       0 ], // Waldo 
+     [ 1,  0,         0,       0,       0,       0,       0,       0,       0 ], // is
+     [ 1,  0,         0,       0,       0,       0,       0,       0,       0 ], // in
+     [ 0,  0,         0,       0,       0,       0,       0,       0,       1.0 ] // Paris
+   ] 
+]
+----
+
+[discrete]
+=== Text embedding expected model output
+
+Text embedding allows for semantic embedding of text for dense information retrieval.
+The output of the model must be the specific embedding directly without any additional pooling.
+
+Eland does this wrapping for the aforementioned models. But if supplying your own, the model must output the embedding for
+each inferred sequence.
+
+[discrete]
+=== Text classification expected model output
+
+With text classification (for example, in tasks like sentiment analysis), the entire sequence is classified. The output of
+the model must be a float tensor with `shape(<number of sequences>, <number of classification labels>)`.
+
+Here is an example with two sequences for a binary classification model of "happy" and "sad":
+[source]
+----
+ [
+   [
+//     happy, sad
+     [ 0,     1], // first sequence 
+     [ 1,     0] // second sequence
+   ] 
+]
+----
+
+[discrete]
+=== Zero-shot text classification expected model output
+
+Zero-shot text classification allows text to be classified for arbitrary labels not necessarily part of the original
+training. Each sequence is combined with the label given some hypothesis template. The model then scores each of these
+combinations according to `[entailment, neutral, contradiction]`. The output of the model must be a float tensor
+with `shape(<number of sequences>, <number of labels>, 3)`.
+
+Here is an example with a single sequence classified against 4 labels:
+
+[source]
+----
+ [
+   [
+//     entailment, neutral, contradiction
+     [ 0.5,        0.1,     0.4], // first label 
+     [ 0,          0,       1], // second label 
+     [ 1,          0,       0], // third label 
+     [ 0.7,        0.2,     0.1] // fourth label
+   ] 
+]
+----
\ No newline at end of file