docs: extend tutorial14 about query classification (#3013)

* first draft for tutorial extension * forgotten markdown * improved tutorial * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * add markdown * first draft for tutorial extension * forgotten markdown * improved tutorial * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * add markdown * little corrections * little corrections and add py tutorial * Update tutorials/Tutorial14_Query_Classifier.ipynb Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update tutorials/Tutorial14_Query_Classifier.ipynb Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update tutorials/Tutorial14_Query_Classifier.ipynb Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update tutorials/Tutorial14_Query_Classifier.ipynb Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * update tutorial webpage * fix typo Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai>
deepset-ai · Aug 12, 2022 · 4f261a4 · 4f261a4
1 parent 5b06658
commit 4f261a4
Show file tree

Hide file tree

Showing 3 changed files with 430 additions and 114 deletions.
diff --git a/docs/_src/tutorials/tutorials/14.md b/docs/_src/tutorials/tutorials/14.md
@@ -56,7 +56,7 @@ Next we make sure the latest version of Haystack is installed:
 !pip install pygraphviz
 ```
 
-## Logging
+### Logging
 
 We configure how logging messages should be displayed and which log level should be used before importing Haystack.
 Example log message:
@@ -83,7 +83,7 @@ from haystack.nodes import SklearnQueryClassifier
 keyword_classifier = SklearnQueryClassifier()
 ```
 
-Now let's feed some queries into this query classifier. We'll test with one keyword query, one interrogative query, and one statement query. Notice that we don't use any punctuation, such as question marks; this illustrates that the classifier doesn't need punctuation in order to make the right decision.
+Now let's feed some queries into this query classifier. We'll test with one keyword query, one interrogative query, and one statement query. Note that we don't need to use any punctuation, such as question marks, for the query classifier to make the right decision.
 
 
 ```python
@@ -94,7 +94,7 @@ queries = [
 ]
 ```
 
-We can see below what our classifier does with these queries: "Arya Stark father" is rightly determined to be a keyword query and is sent to branch 2, while both the interrogative query "Who was the father of Arya Stark" and the statement query "Lord Eddard was the father of Arya Stark" are correctly labeled as non-keyword queries, and are thus shipped off to branch 1.
+Below, you can see what the classifier does with these queries: it correctly determines that  "Arya Stark father" is a keyword query and sends it to branch 2. It also correctly classifies both the interrogative query "Who was the father of Arya Stark" and the statement query "Lord Eddard was the father of Arya Stark" as non-keyword queries, and sends them to branch 1.
 
 
 ```python
@@ -111,7 +111,7 @@ for query in queries:
 pd.DataFrame.from_dict(k_vs_qs_results)
 ```
 
-Next we will illustrate a **question vs. statement** `SklearnQueryClassifier`. We define our classifier below; notice that this time we have to explicitly specify the model and vectorizer, since the default for an `SklearnQueryClassifier` (and a `TransformersQueryClassifier`) is keyword vs. question/statement classification.
+Next, we will illustrate a **question vs. statement** `SklearnQueryClassifier`. We define our classifier below. Note that this time we have to explicitly specify the model and vectorizer since the default for a `SklearnQueryClassifier` (and a `TransformersQueryClassifier`) is keyword vs. question/statement classification.
 
 
 ```python
@@ -146,7 +146,7 @@ for query in queries:
 pd.DataFrame.from_dict(q_vs_s_results)
 ```
 
-And as we see, the question "Who was the father of Arya Stark" is sent to branch 1, while the statement "Lord Eddard was the father of Arya Stark" is sent to branch 2, so we can have our pipeline treat statements and questions differently.
+And as we see, the question "Who was the father of Arya Stark" is sent to branch 1, while the statement "Lord Eddard was the father of Arya Stark" is sent to branch 2. This means we can have our pipeline treat statements and questions differently.
 
 ### Using Query Classifiers in a Pipeline
 
@@ -237,7 +237,7 @@ sklearn_keyword_classifier.add_node(component=reader, name="QAReader", inputs=["
 sklearn_keyword_classifier.draw("sklearn_keyword_classifier.png")
 ```
 
-Below we can see some results from this choice in branching structure: the keyword query "arya stark father" and the question query "Who is the father of Arya Stark?" generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries.
+Below, we can see how this choice affects the branching structure: the keyword query "arya stark father" and the question query "Who is the father of Arya Stark?" generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries.
 
 
 ```python
@@ -320,7 +320,7 @@ transformer_question_classifier.add_node(component=reader, name="QAReader", inpu
 transformer_question_classifier.draw("transformer_question_classifier.png")
 ```
 
-And below we see the results of this pipeline: with a question query like "Who is the father of Arya Stark?" we get back answers returned by a reader, but with a statement query like "Arya Stark was the daughter of a Lord" we just get back documents returned by a retriever.
+And here are the results of this pipeline: with a question query like "Who is the father of Arya Stark?", we obtain answers from a reader, and with a statement query like "Arya Stark was the daughter of a Lord", we just obtain documents from a retriever.
 
 
 ```python
@@ -339,6 +339,109 @@ print(f"\n\n{equal_line}\nSTATEMENT QUERY RESULTS\n{equal_line}")
 print_documents(res_2)
 ```
 
+### Other use cases for Query Classifiers: custom classification models and zero-shot classification.
+
+`TransformersQueryClassifier` is very flexible and also supports other options for classifying queries.
+For example, we may be interested in detecting the sentiment or classifying the topics.  We can do this by loading a custom classification model from the Hugging Face Hub or by using zero-shot classification.
+
+#### Custom classification model vs zero-shot classification
+- Rraditional text classification models are trained to predict one of a few "hard-coded" classes and require a dedicated training dataset. In the Hugging Face Hub, you can find many pre-trained models, maybe even related to your domain of interest.
+- Zero-shot classification is very versatile: by choosing a suitable base transformer, you can classify the text without any training dataset. You just have to provide the candidate categories.
+
+#### Using custom classification models
+We can use a public model, available in the Hugging Face Hub. For example, if we want to classify the sentiment of the queries, we can choose an appropriate model, such as https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment.
+
+*In this case, the `labels` parameter must contain a list with the exact model labels.
+The first label we provide corresponds to output_1, the second label to output_2, and so on.*
+
+
+```python
+from haystack.nodes import TransformersQueryClassifier
+
+# Remember to compile a list with the exact model labels
+# The first label you provide corresponds to output_1, the second label to output_2, and so on.
+labels = ["LABEL_0", "LABEL_1", "LABEL_2"]
+
+sentiment_query_classifier = TransformersQueryClassifier(
+    model_name_or_path="cardiffnlp/twitter-roberta-base-sentiment",
+    use_gpu=True,
+    task="text-classification",
+    labels=labels,
+)
+```
+
+
+```python
+queries = [
+    "What's the answer?",  # neutral query
+    "Would you be so lovely to tell me the answer?",  # positive query
+    "Can you give me the damn right answer for once??",  # negative query
+]
+```
+
+
+```python
+import pandas as pd
+
+sent_results = {"Query": [], "Output Branch": [], "Class": []}
+
+for query in queries:
+    result = sentiment_query_classifier.run(query=query)
+    sent_results["Query"].append(query)
+    sent_results["Output Branch"].append(result[1])
+    if result[1] == "output_1":
+        sent_results["Class"].append("negative")
+    elif result[1] == "output_2":
+        sent_results["Class"].append("neutral")
+    elif result[1] == "output_3":
+        sent_results["Class"].append("positive")
+
+pd.DataFrame.from_dict(sent_results)
+```
+
+#### Using zero-shot classification
+You can also perform zero-shot classification by providing a suitable base transformer model and **choosing** the classes the model should predict.
+For example, we may be interested in whether the user query is related to music or cinema.
+
+*In this case, the `labels` parameter is a list containing the candidate classes.*
+
+
+```python
+# In zero-shot-classification, you can choose the labels
+labels = ["music", "cinema"]
+
+query_classifier = TransformersQueryClassifier(
+    model_name_or_path="typeform/distilbert-base-uncased-mnli",
+    use_gpu=True,
+    task="zero-shot-classification",
+    labels=labels,
+)
+```
+
+
+```python
+queries = [
+    "In which films does John Travolta appear?",  # query about cinema
+    "What is the Rolling Stones first album?",  # query about music
+    "Who was Sergio Leone?",  # query about cinema
+]
+```
+
+
+```python
+import pandas as pd
+
+query_classification_results = {"Query": [], "Output Branch": [], "Class": []}
+
+for query in queries:
+    result = query_classifier.run(query=query)
+    query_classification_results["Query"].append(query)
+    query_classification_results["Output Branch"].append(result[1])
+    query_classification_results["Class"].append("music" if result[1] == "output_1" else "cinema")
+
+pd.DataFrame.from_dict(query_classification_results)
+```
+
 ## About us
 
 This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany