Skip to content

Commit

Permalink
docs: extend tutorial14 about query classification (#3013)
Browse files Browse the repository at this point in the history
* first draft for tutorial extension

* forgotten markdown

* improved tutorial

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* add markdown

* first draft for tutorial extension

* forgotten markdown

* improved tutorial

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* add markdown

* little corrections

* little corrections and add py tutorial

* Update tutorials/Tutorial14_Query_Classifier.ipynb

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update tutorials/Tutorial14_Query_Classifier.ipynb

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update tutorials/Tutorial14_Query_Classifier.ipynb

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update tutorials/Tutorial14_Query_Classifier.ipynb

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* update tutorial webpage

* fix typo

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Thomas Stadelmann <thomas.stadelmann@deepset.ai>
  • Loading branch information
3 people committed Aug 12, 2022
1 parent 5b06658 commit 4f261a4
Show file tree
Hide file tree
Showing 3 changed files with 430 additions and 114 deletions.
117 changes: 110 additions & 7 deletions docs/_src/tutorials/tutorials/14.md
Expand Up @@ -56,7 +56,7 @@ Next we make sure the latest version of Haystack is installed:
!pip install pygraphviz
```

## Logging
### Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
Expand All @@ -83,7 +83,7 @@ from haystack.nodes import SklearnQueryClassifier
keyword_classifier = SklearnQueryClassifier()
```

Now let's feed some queries into this query classifier. We'll test with one keyword query, one interrogative query, and one statement query. Notice that we don't use any punctuation, such as question marks; this illustrates that the classifier doesn't need punctuation in order to make the right decision.
Now let's feed some queries into this query classifier. We'll test with one keyword query, one interrogative query, and one statement query. Note that we don't need to use any punctuation, such as question marks, for the query classifier to make the right decision.


```python
Expand All @@ -94,7 +94,7 @@ queries = [
]
```

We can see below what our classifier does with these queries: "Arya Stark father" is rightly determined to be a keyword query and is sent to branch 2, while both the interrogative query "Who was the father of Arya Stark" and the statement query "Lord Eddard was the father of Arya Stark" are correctly labeled as non-keyword queries, and are thus shipped off to branch 1.
Below, you can see what the classifier does with these queries: it correctly determines that "Arya Stark father" is a keyword query and sends it to branch 2. It also correctly classifies both the interrogative query "Who was the father of Arya Stark" and the statement query "Lord Eddard was the father of Arya Stark" as non-keyword queries, and sends them to branch 1.


```python
Expand All @@ -111,7 +111,7 @@ for query in queries:
pd.DataFrame.from_dict(k_vs_qs_results)
```

Next we will illustrate a **question vs. statement** `SklearnQueryClassifier`. We define our classifier below; notice that this time we have to explicitly specify the model and vectorizer, since the default for an `SklearnQueryClassifier` (and a `TransformersQueryClassifier`) is keyword vs. question/statement classification.
Next, we will illustrate a **question vs. statement** `SklearnQueryClassifier`. We define our classifier below. Note that this time we have to explicitly specify the model and vectorizer since the default for a `SklearnQueryClassifier` (and a `TransformersQueryClassifier`) is keyword vs. question/statement classification.


```python
Expand Down Expand Up @@ -146,7 +146,7 @@ for query in queries:
pd.DataFrame.from_dict(q_vs_s_results)
```

And as we see, the question "Who was the father of Arya Stark" is sent to branch 1, while the statement "Lord Eddard was the father of Arya Stark" is sent to branch 2, so we can have our pipeline treat statements and questions differently.
And as we see, the question "Who was the father of Arya Stark" is sent to branch 1, while the statement "Lord Eddard was the father of Arya Stark" is sent to branch 2. This means we can have our pipeline treat statements and questions differently.

### Using Query Classifiers in a Pipeline

Expand Down Expand Up @@ -237,7 +237,7 @@ sklearn_keyword_classifier.add_node(component=reader, name="QAReader", inputs=["
sklearn_keyword_classifier.draw("sklearn_keyword_classifier.png")
```

Below we can see some results from this choice in branching structure: the keyword query "arya stark father" and the question query "Who is the father of Arya Stark?" generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries.
Below, we can see how this choice affects the branching structure: the keyword query "arya stark father" and the question query "Who is the father of Arya Stark?" generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries.


```python
Expand Down Expand Up @@ -320,7 +320,7 @@ transformer_question_classifier.add_node(component=reader, name="QAReader", inpu
transformer_question_classifier.draw("transformer_question_classifier.png")
```

And below we see the results of this pipeline: with a question query like "Who is the father of Arya Stark?" we get back answers returned by a reader, but with a statement query like "Arya Stark was the daughter of a Lord" we just get back documents returned by a retriever.
And here are the results of this pipeline: with a question query like "Who is the father of Arya Stark?", we obtain answers from a reader, and with a statement query like "Arya Stark was the daughter of a Lord", we just obtain documents from a retriever.


```python
Expand All @@ -339,6 +339,109 @@ print(f"\n\n{equal_line}\nSTATEMENT QUERY RESULTS\n{equal_line}")
print_documents(res_2)
```

### Other use cases for Query Classifiers: custom classification models and zero-shot classification.

`TransformersQueryClassifier` is very flexible and also supports other options for classifying queries.
For example, we may be interested in detecting the sentiment or classifying the topics. We can do this by loading a custom classification model from the Hugging Face Hub or by using zero-shot classification.

#### Custom classification model vs zero-shot classification
- Rraditional text classification models are trained to predict one of a few "hard-coded" classes and require a dedicated training dataset. In the Hugging Face Hub, you can find many pre-trained models, maybe even related to your domain of interest.
- Zero-shot classification is very versatile: by choosing a suitable base transformer, you can classify the text without any training dataset. You just have to provide the candidate categories.

#### Using custom classification models
We can use a public model, available in the Hugging Face Hub. For example, if we want to classify the sentiment of the queries, we can choose an appropriate model, such as https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment.

*In this case, the `labels` parameter must contain a list with the exact model labels.
The first label we provide corresponds to output_1, the second label to output_2, and so on.*


```python
from haystack.nodes import TransformersQueryClassifier

# Remember to compile a list with the exact model labels
# The first label you provide corresponds to output_1, the second label to output_2, and so on.
labels = ["LABEL_0", "LABEL_1", "LABEL_2"]

sentiment_query_classifier = TransformersQueryClassifier(
model_name_or_path="cardiffnlp/twitter-roberta-base-sentiment",
use_gpu=True,
task="text-classification",
labels=labels,
)
```


```python
queries = [
"What's the answer?", # neutral query
"Would you be so lovely to tell me the answer?", # positive query
"Can you give me the damn right answer for once??", # negative query
]
```


```python
import pandas as pd

sent_results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
result = sentiment_query_classifier.run(query=query)
sent_results["Query"].append(query)
sent_results["Output Branch"].append(result[1])
if result[1] == "output_1":
sent_results["Class"].append("negative")
elif result[1] == "output_2":
sent_results["Class"].append("neutral")
elif result[1] == "output_3":
sent_results["Class"].append("positive")

pd.DataFrame.from_dict(sent_results)
```

#### Using zero-shot classification
You can also perform zero-shot classification by providing a suitable base transformer model and **choosing** the classes the model should predict.
For example, we may be interested in whether the user query is related to music or cinema.

*In this case, the `labels` parameter is a list containing the candidate classes.*


```python
# In zero-shot-classification, you can choose the labels
labels = ["music", "cinema"]

query_classifier = TransformersQueryClassifier(
model_name_or_path="typeform/distilbert-base-uncased-mnli",
use_gpu=True,
task="zero-shot-classification",
labels=labels,
)
```


```python
queries = [
"In which films does John Travolta appear?", # query about cinema
"What is the Rolling Stones first album?", # query about music
"Who was Sergio Leone?", # query about cinema
]
```


```python
import pandas as pd

query_classification_results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
result = query_classifier.run(query=query)
query_classification_results["Query"].append(query)
query_classification_results["Output Branch"].append(result[1])
query_classification_results["Class"].append("music" if result[1] == "output_1" else "cinema")

pd.DataFrame.from_dict(query_classification_results)
```

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany
Expand Down

0 comments on commit 4f261a4

Please sign in to comment.