Add QuestionGenerator #1267

brandenchan · 2021-07-09T13:28:28Z

This PR creates a QuestionGenerator class that takes a document as input and generates questions as output. Since the default models used only seem to generate about 3 questions per passage, input text is split into 50 word chunks with 10 word split overlap. This is somewhat inefficient since T5's max seq len is significantly longer. But this allows us to handle variable length documents.

As a bonus, this also allows the DocumentStore to be used as an iterator

TODO:

Documentation
Design implementation into pipelines

For future:

Consider how to speed up QG
Consider a QuestGen implementation

lalitpagaria · 2021-07-09T13:53:06Z

Oh nice addition.

BTW this one can also be integrated https://github.com/ramsrigouthamg/Questgen.ai, yes but in separate PR if you feel beneficial.

cc: @ramsrigouthamg

julian-risch

I left some minor comments. The code looks good to me. However, before merging, please add some test cases for the new QuestionGenerator node. You could reuse the tutorial code for that.

julian-risch · 2021-07-12T09:12:33Z

haystack/document_store/base.py

+        if len(self.ids_iterator) == 0:
+            raise StopIteration
+        else:
+            curr_id = self.ids_iterator[0]


curr_id = self.ids_iterator.pop(0) could be used here so that self.ids_iterator = self.ids_iterator[1:]is not necessary. Just a thought. Not a request for a change.

Yup good idea!

julian-risch · 2021-07-12T09:20:41Z

haystack/pipeline.py

@@ -591,6 +592,70 @@ def run(self, **kwargs):
        return output


+class QuestionGenerationPipeline(BaseStandardPipeline):


I understand that having standard pipelines, such as QuestionGenerationPipeline, RetrieverQuestionGenerationPipeline, and QuestionAnswerGenerationPipeline, make it easier to run a pipeline because there is no need to define the individual pipeline components. However, with more and more standard pipelines, how do users know which standard pipeline fits there needs? We already have DocumentSearchPipeline, ExtractiveQAPipeline, GenerativeQAPipeline , FAQPipeline, SearchSummarizationPipeline, and TranslationWrapperPipeline. Maybe we need an overview of all these pipelines in the documentation? And some guidelines when to define a new standard pipeline?

julian-risch · 2021-07-12T09:30:31Z

tutorials/question_generation.py

@@ -0,0 +1,53 @@
+from transformers import pipeline


All other tutorials are numbered so let's rename this file to Tutorial13_Question_Generation.py.

lalitpagaria · 2021-07-12T14:19:57Z

haystack/question_generator/question_generator.py

+    outgoing_edges = 1
+
+    def __init__(self,
+                 model_name_or_path="valhalla/t5-base-e2e-qg",


it would be nice to suggest user what type of model are supported like T5, GPT or we can tell text generation models only used here.

Agreed. Maybe this link could help here as a comment: https://huggingface.co/models?filter=question-generation

julian-risch · 2021-07-12T16:24:37Z

haystack/document_store/base.py

@@ -65,6 +66,20 @@ def get_all_documents(
        """
        pass

+    def __iter__(self):
+        if not self.ids_iterator:
+            self.ids_iterator = [x.id for x in self.get_all_documents()]


I understand this is a generic solution in the BaseDocumentStore. Maybe it makes sense to implement a get_all_document_ids() method in the individual document stores (although it's more effort)? Or at least keep this in mind for future improvements? Otherwise this call will load a lot of data.
In ElasticSearchDocumentStore, you could actually make use of the existing generator:

haystack/haystack/document_store/elasticsearch.py

Line 601 in 2a90471

yield document

Yes I agree there are more efficient ways of implementing the Document Store as an iterator. I think that is outside the scope of this PR but we should think of reimplementing it, especially if this becomes an idiomatic way of accessing docs in the DocumentStore

review-notebook-app · 2021-07-26T15:03:55Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

cvgoudar · 2021-08-15T05:55:35Z

@brandenchan : For the tutorial 13 to be complete can the haystack and Elasticsearch setup be added as part of the Tutorial. I tried in the notebook just by adding the following and the entire notebook worked end to end.

!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

tholor · 2021-08-15T07:13:20Z

Hey @cvgoudar , that's right. The current ES init uses docker and is therefore not working on colab. Do you want to create a PR?

brandenchan added 4 commits July 9, 2021 11:39

Create basic Question Generation

87934d8

Split texts into 50 word chunks

7b92154

Allow prompt to be changed

1ca2cfa

Implement iteration functionality in DS

afa876b

brandenchan added 3 commits July 9, 2021 17:25

Add docstrings, create pipelines

171293f

Make pipelines work

f172b37

Add comments

36709b7

brandenchan requested review from tholor and julian-risch July 9, 2021 16:46

brandenchan self-assigned this Jul 9, 2021

julian-risch requested changes Jul 12, 2021

View reviewed changes

lalitpagaria reviewed Jul 12, 2021

View reviewed changes

julian-risch reviewed Jul 12, 2021

View reviewed changes

Add tests

02bad41

julian-risch approved these changes Jul 14, 2021

View reviewed changes

Add tutorials and docs

0a8de4b

Add doc string

1131210

brandenchan merged commit 937247d into master Jul 26, 2021

brandenchan deleted the question_generation branch July 26, 2021 15:20

brandenchan mentioned this pull request Aug 23, 2021

Add support for no Docker envs in Tutorial 13 #1365

Merged

hammer mentioned this pull request Sep 1, 2021

Refresh usage/generator documentation #1403

Closed

ZanSara mentioned this pull request Jul 13, 2022

Generalize <sep>, <pad> and </s> tokens of QuestionGenerator node #2769

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QuestionGenerator #1267

Add QuestionGenerator #1267

brandenchan commented Jul 9, 2021 •

edited

Loading

lalitpagaria commented Jul 9, 2021

julian-risch left a comment

julian-risch Jul 12, 2021 •

edited

Loading

brandenchan Jul 14, 2021

julian-risch Jul 12, 2021

julian-risch Jul 12, 2021

lalitpagaria Jul 12, 2021

julian-risch Jul 12, 2021

julian-risch Jul 12, 2021

brandenchan Jul 26, 2021

review-notebook-app bot commented Jul 26, 2021

cvgoudar commented Aug 15, 2021 •

edited

Loading

tholor commented Aug 15, 2021

		@@ -591,6 +592,70 @@ def run(self, **kwargs):
		return output


		class QuestionGenerationPipeline(BaseStandardPipeline):

Add QuestionGenerator #1267

Add QuestionGenerator #1267

Conversation

brandenchan commented Jul 9, 2021 • edited Loading

lalitpagaria commented Jul 9, 2021

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

brandenchan Jul 14, 2021

Choose a reason for hiding this comment

julian-risch Jul 12, 2021

Choose a reason for hiding this comment

julian-risch Jul 12, 2021

Choose a reason for hiding this comment

lalitpagaria Jul 12, 2021

Choose a reason for hiding this comment

julian-risch Jul 12, 2021

Choose a reason for hiding this comment

julian-risch Jul 12, 2021

Choose a reason for hiding this comment

brandenchan Jul 26, 2021

Choose a reason for hiding this comment

review-notebook-app bot commented Jul 26, 2021

cvgoudar commented Aug 15, 2021 • edited Loading

tholor commented Aug 15, 2021

brandenchan commented Jul 9, 2021 •

edited

Loading

julian-risch Jul 12, 2021 •

edited

Loading

cvgoudar commented Aug 15, 2021 •

edited

Loading