Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QuestionGenerator #1267

Merged
merged 10 commits into from
Jul 26, 2021
Merged

Add QuestionGenerator #1267

merged 10 commits into from
Jul 26, 2021

Conversation

brandenchan
Copy link
Contributor

@brandenchan brandenchan commented Jul 9, 2021

This PR creates a QuestionGenerator class that takes a document as input and generates questions as output. Since the default models used only seem to generate about 3 questions per passage, input text is split into 50 word chunks with 10 word split overlap. This is somewhat inefficient since T5's max seq len is significantly longer. But this allows us to handle variable length documents.

As a bonus, this also allows the DocumentStore to be used as an iterator

TODO:

  • Documentation
  • Design implementation into pipelines

For future:

  • Consider how to speed up QG
  • Consider a QuestGen implementation

@lalitpagaria
Copy link
Contributor

Oh nice addition.

BTW this one can also be integrated https://github.com/ramsrigouthamg/Questgen.ai, yes but in separate PR if you feel beneficial.

cc: @ramsrigouthamg

@brandenchan brandenchan self-assigned this Jul 9, 2021
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments. The code looks good to me. However, before merging, please add some test cases for the new QuestionGenerator node. You could reuse the tutorial code for that.

if len(self.ids_iterator) == 0:
raise StopIteration
else:
curr_id = self.ids_iterator[0]
Copy link
Member

@julian-risch julian-risch Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curr_id = self.ids_iterator.pop(0) could be used here so that self.ids_iterator = self.ids_iterator[1:]is not necessary. Just a thought. Not a request for a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup good idea!

@@ -591,6 +592,70 @@ def run(self, **kwargs):
return output


class QuestionGenerationPipeline(BaseStandardPipeline):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that having standard pipelines, such as QuestionGenerationPipeline, RetrieverQuestionGenerationPipeline, and QuestionAnswerGenerationPipeline, make it easier to run a pipeline because there is no need to define the individual pipeline components. However, with more and more standard pipelines, how do users know which standard pipeline fits there needs? We already have DocumentSearchPipeline, ExtractiveQAPipeline, GenerativeQAPipeline , FAQPipeline, SearchSummarizationPipeline, and TranslationWrapperPipeline. Maybe we need an overview of all these pipelines in the documentation? And some guidelines when to define a new standard pipeline?

@@ -0,0 +1,53 @@
from transformers import pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other tutorials are numbered so let's rename this file to Tutorial13_Question_Generation.py.

outgoing_edges = 1

def __init__(self,
model_name_or_path="valhalla/t5-base-e2e-qg",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to suggest user what type of model are supported like T5, GPT or we can tell text generation models only used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Maybe this link could help here as a comment: https://huggingface.co/models?filter=question-generation

@@ -65,6 +66,20 @@ def get_all_documents(
"""
pass

def __iter__(self):
if not self.ids_iterator:
self.ids_iterator = [x.id for x in self.get_all_documents()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is a generic solution in the BaseDocumentStore. Maybe it makes sense to implement a get_all_document_ids() method in the individual document stores (although it's more effort)? Or at least keep this in mind for future improvements? Otherwise this call will load a lot of data.
In ElasticSearchDocumentStore, you could actually make use of the existing generator:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree there are more efficient ways of implementing the Document Store as an iterator. I think that is outside the scope of this PR but we should think of reimplementing it, especially if this becomes an idiomatic way of accessing docs in the DocumentStore

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@brandenchan brandenchan merged commit 937247d into master Jul 26, 2021
@brandenchan brandenchan deleted the question_generation branch July 26, 2021 15:20
@cvgoudar
Copy link
Contributor

cvgoudar commented Aug 15, 2021

@brandenchan : For the tutorial 13 to be complete can the haystack and Elasticsearch setup be added as part of the Tutorial. I tried in the notebook just by adding the following and the entire notebook worked end to end.

!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

@tholor
Copy link
Member

tholor commented Aug 15, 2021

Hey @cvgoudar , that's right. The current ES init uses docker and is therefore not working on colab. Do you want to create a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants