-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QuestionGenerator #1267
Add QuestionGenerator #1267
Conversation
Oh nice addition. BTW this one can also be integrated https://github.com/ramsrigouthamg/Questgen.ai, yes but in separate PR if you feel beneficial. cc: @ramsrigouthamg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments. The code looks good to me. However, before merging, please add some test cases for the new QuestionGenerator
node. You could reuse the tutorial code for that.
if len(self.ids_iterator) == 0: | ||
raise StopIteration | ||
else: | ||
curr_id = self.ids_iterator[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curr_id = self.ids_iterator.pop(0)
could be used here so that self.ids_iterator = self.ids_iterator[1:]
is not necessary. Just a thought. Not a request for a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup good idea!
@@ -591,6 +592,70 @@ def run(self, **kwargs): | |||
return output | |||
|
|||
|
|||
class QuestionGenerationPipeline(BaseStandardPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that having standard pipelines, such as QuestionGenerationPipeline
, RetrieverQuestionGenerationPipeline
, and QuestionAnswerGenerationPipeline
, make it easier to run a pipeline because there is no need to define the individual pipeline components. However, with more and more standard pipelines, how do users know which standard pipeline fits there needs? We already have DocumentSearchPipeline
, ExtractiveQAPipeline
, GenerativeQAPipeline
, FAQPipeline
, SearchSummarizationPipeline
, and TranslationWrapperPipeline
. Maybe we need an overview of all these pipelines in the documentation? And some guidelines when to define a new standard pipeline?
tutorials/question_generation.py
Outdated
@@ -0,0 +1,53 @@ | |||
from transformers import pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other tutorials are numbered so let's rename this file to Tutorial13_Question_Generation.py
.
outgoing_edges = 1 | ||
|
||
def __init__(self, | ||
model_name_or_path="valhalla/t5-base-e2e-qg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to suggest user what type of model are supported like T5, GPT or we can tell text generation models only used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Maybe this link could help here as a comment: https://huggingface.co/models?filter=question-generation
@@ -65,6 +66,20 @@ def get_all_documents( | |||
""" | |||
pass | |||
|
|||
def __iter__(self): | |||
if not self.ids_iterator: | |||
self.ids_iterator = [x.id for x in self.get_all_documents()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is a generic solution in the BaseDocumentStore. Maybe it makes sense to implement a get_all_document_ids() method in the individual document stores (although it's more effort)? Or at least keep this in mind for future improvements? Otherwise this call will load a lot of data.
In ElasticSearchDocumentStore, you could actually make use of the existing generator:
yield document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree there are more efficient ways of implementing the Document Store as an iterator. I think that is outside the scope of this PR but we should think of reimplementing it, especially if this becomes an idiomatic way of accessing docs in the DocumentStore
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@brandenchan : For the tutorial 13 to be complete can the haystack and Elasticsearch setup be added as part of the Tutorial. I tried in the notebook just by adding the following and the entire notebook worked end to end.
|
Hey @cvgoudar , that's right. The current ES init uses docker and is therefore not working on colab. Do you want to create a PR? |
This PR creates a QuestionGenerator class that takes a document as input and generates questions as output. Since the default models used only seem to generate about 3 questions per passage, input text is split into 50 word chunks with 10 word split overlap. This is somewhat inefficient since T5's max seq len is significantly longer. But this allows us to handle variable length documents.
As a bonus, this also allows the DocumentStore to be used as an iterator
TODO:
For future: