# Question Generation using Haystack

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/13_Question_generation.ipynb)

This is a bare bones tutorial showing what is possible with the QuestionGenerator Nodes and Pipelines which automatically
generate questions which the question generation model thinks can be answered by a given document.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

In [1]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest main of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.2.2-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 7.8 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.2.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-gc27tuxm/farm-haystack_40e15cbedee44f9da44a9b78a01d3612
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-gc27tuxm/farm-haystack_40e15cbedee44f9da44a9b78a01d3612
  Resolved https://github.com/deepset-ai/haystack.git to commit 4fa9d2d8e764161bcd0b5be7401d2c79ec2b27b3
  Installing build de

## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

In [1]:
import logging


logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

In [2]:
# Imports needed to run this notebook

from pprint import pprint
from tqdm import tqdm
from haystack.nodes import QuestionGenerator, BM25Retriever, FARMReader
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.pipelines import (
    QuestionGenerationPipeline,
    RetrieverQuestionGenerationPipeline,
    QuestionAnswerGenerationPipeline,
)
from haystack.utils import launch_es, print_questions

Let's start an Elasticsearch instance with one of the options below:

In [None]:
# Option 1: Start Elasticsearch service via Docker
launch_es()



In [3]:
# Option 2: In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

Let's initialize some core components

In [4]:
import pandas as pd
df=pd.read_csv("out.csv")

In [5]:
df

Unnamed: 0.1,Unnamed: 0,key,sentence,qusetions
0,0,software engineering,"['Notable definitions of software engineering include: ""The systematic appl...",What are the two universities mandated by IEEE to develop the Software Engin...
1,1,software solutions,['Software engineers apply engineering principles and knowledge of programmi...,How long has a graduate software engineer had experience?
2,2,computer science,"['Notable definitions of software engineering include: ""The systematic appl...",When did the IEEE Computer Society publish the SWEBOK?
3,3,development,"['Notable definitions of software engineering include: ""The systematic appl...",What is the name of the ISO/IEC JTC 1/SC 7 subcommit?
4,4,design,"['Notable definitions of software engineering include: ""The systematic appl...",What is the ISO standard for the Software Engineering Body of Knowledge?
5,5,object-oriented programming,"['If you’re considering this as a career, here are some skills you should fo...",What is the SWEBOK?
6,6,certifications,"['You can decide to advance toward a role as a senior software engineer, or ...",What is the ISO standard for the Software Engineering Body of Knowledge?
7,7,testing,"['Notable definitions of software engineering include: ""The systematic appl...",What is the IEEE's Guide to the Software Engineering Body of Knowledge?
8,8,requirements,"['Tasks might include: Developing applications for iOS, Android, Windows, o...",What is the IEEE Computer Society's Technical Report 1979:2005?
9,9,knowledge,"['Notable definitions of software engineering include: ""The systematic appl...",What is the definition of software engineering?


In [6]:

# Use getitem ([]) to iterate over columns in pandas DataFrame
dic=[]
for row in range(len(df["sentence"])):
    #print(type(df["sentence"].values[row]))
    
    dic.append({"content":df["sentence"].values[row] })


In [7]:
for i in dic:
  print(i)

{'content': '[\'Notable definitions of software engineering include:  "The systematic application of scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software"—The Bureau of Labor Statistics—IEEE Systems and software engineering – Vocabulary[17] "The application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software"—IEEE Standard Glossary of Software Engineering Terminology[18] "an engineering discipline that is concerned with all aspects of software production"—Ian Sommerville[19] "the establishment and use of sound engineering principles in order to economically obtain software that is reliable and works efficiently on real machines"—Fritz Bauer[20] "a branch of computer science that deals with the design, implementation, and maintenance of complex computer programs"—Merriam-Webster[21] "\\\'software engineering\\\' encompasses not just the act of writi

In [8]:
text1 = df["sentence"].values[1]
#"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."
text2 = df["sentence"].values[5]
#"Princess Arya Stark is the third child and second daughter of Lord Eddard Stark and his wife, Lady Catelyn Stark. She is the sister of the incumbent Westerosi monarchs, Sansa, Queen in the North, and Brandon, King of the Andals and the First Men. After narrowly escaping the persecution of House Stark by House Lannister, Arya is trained as a Faceless Man at the House of Black and White in Braavos, using her abilities to avenge her family. Upon her return to Westeros, she exacts retribution for the Red Wedding by exterminating the Frey male line."
text3 = df["sentence"].values[6]
#"Dry Cleaning are an English post-punk band who formed in South London in 2018.[3] The band is composed of vocalist Florence Shaw, guitarist Tom Dowse, bassist Lewis Maynard and drummer Nick Buxton. They are noted for their use of spoken word primarily in lieu of sung vocals, as well as their unconventional lyrics. Their musical stylings have been compared to Wire, Magazine and Joy Division.[4] The band released their debut single, 'Magic of Meghan' in 2019. Shaw wrote the song after going through a break-up and moving out of her former partner's apartment the same day that Meghan Markle and Prince Harry announced they were engaged.[5] This was followed by the release of two EPs that year: Sweet Princess in August and Boundary Road Snacks and Drinks in October. The band were included as part of the NME 100 of 2020,[6] as well as DIY magazine's Class of 2020.[7] The band signed to 4AD in late 2020 and shared a new single, 'Scratchcard Lanyard'.[8] In February 2021, the band shared details of their debut studio album, New Long Leg. They also shared the single 'Strong Feelings'.[9] The album, which was produced by John Parish, was released on 2 April 2021.[10]"
#docs=dic
docs = [{"content": text1}, {"content": text2}, {"content": text3}]

# Initialize document store and write in the documents
document_store = ElasticsearchDocumentStore()
document_store.write_documents(docs)

# Initialize Question Generator
question_generator = QuestionGenerator()

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Using sep_token, but it is not set yet.


## Question Generation Pipeline

The most basic version of a question generator pipeline takes a document as input and outputs generated questions
which the the document can answer.

In [9]:
question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store):

    print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
    result = question_generation_pipeline.run(documents=[document])
    print_questions(result)


 * Generating questions for document 0: ['Software engineers apply engineering principles and knowledge of programming languages to build so...


Generated questions:
 - What do software engineers use to build software solutions for end users?

 * Generating questions for document 1: ['If you’re considering this as a career, here are some skills you should focus on building:  Coding...


Generated questions:
 - Python, Java, C++, or Scala are examples of what?
 - Database architecture Agile and Scrum project management Operating systems Cloud computing Version control Design testing What should you focus on building?
 - What type of computing is Cloud computing?
 - What can you do by earning a certification?

 * Generating questions for document 2: ['You can decide to advance toward a role as a senior software engineer, or you can continue gaining...


Generated questions:
 - What does the Software Engineering Institute offer on security, process improvement and software architecture?

## Retriever Question Generation Pipeline

This pipeline takes a query as input. It retrieves relevant documents and then generates questions based on these.

In [10]:
retriever = BM25Retriever(document_store=document_store)
rqg_pipeline = RetrieverQuestionGenerationPipeline(retriever, question_generator)

print(f"\n * Generating questions for documents matching the query 'Software engineers'\n")
result = rqg_pipeline.run(query="Software engineers")
print_questions(result)


 * Generating questions for documents matching the query 'Software engineers'


Generated questions:
 - What do software engineers use to build software solutions for end users?
 - What does the Software Engineering Institute offer on security, process improvement and software architecture?


## Question Answer Generation Pipeline

This pipeline takes a document as input, generates questions on it, and attempts to answer these questions using
a Reader model

In [11]:

reader = FARMReader("deepset/roberta-base-squad2")
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)


INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading pytorch_model.bin:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.infer:Got ya 2 parallel workers to do inference ...
INFO:haystack.modeling.infer: 0     0  
INFO:haystack.modeling.infer:/w\   /w\ 
INFO:haystack.modeling.infer:/'\   / \ 
0it [00:00, ?it/s]


 * Generating questions and answers for document 0: ['Software engineers apply engineering principles and knowledge of programming languages to build so...




Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.07 Batches/s]
1it [00:01,  1.11s/it]


Generated pairs:
 - Q: What do software engineers use to build software solutions for end users?
      A: engineering principles and knowledge of programming languages

 * Generating questions and answers for document 1: ['If you’re considering this as a career, here are some skills you should focus on building:  Coding...




Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.96 Batches/s]
2it [00:04,  2.26s/it]


Generated pairs:
 - Q: Python, Java, C++, or Scala are examples of what?
      A: Coding languages
 - Q: Database architecture Agile and Scrum project management Operating systems Cloud computing Version control Design testing What should you focus on building?
      A: debugging
 - Q: What type of computing is Cloud computing?
      A: Operating systems
 - Q: What can you do by earning a certification?
      A: build new skills and validate those skills to potential employers

 * Generating questions and answers for document 2: ['You can decide to advance toward a role as a senior software engineer, or you can continue gaining...




Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.92 Batches/s]
3it [00:05,  1.85s/it]


Generated pairs:
 - Q: What does the Software Engineering Institute offer on security, process improvement and software architecture?
      A: certifications





In [88]:
questions=[]
answers=[]
for idx, document in enumerate(tqdm(document_store)):

    print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
    result = qag_pipeline.run(documents=[document])
    print(result)
    #for i in range(len(result['answers'])): 
    answers.append(result['answers']) #dict_keys(['queries', 'answers', 'no_ans_gaps', 'documents', 'root_node', 'params', 'node_id'])

    #for i in range(len(result['queries'])): 
    questions.append(result['queries']) #dict_keys(['queries', 'answers', 'no_ans_gaps', 'documents', 'root_node', 'params', 'node_id'])



0it [00:00, ?it/s]


 * Generating questions and answers for document 0: ['Software engineers apply engineering principles and knowledge of programming languages to build so...




Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.75 Batches/s]
1it [00:00,  1.71it/s]

{'queries': ['What do software engineers use to build software solutions for end users?'], 'answers': [[<Answer {'answer': 'engineering principles and knowledge of programming languages', 'type': 'extractive', 'score': 0.8626583814620972, 'context': "['Software engineers apply engineering principles and knowledge of programming languages to build software solutions for end users.']", 'offsets_in_document': [{'start': 27, 'end': 88}], 'offsets_in_context': [{'start': 27, 'end': 88}], 'document_id': '64c48b999da41ff4afbbecac82da924', 'meta': {}}>]], 'no_ans_gaps': [7.165762782096863], 'documents': [[<Document: {'content': "['Software engineers apply engineering principles and knowledge of programming languages to build software solutions for end users.']", 'content_type': 'text', 'score': 0.5312093733737563, 'meta': {}, 'embedding': None, 'id': '64c48b999da41ff4afbbecac82da924'}>]], 'root_node': 'Query', 'params': {}, 'node_id': 'Reader'}

 * Generating questions and answers for document


Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.04 Batches/s]
2it [00:01,  1.11it/s]

{'queries': ['Python, Java, C++, or Scala are examples of what?', 'Database architecture Agile and Scrum project management Operating systems Cloud computing Version control Design testing What should you focus on building?', 'What type of computing is Cloud computing?', 'What can you do by earning a certification?'], 'answers': [[<Answer {'answer': 'Coding languages', 'type': 'extractive', 'score': 0.9609700441360474, 'context': 's as a career, here are some skills you should focus on building:  Coding languages like Python, Java, C, C++, or Scala  Object-oriented programming  ', 'offsets_in_document': [{'start': 94, 'end': 110}], 'offsets_in_context': [{'start': 67, 'end': 83}], 'document_id': 'c428d82dbe0c74ef62b5d14b4b12424a', 'meta': {}}>], [<Answer {'answer': 'debugging', 'type': 'extractive', 'score': 0.3788924217224121, 'context': 'perating systems  Cloud computing  Version control  Design testing and debugging  Attention to detail By earning a certification, you can build new 


Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 44.73 Batches/s]
3it [00:02,  1.36it/s]

{'queries': ['What does the Software Engineering Institute offer on security, process improvement and software architecture?'], 'answers': [[<Answer {'answer': 'certifications', 'type': 'extractive', 'score': 0.9642552733421326, 'context': "r.', '[57]  Certification The Software Engineering Institute offers certifications on specific topics like security, process improvement and software ", 'offsets_in_document': [{'start': 257, 'end': 271}], 'offsets_in_context': [{'start': 68, 'end': 82}], 'document_id': '2924e44e47762e8f3deb4f0bec8e82f7', 'meta': {}}>]], 'no_ans_gaps': [8.693629264831543], 'documents': [[<Document: {'content': "['You can decide to advance toward a role as a senior software engineer, or you can continue gaining certifications and experience to advance to roles like project manager or systems manager.', '[57]  Certification The Software Engineering Institute offers certifications on specific topics like security, process improvement and software architecture.']", 'con




In [89]:
questions

[['What do software engineers use to build software solutions for end users?'],
 ['Python, Java, C++, or Scala are examples of what?',
  'Database architecture Agile and Scrum project management Operating systems Cloud computing Version control Design testing What should you focus on building?',
  'What type of computing is Cloud computing?',
  'What can you do by earning a certification?'],
 ['What does the Software Engineering Institute offer on security, process improvement and software architecture?']]

In [90]:
answers

[[[<Answer {'answer': 'engineering principles and knowledge of programming languages', 'type': 'extractive', 'score': 0.8626583814620972, 'context': "['Software engineers apply engineering principles and knowledge of programming languages to build software solutions for end users.']", 'offsets_in_document': [{'start': 27, 'end': 88}], 'offsets_in_context': [{'start': 27, 'end': 88}], 'document_id': '64c48b999da41ff4afbbecac82da924', 'meta': {}}>]],
 [[<Answer {'answer': 'Coding languages', 'type': 'extractive', 'score': 0.9609700441360474, 'context': 's as a career, here are some skills you should focus on building:  Coding languages like Python, Java, C, C++, or Scala  Object-oriented programming  ', 'offsets_in_document': [{'start': 94, 'end': 110}], 'offsets_in_context': [{'start': 67, 'end': 83}], 'document_id': 'c428d82dbe0c74ef62b5d14b4b12424a', 'meta': {}}>],
  [<Answer {'answer': 'debugging', 'type': 'extractive', 'score': 0.3788924217224121, 'context': 'perating systems  Cloud

In [59]:
df_haystak=df.copy()

In [67]:
df_haystak=df_haystak.drop([0,2,3,4,7,8,9], axis=0).reset_index(drop=True)

In [80]:
df_haystak=df_haystak.drop(["Unnamed: 0"],axis=1)

In [82]:
df_haystak=df_haystak.drop(["qusetions"],axis=1)

In [91]:
df_haystak["qusetions"]=questions
df_haystak["answers"]=answers
df_haystak

Unnamed: 0,key,sentence,qusetions,answers
0,software solutions,['Software engineers apply engineering principles and knowledge of programmi...,[What do software engineers use to build software solutions for end users?],[[<Answer: answer='engineering principles and knowledge of programming langu...
1,object-oriented programming,"['If you’re considering this as a career, here are some skills you should fo...","[Python, Java, C++, or Scala are examples of what?, Database architecture Ag...","[[<Answer: answer='Coding languages', score=0.9609700441360474, context='s a..."
2,certifications,"['You can decide to advance toward a role as a senior software engineer, or ...","[What does the Software Engineering Institute offer on security, process imp...","[[<Answer: answer='certifications', score=0.9642552733421326, context='r.', ..."


In [92]:
df_haystak.to_csv("/content/df_haystak.csv")  

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Discord](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)