<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="360" height="160" /></center>

# <center>**Question Answer generation from the given PDF**</center>


## **Table of Contents**

1. [**Problem Statement**](#Section1)<br>
2. [**Installing & Importing Libraries**](#Section2)<br>
 - 2.1 [**Installing Libraries**](#Section21)<br>
 - 2.2 [**Importing Libraries**](#Section23)<br>
3. [**Data Acquistition and Description**](#Section3)<br>
4. [**Model Fitting**](#Section3)<br>
5. [**Exception Handling**](#Section3)<br>
6. [**Conclusion**](#Section3)<br>
7. [**Applications**](#Section3)<br>



<a id=section2></a>

---
# **1. Problem Statement**
---

- The goal of this project is to develop an **automatic question answering** system

- This file should be able to take a **pdf book as an input** and return **best possible question(s)-answers** related to that book


<center><img src="https://ak.picdn.net/shutterstock/videos/3031567/thumb/4.jpg" height= 400 width=1000 ></center>

### **Scenario**

- You have been hired as a **freelance data scientist** for an **indian edu-tech start up**

- The company is facing major issues when it comes to **genarating questions and answers from any text book**

- The major challenge is that this is a **very time consuming process** and needs a **bulk amount of manpower**

- Hence, they want to automate this entire process so that no manual processing is needed


<a id=section2></a>

---
# **2. Importing Libraries**
---

<a name = Section21></a>
### **2.1 Installing Libraries**

In [None]:
!pip install PyPDF2                                                                     # Installing PyPDF 2
!pip install -U transformers==3.0.0                                                     # Installing Transformers
!python -m nltk.downloader punkt                                                        # Installing NLTK
!git clone https://github.com/patil-suraj/question_generation.git                       # Cloning QnA GIT for calling custom-made pipelines

Requirement already up-to-date: transformers==3.0.0 in /usr/local/lib/python3.7/dist-packages (3.0.0)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Cloning into 'question_generation'...
remote: Enumerating objects: 265, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 265 (delta 1), reused 2 (delta 0), pack-reused 259[K
Receiving objects: 100% (265/265), 298.28 KiB | 8.52 MiB/s, done.
Resolving deltas: 100% (141/141), done.


<a name = Section23></a>
### **2.2 Importing Libraries**

In [None]:
import PyPDF2                                                               # Importing PYPDF 2 for reading PDF files
import pandas as pd                                                         # Importing pandas    
import numpy as np                                                          # Importing Numpy
import matplotlib.pyplot as plt                                             # Importing Pyplot
%matplotlib inline                                                          
import warnings                                                             # Calling warnings
warnings.filterwarnings("ignore")                               
from datetime import datetime                                               # Importing Datetime
import nltk                                                                 # Importing NLTK for text processing    
from pipelines import pipeline                                              # Calling pipelines

<a id=section4></a>

---
# **3. Data Acquistition and Description**
---


In [None]:
# Reading the file using PyPDF2
file = PyPDF2.PdfFileReader("/content/SQL.pdf")
print(file.documentInfo)

{'/Producer': 'doPDF Ver 7.2 Build 376 (Windows 7 Home Basic Edition (SP 1) - Version: 6.1.7601 (x64))', '/CreationDate': "D:20140908132224+05'30'"}


In [None]:
# Total number of  pages in the pdf
pagenumbers = file.getNumPages()
pagenumbers

7

In [None]:
# Extract text from the pdf and appending it in the list of list in content 
content = []
for i in range(0,pagenumbers):
    content.append(file.getPage(i).extractText())

**Observation:**

-  We can see that the total pagecount is of **7**

- This pagecount includes the **cover page, the index and the acknowledgements.**





<a id=section4></a>

---
# **4. Model Fitting**
---


- In this section we will be using <b><a href="https://github.com/patil-suraj/question_generation">this pre trained model </a></b> in order to genarate the questions and answers.

- We have extracted the data from the **sql book by using PyPDF2** which is now stored in **content**

- **content** is our **text corpus** which we will use to genarate the **questions and answers** 

In [None]:
%cd question_generation                                                 # Mapping the QNA model to the current environment

/content/question_generation


In [None]:
nlp = pipeline("question-generation")                                   # Fitting into QNA model

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=627.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=31.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=65.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=90.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242013444.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=656.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=31.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=65.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=90.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242013376.0, style=ProgressStyle(descri…




In [None]:
# Test with Demo Text
text = "Shakespeare was born and raised in Stratford-upon-Avon, Warwickshire. At the age of 18, he married Anne Hathaway, with whom he had three children: Susanna and twins Hamnet and Judith. Sometime between 1585 and 1592, he began a successful career in London as an actor, writer, and part-owner of a playing company called the Lord Chamberlain's Men, later known as the King's Men. At age 49 (around 1613), he appears to have retired to Stratford, where he died three years later. Few records of Shakespeare's private life survive; this has stimulated considerable speculation about such matters as his physical appearance, his sexuality, his religious beliefs, and whether the works attributed to him were written by others"
nlp(text)

[{'answer': 'Warwickshire', 'question': 'Where was Shakespeare born?'},
 {'answer': '18',
  'question': 'At what age did Shakespeare marry Anne Hathaway?'},
 {'answer': "the King's Men",
  'question': "What was the Lord Chamberlain's Men later known as?"},
 {'answer': 'Stratford', 'question': 'Where did Shakespeare die?'},
 {'answer': 'his physical appearance, his sexuality, his religious beliefs, and whether the works attributed to him were written by others',
  'question': "What matters did Shakespeare's private life have?"}]

In [None]:
[dict(t) for t in {tuple(d.items()) for d in l}]

[{'answer': 'Service Oriented Architecture',
  'question': 'What is one of the key software architecture design patterns with which software application functionality is provided as a service to another application?'}]

**Observation:**
-  When model was not able to generate any answer it give **value error.**

- This is mainly caused as the data provided for this model has **only 7 Pages of data.**

- Hence, the text corpus has a **smaller length**.

- Apart from that whenever, the model faced a **blank string** it would throw this error.

- Now in order to avoid these issue(s) we will be using **assertion type and value error excecption handling technique** 





<a id=section4></a>

---
# **5. Exception Handling**
---


- In this step we will perform **exception handling techniques** in order to encounter the **value erro**r and the **assertion type error**. 

- Then we will take each individual response and append it into the **Superfinal list**

In [None]:
# Defining a superfinal list
superfinal = []

# spliting the content according to page
for x in range (0,len(content)):
  c = content[x].splitlines()

# striping the trailing and preceiding whitespace
  new_c = []
  for text in c:
    new_c.append(text.strip())

# Removing the empty text
  while "" in new_c:
    new_c.remove("")
  v = " ".join(new_c)
  v = v.split(".")

# Applying the model and reteriving question answer in ans list
  ans = []
  for text in v:
    try:
      ans.append(nlp(text))
    except (ValueError,AssertionError):
      pass

# Since we got repeated Answer:question pair we removed it using set
  finalans = []
  for i in range(len(ans)):
    l = ans[i]
    finalans.append([dict(t) for t in {tuple(d.items()) for d in l}])
  superfinal.append(finalans)

In [None]:
superfinal  # list of list of list 

[[[{'answer': 'Service Oriented Architecture',
    'question': 'What is the name of the page 1 of Teamcenter White Paper?'},
   {'answer': 'Teamcenter White Paper',
    'question': 'Where is the Service Oriented Architecture located?'}],
  [{'answer': 'Service Oriented Architecture',
    'question': 'What is one of the key software architecture design patterns with which software application functionality is provided as a service to another application?'}],
  [{'answer': 'investment and maintenance',
    'question': 'What are the benefits of using functionalities as services?'}],
  [{'answer': 'SOA',
    'question': 'What does D igital product de sign and development use?'}],
  [{'answer': 'Teamcenter Business Logic Server',
    'question': 'What server does Teamcenter software provide an open, high performance, coarse - grained interface to?'}],
  [{'answer': 'create customized, task - specific programs',
    'question': 'What does Teamcenter enable you to do to meet your business nee

In [None]:
flat = [item for sublist in superfinal for item in sublist] # open List of List of List

In [None]:
flat_list = [item for sublist in flat for item in sublist] # open List of List

In [None]:
flat_list # Final List of Answer:Question pair

[{'answer': 'Service Oriented Architecture',
  'question': 'What is the name of the page 1 of Teamcenter White Paper?'},
 {'answer': 'Teamcenter White Paper',
  'question': 'Where is the Service Oriented Architecture located?'},
 {'answer': 'Service Oriented Architecture',
  'question': 'What is one of the key software architecture design patterns with which software application functionality is provided as a service to another application?'},
 {'answer': 'investment and maintenance',
  'question': 'What are the benefits of using functionalities as services?'},
 {'answer': 'SOA',
  'question': 'What does D igital product de sign and development use?'},
 {'answer': 'Teamcenter Business Logic Server',
  'question': 'What server does Teamcenter software provide an open, high performance, coarse - grained interface to?'},
 {'answer': 'create customized, task - specific programs',
  'question': 'What does Teamcenter enable you to do to meet your business needs?'},
 {'answer': 'Teamcenter',


<a id=section4></a>

---
# **6. Conclusion**
---


- In this project we were able to successfully make a question answering system with the help of **transformers** and a **pre trained custom made pipeline**.

- We tested this model over a **7 paged pdf book of SQL**.

- We then checked if this can perform good for any test datapoint.

- Post that we saw that the model was throwing **value error and assertion error as exception(s)** when they didn't meet the specific use cases.

- These exceptions were tackled down by using, **value error and assertion error try catch exception handling technique(s)**

- Though the model predicts good quality **questions and answers** , this model is limited to work on large files

- As the file size increases so does the **computational time and space complexcities** which can be a major challenge in the near future

<a id=section4></a>

---
# **7. Applications:**
---

- This project can be used to genarate **questions and answers** to train chatbots like **RASA NLU**

- For schools and colleges we can use this project in order to make **question banks and test papers**

- If tuned further and trained properly in large data this project can be used for **Open-domain Question Answering (ODQA)**