In this notebook we are downloading PDF file, converting it to TXT and doing some "pre-cleaning": removing not meaningful parts of document and leaving just the most valuable leftovers for our future generator.
THe outcome of the below code is pre-processed but still raw data.


"extracted_text" variable has "StringIO" type: The StringIO object is part of Python's io module and is a class that provides an in-memory file-like object that can be used for reading from or writing to strings as if they were files. It allows you to treat strings as file-like objects, which can be useful in various situations, such as when you want to read from or write to a string in a way that mimics file operations.


In [69]:
# import of libraries
from io import StringIO # extracted_text is the main variable, contains the whole text of document in stringIO format in memory
import requests
import re  # provides reg. exp. support
import math

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
import fitz
import nltk
from nltk.corpus import stopwords
import spacy
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import sentencepiece
from transformers import T5ForConditionalGeneration, T5Tokenizer, BertForQuestionAnswering, BertTokenizer

In [60]:
#!pip install -U pip setuptools wheel
#!pip install -U spacy
#!python -m spacy download en_core_web_sm
#!pip install PyMuPDF # this is fitz


In [73]:
# downloading pdf to '/data/' folder
url = 'https://astqb.org/assets/documents/ISTQB_CTFL_Syllabus-v4.0.pdf'
r = requests.get(url, allow_redirects=True)
open('data/ISTQB_CTFL_Syllabus-v4.0.pdf', 'wb').write(r.content)

1113747

r"Page \d{4,74} of 74"

In [74]:
#converting pdf to text and saving into .txt file initial version
output_string = StringIO()
output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0.txt'
with open('data/ISTQB_CTFL_Syllabus-v4.0.pdf', 'rb') as in_file, open(output_file_path, 'w', encoding='utf-8') as out_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
    # Getting the extracted text from StringIO, it means the entire text extracted from the PDF is stored as a single string in memory.
    extracted_text = output_string.getvalue()
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()


# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' saved to '{output_file_path}'")

Extracted text for 'The Certified Tester Foundation Level in Software Testing' saved to 'data/ISTQB_CTFL_Syllabus-v4.0.txt'


In [None]:
# let us check size of StringIO on the full size of converted file, just out of curiosity
size_bytes = len(extracted_text.encode('utf-8'))
print ('The length of string in bytes : ' + str (size_bytes))

# function's code is taken from stackoverflow ---
def convert_size(size_bytes):
   if size_bytes == 0:
       return "0B"
   size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
   i = int(math.floor(math.log(size_bytes, 1024)))
   p = math.pow(1024, i)
   s = round(size_bytes / p, 2)
   return "%s %s" % (s, size_name[i])
# ---
print("File size, document contains 70+ pages: ", convert_size(size_bytes))

In [None]:
# Looking up for the text to remove everything before it
target_text = "1.1. What is Testing?"

# Finding the position of the target text in the extracted text
start_position = extracted_text.find(target_text)

# Checking if the target text was found, just in case
if start_position != -1:
    # Removing everything before the target text
    extracted_text = extracted_text[start_position:]


# let us save the content to .txt file with prefix '_v0.1' for further debugging purpose and human evaluation process

output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v01.txt'
with open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.1 saved to '{output_file_path}'")



In [None]:
# removing empty lines
# _ - is iterator, if s.strip(): This part of the list comprehension checks whether the line s contains any non-whitespace characters. 
# If it does, the line is included in the resulting list.

extracted_text = "".join([_ for _ in extracted_text.strip().splitlines(True) if _.strip()])

output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v02.txt'
with open('data/ISTQB_CTFL_Syllabus-v4.0_v01.txt', 'rb') as in_file, open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.2 saved to '{output_file_path}'")

In [None]:
# Removing text from 'Page 56 of 74' till the end of the text

# Looking up for the text to remove everything after it
target_text = "Page 56 of 74"

# Finding the position of the target text in the extracted text
end_position = extracted_text.find(target_text)

# Checking if the target text was found, just in case
if end_position != -1:
    # Removing everything before the target text
    extracted_text = extracted_text[:end_position]


# let us save the content to .txt file with prefix '_v0.1' for further debugging purpose and human evaluation process

output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v03.txt'
with open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.1 saved to '{output_file_path}'")


In [None]:
# Your stop words list
stop_words = ["v4.0", "Page", "74", "18", "15", "of", "2023-04-21", "©", "Certified Tester", "Foundation", "Level", "International Software Testing Qualifications Board"]

# Split the extracted_text into words
words = extracted_text.split()

# Filter out words that are in the stop words list
filtered_words = [word for word in words if word.lower() not in stop_words]

# Join the filtered words back into a text
extracted_text = " ".join(filtered_words)

# Print the cleaned text
#print(extracted_text)


output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v05.txt'
with open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.5 saved to '{output_file_path}'")

In [None]:
# convert to lower case all words in stringIO
#extracted_text = extracted_text.lower()

In [None]:

#punctuation
# Load the language model
#nlp = spacy.load("en_core_web_sm")

# Process the text with SpaCy
###doc = nlp(extracted_text)

# Create a list of tokens that are not punctuation
#filtered_tokens = [token.text for token in doc if not token.is_punct]

# Join the filtered tokens back into a text
#extracted_text = " ".join(filtered_tokens)

# Print the text without punctuation
#print(extracted_text)

#output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v04.txt'
#with open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
#    out_file.write(extracted_text)

# Closing the stream
#output_string.close()

# Printing message to indicate that the text has been saved to the file
#print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.1 saved to '{output_file_path}'")



In [90]:
# built by chatgpt on provided context from my side, I used a part of text of file above, reviewed and customized by me as well
#stop_words = ["buxton", "a", "about", "above", "additional", "an", "and", "another", "are", "as", "be", "being", "by", "can", "common", "commonly", "do", "does", "each", "even", "for", "from", "has", "have", "in", "including", "is", "it", "its", "it's", "many", "may", "more", "most", "not", "of", "74", "often", "on", "or", "over", "such", "than", "that", "the", "there", "these", "this", "to", "under", "was", "we", "what", "when", "which", "who", "why", "will", "with", "within", "work", "you", "2023", "04", "21", "v4.0", "page", "2023-04-21", "©", "international", "qualifications", "board", "certified", "tester",  "foundation", "level", "FL-", "K2", "see", "section" , "didn't", "doesn't", "don't", "i.e.", "it's", "let's", "that's", "there's", "they're", "you're", "e.g."]
stop_words = ["15", "16", "17", "18", "19", "20", r"\b20\b", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36",
              "37", "38","39", "40","41", "42","43", "44","45", "46", "47", "48", "49", "50", "51", "52", "53", "54", 
              "International Software Testing Qualifications Board Certified Tester Foundation Level"]
# 
# Your stop words list
#stop_words = ["v4.0", "page", "of", "2023-04-21", "©", "International Software Testing Qualifications Board",
 #             "Certified", "Tester", "Foundation", "Level"]
# 
# Regular expression pattern to match phrases like "15 74", "16 74", ..., "54 74"
pattern = re.compile(r"(?s)^v4.0.*Foundation Level$", re.DOTALL)
# 
# Split the extracted_text into words
words = re.split(r'\s+', extracted_text)
# 
# Filter out words that match the regular expression pattern or are in the stop words list
filtered_words = [word for word in words if not re.match(pattern, word) and word.lower() not in stop_words]

# 
# Join the filtered words back into a text
extracted_text = " ".join(filtered_words)

# Define the phrase you want to remove
phrase_to_remove = "International Software Testing Qualifications Board Certified Tester Foundation Level"

# Replace the phrase with an empty string
extracted_text = extracted_text.replace(phrase_to_remove, "")

# 
output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v05.txt'
with open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)
# 
# Closing the stream
output_string.close()
# 
# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.5 saved to '{output_file_path}'")
# 
# 

Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.5 saved to 'data/ISTQB_CTFL_Syllabus-v4.0_v05.txt'


In [93]:
pattern = r'[0-9]'

# Match all digits in the string and replace them with an empty string
extracted_text = re.sub(pattern, '', extracted_text)

output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v06.txt'
with open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)
# 
# Closing the stream
output_string.close()
# 
# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.6 saved to '{output_file_path}'")
# 
# 



Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.6 saved to 'data/ISTQB_CTFL_Syllabus-v4.0_v06.txt'


In [None]:



# Load the language model
nlp = spacy.load("en_core_web_sm")

# Your text
text = extracted_text

# Process the text with SpaCy
doc = nlp(text)

# Create a StringIO object to store the NER results
output_string = io.StringIO()

# Extract named entities and write them to the StringIO object
for ent in doc.ents:
    output_string.write(f"Entity: {ent.text}, Type: {ent.label_}\n")

# Get the NER results as a string
ner_results = output_string.getvalue()

output_file_path = 'data/NER.txt'
with open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
      out_file.write(ner_results)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted NER list for 'The Certified Tester Foundation Level in Software Testing; {output_file_path}")



At this point NER dict is saved into /data folder, edited manually and now let us import this file into stop_list StringIO

In [None]:


# Check the content of stop_list_stringio
content = stop_list_stringio.getvalue()
print(content)

In [None]:
#Markov Chain
# Sample text (replace with your extracted_text)
# Tokenize the text into words
#tokens = nltk.word_tokenize(extracted_text)

# Create a dictionary to store transition probabilities
#transition_probabilities = {}

# Build the transition probability matrix
#for i in range(len(tokens) - 1):
#    current_token = tokens[i]
#    next_token = tokens[i + 1]
    
#    if current_token in transition_probabilities:
#        transition_probabilities[current_token].append(next_token)
#    else:
#        transition_probabilities[current_token] = [next_token]

# Start with an initial word
#current_word = random.choice(tokens)

# Generate a sentence of a certain length
#generated_text = [current_word]
#sentence_length = 10

#for _ in range(sentence_length - 1):
#    if current_word in transition_probabilities:
#        next_word = random.choice(transition_probabilities[current_word])
#        generated_text.append(next_word)
#        current_word = next_word
#    else:
#        break

# Join the generated words into a sentence
#generated_sentence = " ".join(generated_text)
#print(generated_sentence)


In [None]:

# Defining a regular expression pattern to match section titles
section_pattern = r'\b\d+(?:\.\d+)+\s+'

# Using re.finditer to find all section titles and their starting positions, extracted_text is a stringIO from previous code
section_matches = re.finditer(section_pattern, extracted_text)

# Create a list to store sections
sections = []

# Iterate through section matches
for match in section_matches:
    start_pos = match.start()
    end_pos = section_matches.__next__().start() if match.end() < len(extracted_text) else len(extracted_text)
    section_title = match.group()
    section_content = extracted_text[start_pos:end_pos].strip()
    sections.append((section_title, section_content))

# Print the first section title and content
if sections:
    for section in sections:
        print(section)
else:
    print("No sections found, take a look something went wrong")


In [None]:
#!pip install pytorch

In [None]:
#!pip install tensorflow

In [None]:
# Load the pre-trained model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Provide a passage and a question
passage = extracted_text
question = "Which of the following statements describe a valid test objective?"

#Which of the following statements describe a valid test objective?
#What does not work as expected?

# Tokenize the passage and question
inputs = tokenizer(question, passage, return_tensors="pt", padding=True, truncation=True)

# Get the answer from the model
start_scores, end_scores = model(**inputs, return_dict = False)
start_idx = torch.argmax(start_scores)
end_idx = torch.argmax(end_scores)

# Decode the answer from the tokenized output
answer_tokens = inputs["input_ids"][0][start_idx:end_idx + 1]
answer = tokenizer.decode(answer_tokens)

print("Answer:", answer)


To process text and generate questions with answers, you can consider using pre-trained language models, such as GPT-3, GPT-4, BERT, T5, or similar models. Each of these models has its strengths and can be used for different aspects of question generation and answering:

1. **GPT-3 and GPT-4 (Generative Pre-trained Transformers):** These models are known for their ability to generate coherent and contextually relevant text. They can be fine-tuned for question generation and answer generation tasks. You can use them to generate questions based on a given text and answer those questions by extracting relevant information from the text. These models are more versatile for generating human-like text.

2. **BERT (Bidirectional Encoder Representations from Transformers):** BERT is a popular model for various NLP tasks, including question-answering. It can be fine-tuned for specific question-answering tasks and can provide highly accurate answers. BERT models are effective at understanding the context of a text passage and extracting information.

3. **T5 (Text-to-Text Transfer Transformer):** T5 is designed to convert all NLP tasks into a text-to-text format. You can fine-tune a T5 model to perform question generation and answer generation tasks in this format. It's known for its versatility and performance on various NLP tasks.

4. **Dedicated Question-Answering Models:** There are models specifically designed for question-answering tasks, such as the Stanford Question Answering Dataset (SQuAD) models. These models are fine-tuned on question-answering datasets and are optimized for extracting answers from text given specific questions.

When choosing a model, consider the specific requirements of your project, such as the complexity of questions, the size of your dataset, and the desired level of accuracy. You may need to fine-tune these models on your specific task and dataset to achieve the best results. Additionally, you can use NLP libraries like Hugging Face Transformers or the OpenAI API, which provide access to pre-trained models for question generation and answering.

In [None]:
# from transformers import AutoTokenizer, AutoModelForQuestionAnswering
# from io import StringIO

# # Load the pre-trained model and tokenizer
# model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# # Create a StringIO object with your text
# text_io = StringIO()
# text_io.write("Your text goes here.")
# text_io.seek(0)  # Reset the StringIO object to the beginning

# # Read the text from the StringIO object and convert it to a regular string
# text = text_io.read()

# # Provide a question
# question = "What is the answer to my question?"

# # Tokenize the text and question
# inputs = tokenizer(question, text, return_tensors="pt", padding=True, truncation=True)

# # Get the answer from the model
# start_scores, end_scores = model(**inputs)
# start_idx = torch.argmax(start_scores)
# end_idx = torch.argmax(end_scores)

# # Decode the answer from the tokenized output
# answer_tokens = inputs["input_ids"][0][start_idx:end_idx + 1]
# answer = tokenizer.decode(answer_tokens)

# print("Answer:", answer)


seq2seq to bert

In [None]:
#!pip install sentencepiece


In [None]:
# Step 1: Question Generation using Seq2Seq (T5)

# Load the pre-trained Seq2Seq model for question generation
question_generation_model = T5ForConditionalGeneration.from_pretrained("t5-small")
question_generation_tokenizer = T5Tokenizer.from_pretrained("t5-small")

# ISTQB document (replace with your actual content)
istqb_document = """
1.1. What is Testing? 

Software systems are an integral part of our daily life. Most people have had experience with software 
that did not work as expected. Software that does not work correctly can lead to many problems, 
including loss of money, time or business reputation, and, in extreme cases, even injury or death. 
Software testing assesses software quality and helps reducing the risk of software failure in operation. 

Software testing is a set of activities to discover defects and evaluate the quality of software artifacts. 
These artifacts, when being tested, are known as test objects. A common misconception about testing is 
that it only consists of executing tests (i.e., running the software and checking the test results). However, 
software testing also includes other activities and must be aligned with the software development lifecycle 
(see chapter 2). 

Another common misconception about testing is that testing focuses entirely on verifying the test object. 
Whilst testing involves verification, i.e., checking whether the system meets specified requirements, it also 
involves validation, which means checking whether the system meets users’ and other stakeholders’ 
needs in its operational environment. 
"""

# Generate questions from the ISTQB document
def generate_questions(document, max_length=64, num_questions=1):
    inputs = question_generation_tokenizer.encode("generate questions: " + document, return_tensors="pt", max_length=max_length, truncation=True)
    questions = question_generation_model.generate(inputs, max_length=max_length, num_return_sequences=num_questions)
    return [question_generation_tokenizer.decode(question, skip_special_tokens=True) for question in questions]

generated_questions = generate_questions(istqb_document)

# Print generated questions
for question in generated_questions:
    print("Question:", question)
