In this notebook we are downloading PDF file, converting it to TXT and doing some "pre-cleaning": removing not meaningful parts of document and leaving just the most valuable leftovers for our future generator.
THe outcome of the below code is pre-processed but still raw data.


"extracted_text" variable has "StringIO" type: The StringIO object is part of Python's io module and is a class that provides an in-memory file-like object that can be used for reading from or writing to strings as if they were files. It allows you to treat strings as file-like objects, which can be useful in various situations, such as when you want to read from or write to a string in a way that mimics file operations.


In [165]:
# import of libraries
from io import StringIO # extracted_text is the main variable, contains the whole text of document in stringIO format in memory
import requests
#import pdfminer
import re  # provides reg. exp. support
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser


In [166]:
# downloading pdf to '/data/' folder
url = 'https://astqb.org/assets/documents/ISTQB_CTFL_Syllabus-v4.0.pdf'
r = requests.get(url, allow_redirects=True)
open('data/ISTQB_CTFL_Syllabus-v4.0.pdf', 'wb').write(r.content)

1113747

In [167]:
#converting pdf to text and saving into .txt file initial version
output_string = StringIO()
output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0.txt'
with open('data/ISTQB_CTFL_Syllabus-v4.0.pdf', 'rb') as in_file, open(output_file_path, 'w', encoding='utf-8') as out_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
    # Getting the extracted text from StringIO, it means the entire text extracted from the PDF is stored as a single string in memory.
    extracted_text = output_string.getvalue()
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' saved to '{output_file_path}'")

Extracted text for 'The Certified Tester Foundation Level in Software Testing' saved to 'data/ISTQB_CTFL_Syllabus-v4.0.txt'


In [168]:
# Looking up for the text to remove everything before it
target_text = "1.1. What is Testing?"

# Finding the position of the target text in the extracted text
start_position = extracted_text.find(target_text)

# Checking if the target text was found, just in case
if start_position != -1:
    # Removing everything before the target text
    extracted_text = extracted_text[start_position:]


# let us save the content to .txt file with prefix '_v0.1' for further debugging purpose and human evaluation process

output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v01.txt'
with open('data/ISTQB_CTFL_Syllabus-v4.0.txt', 'rb') as in_file, open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.1 saved to '{output_file_path}'")



Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.1 saved to 'data/ISTQB_CTFL_Syllabus-v4.0_v01.txt'


In [169]:
# removing empty lines
# _ - is iterator, if s.strip(): This part of the list comprehension checks whether the line s contains any non-whitespace characters. 
# If it does, the line is included in the resulting list.

extracted_text = "".join([_ for _ in extracted_text.strip().splitlines(True) if _.strip()])

output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v02.txt'
with open('data/ISTQB_CTFL_Syllabus-v4.0_v01.txt', 'rb') as in_file, open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.2 saved to '{output_file_path}'")

Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.2 saved to 'data/ISTQB_CTFL_Syllabus-v4.0_v02.txt'


In [171]:
# Removing text from 'Page 56 of 74' till the end of the text

# Looking up for the text to remove everything after it
target_text = "Page 56 of 74"

# Finding the position of the target text in the extracted text
end_position = extracted_text.find(target_text)

# Checking if the target text was found, just in case
if end_position != -1:
    # Removing everything before the target text
    extracted_text = extracted_text[:end_position]


# let us save the content to .txt file with prefix '_v0.1' for further debugging purpose and human evaluation process

output_file_path = 'data/ISTQB_CTFL_Syllabus-v4.0_v03.txt'
with open('data/ISTQB_CTFL_Syllabus-v4.0_v02.txt', 'rb') as in_file, open(output_file_path, 'w', encoding='utf-8') as out_file:
    # Writing the extracted text to the output file
    out_file.write(extracted_text)

# Closing the stream
output_string.close()

# Printing message to indicate that the text has been saved to the file
print(f"Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.1 saved to '{output_file_path}'")


Extracted text for 'The Certified Tester Foundation Level in Software Testing' pre processed version 0.1 saved to 'data/ISTQB_CTFL_Syllabus-v4.0_v03.txt'


In [174]:
# convert to lower case all words in stringIO
extracted_text = extracted_text.lower()

1.1. what is testing? 
software systems are an integral part of our daily life. most people have had experience with software 
that did not work as expected. software that does not work correctly can lead to many problems, 
including loss of money, time or business reputation, and, in extreme cases, even injury or death. 
software testing assesses software quality and helps reducing the risk of software failure in operation. 
software testing is a set of activities to discover defects and evaluate the quality of software artifacts. 
these artifacts, when being tested, are known as test objects. a common misconception about testing is 
that it only consists of executing tests (i.e., running the software and checking the test results). however, 
software testing also includes other activities and must be aligned with the software development lifecycle 
(see chapter 2). 
another common misconception about testing is that testing focuses entirely on verifying the test object. 
whilst testi