## Summarising multiple Science Fiction titles using NLP transformers

In this project we will take multiple science fiction titles from Project Gutenberg and create shortened summaries using natural language processing. When dealing with large quantities of text, it often helps to have shorter summaries for quick field research. Summarization models also tend to perform better on non-fiction documents.

First, let's import the books as text data using glob.

In [1]:
# Import library
import glob

# The books files are contained in this folder
folder = "C:/Users/dalin/Dropbox/MachineLearning/SciFi/books/"

# List all the .txt files and sort them alphabetically
files = glob.glob(folder + "*.txt")
files.sort()
files

['C:/Users/dalin/Dropbox/MachineLearning/SciFi/books\\A Journey To the Centre of the Earth.txt',
 'C:/Users/dalin/Dropbox/MachineLearning/SciFi/books\\A Trip to Venus.txt',
 'C:/Users/dalin/Dropbox/MachineLearning/SciFi/books\\Armageddon.txt']

From the Hugging Face library, we will use the default Distilbart-CNN-12-6 model to perform our summarization. Using the Hugging Face pipeline as follows, we'll create a summarizer.

In [2]:
from transformers import pipeline

In [4]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Next, we will separate out the texts from the titles of the books. We'll also remove all non-alphanumeric characters, retaining periods, question marks, and exclamation marks.

In [5]:
# Import libraries
import re, os

# Initialize the object that will contain the texts and titles
txts = []
titles = []

for n in files:
    # Open each file
    f = open(n, encoding='utf-8-sig')
    # Remove all non-alpha-numeric characters except periods, question marks, exclamation marks. f.read() reads the text and ' ' replaces non-alphanumeric characters with a space.
    data = re.sub('[^a-zA-Z0-9_.?!]+', ' ', f.read())
    # Store the texts and titles of the books in two separate lists
    txts.append(data)
    titles.append(os.path.basename(n).replace(".txt", ""))

# Print the length, in characters, of each text
[len(t) for t in txts]

[492162, 292767, 172609]

In the next block, we'll create titles for the summaries to identify them once saved.

In [6]:
# Create titles for summary. Will be used at the end for creating text files.
summary_string = '_summary.txt'
summary_titles = [t + summary_string for t in titles]

In [7]:
summary_titles

['A Journey To the Centre of the Earth_summary.txt',
 'A Trip to Venus_summary.txt',
 'Armageddon_summary.txt']

Now we'll extract the main body of each book, excluding introductory and appendix notes.

In [8]:
# Extract the main body of the text file, excluding introductory and licensing notes, per Project Gutenberg's formatting.
content = []
for txt in txts:
    start = txt.find("START OF THIS PROJECT GUTENBERG") + len("START OF THIS PROJECT GUTENBERG")
    end = txt.find("END OF THIS PROJECT GUTENBERG")
    substring = txt[start:end]
    content.append(substring)

Let's check the length of each book once extracted.

In [9]:
[len(c) for c in content]

[473512, 274098, 153897]

Now for each book, we'll identify the end of sentences using periods, question marks, and exclamation marks

In [12]:
eos_content = []
for c in content:
    c1 = c.replace('. ', '.<eos>')
    c2 = c1.replace('? ', '?<eos>')
    c3 = c2.replace('! ', '!<eos>')
    eos_content.append(c3)

Now we'll split the content of the books into individual sentences and feed the sentences in chunks to our summarizer. The summarizer will then generate a summary for book.

In [15]:
max_chunk = 500
summaries = []
for e in eos_content:
    sentences = e.split('<eos>')
    current_chunk = 0 
    chunks = []
    for sentence in sentences:
        if len(chunks) == current_chunk + 1: 
            if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
                chunks[current_chunk].extend(sentence.split(' '))
            else:
                current_chunk += 1
                chunks.append(sentence.split(' '))
        else:
            print(current_chunk)
            chunks.append(sentence.split(' '))

    for chunk_id in range(len(chunks)):
        chunks[chunk_id] = ' '.join(chunks[chunk_id])
    
    # Summarizer object
    res = summarizer(chunks, max_length=120, min_length=30, do_sample=False)
    # Join chunk summary together to make whole summary
    text = ' '.join([summ['summary_text'] for summ in res])
    # Append whole summary to summary list
    summaries.append(text)
    chunks = []
    
# Audio(sound_file, autoplay=True)

0


Your max_length is set to 120, but you input_length is only 33. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


0
0


Each book's summary will now be save separately as a text file. Each text file can then be freely opened to grab a summary.

In [28]:
# Save the summaries in individual text files.
import io
for i in range(len(summaries)):
    with io.open(str(titles[i]) + "_summary.txt", 'w', encoding='utf-8') as f:
        f.write(str(summaries[i]))