# Summarization without Transformer

Generating summaries without the help of powerful pre-trained models and libraries like Transformers can be quite challenging and cumbersome. Let's go through a simple example using traditional NLP techniques without any pre-trained models

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from string import punctuation

We start by downloading the required NLTK data for stopwords and sentence tokenization

In [2]:

# Download required NLTK data
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
# Example article
article = """
The Astonishing Hypothesis is a 1994 book by scientist Francis Crick about consciousness and neuroscience. In it, Crick promotes the idea that consciousness is produced by physical and chemical processes in the brain, and that neuroscience will eventually have a theory which can explain consciousness. The book surveys the history of research into consciousness and outlines several hypotheses about the neural correlates of various components and properties of consciousness.
"""

We tokenize the article into sentences using sent_tokenize

In [7]:
# Tokenize sentences
sentences = sent_tokenize(article)

We then tokenize each sentence into words using word_tokenize, convert them to lowercase, and remove stopwords and punctuation.

In [8]:
# Tokenize words and remove stopwords/punctuation
word_tokens = [word_tokenize(sentence.lower()) for sentence in sentences]
filtered_words = []
for words in word_tokens:
    filtered_words.append([word for word in words if word not in stopwords.words('english') and word not in list(punctuation)])


We calculate the frequency of each word in the article

In [9]:
# Calculate word frequencies
word_freq = {}
for words in filtered_words:
    for word in words:
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1

For each sentence, we calculate a score by summing the frequencies of the words it contains

In [10]:
# Calculate sentence scores
sentence_scores = {}
for i, sentence in enumerate(sentences):
    for word in word_tokenize(sentence.lower()):
        if word in word_freq.keys():
            if i in sentence_scores:
                sentence_scores[i] += word_freq[word]
            else:
                sentence_scores[i] = word_freq[word]



We sort the sentences by their scores in descending order and select the top N (in this case, 3) sentences as the summary. 

Finally, we join the selected sentences into a single string and print it as the summary

In [11]:
# Select top N sentences as the summary
N = 3
summary_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)[:N]
summary = ' '.join([sentences[idx] for idx, _ in summary_sentences])

print(f"Summary: {summary}")

Summary: In it, Crick promotes the idea that consciousness is produced by physical and chemical processes in the brain, and that neuroscience will eventually have a theory which can explain consciousness. The book surveys the history of research into consciousness and outlines several hypotheses about the neural correlates of various components and properties of consciousness. 
The Astonishing Hypothesis is a 1994 book by scientist Francis Crick about consciousness and neuroscience.
