# Frequency based algorithm of text summarization

## Steps:
- Preprocessing text (e.g. lowercase, remove stop words).
- Calculate absolute word frequency.
- Calculate weighted word frequency relative to the most frequent word (results will be in the range 0-1).
- Sentence tokenization (split original text into sentences).
- Calculate sentences score (sum of weighted word frequencies used in each sentence).
- Order the sentences.
- Generate the summary.

### Preprocessing text

In [1]:
import re
import nltk
import string

In [3]:
source_text = """Artificial intelligence is human like intelligence. 
                   It is the study of intelligent artificial agents. 
                   Science and engineering to produce intelligent machines. 
                   Solve problems and have intelligence. 
                   Related to intelligent behavior. 
                   Developing of reasoning machines. 
                   Learn from mistakes and successes. 
                   Artificial intelligence is related to reasoning in everyday situations."""

In [6]:
original_text = re.sub(r'\s+', ' ', source_text)
original_text

'Artificial intelligence is human like intelligence. It is the study of intelligent artificial agents. Science and engineering to produce intelligent machines. Solve problems and have intelligence. Related to intelligent behavior. Developing of reasoning machines. Learn from mistakes and successes. Artificial intelligence is related to reasoning in everyday situations.'

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mykolafant/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mykolafant/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [23]:
def preprocess(text):
    tokens = []
    for token in nltk.word_tokenize(text.lower()):
        tokens.append(token)
    # remove stop words AND punctuation
    tokens = [word for word in tokens if word not in stopwords and word not in string.punctuation]
    
    return ' '.join(tokens)

In [24]:
formatted_text = preprocess(original_text)
formatted_text

'artificial intelligence human like intelligence study intelligent artificial agents science engineering produce intelligent machines solve problems intelligence related intelligent behavior developing reasoning machines learn mistakes successes artificial intelligence related reasoning everyday situations'

### Calculate absolute word frequency

In [32]:
word_frequency = nltk.FreqDist(nltk.word_tokenize(formatted_text))
word_frequency

FreqDist({'intelligence': 4, 'artificial': 3, 'intelligent': 3, 'machines': 2, 'related': 2, 'reasoning': 2, 'human': 1, 'like': 1, 'study': 1, 'agents': 1, ...})

### Calculate weighted word frequency

In [33]:
highest_frequency = max(word_frequency.values())
highest_frequency

4

In [34]:
for word in word_frequency.keys():
    word_frequency[word] /= highest_frequency
word_frequency

FreqDist({'intelligence': 1.0, 'artificial': 0.75, 'intelligent': 0.75, 'machines': 0.5, 'related': 0.5, 'reasoning': 0.5, 'human': 0.25, 'like': 0.25, 'study': 0.25, 'agents': 0.25, ...})

### Sentence tokenization

In [36]:
sentence_list = nltk.sent_tokenize(original_text)
sentence_list

['Artificial intelligence is human like intelligence.',
 'It is the study of intelligent artificial agents.',
 'Science and engineering to produce intelligent machines.',
 'Solve problems and have intelligence.',
 'Related to intelligent behavior.',
 'Developing of reasoning machines.',
 'Learn from mistakes and successes.',
 'Artificial intelligence is related to reasoning in everyday situations.']

### Calculate sentences score

In [38]:
scored_sentences = {}
for sentence in sentence_list:
    for word in nltk.word_tokenize(sentence.lower()):
        if sentence not in scored_sentences.keys():
            scored_sentences[sentence] = word_frequency[word]
        else:
            scored_sentences[sentence] += word_frequency[word]

scored_sentences

{'Artificial intelligence is human like intelligence.': 3.25,
 'It is the study of intelligent artificial agents.': 2.0,
 'Science and engineering to produce intelligent machines.': 2.0,
 'Solve problems and have intelligence.': 1.5,
 'Related to intelligent behavior.': 1.5,
 'Developing of reasoning machines.': 1.25,
 'Learn from mistakes and successes.': 0.75,
 'Artificial intelligence is related to reasoning in everyday situations.': 3.25}

### Order the sentences.

In [39]:
import heapq
best_sentences = heapq.nlargest(3, scored_sentences, key = scored_sentences.get)

best_sentences

['Artificial intelligence is human like intelligence.',
 'Artificial intelligence is related to reasoning in everyday situations.',
 'It is the study of intelligent artificial agents.']

### Generate the summary.

In [40]:
summary = ' '.join(best_sentences)
summary

'Artificial intelligence is human like intelligence. Artificial intelligence is related to reasoning in everyday situations. It is the study of intelligent artificial agents.'

## HTML visualization

In [41]:
from IPython.core.display import HTML

In [45]:
text = ''
display(HTML(f'<h2>Summary</h2>'))
for sentence in sentence_list:
  if sentence in best_sentences:
    text += ' ' + sentence.replace(sentence, f"<mark><strong>{sentence}</strong></mark>")
  else:
    text += ' ' + sentence

display(HTML(f"""<p style="font-size: 16px;">{text}</p>"""))

## Extracting texts from the Internet

In [46]:
!pip install goose3

Collecting goose3
  Downloading goose3-3.1.17-py3-none-any.whl.metadata (11 kB)
Collecting Pillow (from goose3)
  Downloading Pillow-10.1.0-cp311-cp311-macosx_10_10_x86_64.whl.metadata (9.5 kB)
Collecting lxml (from goose3)
  Downloading lxml-4.9.3.tar.gz (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m00:01[0m:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting cssselect (from goose3)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting langdetect (from goose3)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pyahocorasick (from goose3)
  Downloading pyahocorasick-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl (37 kB)
Downloading goose3-3.1.17-py3-none-any.whl (88 k

In [47]:
from goose3 import Goose

In [48]:
g = Goose()
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
article = g.extract(url)
article.title

'Automatic summarization - Wikipedia'

In [49]:
article.cleaned_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important vi

In [50]:
formatted_article = preprocess(article.cleaned_text)
formatted_article

"automatic summarization process shortening set data computationally create subset summary represents important relevant information within original content artificial intelligence algorithms commonly developed employed achieve specialized different types data text summarization usually implemented natural language processing methods designed locate informative sentences given document 1 hand visual content summarized using computer vision algorithms image summarization subject ongoing research existing approaches typically attempt display representative images given image collection generate video includes important content entire collection 2 3 4 video summarization algorithms identify extract original video content important frames key-frames and/or important video segments key-shots normally temporally ordered fashion 5 6 7 8 video summaries simply retain carefully selected subset original video frames therefore identical output video synopsis algorithms new video frames synthesize

In [51]:
def summarize(text, number_of_sentences, percentage = 0):
  original_text = text
  formatted_text = preprocess(original_text)

  word_frequency = nltk.FreqDist(nltk.word_tokenize(formatted_text))
  highest_frequency = max(word_frequency.values())
  for word in word_frequency.keys():
    word_frequency[word] = (word_frequency[word] / highest_frequency)
  sentence_list = nltk.sent_tokenize(original_text)
  
  score_sentences = {}
  for sentence in sentence_list:
    for word in nltk.word_tokenize(sentence):
      if word in word_frequency.keys():
        if sentence not in score_sentences.keys():
          score_sentences[sentence] = word_frequency[word]
        else:
          score_sentences[sentence] += word_frequency[word]

  import heapq
  if percentage > 0:
    best_sentences = heapq.nlargest(int(len(sentence_list) * percentage), score_sentences, key=score_sentences.get)
  else:
    best_sentences = heapq.nlargest(number_of_sentences, score_sentences, key=score_sentences.get)

  return sentence_list, best_sentences, word_frequency, score_sentences

In [56]:
sentence_list, best_sentences, word_frequency, score_sentences = summarize(article.cleaned_text, 20)

In [53]:
def visualize(title, sentence_list, best_sentences):
  from IPython.core.display import HTML
  text = ''

  display(HTML(f'<h1>Summary - {title}</h1>'))
  for sentence in sentence_list:
    if sentence in best_sentences:
      text += ' ' + str(sentence).replace(sentence, f"<mark>{sentence}</mark>")
    else:
      text += ' ' + sentence
  display(HTML(f""" {text} """))

In [57]:
visualize(article.title, sentence_list, best_sentences)