# TEXT SUMMERIZATION - COSINE ALGORITHM

**1. Preprocessing the texts**

In [17]:
# pip install matplotlib
# pip install wordcloud
# pip install spacy
# !python -m spacy download en_core_web_sm
# pip install networkx
# !pip install scipy

In [2]:
#nltk.download("punkt")  # for tokenization
#nltk.download("stopwords")  # for stopwords

In [41]:
import re # regular expressions
import nltk # natural language toolkit
import string # for string operations
import numpy as np # numerical python
import networkx as nx # networkx for graph operations
from nltk.cluster.util import cosine_distance
from IPython.core.display import HTML # for displaying HTML in Jupyter Notebook
from goose3 import Goose # for extracting text from web pages

In [4]:
original_text = """Artificial intelligence is human like intelligence machines.
                   It is the study of intelligent artificial agents.
                   Science and engineering to produce intelligent machines.
                   Solve problems and have intelligence.
                   Related to intelligent behavior machines.
                   Developing of reasoning machines.
                   Learn from mistakes and successes.
                   Artificial intelligence is related to reasoning in everyday situations."""
original_text = re.sub(r'\s+', ' ', original_text)  # remove extra spaces and newlines
original_text

'Artificial intelligence is human like intelligence machines. It is the study of intelligent artificial agents. Science and engineering to produce intelligent machines. Solve problems and have intelligence. Related to intelligent behavior machines. Developing of reasoning machines. Learn from mistakes and successes. Artificial intelligence is related to reasoning in everyday situations.'

In [5]:
stopwords = nltk.corpus.stopwords.words('english')  # get the list of stopwords in English
print(stopwords)
len(stopwords)  # number of stopwords
print(string.punctuation)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [6]:
def preprocess(text): 
    formatted_text = text.lower()
    tokens = []
    # tokenize the text using word tokenizer 
    for token in nltk.word_tokenize(formatted_text, language="english", preserve_line=False): 
        tokens.append(token)
    #print(tokens)
    tokens = [word for word in tokens if word not in stopwords and word not in string.punctuation] # remove stopwords and punctuation from the text 
    formatted_text = " ".join(element for element in tokens)  # join the tokens back to string
    
    return formatted_text

**2. Function to calculate similarity between sentences**

- Link: https://en.wikipedia.org/wiki/Cosine_similarity
- Step by step calculations: https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/


In [7]:
original_sentences = [sentence for sentence in nltk.sent_tokenize(original_text)]
original_sentences

['Artificial intelligence is human like intelligence machines.',
 'It is the study of intelligent artificial agents.',
 'Science and engineering to produce intelligent machines.',
 'Solve problems and have intelligence.',
 'Related to intelligent behavior machines.',
 'Developing of reasoning machines.',
 'Learn from mistakes and successes.',
 'Artificial intelligence is related to reasoning in everyday situations.']

In [8]:
formatted_sentences = [preprocess(original_sentence) for original_sentence in original_sentences]
formatted_sentences

['artificial intelligence human like intelligence machines',
 'study intelligent artificial agents',
 'science engineering produce intelligent machines',
 'solve problems intelligence',
 'related intelligent behavior machines',
 'developing reasoning machines',
 'learn mistakes successes',
 'artificial intelligence related reasoning everyday situations']

In [9]:
def calculate_sentence_similarity(sent1, sent2): # return a number how similar two sentences are
    word1 = [word for word in nltk.word_tokenize(sent1)]
    word2 = [word for word in nltk.word_tokenize(sent2)]
    # print(word1)
    # print(word2)
    
    all_words = list(set(word1 + word2)) # create a set of all words in both sentences and remove duplicates
    # print(all_words)
    
    vector1= [0] * len(all_words)
    vector2= [0] * len(all_words)
    # print(vector1)
    # print(vector2)
    
    # create the vector for the two sentences using bag of words model
    for word in word1: 
        vector1[all_words.index(word)] += 1
    for word in word2: 
        vector2[all_words.index(word)] += 1
    
    # print(vector1)
    # print(vector2)
    
    return  1 - cosine_distance(vector1, vector2) # return cosine similarity between two sentences

In [10]:
calculate_sentence_similarity(formatted_sentences[0], formatted_sentences[4])

np.float64(0.17677669529663687)

**3. Function to create the similarity matrix**

In [11]:
# The higher the value, the more similar the sentences are.
# The more words two sentences have in common, the higher the similarity score will be.

def calculate_similarity_matrix(sentences):
    similarity_matrix = np.zeros((len(sentences), len(sentences))) 
    for i in range(len(sentences)): 
        for j in range(len(sentences)): 
            if i == j: # same sentence
                continue
            similarity_matrix[i][j] = calculate_sentence_similarity(sentences[i], sentences[j])
    return similarity_matrix

In [12]:
calculate_similarity_matrix(formatted_sentences) # return the similarity matrix between all sentences

array([[0.        , 0.1767767 , 0.15811388, 0.40824829, 0.1767767 ,
        0.20412415, 0.        , 0.4330127 ],
       [0.1767767 , 0.        , 0.2236068 , 0.        , 0.25      ,
        0.        , 0.        , 0.20412415],
       [0.15811388, 0.2236068 , 0.        , 0.        , 0.4472136 ,
        0.25819889, 0.        , 0.        ],
       [0.40824829, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.23570226],
       [0.1767767 , 0.25      , 0.4472136 , 0.        , 0.        ,
        0.28867513, 0.        , 0.20412415],
       [0.20412415, 0.        , 0.25819889, 0.        , 0.28867513,
        0.        , 0.        , 0.23570226],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.4330127 , 0.20412415, 0.        , 0.23570226, 0.20412415,
        0.23570226, 0.        , 0.        ]])

**4. Function to summarize the text**

In [74]:
def summarize_text(text, num_sentences, percentage = 0): 
    original_sentences = [sentence for sentence in nltk.sent_tokenize(text)]
    formatted_sentences = [preprocess(original_sentence) for original_sentence in original_sentences]
    
    similarity_matrix = calculate_similarity_matrix(formatted_sentences)
    
    # print(similarity_matrix)
    
    # create a graph from the similarity matrix
    similarity_graph = nx.from_numpy_array(similarity_matrix)
    # print("Index of sentences:", similarity_graph.nodes) 
    # print("Similarity between sentences:", similarity_graph.edges) 
    
    scores = nx.pagerank(similarity_graph) # rank the sentences using pagerank algorithm
    # print("Scores of sentences:", scores)
    ordered_scores = sorted(((scores[i], s) for i, s in enumerate(original_sentences)), reverse=True)
    # print("Ordered sentences:", ordered_scores)
    
    if percentage > 0:
        num_sentences = int(len(formatted_sentences) * percentage)
    
    best_sentences = []
    for sentence in range(num_sentences):
        best_sentences.append(ordered_scores[sentence][1])
    
    return original_sentences, best_sentences, ordered_scores

In [50]:
original_sentences, best_sentences, ordered_scores =  summarize_text(original_text, 3)

In [51]:
original_sentences

['Artificial intelligence is human like intelligence machines.',
 'It is the study of intelligent artificial agents.',
 'Science and engineering to produce intelligent machines.',
 'Solve problems and have intelligence.',
 'Related to intelligent behavior machines.',
 'Developing of reasoning machines.',
 'Learn from mistakes and successes.',
 'Artificial intelligence is related to reasoning in everyday situations.']

In [52]:
best_sentences

['Artificial intelligence is human like intelligence machines.',
 'Related to intelligent behavior machines.',
 'Artificial intelligence is related to reasoning in everyday situations.']

In [53]:
ordered_scores

[(0.1905291122256424,
  'Artificial intelligence is human like intelligence machines.'),
 (0.1668464684213966, 'Related to intelligent behavior machines.'),
 (0.16235592006413635,
  'Artificial intelligence is related to reasoning in everyday situations.'),
 (0.1360932146861898,
  'Science and engineering to produce intelligent machines.'),
 (0.12441694110573492, 'Developing of reasoning machines.'),
 (0.11055888752338401, 'It is the study of intelligent artificial agents.'),
 (0.08822043499447205, 'Solve problems and have intelligence.'),
 (0.02097902097904386, 'Learn from mistakes and successes.')]

In [55]:
def visualize(title, best_sentences, original_sentences):
    """
    Display the article title and highlight the best sentences in the summary.
    - title: str, the article title
    - best_sentences: list of str, the selected summary sentences
    - original_sentences: list of str, all sentences in the article (in order)
    """
    text = ""
    for sentence in original_sentences:
        if sentence in best_sentences:
            text += f"<mark>{sentence}</mark> "
        else:
            text += f"{sentence} "
    html = f"<h2>{title}</h2><p>{text}</p>"
    display(HTML(html))

In [56]:
visualize("Artificial Intelligence", best_sentences, original_sentences)

**5. Extracting texts from the internet**

In [66]:
g = Goose()
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
article = g.extract(url=url)
article.cleaned_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most importa

In [75]:
original_sentences, best_sentences, ordered_scores = summarize_text(article.cleaned_text, 120, percentage=0.2)

In [76]:
best_sentences

['The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".',
 '"Summarizing Conceptual Graphs for Automatic Summarization Task".',
 'Some unsupervised summarization approaches are based on finding a "centroid" sentence, which is the mean word vector of all the sentences in the document.',
 'For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together.',
 'Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[16] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages.',
 

In [77]:
visualize(article.title, best_sentences, original_sentences)