# Text Summarisation with Extraction Workbook

## 1.0. Introduction

In today's digital age, news flows in an endless stream from various sources. We have great amount of news articles everyday. But, there are a small amount of useful information in the articles and it is hard to extract useful information manually. As a result, there are lots of news articles but, it is hard to read all of articles and find informative news manually. One of solutions on this problem is to summarize texts in the article.

<p align='center'>
    <img src="https://blog.fpt-software.com/hs-fs/hubfs/image-8.png?width=376&name=image-8.png" alt="Text Summarisation Visual" />
</p>

### 1.1. Problem Statement
Text summarisation automatically gives the reader a summary containing important sentences and relevant information about an article. This is highly useful because it shortens the time needed to capture the meaning and main events of an article. Broadly, there are 2 ways of performing text summarisation - abstractive and extractive. 

**Abstractive.** Abstractive methods analyse input texts and generate new texts that capture the essence of the original text. If trained correctly, they convey the same meaning as the original text, yet are more concise.

**Extractive.** Extractive methods, on the other, take out the important texts from the original text and joins them to form a summary. Hence, they do not generate any new texts.

In this assignment, we'll use the abstractive method to solve the following problem - **given a news article, can we return a succinct summary of the article?**

### 1.2. Extractive Text Summarisation
In the field of text summarisation, the techniques used can be broadly classified into two categories - extraction and abstraction. **Extraction** techniques take out the important sentences or phrases from the original text and joins them to form a summary. This involves a ranking algorithm to assign scores to sentences or phrases based on a certain relevance to the overall meaning of the document. 

This workbook will be used to develop 2 forms of extractive methods for text summarisation - 1) Weighted Frequency-Based Approach, and 2) Term Frequency-Inverse Document Frequency (TF-IDF)

## 2.0. Import Libraries


In [1]:
import nltk
import pandas as pd
import numpy as np
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from heapq import nlargest

[nltk_data] Downloading package punkt to /Users/bobbycxy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bobbycxy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 2.1. Create Helper Functions

In [2]:
def preprocess(filepath, substring1 = '<content>', substring2 = '</content>'):
    '''
    inputs:
        filepath: file path to article
        substring1: by default, it is "<content>"
        substring2: by default, it is "</content>"
    output:
        res: the top N ranked sentences
    '''
    with open(filepath, encoding='utf-8') as f:
        article = f.read()

    idx1 = article.index(substring1)
    idx2 = article.index(substring2)

    res = article[idx1 + (len(substring1) - 1) + 1:idx2]
    res = res.strip() 

    return res

## 3.0. Create the Text Summarisation Functions

In [3]:
## summarisation extraction based on weighted freqencies
def summarise_weight_freq(text, n = None, max_sentence_length = 25):
    '''
    inputs:
        text: body of words
        n: [int] number of sentences, [float and lesser than 1] percentage of sentences, [None] 15% of the sentences extracted
        max_sentence_length: keep sentence in the text that have sentence lengths equal or lesser to this
    output:
        summary: the top N ranked sentences
    '''

    sentences = sent_tokenize(text) # tokenize text into a list of sentences
    stop_words = set(stopwords.words('english')) 

    # In order to rank sentences by frequency, we need to have the word frequencies.
    words = [word.lower() for word in word_tokenize(text) if word.lower() not in stop_words and word.isalnum()]
    word_freq = Counter(words)

    # calculate the sentence scores via weighted word frequencies
    sentence_scores = {}
    for sentence in sentences:
        sentence_words = [word.lower() for word in word_tokenize(sentence) if word.lower() not in stop_words and word.isalnum()]
        sentence_score = sum([word_freq[word] for word in sentence_words])
        if len(sentence_words) <= max_sentence_length:
            sentence_scores[sentence] = sentence_score/len(sentence_words) # calculates the average of the sum of word frequencies per sentence

    # get the top n sentences
    if n == None:
        n = int(0.15 * len(sentences)) # rounds down to approximately 15% of the original sentence
    elif isinstance(n,float) and n <= 1:
        n = int(n * len(sentences))
    summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse = True)[:n]
    summary = ' '.join(summary_sentences)

    return summary

## summarisation extraction based on tf-idf
def summarise_tfidf(text, n = None):
    '''
    inputs:
        text: body of words
        n: [int] number of sentences, [float and lesser than 1] percentage of sentences, [None] 15% of the sentences extracted
    output:
        summary: the top N ranked sentences
    '''

    sentences = sent_tokenize(text) # tokenize text into a list of sentences

    # prepare a TF-IDF matrix using sklearn library
    vectorizer = TfidfVectorizer(stop_words = 'english')
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # calculate the cosine similarity of each sentence against the whole text
    sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]

    # get the top n sentences
    if n == None:
        n = int(0.15 * len(sentences)) # rounds down to approximately 15% of the original sentence
    elif isinstance(n,float) and n <= 1:
        n = int(n * len(sentences))
    summary_sentences = nlargest(n, range(len(sentence_scores)), key = sentence_scores.__getitem__)
    summary = ' '.join([sentences[i] for i in sorted(summary_sentences)])

    return summary

## 4.0. Testing out the functions

### 4.1. Weighted Frequency

In [4]:
n = 3

print('WEIGHTED FREQUENCY:')

for article in ['article1.txt','article2.txt','article3.txt']:
    print(article, '\n', sent_tokenize(summarise_weight_freq(preprocess(article), n)))

WEIGHTED FREQUENCY:
article1.txt 
 ['A second presale will be held for KrisFlyer members from 10am on Oct 30 to 9.59am on Oct 31.', 'They will then receive a unique access code from KrisFlyer via email on Oct 27.', 'UOB cardholders can enjoy a presale from 10am on Oct 27 till 9.59 am on Oct 29.']
article2.txt 
 ['Dozens of Palestinians have been killed in the West Bank in the latest flare-up of Israeli-Palestinian violence.', 'Oct 19 (Reuters) - Three Palestinians, including two teenagers, were killed by Israeli forces in separate incidents in the occupied West Bank early on Thursday, Palestinian official news agency WAFA said.', 'Israeli forces have carried out their fiercest bombardment of Gaza in response, killing more than 3,000 Palestinians and imposing a total siege on the blockaded enclave that Hamas controls, fuelling anger among Palestinians in the West Bank.']
article3.txt 
 ['He was driving along Sophia Road towards Upper Wilkie Road shortly before 11.30pm when he spotted a 

### 4.2. TF-IDF

In [5]:
n = 3

print('TFIDF:')

for article in ['article1.txt','article2.txt','article3.txt']:
    print(article, '\n', sent_tokenize(summarise_tfidf(preprocess(article), n)))

TFIDF:
article1.txt 
 ['UOB cardholders can enjoy a presale from 10am on Oct 27 till 9.59 am on Oct 29.', 'A second presale will be held for KrisFlyer members from 10am on Oct 30 to 9.59am on Oct 31.', 'They will then receive a unique access code from KrisFlyer via email on Oct 27.']
article2.txt 
 ['Oct 19 (Reuters) - Three Palestinians, including two teenagers, were killed by Israeli forces in separate incidents in the occupied West Bank early on Thursday, Palestinian official news agency WAFA said.', 'Dozens of Palestinians have been killed in the West Bank in the latest flare-up of Israeli-Palestinian violence.', 'Israel is preparing a ground assault in the Gaza Strip in response to a deadly attack by Palestinian militant group Hamas that killed at least 1,400 Israelis, mostly civilians, on Oct. 7.']
article3.txt 
 ['SINGAPORE – In an attempt to evade arrest, a doctor who drove a car after drinking beer tried to change seats with his passenger when he spotted a police roadblock.', 

## 5.0. Conclusion
Extractive text summarisation offers a higher accuracy, lower computation complexity and a better conservation of the information from the article compared to abstractive text summarisation. In the README.docx, I analyse the printed results of each method.

In a next iteration, we can explore using graph-based ranking algorithms like TextRank. Text rank works by constructing a graph where sentences are represented as nodes, and the edges represent the relationships between them. The ranking score is determined by iteratively updating the scores of the sentences based on 1) the similarity and 2) importance of their neighboring sentences. The highly ranked sentences are then used to generate a summary of the text.
