# ISSS609 Text Analytics and Applications
## IMBD Movie Review - Extractive Text Summarisation
### G1 - Group 4

<a id="table_of_contents"></a>
### Table of Contents

1. [Importing Files and Libraries](#import)
2. [Extractive Summarisation steps](#outline)
3. [Importing raw data](#setup)
4. [Creating a frequency table](#freq)
5. [Calculate sentence scores](#ss)
6. [Calculate threshold](#threshold)
7. [Finetuning threshold](#finetune)
8. [Programmatic evaluation](#eval)

<a id="import"></a>
### 1. Importing libraries

In [1]:
import random
import pandas as pd
from text_analytics.config import DATA_PATH
from rouge import Rouge
import numpy as np
import numpy.typing as npt
from typing import List, Any
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict

# set custom colwidths 
pd.set_option("max_colwidth", 150)
%matplotlib inline

<a id="outline"></a>
### 2. Extractive summarisation steps

Here is an outline of steps to build the extractive summariser 

- Select a raw article to preprocess 
- Tokenise the sentences to get all stems present 
- Evaluate the weighted occurrence frequency of the words 
- Split the paragraph into sentences
- Apply the masking threshold to output the summarised review

<a id="setup"></a>
### 3. Importing raw data

- A raw article is selected at this stage for preprocessing

In [2]:
movie_reviews = pd.read_parquet(DATA_PATH / "imdb_data.parquet")
# preview the dataframe
movie_reviews.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened ...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometim...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted ...",positive
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />Thi...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. Thi...",positive


In [3]:
article = movie_reviews.loc[5, "review"]
print(article)

Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.


<a id="freq"></a>
### 4. Creating a frequency table

- The first step in extractive summarisation is to determine the relative importance of each word within the overall context of the sentence 
- Every stem in an article's importance can be captured into a frequency table and weighted accordingly
- We remove stop words and calculate a frequency table for each article  

In [4]:
def create_dictionary_table(article, stemmer = None):
    #removing stop words
    frequency_table = defaultdict(int)

    stop_words = set(stopwords.words("english"))
    word_vector = word_tokenize(article)

    # instantiate the stemmer 
    if stemmer is None: 
        stemmer = PorterStemmer()

    stemmed_word_vector = [stemmer.stem(word) for word in word_vector]
    for word in stemmed_word_vector:
        if word not in stop_words:
            frequency_table[word] += 1

    return frequency_table

In [5]:
frequency_table = create_dictionary_table(article)

In [6]:
frequencies = pd.DataFrame(
    {"frequencies": frequency_table}
    ).sort_values("frequencies", ascending=False)
# preview the frequencies
frequencies.head()

Unnamed: 0,frequencies
",",11
.,6
's,3
movi,2
'',2


In [7]:
frequencies

Unnamed: 0,frequencies
",",11
.,6
's,3
movi,2
'',2
...,...
happen,1
kid,1
last,1
like,1


<a id="ss"></a>
### 5. Calculate sentence scores

- The frequency scores are used to determine the importance of each sentence 
- For instance, `I like this movie` will return a score of 3 if `like=1` and `movie=2`
- To ensure long sentences do not dominate shorter sentences, we normalise the scores of each sentence by dividing each sentence score by its word length 

In [8]:
sentences = sent_tokenize(article)

In [9]:
def calculate_sentence_scores(sentences: npt.ArrayLike, frequency_table: dict) -> dict:   
    # Every sentence is scored by how important its constituent words are in the frequency table
    sentence_weights = defaultdict(int)

    for sentence in sentences:
        sentence_wordcount_without_stop_words = 0

        for word_weight in frequency_table:
            sentence_weights[sentence[:7]] += frequency_table[word_weight]

            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1

        sentence_weights[sentence[:7]] /= sentence_wordcount_without_stop_words

    return sentence_weights

In [10]:
sentence_weights = calculate_sentence_scores(sentences, frequency_table)

In [11]:
sentence_weight_preview = pd.DataFrame(
    {"sentence_weights": sentence_weights}
    ).sort_values("sentence_weights", ascending=False)
sentence_weight_preview.head()

Unnamed: 0,sentence_weights
If I ha,10.625
And the,7.727273
It just,7.083333
The kid,6.538462
Probabl,6.071429


<a id="threshold"></a>
### 6. Calculate the threshold for a token to be counted as important 

- We can adjust the threshold by multiplying the mean of the scores by an alpha value
- Alternatives, such as the median, can also be used to compute the threshold for inclusion

In [12]:
def calculate_threshold_score(sentence_weights: dict, alpha: float = 1.0) -> float:
    return np.mean(list(sentence_weights.values())) * alpha

In [13]:
print(f"Threshold weight for example article: {calculate_threshold_score(sentence_weights):.02f}")

Threshold weight for example article: 7.29


<a id="finetune"></a>
### 7. Finetuning the threshold 

- Different threshold values represent a trade-off between comprehension and length
- Lower thresholds result in longer sentences, but will contain more contextual markers
- The optimal threshold can either be determined manually or programmatically via a validation set

In [14]:
def get_article_summary(sentences: npt.ArrayLike, sentence_weights: dict, threshold: float) -> str:
    article_summary = [sentence for sentence in sentences if sentence[:7] in sentence_weights and sentence_weights.get(sentence[:7]) >= threshold]
    return " ".join(article_summary)

In [15]:
alpha_values = np.arange(0.95, 1.25, 0.1)
alpha_values

array([0.95, 1.05, 1.15, 1.25])

In [16]:
for alpha in alpha_values: 
    threshold_score = calculate_threshold_score(sentence_weights=sentence_weights, alpha=alpha) 
    final_result = get_article_summary(sentences=sentences, sentence_weights=sentence_weights, threshold=threshold_score) 

    print(f"At threshold of: {alpha:.02f}")
    print(f"Result: {final_result}")

At threshold of: 0.95
Result: It just never gets old, despite my having seen it some 15 or more times in the last 25 years. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.
At threshold of: 1.05
Result: And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.
At threshold of: 1.15
Result: If I had a dozen thumbs, they'd all be "up" for this movie.
At threshold of: 1.25
Result: If I had a dozen thumbs, they'd all be "up" for this movie.


<a id="eval"></a>

### 8. Programmatic evaluation

- We wrap up all the previous steps into an `ExtractiveTextSummarizer` class 
- To evaluate the effectiveness of the summarisation at different thresholds, we have manually summarised 101 movie reviews 
- Our human-labelled summary serves as a reference to estimate the algorithm's ability to pick out important aspects of the review 

In [17]:
labelled_movie_reviews = pd.read_csv(DATA_PATH / "review_evaluation.csv", index_col=0).iloc[:,:2]

In [18]:
class ExtractiveTextSummarizer:
    def __init__(self, article: str, alpha: float = 1.0) -> None:
        self.article = article
        self.alpha = alpha         
        self.frequency_table = defaultdict(int)

    def _create_dictionary_table(self, stemmer: Any = None) -> dict:
   
        #removing stop words
        stop_words = set(stopwords.words("english"))
        word_vector = word_tokenize(self.article)

        # instantiate the stemmer 
        if stemmer is None: 
            stemmer = PorterStemmer()

        stemmed_word_vector = [stemmer.stem(word) for word in word_vector]
        for word in stemmed_word_vector:
            if word not in stop_words:
                self.frequency_table[word] += 1

        return self.frequency_table


    def _calculate_sentence_scores(self, sentences: npt.ArrayLike) -> dict:   

        #algorithm for scoring a sentence by its words
        sentence_weights = defaultdict(int)

        for sentence in sentences:
            sentence_wordcount_without_stop_words = 0

            for word_weight in self.frequency_table:
                sentence_weights[sentence[:7]] += self.frequency_table[word_weight]

                if word_weight in sentence.lower():
                    sentence_wordcount_without_stop_words += 1

            sentence_weights[sentence[:7]] /= sentence_wordcount_without_stop_words

        return sentence_weights


    def _calculate_threshold_score(self, sentence_weight: dict) -> float:
        return np.mean(list(sentence_weight.values())) * self.alpha


    def _get_article_summary(self, sentences: npt.ArrayLike, sentence_weights: dict, threshold: float) -> str:
        article_summary = [sentence for sentence in sentences if sentence[:7] in sentence_weights and sentence_weights.get(sentence[:7]) >= threshold]

        return " ".join(article_summary)

    def run_article_summary(self):

        #creating a dictionary for the word frequency table
        _ = self._create_dictionary_table()

        #tokenizing the sentences
        sentences = sent_tokenize(self.article)

        #algorithm for scoring a sentence by its words
        sentence_scores = self._calculate_sentence_scores(sentences)

        # getting the threshold
        threshold = self._calculate_threshold_score(sentence_scores)

        #producing the summary
        article_summary = self._get_article_summary(sentences, sentence_scores, threshold)

        return article_summary

    def get_rouge_score(self, hypothesis_text: str, reference_text: str) -> npt.ArrayLike:
        rouge = Rouge()
        scores = rouge.get_scores(hypothesis_text, reference_text)
        return scores

- We store the results the ROUGE-1 F1 of each summarisation under their respective alpha thresholds and calculate the mean values of 10 randomly chosen articles
- Given that we consider 20% of the original length to be an acceptable amount, we select 0.95 as the final threshold 

In [19]:
# random seed for reproducibility
random.seed(2022)
random_subset = random.sample(range(101), 10)

In [20]:
result = defaultdict(list)
percentage_summarised = defaultdict(list)

for idx, article in enumerate(labelled_movie_reviews.loc[random_subset, "review"].values):
    ext = ExtractiveTextSummarizer(article=article)
    original_article_length = len(article.split())
    for alpha in alpha_values: 
        ext.alpha = alpha
        article_summary = ext.run_article_summary() 
        rouge_score = ext.get_rouge_score(
            hypothesis_text=article_summary, 
            reference_text=labelled_movie_reviews.loc[idx, "Summary"])

        _, _, f1 = rouge_score[0].get("rouge-1").values()

        percentage_summarised[alpha].append(len(article_summary.split()) / original_article_length)
        result[alpha].append(f1)

In [21]:
final_scores = zip(
    alpha_values, 
    map(np.mean, result.values()),
    map(np.mean, percentage_summarised.values())
    )

for alpha, score, percentage in final_scores: 
    print(f"Alpha value of {alpha:.02f}")
    print(f"Score: {score:.03f}")
    print(f"Percentage summarised: {percentage:.02%}")

Alpha value of 0.95
Score: 0.123
Percentage summarised: 22.75%
Alpha value of 1.05
Score: 0.119
Percentage summarised: 21.79%
Alpha value of 1.15
Score: 0.110
Percentage summarised: 16.39%
Alpha value of 1.25
Score: 0.083
Percentage summarised: 11.54%
