# ISSS609 Text Analytics and Applications
## IMBD Movie Review - Extractive Text Summarisation
### G1 - Group 4

<a id="table_of_contents"></a>
### Table of Contents

1. [Importing Files and Libraries](#import)
2. [Extractive Summarisation steps](#outline)
3. [Importing raw data](#setup)
4. [Creating a frequency table](#freq)
5. [Calculate sentence scores](#ss)
6. [Calculate threshold](#threshold)
7. [Finetuning threshold](#finetune)
8. [Programmatic evaluation](#eval)

<a id="import"></a>
### 1. Importing libraries

In [2]:
import random
import pandas as pd
from text_analytics.config import DATA_PATH
from rouge import Rouge
import numpy as np
import numpy.typing as npt
from typing import List, Any
from nltk.corpus import stopwords
import string
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict

# set custom colwidths 
pd.set_option("max_colwidth", 150)
%matplotlib inline

<a id="outline"></a>
### 2. Extractive summarisation steps

Here is an outline of steps to build the extractive summariser 

- Select a raw article to preprocess 
- Tokenise the sentences to get all stems present 
- Evaluate the weighted occurrence frequency of the words 
- Split the paragraph into sentences
- Apply the masking threshold to output the summarised review

<a id="setup"></a>
### 3. Importing raw data

- A raw article is selected at this stage for preprocessing

In [3]:
movie_reviews = pd.read_parquet(DATA_PATH / "imdb_data.parquet")
# preview the dataframe
movie_reviews.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened ...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometim...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted ...",positive
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />Thi...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. Thi...",positive


In [4]:
movie_reviews.review = movie_reviews.review.replace(r"<.*?>"," ", regex=True)
movie_reviews.head(1)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened ...",positive


In [5]:
article = movie_reviews.loc[5, "review"]
print(article)

Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.


<a id="freq"></a>
### 4. Creating a frequency table

- The first step in extractive summarisation is to determine the relative importance of each word within the overall contabs of the sentence 
- Every stem in an article's importance can be captured into a frequency table and weighted accordingly
- We remove stop words and calculate a frequency table for each article  

In [6]:
def create_dictionary_table(article, stemmer = None):
    #removing stop words
    frequency_table = defaultdict(int)

    stop_words = set(stopwords.words("english"))
    punct = set(string.punctuation)
    
    check_set = stop_words.union(punct)
    word_vector = word_tokenize(article)

    # instantiate the stemmer 
    if stemmer is None: 
        stemmer = PorterStemmer()

    stemmed_word_vector = [stemmer.stem(word) for word in word_vector]
    for word in stemmed_word_vector:
        if word not in check_set and word.isalnum():
            frequency_table[word] += 1

    return frequency_table

In [7]:
frequency_table = create_dictionary_table(article)

In [8]:
frequencies = pd.DataFrame(
    {"frequencies": frequency_table}
    ).sort_values("frequencies", ascending=False)
# preview the frequencies
frequencies.head()

Unnamed: 0,frequencies
movi,2
nobl,1
old,1
one,1
onli,1


<a id="ss"></a>
### 5. Calculate sentence scores

- The frequency scores are used to determine the importance of each sentence 
- For instance, `I like this movie` will return a score of 3 if `like=1` and `movie=2`
- To ensure long sentences do not dominate shorter sentences, we normalise the scores of each sentence by dividing each sentence score by its word length 

In [9]:
def calculate_sentence_scores(sentences: npt.ArrayLike, frequency_table: dict) -> dict:   
    # Every sentence is scored by how important its constituent words are in the frequency table
    sentence_weights = defaultdict(int)

    for sentence in sentences:
        sentence_wordcount_without_stop_words = 0

        for word_weight in frequency_table:
            sentence_weights[sentence[:7]] += frequency_table[word_weight]

            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1

        sentence_weights[sentence[:7]] /= sentence_wordcount_without_stop_words

    return sentence_weights

In [10]:
sentences = sent_tokenize(article)

In [11]:
sentence_weights = calculate_sentence_scores(sentences, frequency_table)

In [12]:
sentence_weight_preview = pd.DataFrame(
    {"sentence_weights": sentence_weights}
    ).sort_values("sentence_weights", ascending=False)
sentence_weight_preview.head()

Unnamed: 0,sentence_weights
If I ha,13.4
And the,7.444444
It just,6.7
Probabl,6.090909
The kid,6.090909


<a id="threshold"></a>
### 6. Calculate the threshold for a token to be counted as important 

- We can adjust the threshold by multiplying the mean of the scores by an alpha value
- Alternatives, such as the median, can also be used to compute the threshold for inclusion

In [13]:
def calculate_threshold_score(sentence_weights: dict, alpha: float = 1.0) -> float:
    return np.mean(list(sentence_weights.values())) * alpha

In [14]:
print(f"Threshold weight for example article: {calculate_threshold_score(sentence_weights):.02f}")

Threshold weight for example article: 7.55


<a id="finetune"></a>
### 7. Finetuning the threshold 

- Different threshold values represent a trade-off between comprehension and length
- Lower thresholds result in longer sentences, but will contain more contabsual markers
- The optimal threshold can either be determined manually or programmatically via a validation set

In [15]:
def get_article_summary(sentences: npt.ArrayLike, sentence_weights: dict, threshold: float) -> str:
    article_summary = [sentence for sentence in sentences if sentence[:7] in sentence_weights and sentence_weights.get(sentence[:7]) >= threshold]
    return " ".join(article_summary)

In [16]:
alpha_values = np.arange(0.7, 1.1, 0.1)
alpha_values

array([0.7, 0.8, 0.9, 1. , 1.1])

In [91]:
for alpha in alpha_values: 
    threshold_score = calculate_threshold_score(sentence_weights=sentence_weights, alpha=alpha) 
    final_result = get_article_summary(sentences=sentences, sentence_weights=sentence_weights, threshold=threshold_score) 

    print(f"At threshold of: {alpha:.02f}")
    print(f"Result: {final_result}")

At threshold of: 0.70
Result: Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.
At threshold of: 0.80
Result: Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. The kids are, as grandma says, more like "dressed-up

<a id="eval"></a>

### 8. Programmatic evaluation

- We wrap up all the previous steps into an `ExtractiveTabsSummarizer` class 
- To evaluate the effectiveness of the summarisation, we have manually summarised 101 movie reviews 
- Our human-labelled summary serves as a reference to estimate the algorithm's ability to pick out important aspects of the review 

In [12]:
# fine tune based on average Rouge-1 F1 score
labelled_movie_reviews = pd.read_csv("../data/review_evaluation.csv", index_col=0).iloc[:,:2]  #csv contains review and human summary
labelled_movie_reviews.head()

Unnamed: 0,review,Summary
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened ...","The first episode I saw struck me as so nasty it was surreal, I couldn't say I was ready for it. As I watched more I developed a taste for Oz, and..."
1,"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomfor...","A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomfor..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted ...","Woody Allen is still fully in control of the style many of us have grown to love. The plot is simplistic, but the dialogue is witty and the charac..."
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is s...,This movie is slower than a soap opera. As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jak...
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. Thi...","This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. Director Petter M..."


In [13]:
from rouge import Rouge
import numpy as np
import numpy.typing as npt
from typing import List, Any
from nltk.corpus import stopwords
import re, string
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict

class ExtractiveTextSummarizer:
    def __init__(self, article: str, alpha: float = 1.0) -> None:
        self.article = article
        self.alpha = alpha         
        self.frequency_table = defaultdict(int)

    def _create_dictionary_table(self, stemmer: Any = None) -> dict:
   
        #removing stop words
        stop_words = set(stopwords.words("english"))
        punct = set(string.punctuation)
        check_set = stop_words.union(punct) # remove punctuation
        word_vector = word_tokenize(self.article)

        # instantiate the stemmer 
        if stemmer is None: 
            stemmer = PorterStemmer()

        stemmed_word_vector = [stemmer.stem(word) for word in word_vector]
        for word in stemmed_word_vector:
            if word not in check_set and word.isalnum():
                self.frequency_table[word] += 1

        return self.frequency_table


    def _calculate_sentence_scores(self, sentences: npt.ArrayLike) -> dict:   

        #algorithm for scoring a sentence by its words
        sentence_weights = defaultdict(int)

        for sentence in sentences:
            sentence_wordcount_without_stop_words = 0

            for word_weight in self.frequency_table:
                sentence_weights[sentence[:7]] += self.frequency_table[word_weight]

                if word_weight in sentence.lower():
                    sentence_wordcount_without_stop_words += 1

            if sentence_wordcount_without_stop_words: 
                sentence_weights[sentence[:7]] /= sentence_wordcount_without_stop_words
            else:
                sentence_weights[sentence[:7]] = 0

        return sentence_weights


    def _calculate_threshold_score(self, sentence_weight: dict) -> float:
        return np.mean(list(sentence_weight.values())) * self.alpha


    def _get_article_summary(self, sentences: npt.ArrayLike, sentence_weights: dict, threshold: float) -> str:
        article_summary = [sentence for sentence in sentences if sentence[:7] in sentence_weights and sentence_weights.get(sentence[:7]) >= threshold]

        return " ".join(article_summary)

    def run_article_summary(self):

        #creating a dictionary for the word frequency table
        _ = self._create_dictionary_table()

        #tokenizing the sentences
        sentences = sent_tokenize(self.article)

        #algorithm for scoring a sentence by its words
        sentence_scores = self._calculate_sentence_scores(sentences)

        # getting the threshold
        threshold = self._calculate_threshold_score(sentence_scores)

        #producing the summary
        article_summary = self._get_article_summary(sentences, sentence_scores, threshold)

        return article_summary

    def get_rouge_score(self, hypothesis_text: str, reference_text: str) -> npt.ArrayLike:
        rouge = Rouge()
        scores = rouge.get_scores(hypothesis_text, reference_text)
        return scores

the error here is to show when alpha = 1.5, the threshold is too high to generate summaries

In [14]:
ext_review, ext_recall, ext_precision, ext_f1 = [], [], [], []
for review in labelled_movie_reviews['review']:
    extractive_summarizer = ExtractiveTextSummarizer(article=review, alpha=1.5)
    ext_review.append(extractive_summarizer.run_article_summary())
labelled_movie_reviews['ext_review'] = ext_review

for ref, ext_review in zip(labelled_movie_reviews['Summary'], labelled_movie_reviews['ext_review']):
    score = extractive_summarizer.get_rouge_score(hypothesis_text=ext_review, reference_text=ref)
    ext_recall.append(score[0]['rouge-1']['r'])
    ext_precision.append(score[0]['rouge-1']['p'])
    ext_f1.append(score[0]['rouge-1']['f'])
    
labelled_movie_reviews['ext_recall'] = ext_recall
labelled_movie_reviews['ext_precision'] = ext_precision
labelled_movie_reviews['ext_f1'] = ext_f1

ValueError: Hypothesis is empty.

In [15]:
alpha_tune_results = pd.DataFrame(index=['Rouge-1 Recall score', 'Rouge-1 Precision score', 'Rouge-1 F1 score', 'avg length'])
# lower alpha will result in longer summary, we try to keep it short
for al in [0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1]:
    ext_review, ext_recall, ext_precision, ext_f1 = [], [], [], []
    for review in labelled_movie_reviews['review']:
        extractive_summarizer = ExtractiveTextSummarizer(article=review, alpha=al)
        ext_review.append(extractive_summarizer.run_article_summary())
    labelled_movie_reviews['ext_review'] = ext_review

    for ref, ext_review in zip(labelled_movie_reviews['Summary'], labelled_movie_reviews['ext_review']):
        score = extractive_summarizer.get_rouge_score(hypothesis_text=ext_review, reference_text=ref)
        ext_recall.append(score[0]['rouge-1']['r'])
        ext_precision.append(score[0]['rouge-1']['p'])
        ext_f1.append(score[0]['rouge-1']['f'])

    alpha_tune_results[al] = [np.mean(ext_recall), np.mean(ext_precision), np.mean(ext_f1), labelled_movie_reviews['ext_review'].apply(len).mean()]

alpha_tune_results

Unnamed: 0,0.70,0.75,0.80,0.85,0.90,0.95,1.00
Rouge-1 Recall score,0.500529,0.455571,0.40416,0.371538,0.33869,0.319733,0.295041
Rouge-1 Precision score,0.348047,0.353841,0.348464,0.342638,0.342938,0.344217,0.34993
Rouge-1 F1 score,0.369096,0.355662,0.331793,0.314705,0.298958,0.290326,0.276529
avg length,480.90099,432.257426,389.415842,355.80198,310.821782,282.386139,258.49505


In [16]:
ext_review, ext_recall, ext_precision, ext_f1 = [], [], [], []
for review in labelled_movie_reviews['review']:
    extractive_summarizer = ExtractiveTextSummarizer(article=review, alpha=0.7)
    ext_review.append(extractive_summarizer.run_article_summary())
labelled_movie_reviews['ext_review'] = ext_review

for ref, ext_review in zip(labelled_movie_reviews['Summary'], labelled_movie_reviews['ext_review']):
    score = extractive_summarizer.get_rouge_score(hypothesis_text=ext_review, reference_text=ref)
    ext_recall.append(score[0]['rouge-1']['r'])
    ext_precision.append(score[0]['rouge-1']['p'])
    ext_f1.append(score[0]['rouge-1']['f'])

labelled_movie_reviews['ext_recall'] = ext_recall
labelled_movie_reviews['ext_precision'] = ext_precision
labelled_movie_reviews['ext_f1'] = ext_f1

In [17]:
labelled_movie_reviews.head(3)

Unnamed: 0,review,Summary,ext_review,ext_recall,ext_precision,ext_f1
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened ...","The first episode I saw struck me as so nasty it was surreal, I couldn't say I was ready for it. As I watched more I developed a taste for Oz, and...","One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened ...",0.444444,0.176991,0.253165
1,"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomfor...","A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomfor...","A wonderful little production. The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices dow...",0.575,0.638889,0.605263
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted ...","Woody Allen is still fully in control of the style many of us have grown to love. The plot is simplistic, but the dialogue is witty and the charac...","I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted ...",0.780488,0.390244,0.520325


In [18]:
stat = pd.DataFrame(index=['MEAN', 'MAX', 'MIN', 'MEDIAN'], columns=['Recall', 'Precision', 'F1'])
#f1_list, recall_list, precision_list = [],[],[]
f1_list = [np.mean(labelled_movie_reviews['ext_f1']), max(labelled_movie_reviews['ext_f1']), \
               min(labelled_movie_reviews['ext_f1']), np.median(labelled_movie_reviews['ext_f1'])]

recall_list = [np.mean(labelled_movie_reviews['ext_recall']), max(labelled_movie_reviews['ext_recall']), \
               min(labelled_movie_reviews['ext_recall']), np.median(labelled_movie_reviews['ext_recall'])]

precision_list = [np.mean(labelled_movie_reviews['ext_precision']), max(labelled_movie_reviews['ext_precision']), \
               min(labelled_movie_reviews['ext_precision']), np.median(labelled_movie_reviews['ext_precision'])]

stat['Recall'] = recall_list
stat['Precision'] = precision_list
stat['F1'] = f1_list
stat

Unnamed: 0,Recall,Precision,F1
MEAN,0.500529,0.348047,0.369096
MAX,1.0,1.0,0.916667
MIN,0.0,0.0,0.0
MEDIAN,0.509091,0.333333,0.356436


In [19]:
labelled_movie_reviews.to_csv("../data/ext_review_eval.csv", index=False)