<a href="https://colab.research.google.com/github/cfong32/key-sentence-extraction/blob/main/exp9_preprocess_tfidf_sbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

In this notebook, we
1. Download the CNN/DM dataset, store it in a dataframe `df`
2. Break down each `article` into `sentences`
3. Compute TF-IDF cosine-similarity of every sentence to its source `article`
4. Compute ROUGE of every sentence to the `highlights`, the gold summary
5. Analyze results
    - Verify correlation between TFIDF cosine-similarity and ROUGE
    - Evaluate F1 score of "top-K%-sentence classification"
        - E.g., for an article of 20 sentences, the "top-10%-sentence classification" is to predict the most important 2 key-sentences.

# I. Install and Import

In [1]:
!pip install -q datasets rouge_score sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/469.0 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m460.8/469.0 KB[0m [31m29.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [16]:
# import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from datasets import load_dataset
from spacy.lang.en import English
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import precision_score, recall_score, f1_score
from rouge_score.rouge_scorer import RougeScorer
from sentence_transformers import SentenceTransformer
from itertools import cycle
from functools import partial
from textwrap import wrap
from IPython.display import HTML as html_print
pd.set_option('display.min_rows', 4)
tqdm.pandas()
tqdm = partial(tqdm, position=0, leave=True)
Ks = [1, 5, 10, 20, 40, 60, 80, 100]

# II. Computation

## Option 1: download pre-computed `df`


In [7]:
## for running !wget on a GPU instance, please uncomment the following two lines
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

In [21]:
# the following computations could take several minutes to run
# to save time, you may download the pre-calculated df by uncommenting the following two line
!wget -O exp9.230306.0024.dfpkl https://uoguelphca-my.sharepoint.com/:u:/g/personal/chungyan_uoguelph_ca/EetFsM17mc5Hje7JZAmUTJcBiABnzB0V_ZpvhVcXJq2SYA?download=1
df = pd.read_pickle('exp9.230306.0024.dfpkl')
df

--2023-03-06 05:26:23--  https://uoguelphca-my.sharepoint.com/:u:/g/personal/chungyan_uoguelph_ca/EetFsM17mc5Hje7JZAmUTJcBiABnzB0V_ZpvhVcXJq2SYA?download=1
Resolving uoguelphca-my.sharepoint.com (uoguelphca-my.sharepoint.com)... 52.104.56.41, 2a01:111:f402:f04e::41
Connecting to uoguelphca-my.sharepoint.com (uoguelphca-my.sharepoint.com)|52.104.56.41|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /personal/chungyan_uoguelph_ca/Documents/Shared/exp9.230306.0024.dfpkl?ga=1 [following]
--2023-03-06 05:26:23--  https://uoguelphca-my.sharepoint.com/personal/chungyan_uoguelph_ca/Documents/Shared/exp9.230306.0024.dfpkl?ga=1
Reusing existing connection to uoguelphca-my.sharepoint.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 183199103 (175M) [application/octet-stream]
Saving to: ‘exp9.230306.0024.dfpkl’


2023-03-06 05:26:34 (17.3 MB/s) - ‘exp9.230306.0024.dfpkl’ saved [183199103/183199103]



Unnamed: 0,article,highlights,id,sentences,TFIDF_sim,raw_ROUGE,R1,R2,RL,ROUGE_mean,...,TFIDF_SBERT_top60%_f1,SBERT_SBERT_top60%_f1,TFIDF_ROUGE_top80%_f1,SBERT_ROUGE_top80%_f1,TFIDF_SBERT_top80%_f1,SBERT_SBERT_top80%_f1,TFIDF_ROUGE_top100%_f1,SBERT_ROUGE_top100%_f1,TFIDF_SBERT_top100%_f1,SBERT_SBERT_top100%_f1
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,[(CNN)The Palestinian Authority officially bec...,"[0.394838255251771, 0.2184224416188325, 0.5160...","[{'rouge1': (0.41379310344827586, 0.3529411764...","[0.38095238095238093, 0.18867924528301885, 0.3...","[0.19672131147540986, 0.0, 0.1846153846153846,...","[0.28571428571428575, 0.1509433962264151, 0.32...","[0.2877959927140255, 0.11320754716981131, 0.30...",...,0.875000,0.937500,0.904762,0.857143,0.904762,1.000000,1.0,1.0,1.0,1.0
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,"[(CNN)Never mind cats having nine lives., A st...","[0.0910601549035593, 0.29696905043672805, 0.43...","[{'rouge1': (0.0, 0.0, 0.0), 'rouge2': (0.0, 0...","[0.0, 0.3720930232558139, 0.2191780821917808, ...","[0.0, 0.19047619047619047, 0.05633802816901408...","[0.0, 0.32558139534883723, 0.136986301369863, ...","[0.0, 0.29605020302694723, 0.1375008039102193,...",...,0.818182,0.636364,0.866667,0.866667,0.933333,0.866667,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434,"[Despite the hype surrounding its first watch,...","[0.3847239488347531, 0.20526128662452667, 0.59...","[{'rouge1': (0.21739130434782608, 0.15625, 0.1...","[0.18181818181818182, 0.14285714285714288, 0.3...","[0.0, 0.037037037037037035, 0.2702702702702703...","[0.14545454545454545, 0.14285714285714288, 0.3...","[0.10909090909090909, 0.10758377425044093, 0.3...",...,0.785714,0.928571,0.810811,0.810811,0.972973,0.945946,1.0,1.0,1.0,1.0
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6,[Angus Hawley's brother has spoken of his shoc...,"[0.6310895042459086, 0.3007128522327899, 0.187...","[{'rouge1': (0.6129032258064516, 0.21590909090...","[0.31932773109243695, 0.16949152542372878, 0.2...","[0.17094017094017092, 0.0, 0.03418803418803419...","[0.2352941176470588, 0.11864406779661019, 0.11...","[0.24185400655988887, 0.096045197740113, 0.117...",...,0.783784,0.729730,0.897959,0.877551,0.897959,0.857143,1.0,1.0,1.0,1.0


## Option 2: compute `df` from data

In [None]:
# load dataset into a dataframe

DATASET = 'cnn_dailymail'
CONFIG  = '3.0.0'
SUBSET  = 'test'

dataset = load_dataset(DATASET, CONFIG, split=SUBSET)
df = pd.DataFrame(dataset)
df

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de. Subsequent calls will reuse this data.


Unnamed: 0,article,highlights,id
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef
...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6


In [None]:
# split articles into sentences
# every entry of df['sentences'] will contain a list of strings

spacy_eng_nlp = English()
spacy_eng_nlp.add_pipe("sentencizer")

df['sentences'] = df.progress_apply(
    lambda x: (
        [str(s) for s in spacy_eng_nlp(x.article).sents]
    ),
    axis=1
)
df

100%|██████████| 11490/11490 [00:42<00:00, 272.59it/s]


Unnamed: 0,article,highlights,id,sentences,TFIDF_sim
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,[(CNN)The Palestinian Authority officially bec...,"[0.394838255251771, 0.2184224416188325, 0.5160..."
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,"[(CNN)Never mind cats having nine lives., A st...","[0.0910601549035593, 0.29696905043672805, 0.43..."
...,...,...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434,"[Despite the hype surrounding its first watch,...","[0.3847239488347531, 0.20526128662452667, 0.59..."
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6,[Angus Hawley's brother has spoken of his shoc...,"[0.6310895042459086, 0.3007128522327899, 0.187..."


In [None]:
# calculate TF-IDF (Term Frequency-Inverse Document Frequency)
# then calculate the cosine-similarity of each sentence to the "article"
# every entry of df['TFIDF_sim'] will be an ndarray indicating cossim of the sentences

articles = df.article.tolist()
tfidf = TfidfVectorizer().fit(articles)

df['TFIDF_sim'] = df.progress_apply(
    lambda x: (
        cosine_similarity(
            tfidf.transform([x.article]),
            tfidf.transform(x.sentences)
        )[0]
    ),
    axis=1
)
df

100%|██████████| 11490/11490 [01:08<00:00, 166.74it/s]


Unnamed: 0,article,highlights,id,sentences,TFIDF_sim
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,[(CNN)The Palestinian Authority officially bec...,"[0.394838255251771, 0.2184224416188325, 0.5160..."
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,"[(CNN)Never mind cats having nine lives., A st...","[0.0910601549035593, 0.29696905043672805, 0.43..."
...,...,...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434,"[Despite the hype surrounding its first watch,...","[0.3847239488347531, 0.20526128662452667, 0.59..."
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6,[Angus Hawley's brother has spoken of his shoc...,"[0.6310895042459086, 0.3007128522327899, 0.187..."


In [None]:
# calculate ROUGE score (Recall-Oriented Understudy for Gisting Evaluation)
# to the "highlights"
# every entry of df['Rouge'] will be an ndarray, which is the average of ROUGE-1, ROUGE-2 and ROUGE-L

rouge = RougeScorer(['rouge1', 'rouge2', 'rougeL'])

df['raw_ROUGE'] = df.progress_apply(
    lambda x: (
        [rouge.score(x.highlights, sentence)
         for sentence in x.sentences]
    ),
    axis=1
)
df['R1'] = df.raw_ROUGE.map(lambda xs: np.array([x['rouge1'].fmeasure for x in xs]))
df['R2'] = df.raw_ROUGE.map(lambda xs: np.array([x['rouge2'].fmeasure for x in xs]))
df['RL'] = df.raw_ROUGE.map(lambda xs: np.array([x['rougeL'].fmeasure for x in xs]))
df['ROUGE_mean'] = (df['R1'] + df['R2'] + df['RL']) / 3
df

100%|██████████| 11490/11490 [07:04<00:00, 27.07it/s]


Unnamed: 0,article,highlights,id,sentences,TFIDF_sim,raw_ROUGE,R1,R2,RL,ROUGE_mean
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,[(CNN)The Palestinian Authority officially bec...,"[0.394838255251771, 0.2184224416188325, 0.5160...","[{'rouge1': (0.41379310344827586, 0.3529411764...","[0.38095238095238093, 0.18867924528301885, 0.3...","[0.19672131147540986, 0.0, 0.1846153846153846,...","[0.28571428571428575, 0.1509433962264151, 0.32...","[0.2877959927140255, 0.11320754716981131, 0.30..."
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,"[(CNN)Never mind cats having nine lives., A st...","[0.0910601549035593, 0.29696905043672805, 0.43...","[{'rouge1': (0.0, 0.0, 0.0), 'rouge2': (0.0, 0...","[0.0, 0.3720930232558139, 0.2191780821917808, ...","[0.0, 0.19047619047619047, 0.05633802816901408...","[0.0, 0.32558139534883723, 0.136986301369863, ...","[0.0, 0.29605020302694723, 0.1375008039102193,..."
...,...,...,...,...,...,...,...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434,"[Despite the hype surrounding its first watch,...","[0.3847239488347531, 0.20526128662452667, 0.59...","[{'rouge1': (0.21739130434782608, 0.15625, 0.1...","[0.18181818181818182, 0.14285714285714288, 0.3...","[0.0, 0.037037037037037035, 0.2702702702702703...","[0.14545454545454545, 0.14285714285714288, 0.3...","[0.10909090909090909, 0.10758377425044093, 0.3..."
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6,[Angus Hawley's brother has spoken of his shoc...,"[0.6310895042459086, 0.3007128522327899, 0.187...","[{'rouge1': (0.6129032258064516, 0.21590909090...","[0.31932773109243695, 0.16949152542372878, 0.2...","[0.17094017094017092, 0.0, 0.03418803418803419...","[0.2352941176470588, 0.11864406779661019, 0.11...","[0.24185400655988887, 0.096045197740113, 0.117..."


In [8]:
sbert = SentenceTransformer('all-MiniLM-L6-v2')

df['sbert_embeddings'] = df.progress_apply(
    lambda x: sbert.encode(x.sentences + [x.article, x.highlights]),
    axis=1
)

df['SBERT_s2a_sim'] = df.progress_apply(
    lambda x: (
        cosine_similarity(
            x.sbert_embeddings[[-2]],   # x.article encoded
            x.sbert_embeddings[:-2]     # x.sentences encoded
        )[0]
    ),
    axis=1
)

df['SBERT_s2h_sim'] = df.progress_apply(
    lambda x: (
        cosine_similarity(
            x.sbert_embeddings[[-1]],   # x.highlights encoded
            x.sbert_embeddings[:-2]     # x.sentences encoded
        )[0]
    ),
    axis=1
)

df = df.drop(columns='sbert_embeddings')

df

Unnamed: 0,article,highlights,id,sentences,TFIDF_sim,raw_ROUGE,R1,R2,RL,ROUGE_mean,SBERT_s2a_sim,SBERT_s2h_sim
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,[(CNN)The Palestinian Authority officially bec...,"[0.394838255251771, 0.2184224416188325, 0.5160...","[{'rouge1': (0.41379310344827586, 0.3529411764...","[0.38095238095238093, 0.18867924528301885, 0.3...","[0.19672131147540986, 0.0, 0.1846153846153846,...","[0.28571428571428575, 0.1509433962264151, 0.32...","[0.2877959927140255, 0.11320754716981131, 0.30...","[0.76704454, 0.42279387, 0.7966811, 0.63511753...","[0.6571314, 0.27385426, 0.67717206, 0.7175484,..."
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,"[(CNN)Never mind cats having nine lives., A st...","[0.0910601549035593, 0.29696905043672805, 0.43...","[{'rouge1': (0.0, 0.0, 0.0), 'rouge2': (0.0, 0...","[0.0, 0.3720930232558139, 0.2191780821917808, ...","[0.0, 0.19047619047619047, 0.05633802816901408...","[0.0, 0.32558139534883723, 0.136986301369863, ...","[0.0, 0.29605020302694723, 0.1375008039102193,...","[0.47714734, 0.59764886, 0.6244302, 0.580126, ...","[0.20704053, 0.47492748, 0.71959484, 0.5244162..."
...,...,...,...,...,...,...,...,...,...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434,"[Despite the hype surrounding its first watch,...","[0.3847239488347531, 0.20526128662452667, 0.59...","[{'rouge1': (0.21739130434782608, 0.15625, 0.1...","[0.18181818181818182, 0.14285714285714288, 0.3...","[0.0, 0.037037037037037035, 0.2702702702702703...","[0.14545454545454545, 0.14285714285714288, 0.3...","[0.10909090909090909, 0.10758377425044093, 0.3...","[0.6963891, 0.510888, 0.8967398, 0.5488275, 0....","[0.659778, 0.3296187, 0.7385148, 0.39404175, 0..."
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6,[Angus Hawley's brother has spoken of his shoc...,"[0.6310895042459086, 0.3007128522327899, 0.187...","[{'rouge1': (0.6129032258064516, 0.21590909090...","[0.31932773109243695, 0.16949152542372878, 0.2...","[0.17094017094017092, 0.0, 0.03418803418803419...","[0.2352941176470588, 0.11864406779661019, 0.11...","[0.24185400655988887, 0.096045197740113, 0.117...","[0.7640883, 0.57733625, 0.5972009, -0.00895412...","[0.7234891, 0.116157815, 0.43206477, -0.050523..."


In [15]:
# rank sentences within each article

def cal_ranking(x):
    return np.argsort(np.argsort(x)) / (len(x)-1)

df['rank_by_TFIDF']     = df.TFIDF_sim    .map(cal_ranking)
df['rank_by_ROUGE']     = df.ROUGE_mean   .map(cal_ranking)
df['rank_by_SBERT_s2a'] = df.SBERT_s2a_sim.map(cal_ranking)
df['rank_by_SBERT_s2h'] = df.SBERT_s2h_sim.map(cal_ranking)
df

Unnamed: 0,article,highlights,id,sentences,TFIDF_sim,raw_ROUGE,R1,R2,RL,ROUGE_mean,SBERT_s2a_sim,SBERT_s2h_sim,rank_by_TFIDF,rank_by_ROUGE,rank_by_SBERT_s2a,rank_by_SBERT_s2h
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,[(CNN)The Palestinian Authority officially bec...,"[0.394838255251771, 0.2184224416188325, 0.5160...","[{'rouge1': (0.41379310344827586, 0.3529411764...","[0.38095238095238093, 0.18867924528301885, 0.3...","[0.19672131147540986, 0.0, 0.1846153846153846,...","[0.28571428571428575, 0.1509433962264151, 0.32...","[0.2877959927140255, 0.11320754716981131, 0.30...","[0.76704454, 0.42279387, 0.7966811, 0.63511753...","[0.6571314, 0.27385426, 0.67717206, 0.7175484,...","[0.6538461538461539, 0.38461538461538464, 1.0,...","[0.8846153846153846, 0.5384615384615384, 0.961...","[0.9615384615384616, 0.34615384615384615, 1.0,...","[0.8461538461538461, 0.23076923076923078, 0.92..."
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,"[(CNN)Never mind cats having nine lives., A st...","[0.0910601549035593, 0.29696905043672805, 0.43...","[{'rouge1': (0.0, 0.0, 0.0), 'rouge2': (0.0, 0...","[0.0, 0.3720930232558139, 0.2191780821917808, ...","[0.0, 0.19047619047619047, 0.05633802816901408...","[0.0, 0.32558139534883723, 0.136986301369863, ...","[0.0, 0.29605020302694723, 0.1375008039102193,...","[0.47714734, 0.59764886, 0.6244302, 0.580126, ...","[0.20704053, 0.47492748, 0.71959484, 0.5244162...","[0.0, 0.4444444444444444, 0.8333333333333334, ...","[0.0, 0.9444444444444444, 0.7777777777777778, ...","[0.7222222222222222, 0.8888888888888888, 0.944...","[0.05555555555555555, 0.6666666666666666, 1.0,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434,"[Despite the hype surrounding its first watch,...","[0.3847239488347531, 0.20526128662452667, 0.59...","[{'rouge1': (0.21739130434782608, 0.15625, 0.1...","[0.18181818181818182, 0.14285714285714288, 0.3...","[0.0, 0.037037037037037035, 0.2702702702702703...","[0.14545454545454545, 0.14285714285714288, 0.3...","[0.10909090909090909, 0.10758377425044093, 0.3...","[0.6963891, 0.510888, 0.8967398, 0.5488275, 0....","[0.659778, 0.3296187, 0.7385148, 0.39404175, 0...","[0.6956521739130435, 0.2391304347826087, 1.0, ...","[0.6521739130434783, 0.6304347826086957, 0.934...","[0.9347826086956522, 0.5, 1.0, 0.5869565217391...","[0.9130434782608695, 0.43478260869565216, 0.97..."
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6,[Angus Hawley's brother has spoken of his shoc...,"[0.6310895042459086, 0.3007128522327899, 0.187...","[{'rouge1': (0.6129032258064516, 0.21590909090...","[0.31932773109243695, 0.16949152542372878, 0.2...","[0.17094017094017092, 0.0, 0.03418803418803419...","[0.2352941176470588, 0.11864406779661019, 0.11...","[0.24185400655988887, 0.096045197740113, 0.117...","[0.7640883, 0.57733625, 0.5972009, -0.00895412...","[0.7234891, 0.116157815, 0.43206477, -0.050523...","[0.9666666666666667, 0.5333333333333333, 0.35,...","[0.9166666666666666, 0.5833333333333334, 0.716...","[0.9833333333333333, 0.9, 0.9166666666666666, ...","[0.9833333333333333, 0.1, 0.7166666666666667, ..."


In [17]:
# assume that we are to predict the top-K%-sentence using TFIDF_sim, while
# ROUGE_mean is the ground-truth.  E.g., for an article of 20 sentences,
# "top-10%-sentence" are the most important 2 key-sentences.
# df['top{K}%_f1'] will store the F1-score of such classifications.

def cal_topKpctF1(true_rank_by, pred_rank_by, K):
    true = true_rank_by >= (1-K/100)
    pred = pred_rank_by >= (1-K/100)
    return f1_score(true, pred)

def append_topKpct_f1(df, K, true_rank_by, pred_rank_by, outp_prefix):
    df[f'{outp_prefix}_top{K}%_f1'] = df.progress_apply(
        lambda x: cal_topKpctF1(x[true_rank_by], x[pred_rank_by], K),
        axis=1
    )

for K in Ks:
    # predict by TFIDF, validate by ROUGE
    append_topKpct_f1(df, K,
                      true_rank_by='rank_by_ROUGE',
                      pred_rank_by='rank_by_TFIDF',
                      outp_prefix='TFIDF_ROUGE')
    
    # predict by SBERT, validate by ROUGE
    append_topKpct_f1(df, K,
                      true_rank_by='rank_by_ROUGE',
                      pred_rank_by='rank_by_SBERT_s2a',
                      outp_prefix='SBERT_ROUGE')
    
    # predict by TFIDF, validate by SBERT (sentence-to-highlights)
    append_topKpct_f1(df, K,
                      true_rank_by='rank_by_SBERT_s2h',
                      pred_rank_by='rank_by_TFIDF',
                      outp_prefix='TFIDF_SBERT')
    
    # predict by SBERT (sentence-to-article), validate by SBERT (sentence-to-highlights)
    append_topKpct_f1(df, K,
                      true_rank_by='rank_by_SBERT_s2h',
                      pred_rank_by='rank_by_SBERT_s2a',
                      outp_prefix='SBERT_SBERT')
    
df

100%|██████████| 11490/11490 [00:21<00:00, 524.49it/s]
100%|██████████| 11490/11490 [00:17<00:00, 660.37it/s]
100%|██████████| 11490/11490 [00:17<00:00, 646.09it/s]
100%|██████████| 11490/11490 [00:18<00:00, 614.08it/s]
100%|██████████| 11490/11490 [00:18<00:00, 638.05it/s]
100%|██████████| 11490/11490 [00:17<00:00, 643.19it/s]
100%|██████████| 11490/11490 [00:19<00:00, 599.42it/s]
100%|██████████| 11490/11490 [00:17<00:00, 648.21it/s]
100%|██████████| 11490/11490 [00:18<00:00, 621.04it/s]
100%|██████████| 11490/11490 [00:19<00:00, 595.09it/s]
100%|██████████| 11490/11490 [00:18<00:00, 637.89it/s]
100%|██████████| 11490/11490 [00:17<00:00, 669.88it/s]
100%|██████████| 11490/11490 [00:18<00:00, 636.22it/s]
100%|██████████| 11490/11490 [00:17<00:00, 656.76it/s]
100%|██████████| 11490/11490 [00:17<00:00, 650.91it/s]
100%|██████████| 11490/11490 [00:18<00:00, 605.96it/s]
100%|██████████| 11490/11490 [00:17<00:00, 644.07it/s]
100%|██████████| 11490/11490 [00:17<00:00, 647.96it/s]
100%|█████

Unnamed: 0,article,highlights,id,sentences,TFIDF_sim,raw_ROUGE,R1,R2,RL,ROUGE_mean,...,TFIDF_SBERT_top60%_f1,SBERT_SBERT_top60%_f1,TFIDF_ROUGE_top80%_f1,SBERT_ROUGE_top80%_f1,TFIDF_SBERT_top80%_f1,SBERT_SBERT_top80%_f1,TFIDF_ROUGE_top100%_f1,SBERT_ROUGE_top100%_f1,TFIDF_SBERT_top100%_f1,SBERT_SBERT_top100%_f1
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,[(CNN)The Palestinian Authority officially bec...,"[0.394838255251771, 0.2184224416188325, 0.5160...","[{'rouge1': (0.41379310344827586, 0.3529411764...","[0.38095238095238093, 0.18867924528301885, 0.3...","[0.19672131147540986, 0.0, 0.1846153846153846,...","[0.28571428571428575, 0.1509433962264151, 0.32...","[0.2877959927140255, 0.11320754716981131, 0.30...",...,0.875000,0.937500,0.904762,0.857143,0.904762,1.000000,1.0,1.0,1.0,1.0
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,"[(CNN)Never mind cats having nine lives., A st...","[0.0910601549035593, 0.29696905043672805, 0.43...","[{'rouge1': (0.0, 0.0, 0.0), 'rouge2': (0.0, 0...","[0.0, 0.3720930232558139, 0.2191780821917808, ...","[0.0, 0.19047619047619047, 0.05633802816901408...","[0.0, 0.32558139534883723, 0.136986301369863, ...","[0.0, 0.29605020302694723, 0.1375008039102193,...",...,0.818182,0.636364,0.866667,0.866667,0.933333,0.866667,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11488,"Despite the hype surrounding its first watch, ...",Apple sold more than 61 million iPhones in the...,30ec5f280eee772a73d181bfc8514defd8026434,"[Despite the hype surrounding its first watch,...","[0.3847239488347531, 0.20526128662452667, 0.59...","[{'rouge1': (0.21739130434782608, 0.15625, 0.1...","[0.18181818181818182, 0.14285714285714288, 0.3...","[0.0, 0.037037037037037035, 0.2702702702702703...","[0.14545454545454545, 0.14285714285714288, 0.3...","[0.10909090909090909, 0.10758377425044093, 0.3...",...,0.785714,0.928571,0.810811,0.810811,0.972973,0.945946,1.0,1.0,1.0,1.0
11489,Angus Hawley's brother has spoken of his shock...,Angus Hawley's brother said his late sibling '...,b4a1738c4a0acdf3d189264a0927005aa5b856d6,[Angus Hawley's brother has spoken of his shoc...,"[0.6310895042459086, 0.3007128522327899, 0.187...","[{'rouge1': (0.6129032258064516, 0.21590909090...","[0.31932773109243695, 0.16949152542372878, 0.2...","[0.17094017094017092, 0.0, 0.03418803418803419...","[0.2352941176470588, 0.11864406779661019, 0.11...","[0.24185400655988887, 0.096045197740113, 0.117...",...,0.783784,0.729730,0.897959,0.877551,0.897959,0.857143,1.0,1.0,1.0,1.0


In [18]:
df.to_pickle('exp9.dfpkl')

In [19]:
# from google.colab import drive
# drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [20]:
# !cp exp9.dfpkl gdrive/MyDrive/Shared/