"Sentence Features for Extractive Summarization:
1. Surface Features: Based on structure of documents or sentences, including **sentence position** in the document, the **number of words in the sentence**, and the number of quoted words in the sentence.
2. Content Features: Integrated three well-known sentence features based on content-bearing words i.e., centroid words, **signature terms**, and **high frequency words**.
3. Event Features: An **event** is comprised of an event term and associated event elements.
4. Relevance features: Incorporated to exploit inter-sentence relationships. It is assumed that: (1) sentences related to important sentences are important; (2) sentences related to many other sentences are important. The first sentence in a document or a paragraph is important, and other sentences in a document are compared with the leading ones."

(We have chosen to use the features in bold.)

-- *Extractive Summarization Using Supervised and Semi-Supervised Learning*

http://anthology.aclweb.org/C/C08/C08-1124.pdf

In [1]:
from collections import Counter
import pandas as pd
import spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 326.8 kB/s eta 0:00:40
     --------------------------------------- 0.1/12.8 MB 525.1 kB/s eta 0:00:25
      --------------------------------------- 0.3/12.8 MB 1.3 MB/s eta 0:00:10
     - -------------------------------------- 0.6/12.8 MB 2.3 MB/s eta 0:00:06
     --- ------------------------------------ 1.1/12.8 MB 3.9 MB/s eta 0:00:04
     ----- ---------------------------------- 1.7/12.8 MB 5.3 MB/s eta 0:00:03
     ------- -------------------------------- 2.3/12.8 MB 6.0 MB/s eta 0:00:02
     -------- ------------------------------- 2.8/12.

In [2]:
# Load the processed and cleaned small portion of dataframes.
train_one_percent = pd.read_csv("C:/Users/ankar/Downloads/NLU/drive/processed-cleaned_validation.csv")
val_one_percent = pd.read_csv("C:/Users/ankar/Downloads/NLU/drive/processed-cleaned_validation.csv")
test_one_percent = pd.read_csv("C:/Users/ankar/Downloads/NLU/drive/processed-cleaned_test.csv")

In [3]:
# Hold the three datasets.
datasets = {'train': train_one_percent, 'validation': val_one_percent, 'test': test_one_percent}

In [4]:
datasets['train'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   article             134 non-null    object
 1   highlights          134 non-null    object
 2   id                  134 non-null    object
 3   internet-free_art   134 non-null    object
 4   internet-free_high  134 non-null    object
 5   boiler-free_art     134 non-null    object
 6   boiler-free_high    134 non-null    object
 7   clean_sents_art     134 non-null    object
 8   clean_sents_high    134 non-null    object
dtypes: object(9)
memory usage: 9.6+ KB


In [5]:
datasets['validation'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   article             134 non-null    object
 1   highlights          134 non-null    object
 2   id                  134 non-null    object
 3   internet-free_art   134 non-null    object
 4   internet-free_high  134 non-null    object
 5   boiler-free_art     134 non-null    object
 6   boiler-free_high    134 non-null    object
 7   clean_sents_art     134 non-null    object
 8   clean_sents_high    134 non-null    object
dtypes: object(9)
memory usage: 9.6+ KB


In [6]:
datasets['test'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   article             115 non-null    object
 1   highlights          115 non-null    object
 2   id                  115 non-null    object
 3   internet-free_art   115 non-null    object
 4   internet-free_high  115 non-null    object
 5   boiler-free_art     115 non-null    object
 6   boiler-free_high    115 non-null    object
 7   clean_sents_art     115 non-null    object
 8   clean_sents_high    115 non-null    object
dtypes: object(9)
memory usage: 8.2+ KB


In [7]:
datasets['validation'].head(3)

Unnamed: 0,article,highlights,id,internet-free_art,internet-free_high,boiler-free_art,boiler-free_high,clean_sents_art,clean_sents_high
0,Since giving birth to her daughter Arabella fo...,Critics have made cruel taunts about Rebecca F...,911d354c3ae2eefdd0b3e2e4de98f35205da1fe4,Since giving birth to her daughter Arabella fo...,Critics have made cruel taunts about Rebecca F...,Since giving birth to her daughter Arabella fo...,Critics have made cruel taunts about Rebecca F...,"[['give', 'birth', 'daughter', 'arabella', 'mo...","[['critic', 'cruel', 'taunt', 'rebecca', 'ferg..."
1,Ireland clung on to claim a controversial five...,Ireland beat Zimbabwe by five runs after excit...,47aea873200747c753d8eea82b6fd8c622324104,Ireland clung on to claim a controversial five...,Ireland beat Zimbabwe by five runs after excit...,Ireland clung on to claim a controversial five...,Ireland beat Zimbabwe by five runs after excit...,"[['ireland', 'clung', 'claim', 'controversial'...","[['ireland', 'beat', 'zimbabwe', 'run', 'excit..."
2,Laurence Fishburne is a multimillionaire actor...,Hattie Crawford Fishburne says she will be evi...,59d85872c0649cc425ae34dfcf35e4a0044575b3,Laurence Fishburne is a multimillionaire actor...,Hattie Crawford Fishburne says she will be evi...,Laurence Fishburne is a multimillionaire actor...,Hattie Crawford Fishburne says she will be evi...,"[['laurence', 'fishburne', 'multimillionaire',...","[['hattie', 'crawford', 'fishburne', 'say', 'e..."


In [8]:
# For readability.
# pd.set_option('display.max_colwidth', 200)
pd.set_option('max_colwidth', None)
# pd.set_option('display.max_columns', None)

## Sentence Embeddings
For calculating word embeddings, it's generally more effective to use the original tokens rather than lemmatized ones. This is because word embeddings are typically trained on large corpora of text in their original form, and using lemmatized tokens might lead to a loss of semantic information that's crucial for accurate embeddings.

To calculate word embeddings using the Gensim library, you can use a pre-trained model like Word2Vec, GloVe, or FastText. Gensim provides easy access to these models.

When deciding whether to use the cleaner version of your articles and highlights (without boilerplate and internet clutter) or the original version for creating sentence embeddings, there are a few considerations to keep in mind:

1. Relevance and Noise: The cleaner versions of your articles and highlights are likely to be more relevant and less noisy. Removing elements like boilerplate text, HTML tags, and URLs helps focus on the actual content. For tasks like summarization, where the quality and relevance of the text are crucial, using cleaner data can lead to more accurate and meaningful embeddings.

2. Context Preservation: If the process of cleaning does not significantly alter or remove important contextual information from the articles and highlights, then using the cleaner versions is preferable. However, if the cleaning process might strip away context or alter the meaning of the text, the original version should be considered.

3. Computational Efficiency: The cleaner versions are likely more concise, which can make the processing for embedding generation more efficient. Less irrelevant data (like HTML tags) means quicker processing and potentially better embeddings.

4. Alignment with Task Objectives: Consider what aligns best with the objectives of your summarization task. If the goal is to generate summaries that are clean and free of such clutter, then using the cleaner versions for training makes sense.

Given these considerations, for most scenarios, using the cleaner version of the articles and highlights would be the better choice. It allows you to focus on the substantive content of the text, leading to more accurate and meaningful sentence embeddings. This is especially true for summarization tasks, where the quality of the content is more important than the original formatting or extraneous text.

In [9]:
!pip install gensim



In [32]:
import gensim.downloader as api
import numpy as np

# Load pre-trained Word2Vec model from Gensim.
model = api.load("word2vec-google-news-300")

# For the non-aggressive sentence embeddings on 'boiler-free' text:
#nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])  # Disable for speed.
#nlp.add_pipe('sentencizer')  # Add sentencizer to the pipeline for extra help in sentence tokenization.


def get_sentence_embeddings(lemmatized_sentences):
    sentence_embeddings = []

    for sentence in lemmatized_sentences:
        # Filter words present in the Word2Vec model's vocabulary
        words = [word for word in sentence if word in model.key_to_index]

        if words:
            embeddings = [model[word] for word in words]
            mean_embedding = np.mean(embeddings, axis=0)
            sentence_embeddings.append(mean_embedding)
        else:
            sentence_embeddings.append(np.zeros(model.vector_size))
    # Returns a list of embedding vectors, one for each sentence.
    return sentence_embeddings

import ast

# Function to convert string representation of list to actual list.
def string_to_list(text):
    try:
        return ast.literal_eval(text)
    except Exception as e:
        return []


In [44]:
# Example 3
txt = """[['blackpool', 'pier', 'face', 'uncertain', 'future', 'sale', '4.8million'], ['historic', 'attraction', 'pier', 'market', 'owner', 'cuerden', 'leisure', 'collective', 'asking', 'price', '12.6million'], ['blackpool', 'central', 'pier', 'home', 'famous', '33', 'metre', 'high', 'ferris', 'wheel', 'close', 'pier', 'blackpool', 'tower', 'blackpool', 'south', 'pier', 'llandudno', 'pier', 'wale', 'construct', '19th', 'century'], ['scroll', 'video'], ['grab', 'blackpool', 'central', 'pier', 'sale', '5million', 'blackpool', 'south', 'pier'], ['llandudno', '695', 'metre', 'long', 'pier', 'grade', 'list', 'mean', 'future', 'owner', 'require', 'list', 'building', 'consent', 'change', 'structure'], ['blackpool', 'central', 'pier', 'stand', '341', 'metre', 'long', 'blackpool', 'south', 'pier', '150', 'metre', 'long', 'benefit', 'list', 'status', 'despite', 'opening', '1864', '1892', 'respectively'], ['richard', 'baldwin', 'director', 'bilfinger', 'gva', 'sell', 'pier', 'cuerden', 'leisure', 'say', 'unlikely', 'new', 'owner', 'significant', 'change'], ['say', 'order', 'sustain', 'value', 'ensure', 'value', 'appreciate', 'change', 'dramatically', 'new', 'ownership'], ['opinion', 'value', 'exist', 'use', 'unlikely', 'change', 'blackpool', 'pier'], ['victorian', 'pier', 'construct', 'cast', 'iron', 'pile', 'steel', 'frame', 'wooden', 'decking', 'boast', 'amusement', 'arcade', 'ride', 'generate', 'collective', 'income', '1.6', 'million', 'year', 'cuerden', 'leisure', 'annual', 'concession', 'agreement'], ['grade', 'list', 'llandudno', 'pier', 'sale', 'altogether', 'buy', '12.6million'], ['st', 'john', 'stott', 'director', 'cuerden', 'leisure', 'own', 'eastbourne', 'pier', 'say', 'group', 'sell', 'pier', 'restructure', 'asset'], ['say', 'asset', 'jewel', 'crown', 'uk', 'coastline', 'delighted', 'offer', 'market', 'separate', 'lot', 'collectively'], ['mr', 'baldwin', 'add', 'pier', 'truly', 'iconic', 'structure', 'having', 'popular', 'visitor', 'attraction', 'uk', 'good', 'know', 'resort', 'century'], ['pier', 'offer', 'sale', 'freehold', 'subject', 'concession', 'agreement', 'place'], ['profitable', 'attraction', 'confident', 'sale', 'attract', 'major', 'interest'], ['spokesman', 'national', 'pier', 'society', 'promote', 'preservation', 'continue', 'enjoyment', 'seaside', 'pier', 'uk', 'say', 'pier', 'good', 'order', 'trading', 'successfully', 'give', 'easter', 'doubt', 'quickly', 'find', 'buyer', 'buyer']]"""
txt = string_to_list(txt)
a = get_sentence_embeddings(txt)
print(len(a))

18


In [33]:
for name, df in datasets.items():
    df['art_sent_embeddings'] = df['clean_sents_art'].apply(string_to_list).apply(get_sentence_embeddings)
    df['high_sent_embeddings'] = df['clean_sents_high'].apply(string_to_list).apply(get_sentence_embeddings)
    datasets[name] = df

In [22]:
#for name, df in datasets.items():
 #   if 'high_sent_embeddings' in df.columns: df.drop('high_sent_embeddings', axis=1, inplace=True)

In [43]:
# Check columns + results. All ok.
#for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
datasets['train']['clean_sents_art'].head(4)
#datasets['validation']['clean_sents_art'].head()
#datasets['test'].head(2)

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

## Cosine Similarity (labels)

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(article_embeddings, highlight_embeddings):
    # Calculate cosine similarity between each article sentence and each highlight sentence.
    similarity_scores = []
    for art_emb in article_embeddings:
        # Scores for each article sentence against all highlight sentences.
        scores = [cosine_similarity([art_emb], [high_emb])[0][0] for high_emb in highlight_embeddings]
        # Aggregate the scores by taking the maximum.
        similarity_scores.append(max(scores) if scores else 0)
    return similarity_scores


for name, df in datasets.items():
    # Calculate similarity scores for each row in the dataframe
    similarity_scores = []
    for index, row in df.iterrows():
        # Use the precomputed sentence embeddings
        article_embeddings = row['art_sent_embeddings']
        highlight_embeddings = row['high_sent_embeddings']

        # Calculate similarity scores
        scores = calculate_similarity(article_embeddings, highlight_embeddings)

        # Store the scores
        similarity_scores.append(scores)
    
    # Add the similarity scores as a new column to the dataframe
    df['similarity_scores'] = similarity_scores
    datasets[name] = df


In [38]:
# Check columns + results. All ok.
for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
datasets['train']['similarity_scores'].head()
#datasets['validation']['clean_sents_art'].head()
#datasets['test'].head(2)

Columns in train dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores'],
      dtype='object')
Columns in validation dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores'],
      dtype='object')
Columns in test dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores'],
      dtype='object')


0                                                                                                                                                      [0.78614044, 0.6281063, 0.7003851, 0.2905049, 0.8066679, 0.693566, 0.77165043, 0.7447659, 0.4227616, 0.6808056, 0.76678455, 0.64269066, 0.62133586, 0.6361592, 0.65129006]
1    [0.7712006, 0.77382934, 0.63337034, 0.6840277, 0.8246098, 0.7272917, 0.5598484, 0.64864475, 0.67015505, 0.616089, 0.77523786, 0.70468974, 0.66512036, 0.6574998, 0.72593665, 0.6758898, 0.6753683, 0.7291945, 0.5874759, 0.71150357, 0.7354912, 0.66362464, 0.6546197, 0.7050079, 0.68018496, 0.5729586, 0.69163215, 0.6215]
2                                                      [0.8799081, 0.66603494, 0.46760952, 0.5707332, 0.26893198, 0.5221286, 0.5097554, 0.7589295, 0.77468795, 0.54191643, 0.36442742, 0.51211226, 0.83913976, 0.73731077, 0.30618584, 0.7807745, 0.7302368, 0.60418886, 0.49855953, 0.4104001, 0.45465663, 0.6008837, 0.4152067]
3                                 

In [46]:
datasets['train'].head(1)

Unnamed: 0,article,highlights,id,internet-free_art,internet-free_high,boiler-free_art,boiler-free_high,clean_sents_art,clean_sents_high,art_sent_embeddings,high_sent_embeddings,similarity_scores
0,"Since giving birth to her daughter Arabella four months ago, Rebecca Ferguson has become the victim of cruel taunts about her size 12 post-pregnancy body. But the singer has now hit back at her critics and vowed to resist the pressure to slim down. Speaking to Fabulous magazine, the 28-year-old former X Factor star confessed that recent comments aimed at her weight, and questions about when she'll 'drop the post-baby pounds', have left her more determined than ever to stand up for herself. Scroll down for video . Rebecca Ferguson gave birth to gorgeous baby Arabella just four months ago, but has faced constant questions, even from those close to her, about how and when she'll lose her 'baby weight' The singer says even though she's only a size 12 following her third pregnancy, there has been a lot of pressure on her as a celebrity mum to 'snap back into shape' within a few weeks (Left: Rebecca during her third trimester and right: Rebecca in a bikini last month . She said 'There's this culture where we celebrate people snapping back into shape a week after the birth, but I don't want to be one of those people - I just want to enjoy Arabella.' Rebecca, who has never named the father of Arabella publicly, also has two other children called Lillie, aged 10, and Karl, nine, with her former boyfriend and teenage sweetheart Karl Dures, 29. And the singer now admits that she regrets not having had this empowering new mindset after she had her other kids. Describing how she felt as a new mum in previous years, the Get Happy singer said, 'When I look back, I think I should have just enjoyed my babies and not worried about my weight or being skinny. 'I'm curvy, I've got thighs, but I'm not big. I eat healthily and I'm breastfeeding, so I know the weight will come off naturally.' Rebecca pictured with her two older children Lillie 10, (left) and Karl, nine, (right) Currently promoting her new album Lady Sings The Blues, it would be easy for Rebecca to feel the need to slim down for upcoming photo shoots and interviews. But while some celebrities are famed for getting their impossibly taut figures back just weeks after giving birth, Rebecca says she's not feeling that pressure this time around. Instead, she's standing up for new mums everywhere, by taking aim at the critics who try to publicly shame women into losing weight. The former X Factor contestant has refused to name and shame her weight critics, but revealed some have been a lot closer to home than she ever imagined (Pictured left: In concert at the St James Theatre, London, last month, and right: on This Morning last week) She said: 'Women's bodies are amazing, what our bodies can do is incredible so it's sad that we get distracted - all this stuff about being skinny, be this, be that, they're all distractions. 'They rob us of what we should be focusing on and that's sad.' Though the singer has refused to name and shame her own critics, she has revealed that they're a lot closer home than she ever imagined. 'I just want to enjoy Arabella,' said Rebecca, who previously worried about dropping post-baby pounds as quickly as possible (Rebecca pictured with her newborn) Rebecca also spoke of her toughest year yet during the interview, when her partner walked out after discovering she was pregnant with Arabella. She revealed that although it has been hard to live life as a single mother again, she is now moving forward and enjoying her life as a new mum. The star hasn't been left completely on her own though - her ex-partner Karl pops by regularly to help look after the children. And while Rebecca admits that many of their close friends and family would love to see them get back together, and she is open to the idea, she is enjoying being on her own for the time being. Rebecca flaunted her new curves on the red carpet at the Brit Awards last month and says she's proud of her womanly shape .",Critics have made cruel taunts about Rebecca Ferguson's post-baby body .\nBut singer is refusing to bow to pressure to 'snap back into shape'\nShe wants to enjoy being a new mother to four-month-old Arabella instead .,911d354c3ae2eefdd0b3e2e4de98f35205da1fe4,"Since giving birth to her daughter Arabella four months ago, Rebecca Ferguson has become the victim of cruel taunts about her size 12 post-pregnancy body. But the singer has now hit back at her critics and vowed to resist the pressure to slim down. Speaking to Fabulous magazine, the 28-year-old former X Factor star confessed that recent comments aimed at her weight, and questions about when she'll 'drop the post-baby pounds', have left her more determined than ever to stand up for herself. Scroll down for video . Rebecca Ferguson gave birth to gorgeous baby Arabella just four months ago, but has faced constant questions, even from those close to her, about how and when she'll lose her 'baby weight' The singer says even though she's only a size 12 following her third pregnancy, there has been a lot of pressure on her as a celebrity mum to 'snap back into shape' within a few weeks (Left: Rebecca during her third trimester and right: Rebecca in a bikini last month . She said 'There's this culture where we celebrate people snapping back into shape a week after the birth, but I don't want to be one of those people - I just want to enjoy Arabella.' Rebecca, who has never named the father of Arabella publicly, also has two other children called Lillie, aged 10, and Karl, nine, with her former boyfriend and teenage sweetheart Karl Dures, 29. And the singer now admits that she regrets not having had this empowering new mindset after she had her other kids. Describing how she felt as a new mum in previous years, the Get Happy singer said, 'When I look back, I think I should have just enjoyed my babies and not worried about my weight or being skinny. 'I'm curvy, I've got thighs, but I'm not big. I eat healthily and I'm breastfeeding, so I know the weight will come off naturally.' Rebecca pictured with her two older children Lillie 10, (left) and Karl, nine, (right) Currently promoting her new album Lady Sings The Blues, it would be easy for Rebecca to feel the need to slim down for upcoming photo shoots and interviews. But while some celebrities are famed for getting their impossibly taut figures back just weeks after giving birth, Rebecca says she's not feeling that pressure this time around. Instead, she's standing up for new mums everywhere, by taking aim at the critics who try to publicly shame women into losing weight. The former X Factor contestant has refused to name and shame her weight critics, but revealed some have been a lot closer to home than she ever imagined (Pictured left: In concert at the St James Theatre, London, last month, and right: on This Morning last week) She said: 'Women's bodies are amazing, what our bodies can do is incredible so it's sad that we get distracted - all this stuff about being skinny, be this, be that, they're all distractions. 'They rob us of what we should be focusing on and that's sad.' Though the singer has refused to name and shame her own critics, she has revealed that they're a lot closer home than she ever imagined. 'I just want to enjoy Arabella,' said Rebecca, who previously worried about dropping post-baby pounds as quickly as possible (Rebecca pictured with her newborn) Rebecca also spoke of her toughest year yet during the interview, when her partner walked out after discovering she was pregnant with Arabella. She revealed that although it has been hard to live life as a single mother again, she is now moving forward and enjoying her life as a new mum. The star hasn't been left completely on her own though - her ex-partner Karl pops by regularly to help look after the children. And while Rebecca admits that many of their close friends and family would love to see them get back together, and she is open to the idea, she is enjoying being on her own for the time being. Rebecca flaunted her new curves on the red carpet at the Brit Awards last month and says she's proud of her womanly shape .",Critics have made cruel taunts about Rebecca Ferguson's post-baby body .\nBut singer is refusing to bow to pressure to 'snap back into shape'\nShe wants to enjoy being a new mother to four-month-old Arabella instead .,"Since giving birth to her daughter Arabella four months ago, Rebecca Ferguson has become the victim of cruel taunts about her size 12 post-pregnancy body. But the singer has now hit back at her critics and vowed to resist the pressure to slim down. Speaking to Fabulous magazine, the 28-year-old former X Factor star confessed that recent comments aimed at her weight, and questions about when she'll 'drop the post-baby pounds', have left her more determined than ever to stand up for herself. Scroll down for video. Rebecca Ferguson gave birth to gorgeous baby Arabella just four months ago, but has faced constant questions, even from those close to her, about how and when she'll lose her 'baby weight' The singer says even though she's only a size 12 following her third pregnancy, there has been a lot of pressure on her as a celebrity mum to 'snap back into shape' within a few weeks and Karl, nine, Currently promoting her new album Lady Sings The Blues, it would be easy for Rebecca to feel the need to slim down for upcoming photo shoots and interviews. But while some celebrities are famed for getting their impossibly taut figures back just weeks after giving birth, Rebecca says she's not feeling that pressure this time around. Instead, she's standing up for new mums everywhere, by taking aim at the critics who try to publicly shame women into losing weight. The former X Factor contestant has refused to name and shame her weight critics, but revealed some have been a lot closer to home than she ever imagined She said: 'Women's bodies are amazing, what our bodies can do is incredible so it's sad that we get distracted - all this stuff about being skinny, be this, be that, they're all distractions. 'They rob us of what we should be focusing on and that's sad.' Though the singer has refused to name and shame her own critics, she has revealed that they're a lot closer home than she ever imagined. 'I just want to enjoy Arabella,' said Rebecca, who previously worried about dropping post-baby pounds as quickly as possible Rebecca also spoke of her toughest year yet during the interview, when her partner walked out after discovering she was pregnant with Arabella. She revealed that although it has been hard to live life as a single mother again, she is now moving forward and enjoying her life as a new mum. The star hasn't been left completely on her own though - her ex-partner Karl pops by regularly to help look after the children. And while Rebecca admits that many of their close friends and family would love to see them get back together, and she is open to the idea, she is enjoying being on her own for the time being. Rebecca flaunted her new curves on the red carpet at the Brit Awards last month and says she's proud of her womanly shape.",Critics have made cruel taunts about Rebecca Ferguson's post-baby body. \nBut singer is refusing to bow to pressure to 'snap back into shape'\nShe wants to enjoy being a new mother to four-month-old Arabella instead.,"[['give', 'birth', 'daughter', 'arabella', 'month', 'ago', 'rebecca', 'ferguson', 'victim', 'cruel', 'taunt', 'size', '12', 'post', 'pregnancy', 'body'], ['singer', 'hit', 'critic', 'vow', 'resist', 'pressure', 'slim'], ['speak', 'fabulousmagazine', '28', 'year', 'old', 'x', 'factor', 'star', 'confess', 'recent', 'comment', 'aim', 'weight', 'question', 'drop', 'post', 'baby', 'pound', 'leave', 'determined', 'stand'], ['scroll', 'video'], ['rebecca', 'ferguson', 'give', 'birth', 'gorgeous', 'baby', 'arabella', 'month', 'ago', 'face', 'constant', 'question', 'close', 'lose', 'baby', 'weight', 'singer', 'say', 'size', '12', 'follow', 'pregnancy', 'lot', 'pressure', 'celebrity', 'mum', 'snap', 'shape', 'week', 'karl', 'currently', 'promote', 'new', 'album', 'lady', 'sing', 'blue', 'easy', 'rebecca', 'feel', 'need', 'slim', 'upcoming', 'photo', 'shoot', 'interview'], ['celebrity', 'fame', 'get', 'impossibly', 'taut', 'figure', 'week', 'give', 'birth', 'rebecca', 'say', 'feel', 'pressure', 'time'], ['instead', 'stand', 'new', 'mum', 'take', 'aim', 'critic', 'try', 'publicly', 'shame', 'woman', 'lose', 'weight'], ['x', 'factor', 'contestant', 'refuse', 'shame', 'weight', 'critic', 'reveal', 'lot', 'close', 'home', 'imagine', 'say', 'woman', 'body', 'amazing', 'body', 'incredible', 'sad', 'distract', 'stuff', 'skinny', 'distraction'], ['rob', 'focus', 'sad'], ['singer', 'refuse', 'shame', 'critic', 'reveal', 'lot', 'close', 'home', 'imagine'], ['want', 'enjoy', 'arabella', 'say', 'rebecca', 'previously', 'worry', 'drop', 'post', 'baby', 'pound', 'quickly', 'possible', 'rebecca', 'speak', 'tough', 'year', 'interview', 'partner', 'walk', 'discover', 'pregnant', 'arabella'], ['reveal', 'hard', 'live', 'life', 'single', 'mother', 'move', 'forward', 'enjoy', 'life', 'new', 'mum'], ['star', 'leave', 'completely', 'ex', 'partner', 'karl', 'pop', 'regularly', 'help', 'look', 'child'], ['rebecca', 'admit', 'close', 'friend', 'family', 'love', 'open', 'idea', 'enjoy', 'time'], ['rebecca', 'flaunt', 'new', 'curve', 'red', 'carpet', 'brit', 'award', 'month', 'say', 'proud', 'womanly', 'shape']]","[['critic', 'cruel', 'taunt', 'rebecca', 'ferguson', 'post', 'baby', 'body', 'singeris', 'refuse', 'bow', 'pressure', 'snap', 'shape', 'want', 'enjoy', 'new', 'mother', 'month', 'old', 'arabella', 'instead']]","[[0.017922537, 0.088857375, 0.009217398, 0.043491907, -0.0103263855, 0.007149833, 0.1287929, -0.13412039, 0.093296595, 0.048821587, 0.053374153, -0.19524275, -0.06142538, 0.09701974, -0.044695172, 0.03277588, 0.015359743, 0.09031459, -0.037815638, -0.06769562, 0.01461356, -0.0053754533, 0.057865687, -0.0177645, -0.0051879883, -0.108673096, -0.10060338, 0.04698835, 0.09553746, -0.045388084, -0.027062552, -0.029054914, -0.06769017, -0.045052666, -0.08132499, -0.007206508, 0.05002485, -0.03879002, -0.067866735, 0.11602347, 0.057932172, -0.030918667, 0.14704241, -0.020908901, 0.05255127, -0.04935891, 0.039123535, 0.029562814, -0.043683734, 0.10838536, -0.09870039, -0.065264024, -0.009399414, 0.036213465, 0.026615689, -0.049290247, -0.10481916, -0.038404193, 0.021859305, -0.049822126, 0.09209333, 0.0337786, -0.052356448, -0.056653704, 0.06489781, -0.058761597, 0.012028285, 0.039515905, -0.011230196, 0.18177141, 0.062356133, 0.07560321, -0.020528521, 0.0377982, -0.1809627, 0.063823156, 0.07800075, 0.01498849, 0.13281904, 0.18207224, 0.03679984, -0.037776403, -0.04738944, 0.026689801, -0.057146344, -0.023265293, 0.011265346, 0.07679967, 0.020529611, -0.00084795273, 0.011662074, -0.06236049, -0.120240346, -0.05671256, -0.05155727, 0.013584682, 0.07079424, 0.0134746, 0.09484863, 0.009783064, ...], [0.12604631, 0.10899135, -0.015572684, -0.025390625, -0.09277344, 0.112801686, -0.026506696, -0.0048828125, 0.13580322, 0.0695452, -0.066118516, -0.14034598, -0.010672433, 0.051932197, -0.12192208, 0.048649378, 0.10016741, 0.031284876, -0.033602647, -0.15897043, 0.08485631, 0.094866075, 0.06694685, -0.06007603, 0.10124861, -0.067352295, -0.061880928, -0.08604213, 0.030064175, -0.04750279, 0.013636998, 0.03651646, -0.07819475, -0.00920323, 0.016113281, -0.008231027, 0.10311454, 0.11976842, 0.021030972, 0.07054901, 0.14072964, -0.15443638, 0.12622942, -0.12969099, 0.012442453, -0.0076904297, -0.099583216, 0.05257743, 0.037152972, 0.032070976, 0.0033830914, 0.03557696, -0.034240723, 0.07838658, 0.06047712, -0.011021205, -0.10519409, -0.034946986, -0.044241767, -0.072509766, -0.014439174, 0.04155622, -0.14707729, -0.015032087, 0.052350726, 0.04248047, 0.080740795, 0.17687075, 0.006238665, 0.18088423, 0.028381348, 0.003489903, 0.03679548, 0.17689732, -0.2947998, -0.024239676, -0.0075334823, 0.1496582, 0.06584822, 0.1267613, 0.052716937, -0.039864678, 0.07347761, -0.0579834, -0.041992188, 0.032053266, -0.14526367, 0.1992885, -0.031441826, -0.029492702, -0.009848459, -0.103881836, -0.11429269, 0.053292412, -0.08549281, 0.057774134, 0.04875837, 0.009102957, 0.02811105, -0.0828683, ...], [0.01283746, 0.06327017, 0.019660547, 0.0052819503, -0.0027104428, 0.053723786, 0.10111277, -0.07012297, 0.16889392, 0.07416133, -0.040231, -0.061618604, 0.00868466, 0.03543894, -0.06091148, 0.06874486, 0.02018015, 0.025310315, -0.0764642, -0.05495734, -0.020405017, 0.039597362, 0.021508468, 0.06854087, 0.013101678, -0.07316509, -0.08263518, 0.07897628, 0.050331518, -0.064571984, -0.014568128, -0.030495092, -0.035694726, -0.024513647, -0.045719348, 0.03499563, 0.033195093, 0.023415014, -0.003479004, 0.07046228, 0.05867085, -0.1002386, 0.12718281, -0.03658897, 0.019124985, 0.00020719829, -0.03845536, 0.03536104, -0.0070114136, 0.030012432, -0.011822751, -0.032123767, -0.065294765, -0.036878083, 0.08517496, -0.014870091, -0.06482576, -0.08178711, 0.004227488, -0.05713051, -0.037687603, 0.02157914, -0.078744985, 0.0059075607, 0.077019945, -0.04948345, -0.050315455, 0.110443115, -0.021522924, 0.10442714, 0.0033119603, 0.03537389, 0.030035721, 0.07304302, -0.11598607, -0.02044999, 0.041548878, 0.06748159, 0.01731471, 0.061867163, 0.019014057, -0.07008523, 0.020962464, 0.010606465, 0.002833316, -0.08429678, -0.040885523, 0.096384145, 0.008101614, 0.013254266, 0.113181666, -0.04513791, -0.090958446, -0.040867053, 0.018708881, -0.054183155, 0.06913677, 0.056399696, 0.02459235, -0.039293792, ...], [0.047180176, 0.07145691, -0.110839844, 0.076416016, -0.06970215, 0.05517578, 0.05029297, -0.06384277, 0.1928711, -0.04673767, -0.1640625, -0.13337708, -0.055664062, 0.21533203, -0.1048584, -0.1159668, 0.09899902, -0.15234375, -0.080322266, -0.328125, 0.2319336, 0.22290039, -0.16479492, 0.055664062, 0.0087890625, -0.063201904, -0.045776367, 0.06604004, 0.19628906, -0.12817383, -0.18017578, 0.08999634, -0.19677734, -0.018554688, 0.068115234, -0.15722656, 0.06335449, 0.09716797, 0.0068359375, -0.028427124, -0.13952637, 0.060913086, 0.11987305, 0.09350586, 0.10449219, -0.03413391, -0.010253906, -0.0064697266, 0.049316406, -0.30993652, -0.20495605, -0.0887146, 0.12402344, -0.23413086, 0.026245117, 0.026245117, -0.0703125, 0.0078125, -0.06762695, 0.23486328, -0.018920898, -0.019042969, -0.15698242, 0.15917969, -0.0078125, 0.107177734, -0.10083008, -0.06613159, 0.23535156, 0.13879395, -0.22314453, 0.11584473, -0.09448242, -0.29492188, 0.09472656, -0.21807861, 0.083984375, 0.0262146, 0.14257812, -0.14624023, 0.18554688, -0.023345947, 0.09350586, 0.018554688, 0.19873047, -0.036376953, -0.05621338, 0.13095093, -0.022705078, 0.057617188, -0.024536133, 0.10864258, -0.0463562, 0.022460938, -0.0864563, 0.037109375, 0.15771484, 0.17626953, -0.030761719, 0.16503906, ...], [0.039684836, 0.06729256, -0.053898204, 0.03873149, -0.044664904, 0.0201957, 0.058009755, -0.118166834, 0.06679611, 0.05966412, -0.015930522, -0.09586074, -0.016464245, 0.061542165, -0.11852646, 0.071401425, 0.03186889, 0.10738408, -0.08969844, -0.09070483, -0.002985174, 0.012366902, 0.04387266, -0.022779638, -0.014861367, -0.005125566, -0.098241456, 0.032558344, 0.052254245, -0.08925412, -0.01891743, 0.060507342, -0.040277656, -0.026880438, -0.004809293, -0.039919075, 0.064325504, -0.015555642, -0.01551264, 0.06744801, 0.05373244, -0.10843051, 0.11111104, -0.009618586, 0.027786948, -0.0602469, -0.025629563, -0.0011090365, 0.0008003928, 0.07506284, -0.041660048, 0.021092849, -0.008661443, -0.0057359175, 0.04027696, -0.038531218, -0.043789648, -0.07698059, 0.03571597, -0.06547963, 0.016164953, -0.0029331555, -0.08783323, -0.020041727, 0.10111694, -0.02119862, -0.041695334, 0.06841209, 0.020557404, 0.13460818, 0.038749695, 0.060198005, 0.035307106, 0.09046381, -0.17819838, 0.031921387, 0.052096974, 0.038417555, 0.039336465, 0.14980525, -0.0066607217, -0.067871094, 0.022577459, -0.04270276, -0.052244708, -0.03395826, -0.06317485, 0.120586045, 0.038864136, 0.030246908, 0.025955893, -0.025715915, -0.13281423, -0.06238313, -0.069342874, -0.041044757, 0.091970615, 0.03652018, 0.06681425, -0.040570345, ...], [0.04740688, 0.047938757, -0.088304795, 0.070992604, -0.023553576, -0.025473459, 0.10048131, -0.119803295, 0.0834525, 0.041931152, 0.009892055, -0.09058925, 0.003805978, 0.08617292, -0.16967773, 0.120229445, 0.0765998, 0.15633175, 0.0063665933, -0.07183838, -0.08426339, -0.028320312, 0.09075056, -0.016207013, 0.02154105, -0.059539795, -0.062859125, 0.038992744, -0.0048086983, -0.031232562, 0.0026899066, -0.009136745, 0.0054626465, -0.00944301, 0.011731829, -0.08625139, 0.06136213, -0.012254988, 0.008553641, 0.09075928, 0.08571625, -0.119511195, 0.09513637, -0.041416712, 0.008492606, -0.036859784, -0.02190726, 0.08306013, 0.07925851, 0.085091725, -0.050861903, 0.035487585, -0.036769323, -0.024008615, 0.0038844517, -0.021580288, -0.0425742, -0.036220007, -0.02002825, -0.019989014, 0.053994317, 0.0039994377, -0.11669922, -0.12275914, 0.105163574, -0.010772705, -0.04485539, 0.09012277, -0.051130023, 0.0890067, 0.07465254, 0.08131627, 0.05204991, 0.0136762345, -0.14289856, -0.02368164, 0.085383825, 0.11342948, 0.11690848, 0.15574864, 0.053222656, 0.005244664, 0.04586356, 0.0013645717, -0.073719844, -0.02300099, -0.105577745, 0.05442592, 0.018506732, -0.030410767, 0.026042392, 0.054495674, -0.13698141, -0.07938058, -0.059273858, -0.030840192, 0.056607384, 0.07158988, 0.038868494, -0.06824602, ...], [0.09045997, 0.016113281, 0.071955755, 0.028141903, -0.010009766, 0.019981971, 0.08346792, -0.042151816, 0.10190054, 0.09190075, 0.0031973033, -0.09059965, 0.037222054, 0.045865133, -0.082834095, 0.1237793, -0.001558744, 0.048919678, 0.010663216, -0.11943172, 0.033221904, -0.007827759, 0.09428993, 0.02007587, 0.091911905, -0.026716966, -0.053131104, 0.060828574, 0.0606173, -0.086876504, -0.0072209286, 0.06309627, -0.095616855, -0.003483699, 0.063115045, -0.021756686, 0.07329853, 0.01954064, 0.017160269, 0.026103092, 0.12705642, -0.14099121, 0.13087815, -0.060462363, -0.026905904, -0.024914082, -0.04667781, 0.0019249549, -0.018998366, 0.055885904, -0.018700233, 0.07212654, -0.012897198, -0.012219942, 0.061251126, -0.06727013, -0.127925, -0.09820909, 0.067157455, -0.03438627, 0.016789364, 0.04705341, -0.08908316, -0.055354193, 0.04595008, -0.060114935, -0.046062764, 0.18307026, -0.0056668795, 0.09496601, -0.0028839111, 0.13388883, 0.056168776, 0.07558031, -0.16823167, -0.031869743, 0.06669264, 0.09782057, 0.026108962, 0.12536621, 0.034179688, -0.06834852, 0.03596614, -0.032677285, -0.03563514, -0.006793829, -0.011437049, 0.11642691, 0.009408804, 0.04604868, 0.049391527, -0.05455369, -0.09494253, -0.076532215, -0.007164588, -0.06759409, 0.03566331, 0.031414326, 0.03717745, -0.04143407, ...], [0.07826896, 0.0013865596, 0.018702632, 0.01154543, -0.015993865, 0.0970668, 0.12319814, -0.049406633, 0.13259777, 0.094955444, -0.009715205, -0.17824389, 0.0016267196, -0.019610861, -0.029119408, 0.15554942, 0.06118907, 0.09323319, -0.042174153, -0.107989766, 0.009144658, -0.01407955, 0.035540704, -0.020030146, 0.07609094, -0.0729861, -0.06319129, 0.027694702, 0.063161105, -0.17606519, -0.015728494, 0.06531027, 0.06369948, 0.0029588782, 0.012151304, -0.070657276, 0.068163, 0.04155698, 0.009831968, 0.15785317, 0.07771633, -0.08847162, 0.13024372, -0.0047386833, -0.042541068, -0.053859543, -0.03181789, 0.040818587, 0.030732527, 0.04147737, -0.045367695, 0.028520335, -0.03852098, -0.051200535, 0.022166377, 0.009678053, -0.064267695, -0.074009106, 0.11563243, -0.044819046, 0.0062919287, 0.07590252, -0.061941728, -0.038182136, 0.06649648, -0.024968686, -0.06083878, 0.13541047, -0.070192754, 0.10221332, 0.035697605, 0.026656441, 0.06941687, 0.028605586, -0.12758471, 0.007986317, 0.049350906, 0.07279902, 0.08306088, 0.0812033, 0.037570454, -0.028766135, 0.06692505, -0.006358271, -0.10864523, -0.06261021, -0.057988707, 0.179515, 0.011007558, 0.0040283203, -0.002933668, -0.03674051, -0.11344843, 0.008574112, -0.024079695, -0.04095459, 0.07591181, 0.107262656, 0.058746338, -0.05325052, ...], [0.08178711, 0.15266927, -0.040690105, 0.044759113, -0.044921875, 0.01953125, 0.22786458, -0.0797526, 0.12095133, 0.1393636, -0.06656901, -0.028808594, -0.018554688, 0.17594402, 0.0113932295, 0.014851888, 0.032613117, 0.16430664, 0.09033203, -0.0978597, 0.09838867, -0.023925781, 0.1616211, 0.12591553, 0.17008464, 0.026692709, 0.13444011, -0.102742516, 0.13020833, -0.06656901, 0.08951823, 0.032796223, 0.12666829, -0.08062744, -0.12760417, -0.038085938, 0.0012613932, 0.06852213, 0.059224445, 0.16853905, 0.22949219, 0.19303386, 0.19742839, -0.048339844, 0.042663574, -0.05810547, -0.079589844, -0.023763021, -0.031656902, 0.14892578, -0.15657552, 0.1149292, 0.22701009, -0.11791992, 0.041503906, -0.10858154, -0.19669597, -0.06852213, 0.14497884, -0.24804688, -0.038085938, 0.10107422, -0.009033203, 0.14794922, 0.11897787, -0.08773804, -0.071777344, -0.019856771, 0.0035807292, 0.051920574, -0.12455241, 0.072021484, 0.071777344, 0.02335612, -0.18003337, -0.01369222, -0.075520836, 0.07910156, 0.095947266, 0.03889974, 0.05928548, -0.01928711, 0.10970052, -0.019266764, -0.016642252, -0.1015625, -0.15462239, 0.11368815, 0.008707683, -0.029622396, 0.097086586, -0.05891927, 0.07259115, 0.029052734, -0.096635185, -0.09342448, 0.07324219, 0.06640625, -0.15437825, -0.0863444, ...], [0.09579468, -0.01860894, 0.011990017, 0.03229904, -0.039571125, 0.08028158, 0.05306668, -0.0071614585, 0.17396376, 0.028157553, -0.05115424, -0.18188477, 0.0050794813, -0.016587999, -0.109544545, 0.1989068, 0.08640543, 0.07957628, -0.11944289, -0.066718206, 0.050482854, -0.029880099, 0.094319664, -0.07206896, 0.1311103, 0.019039579, -0.023176406, -0.010320027, 0.00943417, -0.14645046, -0.050428603, 0.11859809, -0.07663303, 0.07028537, 0.024271647, -0.059203573, 0.11771647, 0.039360896, -0.02073839, 0.1215015, 0.082845055, -0.14624701, 0.15991211, -0.06382921, 0.007075045, 0.019436307, -0.078708224, 0.0155164935, -0.018412272, 0.049682617, 0.03757053, 0.074924044, -0.08829752, -0.079440646, -0.033555772, -0.022379557, -0.12402429, -0.047661677, 0.121921115, -0.04695638, -0.02589247, 0.06317817, -0.060512967, -0.055031672, 0.014241536, -0.032416448, -0.021152072, 0.15987141, -0.032992892, 0.05933465, 0.046373155, 0.036975436, 0.080729164, 0.058295354, -0.18478733, -0.017306857, 0.02324083, 0.11198934, 0.039808486, 0.09844293, -0.018793741, -0.023254395, 0.06245592, 0.048733182, -0.09187826, -0.0158524, -0.14428711, 0.15179443, -0.0131293405, -0.04233127, 0.022162544, -0.061340332, -0.10538737, -0.0059407554, -0.06561279, -0.024142794, 0.09616428, 0.097357854, -0.02045695, -0.09372287, ...], [-0.0075511024, 0.024334863, -0.052262805, 0.08442761, -0.045312792, 0.048129126, 0.08350772, -0.04506429, 0.07057408, 0.03409831, -0.01016962, -0.117699035, -0.040677026, 0.03757404, -0.08386521, 0.06976318, 0.033894856, 0.09538923, -0.026614234, -0.07859512, -0.01739502, 0.031728107, 0.039815266, -0.0038735527, 0.0601516, -0.037990026, -0.04258712, 0.054510206, 0.024026053, -0.085812524, -0.009037291, 0.034935363, -0.03813244, -0.121515185, -0.040536065, -0.011662074, 0.04069737, -0.05179269, -0.001520066, 0.06558082, 0.05046881, -0.084631056, 0.11808413, -0.007552374, -0.07272484, -0.009862991, -0.048429944, 0.009989421, -0.033275787, 0.040849958, -0.023206076, 0.022106353, -0.01855832, 0.030201867, 0.034133546, -0.040329706, -0.05423991, -0.078444704, 0.06447347, -0.041253954, -0.009620303, 0.025881812, -0.10423352, 0.013684954, 0.03753081, -0.039361864, -0.058962867, 0.032074723, -0.00057547435, 0.12960379, 0.04001581, 0.08759998, 0.038388208, 0.02073742, -0.15569742, 0.004355294, 0.032017298, 0.10033816, 0.046314057, 0.12367467, -0.03599185, -0.0333826, 0.015816825, 0.032141548, 0.010637556, -0.07155646, -0.07508341, 0.110151015, 0.0049604005, 0.009812128, 0.07406925, 0.008585612, -0.12747046, -0.04280985, -0.04395694, -0.03660075, 0.06542969, 0.06748163, 0.022600446, -0.031382244, ...], [0.020677567, -0.029856363, -0.02535502, 0.090037026, 0.0060424805, 0.018941244, 0.097218834, -0.080434166, 0.088391624, 0.1182251, 0.03712972, -0.12063599, -0.096847534, 0.008433024, -0.073404945, 0.1047465, 0.07537333, 0.0999349, -0.00982666, -0.051503498, -0.036315918, 0.015638987, 0.0018107096, 0.065989174, 0.04404704, 0.009857178, -0.0155181885, 0.04534912, 0.0814209, -0.061971027, -0.0071614585, 0.048634846, -0.05666097, 0.03401947, -0.042007446, -0.02523295, 0.06814575, -0.061286926, -0.015920004, 0.03410848, 0.0830485, 0.0003000895, 0.019074759, 0.0015920004, 0.051645916, -0.07062022, -0.012430827, 0.051310223, 0.019154867, -0.025380453, 0.060727436, 0.06182353, 0.08443197, -0.006174723, 0.011647542, -0.04906082, -0.11556753, -0.060953777, 0.010172526, -0.009145101, 0.016581217, 0.025924683, -0.07267252, -0.013671875, 0.0574646, 0.02983602, -0.013900757, 0.13450754, 0.0058135986, 0.13680649, 0.04239909, 0.0011774699, 0.054822285, 0.041422527, -0.110331215, -0.005533854, 0.052856445, 0.02497355, 0.04530398, 0.122820534, -0.0048535666, 0.007575989, 0.034169514, -0.035339355, 0.028127035, -0.002339681, -0.19466145, 0.12820435, 0.019041697, 0.06627401, 0.044450123, -0.04972331, -0.057834942, -0.10823226, -0.0798645, -0.004725138, 0.11074829, 0.0068206787, 0.05978775, -0.05967204, ...], [0.032798074, 0.041048918, 0.0005423806, 0.09712913, -0.0810963, 0.12722224, 0.040518675, -0.11591833, 0.11996737, -0.055464312, 0.036421344, -0.081642844, -0.006658381, 0.016546076, -0.12799627, 0.08937419, -0.018579656, 0.09167203, -0.0065141157, -0.018615723, 0.056274414, 0.047840465, 0.08589311, 0.028597744, 0.04330999, -0.042491566, -0.05255127, 0.1295166, 0.017872203, -0.010453657, -0.031150125, 0.0044500176, -0.14505282, 0.0056707207, -0.032542836, -0.018038662, 0.13063742, 0.0038174717, -0.024869053, 0.060990766, 0.050197255, -0.14151278, 0.057190634, -0.028586648, 0.051411714, -0.042477693, -0.020341353, -0.052875865, 0.06776567, 0.12706132, -0.024114436, 0.046142578, -0.06914728, -0.007199374, -0.017214688, -0.05674813, -0.09883256, -0.0804485, -0.007795854, -0.065740414, 0.0041004526, 0.041230634, -0.073591754, -0.025235264, -0.042696867, -0.03971828, -0.013139204, 0.040760387, -0.04772533, 0.04170643, 0.062683105, 0.051723395, 0.014137962, 0.026863791, -0.20427912, 0.030059814, 0.0075794565, 0.09240723, 0.038934883, 0.010769931, 0.017772328, -0.054190896, 0.051902078, 0.0020002886, 0.018155186, -0.025088223, -0.047790527, 0.0766768, -0.003967285, -0.017345082, -5.271218e-05, -0.029589565, -0.09239613, -0.104068585, -0.011752042, -0.0026189631, 0.05372342, -0.0057955654, 0.053666547, -0.043581877, ...], [-0.020172883, -0.052545168, -0.015203858, 0.1753418, -0.06350708, 0.10249023, 0.0718811, -0.05300293, 0.078100584, 0.028527832, 0.0038085938, -0.1532898, -0.05469055, -0.06834106, -0.125, 0.10931244, 0.1324768, 0.08576965, 0.050679017, -0.08861084, -0.024438476, 0.06999512, 0.12927246, -0.019958496, 0.081121825, 0.003503418, -0.05319824, 0.14656067, 0.013580322, -0.14558105, -0.004473877, 0.033197023, -0.04560242, 0.010031128, 0.017541504, -0.012524414, 0.12332153, -0.04960022, 0.009643555, 0.14991455, 0.09710584, -0.081545256, 0.09780884, -0.036849976, 0.027764892, -0.016403198, -0.08314209, -0.0126220705, -0.05645752, 0.05332031, 0.02434082, 0.066992186, 0.017285157, -0.014935303, 0.029125977, 0.061865233, -0.037744142, -0.0044921874, 0.044462584, -0.038513184, 0.01652832, -0.0052833557, -0.06010742, -0.0407959, 0.031018067, -0.07211914, -0.10352478, 0.004592943, -0.0018493652, 0.14857788, 0.067712404, 0.05621948, 0.097644046, 0.031939697, -0.12750244, -0.044311523, -0.02946167, -0.02823181, 0.07143555, 0.21347657, -0.046240233, -0.007269287, 0.10664062, -0.047213744, -0.07917786, -0.03376465, -0.14230958, 0.10897217, 0.023727417, 0.016308594, -0.020064544, 0.016601562, -0.05810547, -0.05336504, -0.055480957, -0.03416748, -0.014276123, 0.054971315, 0.0009094238, -0.10046387, ...], [-0.06657997, 0.037513144, -0.0017747146, 0.09429462, -0.018535908, -0.098989636, 0.076964155, -0.21717247, 0.030105077, 0.11995756, -0.013822115, -0.07259897, -0.06628418, 0.042546198, -0.11546913, 0.042499248, 0.05980301, 0.08101713, -0.07954759, -0.10684791, -0.03160682, 0.019900981, 0.014441857, -0.06299532, 0.098379284, -0.045174234, 0.02788837, 0.08114859, 0.07491361, -0.11592924, 0.008211576, 0.05650447, -0.0074110767, 0.013840895, 0.008713942, -0.06201348, 0.064265326, -0.029667782, -0.019310584, 0.001140888, 0.077847995, -0.05547509, 0.08950512, 0.0109769385, 0.00798152, -0.049081657, -0.055973932, 0.00037560097, 0.063791126, 0.08014855, -0.010272686, 0.062176045, 0.014113206, 0.011075533, 0.019592285, 0.07320463, -0.02587421, -0.12152334, 0.01713914, -0.033869818, -0.005418044, 0.005507249, -0.11326247, -0.07034302, 0.06389677, 0.025763879, -0.040517952, 0.1429995, 0.0086843055, 0.096717246, 0.08111103, 0.14075646, 0.033860426, 0.014162504, -0.10188176, -0.030254658, 0.12809753, 0.13221154, 0.03432993, 0.14425424, 0.006666917, -0.08085632, -0.049363356, 0.0051668608, -0.009380634, -0.0044215275, -0.0933744, 0.1362117, 0.018066406, 0.033959024, 0.0063852165, 0.0056105396, -0.059941217, -0.067899264, -0.08157114, 0.00016432542, -0.011408879, 0.08373554, 0.053847093, -0.12286846, ...]]","[[0.042080306, 0.089006804, 0.0118286135, 0.026961898, -0.08128357, 0.029437255, 0.051831055, -0.10006104, 0.044599913, 0.06927643, 0.011138916, -0.16224365, -0.021102905, 0.017984008, -0.135318, 0.07156372, 0.022323608, 0.07876587, -0.0031412363, -0.10612182, 0.064978026, 0.019927215, 0.06533508, 0.018631745, 0.08149414, -0.079489134, -0.059297778, 0.06225281, 0.08000793, -0.07290325, -0.024841309, 0.05958557, -0.028121948, -0.04918213, 0.0019210816, 0.031189203, 0.071351625, 0.030630494, -0.042910766, 0.080020525, 0.09113846, -0.08638916, 0.17492676, -0.032287598, 0.03551636, -0.060545348, 0.010626221, 0.07480774, -0.03104248, 0.025986481, -0.07769165, 0.03956299, -0.013601685, 0.028857421, 0.033950806, 0.0024902343, -0.084954835, -0.062259674, 0.053242493, -0.04122162, -0.015740966, 0.054016113, -0.06611423, -0.014424896, 0.025710452, -0.023991395, -0.048968505, 0.09460223, -0.039788626, 0.16869049, 0.09414597, 0.09206619, 0.05215454, 0.010574341, -0.15458374, 0.006626892, 0.070812985, 0.04858017, 0.07299805, 0.13076782, 0.061447144, -0.027568627, 0.018660069, -0.006512451, 0.0020141602, 0.018969346, -0.046691895, 0.07552795, 0.023659516, 0.03544159, 0.015458679, -0.024658203, -0.119262695, -0.05435281, -0.07164917, 0.037423708, 0.077423096, 0.03715744, 0.05267334, -0.050857924, ...]]","[0.78614044, 0.6281063, 0.7003851, 0.2905049, 0.8066679, 0.693566, 0.77165043, 0.7447659, 0.4227616, 0.6808056, 0.76678455, 0.64269066, 0.62133586, 0.6361592, 0.65129006]"


In [47]:
def apply_threshold(scores, threshold=0.6):
    return [1 if score >= threshold else 0 for score in scores]

for name, df in datasets.items():
    df['binary_labels'] = df['similarity_scores'].apply(apply_threshold)
    datasets[name] = df


In [48]:
# Check columns + results. All ok.
for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
datasets['train']['binary_labels'].head()
#datasets['validation'].head()
#datasets['test'].head(2)

Columns in train dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels'],
      dtype='object')
Columns in validation dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels'],
      dtype='object')
Columns in test dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels'],
      dtype='object')


0                                           [1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1]
1    [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1]
2                   [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0]
3                                  [1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]
4                                  [1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1]
Name: binary_labels, dtype: object

In [61]:
datasets['train']['clean_sents_art'].head(2)

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

## Surface Features: Sentence Position
" The sentences in the earlier parts of a document are more important than
sentences in later parts."

-- p.3, *Extractive Summarization Using Supervised and Semi-Supervised Learning*

In [None]:
nlp = spacy.load("en_core_web_sm")

def sentence_position_scores(lst):
    total_sentences = len(lst)
    print(total_sentences)
    # Calculate the position score for each sentence.
    sent_position_scores = [1 - (i / (total_sentences - 1)) if total_sentences > 1 else 1 for i in range(total_sentences)]
    return sent_position_scores

# Example
example_lemmatized_sentences =  [['give', 'birth', 'daughter', 'arabella', 'month', 'ago', 'rebecca', 'ferguson', 'victim', 'cruel', 'taunt', 'size', '12', 'post', 'pregnancy', 'body'], ['singer', 'hit', 'critic', 'vow', 'resist', 'pressure', 'slim'], ['speak', 'fabulousmagazine', '28', 'year', 'old', 'x', 'factor', 'star', 'confess', 'recent', 'comment', 'aim', 'weight', 'question', 'drop', 'post', 'baby', 'pound', 'leave', 'determined', 'stand'], ['scroll', 'video'], ['rebecca', 'ferguson', 'give', 'birth', 'gorgeous', 'baby', 'arabella', 'month', 'ago', 'face', 'constant', 'question', 'close', 'lose', 'baby', 'weight', 'singer', 'say', 'size', '12', 'follow', 'pregnancy', 'lot', 'pressure', 'celebrity', 'mum', 'snap', 'shape', 'week', 'karl', 'currently', 'promote', 'new', 'album', 'lady', 'sing', 'blue', 'easy', 'rebecca', 'feel', 'need', 'slim', 'upcoming', 'photo', 'shoot', 'interview'], ['celebrity', 'fame', 'get', 'impossibly', 'taut', 'figure', 'week', 'give', 'birth', 'rebecca', 'say', 'feel', 'pressure', 'time'], ['instead', 'stand', 'new', 'mum', 'take', 'aim', 'critic', 'try', 'publicly', 'shame', 'woman', 'lose', 'weight'], ['x', 'factor', 'contestant', 'refuse', 'shame', 'weight', 'critic', 'reveal', 'lot', 'close', 'home', 'imagine', 'say', 'woman', 'body', 'amazing', 'body', 'incredible', 'sad', 'distract', 'stuff', 'skinny', 'distraction'], ['rob', 'focus', 'sad'], ['singer', 'refuse', 'shame', 'critic', 'reveal', 'lot', 'close', 'home', 'imagine'], ['want', 'enjoy', 'arabella', 'say', 'rebecca', 'previously', 'worry', 'drop', 'post', 'baby', 'pound', 'quickly', 'possible', 'rebecca', 'speak', 'tough', 'year', 'interview', 'partner', 'walk', 'discover', 'pregnant', 'arabella'], ['reveal', 'hard', 'live', 'life', 'single', 'mother', 'move', 'forward', 'enjoy', 'life', 'new', 'mum'], ['star', 'leave', 'completely', 'ex', 'partner', 'karl', 'pop', 'regularly', 'help', 'look', 'child'], ['rebecca', 'admit', 'close', 'friend', 'family', 'love', 'open', 'idea', 'enjoy', 'time'], ['rebecca', 'flaunt', 'new', 'curve', 'red', 'carpet', 'brit', 'award', 'month', 'say', 'proud', 'womanly', 'shape']], [['ireland', 'clung', 'claim', 'controversial', 'run', 'win', 'zimbabwe', 'bid', 'reach', 'world', 'cup', 'quarter', 'final', 'track'], ['ed', 'joyce', 'day', 'international', 'century', 'help', 'ireland', 'post', '331', 'high', 'score', 'world', 'cup', 'zimbabwe', 'look', 'like', 'run', 'record', 'chase', 'debatable', 'john', 'mooney', 'catch'], ['replay', 'appear', 'mooney', 'step', 'rope', 'hold', 'remove', 'sean', 'williams', 'short', 'century', 'seemingly', 'control', 'pursuit'], ['alex', 'cusack', 'ireland', 'congratulate', 'team', 'mate', 'get', 'wicket', 'tawanda', 'mupariwa'], ['kevin', "o'brien", 'celebrate', 'sean', 'williams', 'controversially', 'catch', 'john', 'mooney'], ['ireland', 'celebrate', 'win', 'pulsate', 'world', 'cup', 'clash', 'zimbabwe', 'knock'], ['umpire', 'call', 'judge', 'catch', 'williams', 'remain', 'field', 'play', 'instead', 'opt', 'word', 'mooney', 'take', 'catch', 'inside', 'rope'], ['drama', 'follow', 'number', '10', 'tawanda', 'mupariwa', 'slap', '19', 'penultimate', 'deliver', 'kevin', "o'brien", 'leave', 'zimbabwe', 'need', 'seven', 'ball'], ['alex', 'cusack', 'hold', 'nerve', 'claim', 'final', 'wicket', 'get', 'regis', 'chakabva', 'drag', 'mupariwa', 'ski', 'catch', 'william', 'porterfield', 'gratefully', 'accept'], ['joyce', '112', 'earn', 'man', 'match', 'award', 'cusack', '32', 'invaluable', 'especially', 'remove', 'zimbabwe', 'skipper', 'brendon', 'taylor', '121'], ['ed', 'joyce', 'celebrate', 'make', 'day', 'international', 'century', 'help', 'ireland', '331', '8'], ['joyce', 'score', 'impressive', 'run', 'ireland', 'boost', 'chance', 'world', 'cup', 'qualification'], ['victory', 'ireland', 'second', 'member', 'nation', 'tournament', 'open', 'campaign', 'win', 'west', 'indie'], ['likely', 'need', 'pull', 'shock', 'final', 'pool', 'game', 'india', 'pakistan'], ['pakistan', 'surprise', 'win', 'south', 'africa', 'early', 'day', 'ireland', 'favour', 'clash', 'adelaide', 'march', '15', 'loom', 'potential', 'decider'], ['john', 'mooney', 'celebrates', 'get', 'wicket', 'sikandar', 'raza', 'help', 'way', 'victory'], ['ireland', 'stand', 'national', 'anthem', '2015', 'icc', 'cricket', 'world', 'cup', 'zimbabwe'], ['joyce', 'fourth', 'ireland', 'player', 'score', 'world', 'cup', 'century', 'give', 'help', 'hand', 'sloppy', 'zimbabwe', 'fielding', 'display'], ['sussex', 'batsman', 'twice', 'drop', 'edge', 'ball', 'short', 'slip', 'imperious', 'hit', 'four', 'six'], ['join', 'wicket', 'partnership', '138', 'andy', 'balbirnie', 'run', 'away', 'follow', 'joyce', 'figure', 'run', 'late', 'scramble'], ['ireland', 'take', '108', 'final', '10', 'over', 'reach', 'high', 'odi', 'score', 'surpass', '329', 'seven', 'rack', 'famous', 'world', 'cup', 'win', 'england', 'bangalore', 'year', 'ago'], ['kevin', "o'brien", 'celebrate', 'take', 'controversial', 'wicket', 'zimbabwe', 'batsman', 'sean', 'williams'], ['williams', 'react', 'get', 'bellerive', 'oval', 'ground', 'despite', 'look', 'like', 'mooney', 'touch', 'foam'], ['zimbabwe', 'crash', '74', 'reply', 'ireland', 'total', 'control', 'taylor', 'williams', 'turn', 'momentum', 'match', '149', 'run', 'stand'], ['taylor', 'reach', 'century', '79', 'ball', 'ireland', 'attack', 'look', 'toothless', 'cusack', 'produce', 'slow', 'ball', 'fool', 'zimbabwe', 'skipper', 'spoon', 'catch', 'mid'], ['williams', 'keep', 'scoreboard', 'tick', 'reduce', 'task', 'manageable', '32', '20', 'ball', 'mooney', 'controversial', 'catch', 'claim'], ['critical', 'moment', 'mupariwa', 'ireland', 'sweat', 'hit', 'kevin', "o'brien", 'four', 'cusack', 'clean', 'tail', 'thrilling', 'finale'], ['irish', 'player', 'congratulate', 'final', 'victory', 'african', 'country']]
print(sentence_position_scores(example_lemmatized_sentences))

In [101]:
for name, df in datasets.items():
    df['sent_position'] = df['clean_sents_art'].apply(string_to_list).apply(sentence_position_scores)
    datasets[name] = df

15
28
23
18
18
77
29
32
27
28
36
74
10
27
18
33
45
12
75
12
64
28
23
10
173
46
25
29
72
20
21
41
66
14
87
38
46
29
21
30
63
33
19
18
10
10
22
12
17
62
14
11
16
77
35
39
33
54
25
26
12
21
35
28
22
15
16
31
30
24
31
17
15
42
16
84
27
27
15
40
47
25
57
16
36
43
18
42
32
22
7
38
36
45
16
53
33
30
36
13
17
49
48
53
26
73
42
51
20
12
30
36
48
23
16
22
57
16
15
39
52
22
24
24
47
15
24
28
16
38
52
39
82
45
15
28
23
18
18
77
29
32
27
28
36
74
10
27
18
33
45
12
75
12
64
28
23
10
173
46
25
29
72
20
21
41
66
14
87
38
46
29
21
30
63
33
19
18
10
10
22
12
17
62
14
11
16
77
35
39
33
54
25
26
12
21
35
28
22
15
16
31
30
24
31
17
15
42
16
84
27
27
15
40
47
25
57
16
36
43
18
42
32
22
7
38
36
45
16
53
33
30
36
13
17
49
48
53
26
73
42
51
20
12
30
36
48
23
16
22
57
16
15
39
52
22
24
24
47
15
24
28
16
38
52
39
82
45
28
70
20
68
51
42
43
20
19
57
40
14
24
22
38
13
25
32
41
69
17
87
25
91
52
38
12
23
10
23
13
41
24
22
14
23
22
39
24
14
15
37
84
74
24
18
20
21
38
46
15
28
27
20
44
32
20
73
23
36
19
56
17
42
28
7

In [65]:
#for name, df in datasets.items():
    #if 'sent_position' in df.columns: df.drop('sent_position', axis=1, inplace=True)

In [67]:
pd.set_option("display.max_columns", None) # show all cols
pd.set_option('display.max_colwidth', None) # show full width of showing cols
pd.set_option('display.max_rows', None)

## Surface Features: Sentence Length

"A sentence is important if the number of words (except stop words) in it is within a certain range."

-- p.3, *Extractive Summarization Using Supervised and Semi-Supervised Learning*

In [98]:
# Sentence length can provide insights into the complexity and information density of a sentence, which might correlate with its summary-worthiness.
# In general, both very short and very long sentences might be less likely to be included in a summary.

def sentence_length_on_clean(lemmatized_sentences):
    # Initialize an empty list to store sentence lengths
    sentence_lengths = []

    # Iterate over each sentence (which is a list of words)
    for sentence in lemmatized_sentences:
        # Count the number of words in the sentence
        length = len(sentence)

        # Append the count to the sentence_lengths list
        sentence_lengths.append(length)

    return sentence_lengths

# Example usage
example_lemmatized_sentences = [['give', 'birth', 'daughter', 'arabella', 'month', 'ago', 'rebecca', 'ferguson', 'victim', 'cruel', 'taunt', 'size', '12', 'post', 'pregnancy', 'body'], ['singer', 'hit', 'critic', 'vow', 'resist', 'pressure', 'slim']]
print(sentence_length_on_clean(example_lemmatized_sentences))


[16, 7]


In [99]:
for name, df in datasets.items():
    df['sent_length'] = df['clean_sents_art'].apply(string_to_list).apply(sentence_length_on_clean)
    datasets[name] = df

In [102]:
# Check columns + results. All ok.
for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
#datasets['train']['sent_length'].head()
#datasets['validation']['sent_length'].head()
#datasets['test'].head(2)

Columns in train dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position'],
      dtype='object')
Columns in validation dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position'],
      dtype='object')
Columns in test dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high

In [78]:
# Print the first few entries of the 'clean_sents_art' column
for name, df in datasets.items():
    print(f"Dataset: {name}")
    print(df['clean_sents_art'].head(2))


Dataset: train
0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [83]:
#for name, df in datasets.items():
 #   if 'sent_length' in df.columns: df.drop('sent_length', axis=1, inplace=True)

## Content Features: Signature terms / Named Enitites (counts)
Using the count of NEs in each sentence as a feature can provide a simple yet effective way to measure the richness and importance of the sentence. Sentences containing significant named entities (such as important people, organizations, locations, etc.) are more likely to be essential to the overall narrative or argument of the article, making them strong candidates for inclusion in summaries. For example, a sentence with a higher number of named entities (NEs) may be more informative or central for the article's main theme, and more likely to be summary-worthy.

In [89]:
# Counted only from the 'article' columns before preprocessing, as NE extraction works best in unprocessed text.
# NEs provide valuable insights into the key topics and important elements of the articles.

def count_named_entities_per_sentence(text):
    doc = nlp(text)

    ne_counts_per_sentence = []

    for sent in doc.sents:
        ne_counts_per_sentence.append(len(sent.ents))
    return ne_counts_per_sentence


for name, df in datasets.items():
    df['NE_count'] = df['boiler-free_art'].apply(count_named_entities_per_sentence)
    datasets[name] = df

In [104]:
# Check columns + results. All ok.
#for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
datasets['train']['NE_count'].head()
#datasets['validation']['clean_sents_art'].head()
#datasets['test'].head(2)

0                                     [4, 0, 2, 0, 3, 7, 2, 0, 1, 0, 0, 0, 5, 0, 1, 1, 3]
1    [4, 9, 3, 3, 3, 3, 3, 8, 6, 8, 5, 3, 3, 4, 6, 2, 4, 5, 4, 6, 9, 3, 3, 7, 5, 4, 5, 2]
2                [4, 2, 0, 1, 1, 0, 2, 2, 3, 3, 4, 1, 2, 4, 2, 2, 5, 6, 3, 2, 3, 1, 1, 1]
3                               [2, 4, 7, 0, 0, 3, 4, 6, 2, 0, 0, 4, 3, 5, 1, 4, 0, 0, 3]
4                                  [3, 5, 2, 0, 3, 4, 3, 3, 0, 3, 4, 3, 2, 4, 4, 5, 4, 2]
Name: NE_count, dtype: object

In [94]:
# Extract the NEs themselves, for inspection.

def named_ents_per_sentence(text):
    doc = nlp(text)
    
    named_entities_per_sentence = []
    
    # Iterate over each sentence
    for sent in doc.sents:
        ents = [ent.text for ent in sent.ents]
        named_entities_per_sentence.append(ents)
    return named_entities_per_sentence

for name, df in datasets.items():
    df['NEs_per_sent'] = df['boiler-free_art'].apply(named_ents_per_sentence)
    datasets[name] = df


In [96]:
# Check columns + results. All ok.
for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
datasets['train']['NEs_per_sent'].head(2)
#datasets['validation'].head(2)
#datasets['test'].head(2)

Columns in train dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent'],
      dtype='object')
Columns in validation dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent'],
      dtype='object')
Columns in test dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [[Arabella, four months ago, Rebecca Ferguson, 12

## Content Features: High Frequency Words ???

## Event Features: Verbs (counts)
Verbs are integral to conveying actions and events within a text. In news articles, for instance, verbs play a crucial role in reporting what happened, when, and to whom. A higher frequency of verbs in a sentence might indicate that it is action-packed or event-rich, which could be a characteristic of summary-worthy content.

In [112]:
# Counted only from the ''boiler-free_art'' columns before preprocessing, as verb (and, in general, part-of-speech) extraction works best on unprocessed text.

def count_verbs_per_sentence(text):
    doc = nlp(text)
    verb_counts = [sum(1 for token in sent if token.pos_ == 'VERB') for sent in doc.sents]
    return verb_counts

for name, df in datasets.items():
    df['VERB_count'] = df['boiler-free_art'].apply(count_verbs_per_sentence)
    datasets[name] = df

In [109]:
#for name, df in datasets.items():
 #   if 'VERB_count' in df.columns: df.drop('VERB_count', axis=1, inplace=True)

In [113]:
# Check columns + results. All ok.
for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
datasets['train']['VERB_count'].head(2)
#datasets['validation'].head(6)
#datasets['test'].head(2)

Columns in train dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position',
       'VERB_count'],
      dtype='object')
Columns in validation dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position',
       'VERB_count'],
      dtype='object')
Columns in test dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_

0                                     [2, 4, 6, 1, 3, 6, 5, 5, 5, 3, 2, 5, 8, 4, 3, 5, 2]
1    [3, 4, 5, 2, 2, 4, 6, 6, 6, 2, 3, 2, 1, 2, 1, 3, 1, 4, 3, 3, 4, 2, 3, 2, 5, 4, 3, 1]
Name: VERB_count, dtype: object

In [116]:
# Extract the VERBs themselves, for inspection.

def verbs_per_sentence(text):
    doc = nlp(text)
    verbs_by_sentence = [[token.text for token in sent if token.pos_ == 'VERB'] for sent in doc.sents]
    return verbs_by_sentence

for name, df in datasets.items():
    df['VERBS'] = df['boiler-free_art'].apply(verbs_per_sentence)
    datasets[name] = df


In [117]:
# Check columns + results. All ok.
for name, df in datasets.items(): print(f"Columns in {name} dataset:", df.columns)
datasets['train']['VERBS'].head(2)
#datasets['validation'].head(2)
#datasets['test'].head(2)

Columns in train dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position',
       'VERB_count', 'VERBS'],
      dtype='object')
Columns in validation dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position',
       'VERB_count', 'VERBS'],
      dtype='object')
Columns in test dataset: Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_a

0                                                                                                                                                                                                                            [[giving, become], [hit, vowed, resist, slim], [Speaking, confessed, aimed, drop, left, stand], [Scroll], [gave, faced, lose], [says, following, snap, promoting, feel, slim], [famed, getting, giving, says, feeling], [standing, taking, try, shame, losing], [refused, name, shame, revealed, imagined], [said, get, distracted], [rob, focusing], [refused, name, shame, revealed, imagined], [want, enjoy, said, worried, dropping, spoke, walked, discovering], [revealed, live, moving, enjoying], [left, help, look], [admits, love, see, get, enjoying], [flaunted, says]]
1    [[claim, keep, reach], [helped, post, looked, running], [appeared, show, stepped, held, remove], [congratulated, getting], [celebrates, caught], [celebrate, winning, pulsating, knocking], [called, judge, remain

In [None]:
# thematic word: the top n words with the highest frequency in the cleaned text article.
# It shows that eight out of 10 of the top 10 words are contained in the summary while decrease harshly after the top 10 words. Therefore, the thematic
 # words were set as the top 10 words in the thematic word feature extraction
 # THE TOP 10 WORDS WITH THE HIGHEST FREQUENCY IN THE CLEANED TEXT ORDERED IN DESCDENDING ORDER. NAI
 # FIND TOP 10 verbs FOR THIS ARTICLE


In [118]:
# check how many columns created so far. All ok.
for name, df in datasets.items():
    print(f"THe number of columns in the {name} dataset is", len(df.columns), "and these are: ", df.columns)

THe number of columns in the train dataset is 19 and these are:  Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position',
       'VERB_count', 'VERBS'],
      dtype='object')
THe number of columns in the validation dataset is 19 and these are:  Index(['article', 'highlights', 'id', 'internet-free_art',
       'internet-free_high', 'boiler-free_art', 'boiler-free_high',
       'clean_sents_art', 'clean_sents_high', 'art_sent_embeddings',
       'high_sent_embeddings', 'similarity_scores', 'binary_labels',
       'NE_count', 'NEs_per_sent', 'sent_length', 'sent_position',
       'VERB_count', 'VERBS'],
      dtype='object')
THe number of columns in the test dataset is 19 and these are:  Index(['article', 'highlights',

## Create the X_train input 

In [121]:
# Construct a new DataFrame where each row represents a sentence along with its features and label.
# The columns might include sentence text, sentence embeddings, verb count, named entity count, sentence position, and a binary label indicating whether the sentence is in the summary.

# Initialize an empty dictionary to store the X_train datasets for each dataframe.
X_train_datasets = {}

for name, df in datasets.items():
    X_train = []

    for index, row in df.iterrows():
        # Assuming each of these columns contains a list of features for each sentence.
        sentence_embeddings = row['art_sent_embeddings']
        position_scores = row['sent_position']
        sentence_lengths = row['sent_length']
        ne_counts = row['NE_count']
        verb_counts = row['VERB_count']

        if all(len(lst) == len(sentence_embeddings) for lst in [position_scores, sentence_lengths, ne_counts, verb_counts]):
            article_features = []
            for i in range(len(sentence_embeddings)):
                # Convert the list of additional features to a NumPy array.
                additional_features = np.array([position_scores[i], sentence_lengths[i], ne_counts[i], verb_counts[i]])
                # Concatenate using NumPy's concatenate function.
                sentence_feature = np.concatenate([sentence_embeddings[i], additional_features])
                article_features.append(sentence_feature)
            X_train.append(article_features)
        else:
            print(f"Feature length mismatch in row {index} of {name} dataset")

    # Store the X_train for this dataframe in the dictionary
    X_train_datasets[name] = X_train

# Now X_train_datasets contains the combined features for each dataframe in the datasets dictionary.

Feature length mismatch in row 0 of train dataset
Feature length mismatch in row 2 of train dataset
Feature length mismatch in row 3 of train dataset
Feature length mismatch in row 5 of train dataset
Feature length mismatch in row 7 of train dataset
Feature length mismatch in row 8 of train dataset
Feature length mismatch in row 9 of train dataset
Feature length mismatch in row 10 of train dataset
Feature length mismatch in row 11 of train dataset
Feature length mismatch in row 13 of train dataset
Feature length mismatch in row 14 of train dataset
Feature length mismatch in row 16 of train dataset
Feature length mismatch in row 18 of train dataset
Feature length mismatch in row 20 of train dataset
Feature length mismatch in row 21 of train dataset
Feature length mismatch in row 24 of train dataset
Feature length mismatch in row 25 of train dataset
Feature length mismatch in row 28 of train dataset
Feature length mismatch in row 29 of train dataset
Feature length mismatch in row 30 of t

## Create y_train labels

In [122]:
y_train_datasets = {}

for name, df in datasets.items():
    # Extract the 'binary_labels' column as the labels for training
    y_train = df['binary_labels'].tolist()
    y_train_datasets[name] = y_train

In [123]:
# Check the first few labels of each dataset.
for name, labels in y_train_datasets.items():
    print(f"Dataset: {name}")
    print(labels[:5])


Dataset: train
[[1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1], [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], [1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1], [1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1]]
Dataset: validation
[[1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1], [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], [1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1], [1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1]]
Dataset: test
[[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1], [1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1,