Extractive Summarizer: An extractive summarizer based on a non-deep learning supervised model.
1. In particular, you have to train a non-deep learning classification (e.g. logistic regression, SVM) or regression model (e.g. ElasticNet, SVR) that will be used for scoring the sentences of the input document. (0 or 1)
2. Then based on the scores you will create a summarizer that attempts to create a summary with the most informative, non-redundant **sentences**. It is up to you which machine learning model and features you will use.


## Steps

0. Data exploration
1. Preprocessing, which will be applied also on the highlights: so the 2 new columns cleaned_articles, cleaned highlights will be use for similarity calculation and feature extraction.
2. Feature Extraction (sentence length, sentence position): here also we can try : NER the count of **Named entities** in a sentence , and POS tagging for the count of **VERBS ** (since Verbs can denote events, it can also good practice for event extraction)
3. Sentence score calculation: similarity of each sentence of the article with each sentence in the highlights based on their Average word2vec embedding.
High scores between pairs denote that the sentence is a good candidate for being added in the summary, so high-scoring sentences will be assigned label 1. Low scoring pairs based on a specific threshold will be assigned label 0.
4. Sentence extraction: extract the n first sentences in the document that have been given a value of 1 in the previous step. (i chose the 3 first)
5. Summary production: these 3 first sentences should be written in a .txt file , where on the left side, we will have the article id: eg
(f001ec5c4704938247d27a44948eebb37ae98d01) and then the 3 extracted sentences as raw text,separated with a fullstop (no lists).
6. the content of 'highlights' column should be compared to the content of the 'ml-summary' column and Compute Rouge 2 metrics in a new column which will contain all 3 requested metrics in a dictionary.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
!pip install datasets contractions

Collecting datasets
  Downloading datasets-2.16.0-py3-none-any.whl.metadata (20 kB)
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting pyarrow>=8.0.0 (from datasets)
  Downloading pyarrow-14.0.2-cp311-cp311-win_amd64.whl.metadata (3.1 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py311-none-any.whl.metadata (7.2 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.9.1-cp311-cp311-win_amd64.whl.metadata (7.6 kB)
Collecting huggingface-hub>=0.19.4 (from datasets)
  Downloading huggingface_hub-0.20.1-py3-none-any.whl.metadata (12 kB)
Collecting pyyaml>=5.1 (from datasets)
  Downloading

In [4]:
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import numpy as np
import re
import datasets
import contractions
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import string
from nltk.stem import PorterStemmer
nltk.download('wordnet')
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
from bs4 import BeautifulSoup
from gensim.models import KeyedVectors


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ankar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ankar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ankar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
dataset_name = "cnn_dailymail"
dataset_version = "3.0.0"

dataset = datasets.load_dataset(dataset_name, dataset_version)

print(dataset.keys())

Downloading data: 100%|██████████| 313M/313M [00:48<00:00, 6.45MB/s] 
Downloading data: 100%|██████████| 304M/304M [01:10<00:00, 4.33MB/s] 
Downloading data: 100%|██████████| 155M/155M [00:31<00:00, 4.84MB/s] 
Downloading data: 100%|██████████| 34.7M/34.7M [00:05<00:00, 6.80MB/s]
Downloading data: 100%|██████████| 30.0M/30.0M [00:04<00:00, 6.45MB/s]
Generating train split: 100%|██████████| 287113/287113 [00:03<00:00, 87653.73 examples/s] 
Generating validation split: 100%|██████████| 13368/13368 [00:00<00:00, 123981.09 examples/s]
Generating test split: 100%|██████████| 11490/11490 [00:00<00:00, 93165.49 examples/s]


dict_keys(['train', 'validation', 'test'])


In [6]:
# Convert the datasets to DataFrames, then to csv to be saved.
train_df = pd.DataFrame(dataset['train']) 
train_df.to_csv('train.csv')
val_df = pd.DataFrame(dataset['validation']) 
val_df.to_csv('val.csv')
test_df = pd.DataFrame(dataset['test'])
test_df.to_csv('test.csv')

In [7]:
import os
os.getcwd() # Path of the saved csv files.

'C:\\Users\\ankar\\Desktop\\summarization-models\\src\\push_task2'

In [None]:
df_train =pd.read_csv('/content/train.csv')
df_test=pd.read_csv('/content/test.csv')

In [None]:
df_train=df_train.head(1000)
df_test=df_test.head(300)


#Preprocessing steps:
Remove:
- start and end patterns
- parenthesis (and everything insinde them)

##Also another pattern spotted in the highlights column: NEW: blablaaa , so remove the "NEW:" at the beginning of the highlights
- expand contractions (won't =will not)
- stopwords
- 's (sister's =sister)
- sentence tokenization
- lowercasing
- word tokenisation
- remove punctuation
- non- ascii characters (e.g $ 20 million= 20 million)
- lemmatization  

**What was not removed and maybe it should be removed is the numbers!!**

**Also after stopword removel: Removal of sentences with sentence length <=5 tokens!**
We can discuss this
The other approach would be to keep the articles which have sentence length 200 (in tokens and highlight length of around 50 tokens, this will leave us with around 30.000 train data!)
#Also more cleaning to DO!!!:
1. Drop the duplicate rows
2. Check & Remove Null data
3. Remove https, links, urls,
4. remove href:..

In [None]:
def apply_lemmatization(tokens):
  lemmatizer = WordNetLemmatizer()
  return [lemmatizer.lemmatize(word) for word in tokens]

def apply_stemming(tokens):
  porter = PorterStemmer()
  return [porter.stem(word) for word in tokens]

def clean_text(x):
  puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#',
              '*', '+', '\\', '•', '~', '@', '£',
              '·', '_', '{', '}', '©', '^', '®', '`','--', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', 'Â',
              '█', '½', 'à', '…',
              '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―',
              '¥', '▓', '—', '‹', '─',
              '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸',
              '¾', 'Ã', '⋅', '‘', '∞',
              '∙', '）', '↓', '、', '│', '（', '»','«', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø',
              '¹', '≤', '‡', '√', ]

  x = str(x)
  for punct in puncts:
    x = x.replace(punct, f' {punct} ')
  return x

def remove_non_ascii(tokens):
    return [word for word in tokens if re.match(r'^[^\x00-\x7F]+$', word) is None]

def remove_urls(text):
    # Remove URLs starting with http:// or https://
    text = re.sub(r'https?://\S+', '', text, flags=re.MULTILINE)
    return text

def remove_html_tags(text):
    # Remove HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

def remove_usernames(text):
    # Remove @usernames
    text = re.sub(r'@\w+', '', text)
    return text

def clean_and_tokenize(article):
    # Remove text within parentheses and everything in it
    cleaned_article = re.sub(r'\([^)]*\)', '', article)

    # Split the text using '-- ' as the delimiter
    parts = re.split(r'-- ', cleaned_article, maxsplit=1)

    # Check if there was a match and reconstruct the text
    cleaned_article = parts[1] if len(parts) > 1 else cleaned_article

    # Remove 'E-mail to a friend' and anything that follows it
    cleaned_article = re.sub(r'E-mail to a friend.*', '', cleaned_article)
    #####HERE: add the extra cleaning functions: hmtl, urls, usernames removal)
    #cleaned_article=re.sub(r'<a href','',text)
    #cleaned_article=re.sub(r'&amp;','',text)
    # Remove URLs
    #cleaned_article = remove_urls(article)
    # Remove HTML tags
    #cleaned_article = remove_html_tags(cleaned_article)
    # Remove @usernames
    #cleaned_article = remove_usernames(cleaned_article)
    # Expand word contractions
    expanded_article = contractions.fix(cleaned_article)
    # Add a period after the closing quotation mark if there is a space and a capital letter
    text_with_period = re.sub(r'(")([ ])([A-Z])', r'\1.\2\3', expanded_article)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text_with_period)
    filtered_article = [word for word in word_tokens if word.lower() not in stop_words and word.lower() != "'s"]
    filtered_article = ' '.join(filtered_article)
    #Sentence tokenization
    sentences = sent_tokenize(filtered_article)
    # Lowercase the text
    sentences = [sentence.lower() for sentence in sentences]
    # Tokenize sentences
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    # Remove punctuation using custom clean_text function
    tokenized_sentences_no_punct = [
        [word for word in word_tokenize(clean_text(sentence)) if word not in string.punctuation]
        for sentence in sentences]
    tokenized_sentences_no_ascii = [
    remove_non_ascii(sentence)
    for sentence in tokenized_sentences_no_punct]
    # Apply lemmatization
    tokenized_sentences_lemmatized = [
        apply_lemmatization(sentence)
        for sentence in tokenized_sentences_no_ascii
        if len(sentence) > 5
        ]

    # Remove empty lists
    tokenized_sentences_lemmatized = [sentence for sentence in tokenized_sentences_lemmatized if sentence]

    return tokenized_sentences_lemmatized

###################THESE 2 FUNCTIONS HAVE NOT BEEN YET USED IN IMPLEMETATION THEY AREJUST AN IDEA APPROACH###################
def get_columns(dataframe, article, highlights):

  # Get only the columns we are interested in
  dataset = dataframe[[article]]

  # Apply the pre-processing function to the dataframe containing the text (feature column)
  dataset[article] = dataset[article].apply(clean_and_tokenize)
  dataset[highlights] = dataset[highlights].apply(clean_and_tokenize)

  print('\nText done pre-processing!')

  X = dataset[article]
  Y = dataset[highlights]

  return X, Y

def data_prepare(df_train, df_test,w2v_model, article, highlights):

  # Prepare the training dataset
  print('------ Preparing the training dataset... ------')
  X,y = get_columns(df_train, article, highlights)

  # Prepare the validation/testing dataset
  print('\n------ Preparing the validation/testing dataset... ------')
  x1,y1 = get_columns(df_test, article, highlights)


  w2vX_train, words_found, matrix_len = find_words_in_w2v(X,w2v_model)
  print('Percentage of words found in W2V: ', words_found/matrix_len)

  w2vX_test, words_found, matrix_len = find_words_in_w2v(x1,w2v_model)
  print('Percentage of words found in W2V: ', words_found/matrix_len)
# data_prepare(df_train, df_test,w2v_model='/content/drive/MyDrive/NLU/Task_2/w2v_model/w2v_summ.model','article','highlights')

#Apply cleaning in the articles and highlights of both test and train datasets

In [None]:
# Apply the function to each row in the 'article' column
df_train['cleaned_article'] = df_train['article'].apply(clean_and_tokenize)
df_test['cleaned_article'] = df_test['article'].apply(clean_and_tokenize)

In [None]:
df_train['clean_highlights']=df_train['highlights'].apply(clean_and_tokenize)
df_test['clean_highlights']=df_test['highlights'].apply(clean_and_tokenize)

Keep first 50% of tokenized sentences. in the cleaned articles

In [None]:
df_train['first_half_sentences'] = df_train['cleaned_article'].apply(lambda sentences: sentences[:len(sentences)//2])

In [None]:
df_test['first_half_sentences'] = df_test['cleaned_article'].apply(lambda sentences: sentences[:len(sentences)//2])

In [None]:
rows_to_display = [0,1,2]  # Replace with the row indices you want to check

for row_index in rows_to_display:
    print(f"Row {row_index}:")
    print(df_train['cleaned_article'].iloc[row_index])
    print("\n---\n")

In [None]:
# Set display options to show all rows and columns
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)

#def compute_max_cosine_similarity(row, threshold=0.3):
#    articles = row['cleaned_article']
#    highlights = row['clean_highlights']

    # Flatten the lists of tokenized sentences into strings
#    articles_str = [' '.join(sent) for sent in articles]
#    highlights_str = [' '.join(sent) for sent in highlights]

    # Combine articles and highlights for fitting CountVectorizer
#    all_sentences = articles_str + highlights_str

    # Use CountVectorizer to convert sentences to document-term matrices
#    vectorizer = CountVectorizer()
#    all_matrix = vectorizer.fit_transform(all_sentences).toarray()

    # Split the matrices back into articles and highlights parts
#    articles_matrix = all_matrix[:len(articles)]
#    highlights_matrix = all_matrix[len(articles):]

    # Compute cosine similarity for each pair of sentences
#    similarity_matrix = cosine_similarity(articles_matrix, highlights_matrix)

    # Find the maximum similarity score for each sentence
#    max_similarity_scores = similarity_matrix.max(axis=1)
#    max_similarity_scores = [round(score, 3) for score in max_similarity_scores]
    # Assign labels based on the threshold
#    labels = [1 if score >= threshold else 0 for score in max_similarity_scores]
#    return labels

# Apply the compute_max_cosine_similarity function to each row
#df6['max_cosine_similarity'] = df6.apply(compute_max_cosine_similarity, axis=1)

#df6['labels'] = df6.apply(compute_max_cosine_similarity, axis=1)

Explode dataframe so that each sentence of each article is represented by a row

In [None]:
# Explode the DataFrame and reset the index
df_train_exploded = df_train.explode('first_half_sentences').reset_index(drop=True)
df_test_exploded = df_test.explode('first_half_sentences').reset_index(drop=True)

# Rename the 'articles' column to 'sentences'
df_train_exploded = df_train_exploded.rename(columns={'first_half_sentences': 'sentences'})
df_test_exploded = df_test_exploded.rename(columns={'first_half_sentences': 'sentences'})

# Display the resulting DataFrame
print(df_train_exploded)
print(df_test_exploded)

In [None]:
current_article_id = None
article_index = 0

def create_article_test_sentence(row):
    global current_article_id, article_index

    # Check if the current article ID is different from the one in the row
    if row['id'] != current_article_id:
        current_article_id = row['id']
        article_index += 1  # Increment the article index for a new article

    return f"{article_index}-{row.name}"

# Assuming 'id' is the column containing the article IDs
df_test_exploded['article-sentence'] = df_test_exploded.apply(create_article_test_sentence, axis=1)

Fix the article_sentence  indexing

In [None]:
# Extract article and sentence indices
df_test_exploded[['article', 'sentence']] = df_test_exploded['article-sentence'].str.split('-', expand=True)

# Convert columns to numeric
df_test_exploded['article'] = pd.to_numeric(df_test_exploded['article'])
df_test_exploded['sentence'] = pd.to_numeric(df_test_exploded['sentence'])

# Calculate the correct sentence index for each article
df_test_exploded['new_sentence_index'] = df_test_exploded.groupby('article').cumcount()

# Create the new article-sentence column
df_test_exploded['new_article_sentence'] = df_test_exploded['article'].astype(str) + '-' + df_test_exploded['new_sentence_index'].astype(str)

# Drop unnecessary columns
df_test_exploded = df_test_exploded.drop(['article','article-sentence', 'sentence', 'new_sentence_index'], axis=1)

In [None]:
df_test_exploded

Encode sentence position

In [None]:
# Add a new column 'position' based on the sentences' order within each article
df_train_exploded['position'] = df_train_exploded.groupby('id').cumcount()
df_test_exploded['position'] = df_test_exploded.groupby('id').cumcount()

# Calculate the total number of sentences for each article
train_article_lengths = df_train_exploded.groupby('id').size()
test_article_lengths = df_test_exploded.groupby('id').size()

# Gradually reduce the position value for each article
df_train_exploded['position'] = 1 - (df_train_exploded['position'] / train_article_lengths[df_train_exploded['id']].values)
df_test_exploded['position'] = 1 - (df_test_exploded['position'] / test_article_lengths[df_test_exploded['id']].values)


In [None]:
# If you want to round the values to a certain decimal place, you can use round()
df_train_exploded['position'] = round(df_train_exploded['position'], 2)
df_test_exploded['position'] = round(df_test_exploded['position'], 2)


In [None]:
df_test_exploded.position

W2Vec model! Load the trained model and use it to create average w2v embeddings for each sentence.

In [None]:
w2v_model=Word2Vec.load('/content/drive/MyDrive/NLU/Task_2/w2v_model/w2v_summ.model')

In [None]:
def find_words_in_w2v(train_dataset, w2vmodel):
    # Load the Word2Vec model
    if isinstance(w2vmodel, str):
        w2v_model = KeyedVectors.load_word2vec_format(w2vmodel, binary=True)
    elif isinstance(w2vmodel, Word2Vec):
        w2v_model = w2vmodel
    else:
        raise ValueError("Invalid Word2Vec model format")

    # Get the size of the Word2Vec model
    embedding_size = w2v_model.vector_size

    # Initialize an empty array to store mean embeddings
    mean_embeddings = []
        # Iterate over each row in the train_dataset
    for sentence in train_dataset['sentences']:
        # Check if the 'sentences' column is a list
        if isinstance(sentence, list):
            # Initialize a list to store word embeddings in the current sentence
            word_embeddings = []

            # Iterate over each word in the sentence
            for word in sentence:
                # Check if the word is in the Word2Vec model's vocabulary
                if word in w2v_model.wv.key_to_index:
                    # If the word is found, add its Word2Vec embedding to the list
                    word_embeddings.append(w2v_model.wv[word])
                else:
                    # If the word is not found, initialize it with zero vectors
                    word_embeddings.append(np.zeros(embedding_size))

            # Calculate the mean of the Word2Vec embeddings for the words in the row
            if word_embeddings:
                mean_embedding = np.mean(word_embeddings, axis=0)
            else:
                # If no word embeddings found, use zero vectors
                mean_embedding = np.zeros(embedding_size)

            # Append the mean embedding to the list
            mean_embeddings.append(mean_embedding)
        else:
            # If 'sentences' is not a list, append a zero vector
            mean_embeddings.append(np.zeros(embedding_size))

    # Convert the list of mean embeddings into a NumPy array
    mean_embeddings_array = np.array(mean_embeddings)

    # Create a new column in the dataframe with the mean embeddings
    train_dataset['mean_embeddings'] = mean_embeddings_array.tolist()

    return train_dataset

In [None]:
df_train_exploded= find_words_in_w2v(df_train_exploded, w2v_model)
df_test_exploded = find_words_in_w2v(df_test_exploded, w2v_model)

#SOS!! The resutls in mean_embeddings are differnt  from those in the 'Cosine_Similarity.ipynb : we have to check and compare the 2 files carefully to find out what is going wrong here!!Because in the 'cosine similarity. ipynb the results are reasonable!

Average w2v for each sentence in the list of highlights!

In [None]:
def calculate_avg_word2vec_list(sentence_list, model):
    avg_word2vec_list = []
    for sentence in sentence_list:
        # Filter out words that are not in the model's vocabulary
        valid_words = [word for word in sentence if word in model.wv.key_to_index]
        # Check if there are valid words before calculating the average
        if valid_words:
            # Use numpy's vstack to vertically stack the word vectors
            word_vectors = np.vstack([model.wv[word] for word in valid_words])

            # Calculate the mean along the first axis (axis=0)
            avg_word2vec = np.mean(word_vectors, axis=0)

            avg_word2vec_list.append(avg_word2vec)
        else:
            # If no valid words, append a zero vector
            avg_word2vec_list.append(np.zeros(model.vector_size))

    return avg_word2vec_list

In [None]:
df_train_exploded['avg_word2vec_highlights'] = df_train_exploded['clean_highlights'].apply(lambda x: calculate_avg_word2vec_list(x, model=w2v_model))
df_test_exploded['avg_word2vec_highlights'] = df_test_exploded['clean_highlights'].apply(lambda x: calculate_avg_word2vec_list(x, model=w2v_model))


In [None]:
def calculate_cosine_similarity(sentence, highlights_avg_word2vec_list):
    return [cosine_similarity([sentence], [highlight_sentence])[0][0] for highlight_sentence in highlights_avg_word2vec_list]


In [None]:
df_train_exploded['cosine_similarity'] = df_train_exploded.apply(lambda row: calculate_cosine_similarity(row['mean_embeddings'], row['avg_word2vec_highlights']), axis=1)
df_test_exploded['cosine_similarity'] = df_test_exploded.apply(lambda row: calculate_cosine_similarity(row['mean_embeddings'], row['avg_word2vec_highlights']), axis=1)


In [None]:
def get_max_cosine_similarity(cosine_similarity_list):
    if cosine_similarity_list:
        return max(cosine_similarity_list)
    else:
        return 0

In [None]:
df_train_exploded['max_cosine_similarity'] = df_train_exploded['cosine_similarity'].apply(get_max_cosine_similarity)
df_test_exploded['max_cosine_similarity'] = df_test_exploded['cosine_similarity'].apply(get_max_cosine_similarity)


In [None]:
df_train_exploded[['sentences', 'clean_highlights', 'max_cosine_similarity']]

In [None]:
def assign_value(cosine_similarity):
    return 1 if cosine_similarity >= 0.85 else 0

In [None]:
df_train_exploded['label'] = df_train_exploded['max_cosine_similarity'].apply(assign_value)
df_test_exploded['label'] = df_test_exploded['max_cosine_similarity'].apply(assign_value)


In [None]:
df_train_exploded[['sentences', 'clean_highlights', 'max_cosine_similarity', 'label']]

In [None]:
def calculate_sentence_length(sentence):
    # Convert to string if not already
    sentence_str = str(sentence)

    # Assuming 'sentence' is a list of tokens
    if isinstance(sentence, list):
        return len(sentence)

    # If 'sentence' is a string, split it into tokens and return the length
    elif isinstance(sentence_str, str):
        tokens = sentence_str.split()
        return len(tokens)

    # Handle other cases or return 0 if it's not a recognized format
    else:
        return 0
# Apply the function to the DataFrame
df_train_exploded['sentence_length'] = df_train_exploded['sentences'].apply(calculate_sentence_length)
df_test_exploded['sentence_length'] = df_test_exploded['sentences'].apply(calculate_sentence_length)

# Display the resulting DataFrame with the new 'sentence_length' column
print(df_train_exploded[['sentences', 'clean_highlights', 'max_cosine_similarity', 'label', 'sentence_length']])
print(df_test_exploded[['sentences', 'clean_highlights', 'max_cosine_similarity', 'label', 'sentence_length']])


                                               sentences  \
0      [harry, potter, star, daniel, radcliffe, gain,...   
1      [daniel, radcliffe, harry, potter, harry, pott...   
2      [disappointment, gossip, columnist, around, wo...   
3      [plan, one, people, soon, turn, 18, suddenly, ...   
4      [thing, like, buying, thing, cost, 10, pound, ...   
...                                                  ...   
12042  [1915, made, singing, debut, chicago, symphony...   
12043  [world, war, gave, recital, benefited, red, cr...   
12044  [1918, began, nearly, year, stay, france, sing...   
12045  [experience, led, breakdown, loss, singing, vo...   
12046  [mental, floss, sing, national, anthem, sporti...   

                                        clean_highlights  \
0      [[harry, potter, star, daniel, radcliffe, get,...   
1      [[harry, potter, star, daniel, radcliffe, get,...   
2      [[harry, potter, star, daniel, radcliffe, get,...   
3      [[harry, potter, star, daniel, r

In [None]:

df_test_exploded


In [None]:
train_columns_to_keep = ['mean_embeddings', 'position','max_cosine_similarity', 'label', 'sentence_length']

test_columns_to_keep = ['mean_embeddings', 'position','max_cosine_similarity', 'label', 'sentence_length','new_article_sentence']

# Drop columns not in the list
df_filtered_train = df_train_exploded.drop(columns=df_train_exploded.columns.difference(train_columns_to_keep), axis=1)
df_filtered_test = df_test_exploded.drop(columns=df_test_exploded.columns.difference(test_columns_to_keep), axis=1)

# Display the resulting DataFrame with only the desired columns
print(df_filtered_train)
print(df_filtered_test)

       position                                    mean_embeddings  \
0          1.00  [0.07560794055461884, 0.09687140583992004, 0.0...   
1          0.88  [0.33138221502304077, 0.43603840470314026, 0.2...   
2          0.75  [0.03037995472550392, 0.05898478999733925, -0....   
3          0.62  [0.0943671315908432, 0.020216917619109154, 0.0...   
4          0.50  [-0.034966520965099335, 0.18290971219539642, 0...   
...         ...                                                ...   
12042      0.14  [0.25769147276878357, 0.03035671077668667, 0.0...   
12043      0.11  [0.0879194438457489, -0.016580261290073395, 0....   
12044      0.08  [0.2691250443458557, -0.052242983132600784, 0....   
12045      0.05  [0.25937116146087646, 0.0009072307148016989, 0...   
12046      0.03  [0.24269549548625946, 0.4173557460308075, -0.1...   

       max_cosine_similarity  label  sentence_length  
0                   0.936883      1               18  
1                   0.844604      0              

#Here : to be able to continue with the summary generations from here, we can also keep the columns 'id', sentences and 'higlhights'.

In [None]:
df_filtered_train.to_csv('sample_trainset1000.csv')
df_filtered_test.to_csv('sample_trainset300.csv')


See the examples from the dataset of Github:

In [None]:
#tr=pd.read_csv('/content/drive/MyDrive/NLU/Task_2/Training_Data_Extension_3.csv')
#tst=pd.read_csv('/content/drive/MyDrive/NLU/Task_2/Test_Data_Extension_3.csv')
#tr
#tst