# Problem:
## Create a timeline summarization list, that shows the major events of the IsraelHamas war that had occurred during the time-period mentioned in the Json file

### Steps followed Thought process
- Loading and Preprocessing the json data.
- Taking the input from the user like Israel-Hamas War and Generating the keywords
- Extract the date and the news article description
- Summarize the description using the llm model (here, used the facebook/bart-large-cnn for text summary generation)
- generate the vector embeddings for each article(i.e. description) and then form clusters of the similar articles(i.e. description) by means of some clustering algorithm (here, used DBSCAN)
- Using the keywords filter the articles based on the description and then get a central vector which respresent all the filtered articles via keywords, now based on the central vector find a cluster that it belongs to based on all the data and corseponding to the vector get all the articles
- get the date corresponding to these articles and that will your results
- further summarize the articels by forming the cluster and using title or description to generate new title or desciption for a enchaned summary like mentioned in the problem document "An example of a time timeline summarization list is as follows (excerpt from Wikipedia):". 

In [None]:
# please download the spacy model
#!python -m spacy download en_core_web_lg
# not required to run below commands just for reference
#!pip install spacy
#!pip install keybert
#!python -m spacy download en
#!pip install tf-keras

In [50]:
# code for the json data preprocessing and saving into new file after saving

import json
from datetime import datetime

def preprocess_article(article):
    # Extracting and converting dates if present
    if 'dateModified' in article:
        modified_date_str = article['dateModified']['$date']
        modified_date = datetime.fromisoformat(modified_date_str[:-1])
        article['dateModified'] = modified_date.strftime('%Y-%m-%d %H:%M:%S')

    if 'scrapedDate' in article:
        scraped_date_str = article['scrapedDate']['$date']
        scraped_date = datetime.fromisoformat(scraped_date_str[:-1])
        article['scrapedDate'] = scraped_date.strftime('%Y-%m-%d %H:%M:%S')

    # Trimming whitespace in text fields if present
    if 'articleBody' in article:
        article['articleBody'] = article['articleBody'].strip()

    if 'title' in article:
        article['title'] = article['title'].strip()

    if 'source' in article:
        article['source'] = article['source'].strip()
    
    return article

def preprocess_articles(articles):
    return [preprocess_article(article) for article in articles]

# Load data from file
with open('news.article.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

# Preprocess the data
processed_articles = preprocess_articles(data)

# Write the processed articles to a new JSON file
output_filename = 'processed_articles.json'
with open(output_filename, 'w', encoding='utf-8') as output_file:
    json.dump(processed_articles, output_file, indent=2)

print(f"Processed articles have been written to '{output_filename}'.")


Processed articles have been written to 'processed_articles.json'.


In [None]:
import json
from datetime import datetime
import spacy
from transformers import pipeline
import nltk
from nltk.tokenize import sent_tokenize
from keybert import KeyBERT


# Ensure nltk resources are downloaded
nltk.download('punkt')

# Load SpaCy model
nlp = spacy.load("en_core_web_lg")

# Initialize KeyBERT model
kw_model = KeyBERT()

# Sample JSON data
json_data = '''[
  {
  "articleBody": "The UN secretary general, António Guterres, has condemned an explosion that left three UN military observers and a Lebanese interpreter wounded when a shell exploded near them while they were patrolling the southern Lebanese border.\n\nThe blast came as clashes between the Israeli military and Hezbollah militants escalated in recent weeks.\n\nBoth sides have been exchanging fire since war broke out between Israel and Hamas in Gaza.\n\nThree UN Truce Supervision Organization (Untso) “military observers and one Lebanese language assistant on a foot patrol along the Blue Line were injured when an explosion occurred near their location”, UN Interim Force in Lebanon (Unifil) spokesperson Andrea Tenenti said in a statement on Saturday.\n\nThe wounded were “evacuated for medical treatment”, Tenenti added.\n\nPeacekeepers from Unifil patrol the so-called Blue Line, the border demarcated by the UN in 2000 when Israeli troops pulled out of southern Lebanon.\n\nThe Untso supports the peacekeeping mission.\n\nNorway’s defence ministry said a Norwegian UN observer was “lightly injured” and had been admitted to hospital.\n\n“The circumstances surrounding the attack are unclear,” defence ministry spokesperson Hanne Olafsen told Norwegian news agency NTB.\n\nTenenti told AFP that the other two observers were from Australia and Chile, adding that all four wounded were in “stable” condition while Australia’s defence department said the Australian’s injuries were not life-threatening.\n\nLocal Lebanese media, citing security officials, said an Israeli drone strike targeted the observers in the southern village of Wadi Katmoun near the border town of Rmeich.\n\n\n\nBut the Israeli military posted on social media platform X: “Contrary to the reports, the IDF did not strike a @UNIFIL vehicle in the area of Rmeish this morning.”\n\nTenenti said Unifil had informed all warring parties of their patrols as usual and the observers’ vehicle was carrying clear UN markings. The three military observers were unarmed, he said.\n\n\n\nUnifil is “investigating the origin of the explosion” but it was difficult to put investigators on the ground immediately because of the ongoing exchange of fire, added Tenenti.\n\n“Safety and security of UN personnel must be guaranteed,” Tenenti said, urging “all actors to cease the current heavy exchanges of fire before more people are unnecessarily hurt.”\n\nA UN spokesperson Stephane Dujarric said António Guterres condemned the explosion and expressed “grave concern” at the daily exchanges of fire between armed groups in Lebanon and Israeli forces.\n\n“These hostile actions have not only disrupted the livelihoods of thousands of people, but they also pose a grave threat to the security and stability of Lebanon, Israel, and the region,” Dujarric said.\n\n\n\nGuterres urges all action to refrain from further violations of the 2006 cessation of hostilities “and to pursue a diplomatic solution to the crisis”, Dujarric said, adding that the UN chief stands ready to support such efforts.\n\n\n\nLebanese caretaker prime minister Najib Mikati also condemned the incident in a statement.\n\n\n\nUnifil was created to oversee the withdrawal of Israeli troops from southern Lebanon after Israel’s 1978 invasion.\n\nThe UN expanded its mission after the 2006 war between Israel and Hezbollah, allowing peacekeepers to deploy along the Israeli border to help the Lebanese military extend its authority into the country’s south for the first time in decades.\n\nWith Associated Press, Australian Associated Press and Agence France-Presse",
  "dateModified": {
    "$date": "2024-03-31T00:00:00.000Z"
  },
  "scrapedDate": {
    "$date": "2024-03-31T03:09:50.586Z"
  },
  "source": "https://www.theguardian.com/",
  "title": "United Nations secretary general condemns explosion that injured UN observers in southern Lebanon"
  },
  {
  "articleBody": "Aam Aadmi Party (AAP) leader Raghav Chadha recently found himself amid a political storm after British Labour Party MP Preet Kaur Gill posted on social media about their meeting in London last week to discuss “global health security”.\n\nOn Saturday, Union Minister Anurag Thakur targeted the AAP without naming Chadha. “What kind of government is it (in Punjab)? An MP from the state (Punjab) stands with those forces who speak against India and support terrorism. He poses for a picture with them wearing a smile on his face. The AAP did not react to the BJP’s criticism of Chadha’s UK visit and his meeting with the Labour MP.\n\nGreat to meet @raghav_chadha in Parliament. – He is a Rajya Sabha MP from Punjab, India. I look forward to a discussion around global health security and antimicrobial resistance. pic.twitter.com/9Gz6gc7TZi — Preet Kaur Gill MP (@PreetKGillMP) March 20, 2024\n\nAdvertisement\n\nGill, the Shadow Minister for Primary Care and Public Health, made history in 2017 when she became the first woman Sikh MP of the United Kingdom. She has also often drawn scrutiny from the Indian government due to her perceived support for Khalistan.\n\nThe 51-year-old chairs the All-Party Parliamentary Group (APPG) for British Sikhs and serves as Vice Chair for the APPG on International Freedom of Religion or Belief. In February, she alleged in the House of Commons that agents with ties to India were targeting Sikhs in the UK. Gill mentioned that some British Sikhs were even on a “hit list” of what she termed “transnational repression”, and questioned Security Minister Tom Tugendhat about the British government’s response.\n\nAdvertisement\n\nIn August 2020, Gill engaged in a public row on X (at the time, Twitter) with Conservative British MP Raminder Singh Ranger who posted that British Prime Minister Boris Johnson does not endorse Khalistan. Gill promptly challenged him citing the principle of self-determination enshrined in the Charter of the United Nations.\n\nThe eldest of seven siblings, Preet Gill was born to a bus driver father and seamstress mother at Edgbaston, Birmingham, in 1972. Her father Dalvir Singh Shergill who migrated from the village of Jamsher in Jalandhar to West Midlands in 1962 was known as much for his height — he was a towering 6’4” — as for his work as the president of the first gurdwara in the UK — the Guru Nanak Gurdwara in Smethwick — from 1984 to 2004.\n\nAdvertisement\n\nPreet Gill had her first taste of public life when she was elected college president of Bournville College, where she studied psychology and sociology. She moved to London to get an honours degree in sociology and social work from the University of East London and became a child services manager. After completing her post-graduation, Gill came to India where she spent some time working with street children in Delhi. She also lived in a kibbutz in Israel for some time.\n\nGill’s parliamentary tenure\n\nA very active parliamentarian, Gill is known for her work in public health and environment. She first incurred the ire of the Indian authorities when she called for the release of British national Jagtar Singh Johal, aka Jaggi Johal, who she claimed was being wrongly held and tortured by the police since 2017, the year she was elected MP from Edgbaston, Birmingham. It is one demand she has raised every now and then. Arrested while on a visit to Punjab in 2017, Johal is accused of being involved in targeted killings in Punjab and at present is in Delhi’s Tihar jail.\n\nPreet Gill is still known to maintain her ties with her extended family in Jamsher, where a celebration was held to commemorate her victory in 2017. Around that time she triggered a row when during a visit to Punjab after her election she expressed concern about the prevalence of drugs, a no-go topic for foreign delegates. She also seemed to endorse the year-long farmers’ protest in 2020-’21 against the now-repealed farm laws.\n\nAdvertisement\n\nIn December 2021, soon after a sacrilege and lynching incident at the Golden Temple in Amritsar, the Indian High Commission in Britain issued a statement slamming Gill when she first wrote and then deleted a post that referred to a “Hindu terrorist” behind the act. Later she posted, “Beadbi incidents are unacceptable but the lynching of another person is also unacceptable.”\n\nIn 2023, Gill triggered a row in the Sikh community when she defended faith leaders amid a debate on domestic abuse among Sikh women sparked by a report from Sikh Women’s Aid. The report surveyed 839 Sikh women in Britain, revealing that nearly two-thirds had experienced domestic abuse, including incidents involving faith leaders. Gill took to the WhatsApp group “Sikhs in Labour” to seek a written apology to the “guru ghars”.\n\nOf late, Gill has been at the receiving end of threats, which started coming last year when she got a threatening email. In January, she told BBC that people had been threatening to protest outside her home. “I am really worried in a way I have never been worried before,” she told the broadcaster.\n\nAdvertisement\n\nThe Labour leader is married to fellow social worker Sureash Arora, with whom she has two daughters aged 12 and 14.",
  "dateModified": {
    "$date": "2024-03-26T16:01:17.000Z"
  },
  "scrapedDate": {
    "$date": "2024-03-31T03:13:43.646Z"
  },
  "source": "https://indianexpress.com/",
  "title": "Raghav Chadha stirs a row as he meets this UK MP: Who is Preet Kaur Gill?"
  }
]'''

with open('processed_articles.json') as f:
    json_data = f.read()
# Given text to extract keywords
given_text = "Israel-Hamas war"

# Step 1: Extract keywords from the given text
keywords = kw_model.extract_keywords(given_text, keyphrase_ngram_range=(1, 2), stop_words=None)
keywords = [kw[0] for kw in keywords]

# Step 2: Parse the JSON data
data = json.loads(json_data, strict=False)

# Step 3: Extract dates and event descriptions
events = []
for item in data:
    if 'dateModified' in item:
        date_str = item.get('dateModified', {})
        if date_str:
            date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S')
            description = item.get('articleBody', 'No Description')
            if len(description) < 20:
                continue
            events.append((date, description))
    elif 'scrapedDate' in item:
        date_str = item.get('scrapedDate', {})
        if date_str:
            date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S')
            description = item.get('articleBody', 'No Description')
            if len(description) < 20:
                continue
            events.append((date, description))
print("step 3 done")
# Step 4: Summarize events using a pre-trained summarization model
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')
print("step 4_1 done")


In [79]:
def summarize_text(text):
    #sentences = sent_tokenize(text)
    summarized = summarizer(text[:3500], do_sample=False)
    return summarized[0]['summary_text']

summarized_events = [(date, summarize_text(description)) for date, description in events]

print("step 4_2 done")


Your max_length is set to 142, but your input_length is only 81. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=40)
Your max_length is set to 142, but your input_length is only 87. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=43)
Your max_length is set to 142, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
Your max_length is set to 142, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
Your

step 4_2 done


In [82]:
for i,j in enumerate(summarized_events):
    print(len(summarized_events[i][1]))

360
403
339
350
382
329
311
366
238
343
310
294
414
351
395
266
324
298
278
342
319
414
322
402
323
309
336
266
274
248
346
438
399
341
295
370
281
288
423
299
375
358
366
330
388
335
396
307
285
367
378
285
399
287
316
451
345
310
343
354
334
331
434
288
497
303
341
433
326
297
401
334
335
417
289
388
359
302
375
263
356
247
347
319
281
381
276
338
289
321
341
351
393
310
287
282
267
320
303
348
348
328
315
332
293
309
225
272
279
292
333
245
328
345
405
242
361
342
278
295
268
296
343
341
364
321
397
271
335
334
265
309
212
341
392
373
336
324
376
326
312
434
328
385
346
330
344
286
383
345
286
405
323
242
299
339
297
302
344
308
332
296
297
308
415
302
311
266
280
356
284
348
360
283
260
388
265
456
318
277
340
355
299
291
270
392
246
403
317
281
234
279
371
386
341
334
279
423
354
339
294
370
284
317
276
377
313
393
261
361
394
421
303
405
274
266
287
374
374
386
322
391
281
354
292
302
351
398
314
337
291
267
316
267
384
316
317
335
382
318
285
419
312
327
334
406


In [84]:
print(summarized_events[:5])

[(datetime.datetime(2023, 12, 25, 8, 5, 20), 'The hype over artificial intelligence is starting to fade. Training such large language models bolsters personal data protection concerns. Freely using public web data to create AI-based products and services has also raised pressing copyright and data ownership issues. Some of the questions raised by those cases might be answered in courts as early as 2024.'), (datetime.datetime(2023, 12, 25, 0, 0), 'Usman Khawaja had “Freedom is a human right” and “All lives are equal’ written on his boots in the colours of the Palestinian flag. The Australian cricketer was also reprimanded by the ICC for sporting a black armband, which the batter said was for a personal bereavement. ICC spokesperson said “personal messages of this nature are not allowed as per Clause F of the Clothing and Equipment Regulations’'), (datetime.datetime(2023, 12, 25, 4, 39, 10), 'Palestinians say they feel "no joy" this Christmas as Israel bombed the besieged Palestinian ter

In [171]:

nlp = spacy.load('en_core_web_lg')
sent_vecs = {}
docs = []
for date, description in summarized_events:
    doc = nlp(description)
    docs.append(doc)
    sent_vecs.update({description: doc.vector})
sentences = list(sent_vecs.keys())
vectors = list(sent_vecs.values())


In [172]:
print(sentences[0], "\n", vectors[0])

The hype over artificial intelligence is starting to fade. Training such large language models bolsters personal data protection concerns. Freely using public web data to create AI-based products and services has also raised pressing copyright and data ownership issues. Some of the questions raised by those cases might be answered in courts as early as 2024. 
 [-1.3956993   0.13149853 -1.4270203   0.8394275   4.406913    0.4061782
  0.09317484  4.499222   -1.0841479  -1.9239994   6.662167    1.7905635
 -4.4193454   0.9573397   0.61504555  2.3979592   2.1753      0.75029546
 -3.2653315  -2.7306647   1.2743961  -0.7964604  -3.2359827  -1.0155948
 -1.0168236  -2.7793906  -1.3027642  -0.8391294  -1.0317029   0.6878147
  0.57955426 -0.11089796 -2.2075555  -1.3364551  -2.3851416  -1.1672047
 -0.9560001   1.1102964   0.8921238   0.16967647  1.3739581  -0.03930195
 -1.7274939   0.6975651  -2.503072    1.0653415   2.1759484  -2.7810497
 -0.57786113  1.011024   -0.53813064  2.531317    0.0847920

In [173]:
import numpy as np
from sklearn.cluster import DBSCAN
import pandas as pd
x = np.array(vectors)
n_classes = {}
for i in np.arange(0.001, 1, 0.002):
    dbscan = DBSCAN(eps=i, min_samples=2, metric='cosine').fit(x)
    n_classes.update({i: len(pd.Series(dbscan.labels_).value_counts())})
dbscan = DBSCAN(eps=0.08, min_samples=2, metric='cosine').fit(x)

In [174]:
dbscan

In [175]:
n_classes

{0.001: 1,
 0.003: 1,
 0.005: 1,
 0.007: 2,
 0.009000000000000001: 2,
 0.011: 3,
 0.013000000000000001: 3,
 0.015: 3,
 0.017: 3,
 0.019000000000000003: 5,
 0.021: 6,
 0.023: 6,
 0.025: 8,
 0.027000000000000003: 9,
 0.029: 9,
 0.031: 9,
 0.033: 15,
 0.035: 14,
 0.037000000000000005: 16,
 0.039: 16,
 0.041: 18,
 0.043000000000000003: 16,
 0.045: 13,
 0.047: 13,
 0.049: 10,
 0.051000000000000004: 10,
 0.053000000000000005: 10,
 0.055: 9,
 0.057: 9,
 0.059000000000000004: 8,
 0.061: 10,
 0.063: 10,
 0.065: 8,
 0.067: 6,
 0.069: 5,
 0.07100000000000001: 5,
 0.07300000000000001: 5,
 0.075: 6,
 0.077: 6,
 0.079: 5,
 0.081: 4,
 0.083: 4,
 0.085: 3,
 0.08700000000000001: 3,
 0.089: 3,
 0.091: 3,
 0.093: 3,
 0.095: 3,
 0.097: 3,
 0.099: 3,
 0.101: 3,
 0.10300000000000001: 3,
 0.10500000000000001: 3,
 0.107: 2,
 0.109: 2,
 0.111: 2,
 0.113: 2,
 0.115: 2,
 0.117: 2,
 0.11900000000000001: 2,
 0.121: 2,
 0.123: 2,
 0.125: 2,
 0.127: 2,
 0.129: 2,
 0.131: 2,
 0.133: 2,
 0.135: 2,
 0.137: 2,
 0.139: 2

In [186]:
results = pd.DataFrame({'description':dbscan.labels_, 'sent':sentences})
ex_results = results[results.description == 2].sent.tolist()


In [187]:
results.sent[2]

'Palestinians say they feel "no joy" this Christmas as Israel bombed the besieged Palestinian territory. Festivities effectively scrapped in the occupied West Bank city of Bethlehem, revered as the birthplace of Jesus Christ. Hamas militant group reported 50 strikes in central areas early on Monday, including in the Nuseirat refugee camp.'

In [188]:
ex_results[0]

"US-Canada military center 'tracks' Santa for 68th year. Website at www.noradsanta.org shows Santa Claus and his reindeer on their imagined worldwide delivery route. Santa tracker presented by North American Aerospace Defense Command dates to 1955. NORAD conducts aerospace and maritime operations, including monitoring for missile launches."

In [189]:
ex_results[1]



In [190]:
len(ex_results)

2

In [191]:
event_df = []
j  = 0
for data, description in summarized_events:
    for i in ex_results:
        if description is i:
            print(j, data, i)
            event_df.append((j, data, i))
            break
            
    j = j + 1
print(event_df)

180 2023-12-25 12:01:49 US-Canada military center 'tracks' Santa for 68th year. Website at www.noradsanta.org shows Santa Claus and his reindeer on their imagined worldwide delivery route. Santa tracker presented by North American Aerospace Defense Command dates to 1955. NORAD conducts aerospace and maritime operations, including monitoring for missile launches.


In [192]:
summarized_events[18]

(datetime.datetime(2023, 12, 25, 6, 7, 18),
 'At least 68 people were killed by an Israeli strike in central Gaza, health officials say. The number of Israeli soldiers killed in combat over the weekend rose to 15. The war has killed roughly 20,400 Palestinians and displaced almost all of the territory’s 2.3 million people.')

In [193]:
summarized_events[121]

(datetime.datetime(2023, 12, 24, 22, 10),
 'At least 70 people were killed in Gaza in one of the deadliest strikes of the war. The number of Israeli soldiers killed in combat over the weekend rose to 15. The war has devastated parts of Gaza, killed roughly 20,400 Palestinians and displaced almost all of the territory’s 2.3 million people.')

In [194]:
def Sort_Tuple(t): 
    return(sorted(t, key = lambda x: x[1]))
event_df = Sort_Tuple(event_df)
event_df

[(199,
  datetime.datetime(2023, 12, 25, 10, 6, 14),
 (180,
  datetime.datetime(2023, 12, 25, 12, 1, 49),
  "US-Canada military center 'tracks' Santa for 68th year. Website at www.noradsanta.org shows Santa Claus and his reindeer on their imagined worldwide delivery route. Santa tracker presented by North American Aerospace Defense Command dates to 1955. NORAD conducts aerospace and maritime operations, including monitoring for missile launches.")]

In [195]:
from sklearn.metrics import pairwise_distances_argmin_min
def get_mean_vector(sents):
    a = np.zeros(len(vectors[0]))
    for sent in sents:
        a = a + nlp(sent).vector/2
    return a/len(sents)
def get_central_vector(sents):
    vecs = []
    for sent in sents:
        doc = nlp(description)
        vecs.append(doc.vector)
    mean_vec = get_mean_vector(sents)
    index = pairwise_distances_argmin_min(np.array([mean_vec]), vecs)[0][0]
    return sents[index]

In [197]:
keywords

['hamas war', 'israel hamas', 'hamas', 'israel', 'war']

In [210]:
# Step 5: Filter relevant summarized events using extracted keywords
def filter_israel_hamas_war_events(events, keywords):
    relevant_events = []
    for event in events:
        if any(keyword.lower() in event[1].lower() for keyword in keywords):
            date, desc = event
            relevant_events.append(desc)
    return relevant_events

summarized_events_after_filter = filter_israel_hamas_war_events(summarized_events, keywords)

def filter_israel_hamas_war_events_index(events, keywords):
    relevant_events = []
    for i in range(len(events)):
        if any(keyword.lower() in events[i][1].lower() for keyword in keywords):
            relevant_events.append(i)
    return relevant_events

summarized_events_after_filter_index = filter_israel_hamas_war_events(summarized_events, keywords)


In [211]:
print(len(summarized_events_after_filter))

print(len(summarized_events))

print(len(summarized_events_after_filter_index))


169
246
169


In [214]:
summarized_events_after_filter[0]

'Palestinians say they feel "no joy" this Christmas as Israel bombed the besieged Palestinian territory. Festivities effectively scrapped in the occupied West Bank city of Bethlehem, revered as the birthplace of Jesus Christ. Hamas militant group reported 50 strikes in central areas early on Monday, including in the Nuseirat refugee camp.'

In [215]:
print("isreal hamas war")

single_desc_from_all = get_central_vector(summarized_events_after_filter)
print(single_desc_from_all)

isreal hamas war
Palestinians say they feel "no joy" this Christmas as Israel bombed the besieged Palestinian territory. Festivities effectively scrapped in the occupied West Bank city of Bethlehem, revered as the birthplace of Jesus Christ. Hamas militant group reported 50 strikes in central areas early on Monday, including in the Nuseirat refugee camp.


In [218]:
# find the cluster in which this single_desc_from_all desciption lies
# Number of clusters in labels, ignoring noise if present.
labels = dbscan.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print(labels)
print(n_clusters_)
print(n_noise_)

[-1  0  0  0  0  0  0  0 -1  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0 -1  0 -1  0  0  0  0  0  0  0  0  0  0  0 -1 -1  0  0
  0  0 -1  0  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0
  0  0 -1  0  0  0  0  0  0 -1  0 -1  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0 -1  0  0  0
  0  0  0 -1  0  1 -1  0  0  2  0 -1  0 -1  0  0  0  0  0 -1 -1  0  0  0
  0  0  0  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  1  0  0  0  0  0  0  0  0
  0]
3
24


In [237]:
cluster_assigned_to_single_desc_from_all = -1
for i in range(n_clusters_):
    if single_desc_from_all in results[results.description == i].sent.tolist():
        cluster_assigned_to_single_desc_from_all = i
print(cluster_assigned_to_single_desc_from_all)
result_desc_cluster_assigned_to_single_desc_from_all = results[results.description == cluster_assigned_to_single_desc_from_all].sent.tolist()
print(len(result_desc_cluster_assigned_to_single_desc_from_all))
print(result_desc_cluster_assigned_to_single_desc_from_all[:3])
single_desc_from_all_event_df = []
j  = 0
for date, description in summarized_events:
    for i in result_desc_cluster_assigned_to_single_desc_from_all:
        if description is i:
            single_desc_from_all_event_df.append((j, date, i))
            break
            
    j = j + 1
print(single_desc_from_all_event_df[0])
single_desc_from_all_event_df = Sort_Tuple(single_desc_from_all_event_df)
print(single_desc_from_all_event_df[0])

0
213
['Usman Khawaja had “Freedom is a human right” and “All lives are equal’ written on his boots in the colours of the Palestinian flag. The Australian cricketer was also reprimanded by the ICC for sporting a black armband, which the batter said was for a personal bereavement. ICC spokesperson said “personal messages of this nature are not allowed as per Clause F of the Clothing and Equipment Regulations’', 'Palestinians say they feel "no joy" this Christmas as Israel bombed the besieged Palestinian territory. Festivities effectively scrapped in the occupied West Bank city of Bethlehem, revered as the birthplace of Jesus Christ. Hamas militant group reported 50 strikes in central areas early on Monday, including in the Nuseirat refugee camp.', "The year 2023 proved to be a noteworthy period marked by several significant global events. From the wrestlers' protest to the breach of Parliament security, there were numerous major headlines. The year was particularly memorable in the cont

In [238]:
df_Data = pd.DataFrame(single_desc_from_all_event_df, columns=['index', 'date', 'description'])
df_Data.to_csv('TLS_news.csv')
df_Data.head()

Unnamed: 0,index,date,description
0,171,2014-09-10 08:37:44,Pope Francis laments war in Holy Land on solem...
1,168,2023-12-18 00:00:00,"Heavy rains have started in the Gaza Strip, ex..."
2,206,2023-12-18 00:00:00,"The oil major's move, announced on Monday (Dec..."
3,99,2023-12-18 07:15:39,Israeli forces launch fresh attacks throughout...
4,201,2023-12-19 07:27:00,The current Western governments have been givi...


In [240]:
df_Data.shape

(213, 3)

In [278]:
#[i for i in list(data[0]['title'])]
data = json.loads(json_data, strict=False)
#print(data[0]['title'])
titles_extraction = [i['title'] for i in data]
titles_extraction_filter = []
for i in range(len(titles_extraction)):
    if i in list(df_Data['index']):
        ind = list(df_Data['index']).index(i)
        
        titles_extraction_filter.append((i, df_Data.loc[ind,'date'], titles_extraction[i]))

print(len(titles_extraction_filter))
titles_extraction_filter = Sort_Tuple(titles_extraction_filter)
print(titles_extraction_filter[0])

213
(171, Timestamp('2014-09-10 08:37:44'), 'Israeli soldier killed in rocket attack from Lebanon')


In [281]:
from sklearn.cluster import KMeans
from transformers import pipeline
import numpy as np
import pandas as pd


# Reference date in Timestamp format
reference_date = titles_extraction_filter[0][1]

# Extract dates and articles
dates = np.array([[(date - reference_date).days] for index, date, _ in titles_extraction_filter])  # Convert to numerical representation

# Cluster the dates using K-means
n_clusters = 3  # Choose the number of clusters
kmeans = KMeans(n_clusters=n_clusters)
cluster_labels = kmeans.fit_predict(dates)




In [289]:
print(cluster_labels)

[1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [288]:
# Summarize articles within each cluster
summarizer_dates = pipeline('summarization', model='facebook/bart-large-cnn')

cluster_summaries = [[] for _ in range(n_clusters)]
for cluster_label in range(n_clusters):
    cluster_articles = [title for i, (_, _, title) in enumerate(titles_extraction_filter) if cluster_labels[i] == cluster_label]
    cluster_summary = summarizer_dates(" ".join(cluster_articles)[:3500], do_sample=False)
    cluster_summaries[cluster_label] = cluster_summary

# Print the cluster summaries
for cluster_label, summary in enumerate(cluster_summaries):
    cluster_indices = [i for i, label in enumerate(cluster_labels) if label == cluster_label]
    start_date = titles_extraction_filter[cluster_indices[0]][1].strftime('%d %B %Y')
    end_date = titles_extraction_filter[cluster_indices[-1]][1].strftime('%d %B %Y')
    print(f"{start_date} - {end_date}: {summary}")

Your max_length is set to 142, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


22 December 2023 - 25 December 2023: [{'summary_text': 'Israel increases strikes in central Gaza, killing scores Khawaja denied permission to have peace symbol on bat: Reports Pope Francis says Jesus’ message of peace is being drowned out by ‘futile logic of war’ Shipping giant Maersk prepares to resume operations in Red Sea. 50 elections globally amid wars cast doubt on 2024 economic outlook.'}]
10 September 2014 - 10 September 2014: [{'summary_text': 'Israeli soldier killed in rocket attack from Lebanon. Soldier was killed by a rocket fired from Lebanon, Israel says. Israeli military says it is investigating the attack and has launched an investigation into the source of the rocket attack. Israel says the rocket came from Lebanon and was fired from inside the country.'}]
18 December 2023 - 22 December 2023: [{'summary_text': "World celebrates Christmas in shadow of war. UN Security Council adopts resolution on aid to Gaza watered down by US pressure. Syrians cancel Christmas festivit