<a href="https://colab.research.google.com/github/abhinav-TB/text-summarization/blob/main/experimenting_text_clustering_using_SOM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Pre-Processing


In [45]:
#installing hugging face dataset library
!pip install datasets



In [46]:
# imports
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from datasets import load_dataset

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [47]:
#Loading dataset
dataset = load_dataset('ccdv/cnn_dailymail', '3.0.0')

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134fc22da53ea75852027040d8d1e997f101f)


  0%|          | 0/3 [00:00<?, ?it/s]

In [48]:
#splitting train and test sets
df_train = dataset['train']
df_test = dataset['test']['article']
print(len(df_train))
print(len(df_test))

287113
11490


In [49]:
#sample item in dataset
text = df_train[0]['article']
text

'It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It\'s a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he want

### Tokenization
In this step, the text is split into smaller units. We can use either sentence tokenization or word tokenization based on our problem statement

In [50]:
## Sentence Tokensing
from nltk.tokenize import sent_tokenize

text_tokens = sent_tokenize(text)
org_text = text_tokens.copy()
print(len(text_tokens))
text_tokens[0]

76


"It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria."

### Puncuation Removal
In this step, all the punctuations from the text are removed. string library of Python contains some pre-defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’
}~’

In [51]:
# Removing pucuations
import string

def remove_punctuation(text):
    return "".join([i for  i in text if i not in string.punctuation])

for i in range(len(text_tokens)):
    text_tokens[i] = remove_punctuation(text_tokens[i])


text_tokens[0]

'Its official US President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria'

### Stop word Removal
Stopwords are the commonly used words and are removed from the text as they do not add any value to the analysis. These words carry less or no meaning.

In [52]:
# removing stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

new_text_tokens = []

for sentence in text_tokens:
    temp = " ".join([w for w in sentence.split() if not w.lower() in stop_words])
    new_text_tokens.append(temp)

text_tokens = new_text_tokens
text_tokens[0]

'official US President Barack Obama wants lawmakers weigh whether use military force Syria'

### Lemmatization
It stems the word but makes sure that it does not lose its meaning.  Lemmatization has a pre-defined dictionary that stores the context of words and checks the word in the dictionary while diminishing.


In [53]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
    return "".join(lemm_text)

for i in range(len(text_tokens)):
    text_tokens[i] = lemmatize(text_tokens[i])


## Creating Sentence vectors

In [54]:
!pip install sentence-transformers



In [55]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

In [56]:

sentence_embeddings = sbert_model.encode(text_tokens)

print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))
print('Sample BERT embedding vector', sentence_embeddings[0])

Sample BERT embedding vector - length 768
Sample BERT embedding vector [ 1.92461312e-01  4.73752707e-01 -3.91261071e-01  1.08960591e-01
  5.23253977e-01 -5.32883555e-02 -1.89605772e-01 -4.30334896e-01
  6.06365263e-01 -1.19935966e+00 -1.21604227e-01  6.89616203e-01
  1.08257778e-01  4.31654751e-01 -7.94367790e-01  5.93337417e-01
 -2.29677305e-01  5.29686585e-02  5.07118821e-01  1.71185672e-01
  1.17934681e-01  2.07671925e-01  4.29090798e-01 -4.22741920e-02
  9.46290791e-01  2.19076976e-01  1.42234460e-01  3.67688566e-01
 -4.28213477e-01  2.28824526e-01 -2.49624014e-01 -1.07838444e-01
 -9.14908171e-01 -1.62466526e-01 -1.18347473e-01  8.89864326e-01
  2.78020382e-01  1.85750410e-01  1.31621346e-01 -3.78553599e-01
 -9.91683960e-01 -9.08171892e-01 -4.31626678e-01  1.27131775e-01
 -8.25744569e-01 -7.28055120e-01  6.90432250e-01  7.17395902e-01
 -2.55336583e-01 -1.05166304e+00  2.07867369e-01  6.85787678e-01
  4.09891337e-01 -8.54549557e-02 -1.52421579e-01  1.03981420e-01
 -2.83232015e-02 -1

In [57]:
len(sentence_embeddings)

76

## Clustering using self_organising maps

In [58]:
!pip install sklearn-som



In [82]:
from sklearn_som.som import SOM
# m -> row len of the self organising map
# n -> col len of the self organising map
som = SOM(m=3, n=3, dim=768)

In [60]:
# training the self organising maps
# here the sentence embedding size is n * dimension of data
som.fit(sentence_embeddings , epochs = 10)


In [61]:
predictions = som.predict(sentence_embeddings)

In [62]:
print(len(predictions))
print(set(predictions))
print(predictions[0])

76
{0, 1, 2, 3, 4, 5, 6, 7, 8}
4


In [63]:
predictions

array([4, 7, 4, 2, 2, 3, 0, 7, 3, 3, 6, 7, 6, 0, 1, 8, 7, 2, 2, 5, 6, 8,
       0, 1, 5, 1, 7, 4, 1, 5, 2, 0, 1, 1, 7, 7, 0, 0, 4, 7, 6, 2, 3, 1,
       2, 0, 5, 2, 0, 6, 7, 5, 1, 8, 8, 6, 2, 6, 8, 5, 0, 3, 0, 5, 0, 7,
       2, 5, 4, 6, 2, 8, 8, 8, 0, 2])

## Analysing Predictions

In [79]:
# creating org_text -> prediction mapping
cluster = dict()
for i , sentence in enumerate(org_text):
    cluster[sentence] = predictions[i]


In [80]:
# sorting according to cluster
sorted_keys = sorted(cluster, key=cluster.get)  # [1, 3, 2]
sorted_cluster = dict()
for w in sorted_keys:
    sorted_cluster[w] = cluster[w]

In [81]:
for key , val in sorted_cluster.items():
    print(f"{key} : {val}")

And how will the Syrian government react? : 0
Syrian crisis: Latest developments . : 0
"It needs time to be able to analyze the information and the samples," Nesirky said. : 0
In a world with many dangers, this menace must be confronted." : 0
What will happen if they vote no? : 0
It's unclear. : 0
Reactions mixed to Obama's speech . : 0
"So we are quite concerned." : 0
What do Syria's neighbors think? : 0
Syria's government unfazed . : 0
Syria's prime minister appeared unfazed by the saber-rattling. : 0
No explanation was offered for the discrepancy. : 0
U.N. inspectors leave Syria . : 1
He noted that Ban has repeatedly said there is no alternative to a political solution to the crisis in Syria, and that "a military solution is not an option." : 1
Some U.S. lawmakers have called for immediate action while others warn of stepping into what could become a quagmire. : 1
Any military attack would not be open-ended or include U.S. ground forces, he said. : 1
Syria missile strike: What would