## Load data

In [1]:
import sqlite3 
import pandas as pd

sql = sqlite3.connect("wiki_articles_hw1_extended.db")

df = pd.read_sql_query("SELECT * from wiki_articles_hw1_extended", sql)

In [2]:
print(df.head())

                  title                                               text  \
0            Abuse case  From Wikipedia, the free encyclopedia\n\n\nAbu...   
1   Access-control list  From Wikipedia, the free encyclopedia\n\n\nLis...   
2    Antivirus software  From Wikipedia, the free encyclopedia\n\n\nCom...   
3  Application security  From Wikipedia, the free encyclopedia\n\n\nMea...   
4  Application firewall  From Wikipedia, the free encyclopedia\n\n\nLay...   

                   name                                                url  \
0            Abuse case           https://en.wikipedia.org/wiki/Abuse_case   
1   Access-control list  https://en.wikipedia.org/wiki/Access-control_list   
2    Antivirus software   https://en.wikipedia.org/wiki/Antivirus_software   
3  Application security  https://en.wikipedia.org/wiki/Application_secu...   
4  Application firewall  https://en.wikipedia.org/wiki/Application_fire...   

          datePublished          dateModified  \
0  2010-03-19

In [3]:
print(len(df))

85


In [4]:
print(df.columns)

Index(['title', 'text', 'name', 'url', 'datePublished', 'dateModified',
       'headline', 'nouns', 'adjectives', 'verbs', 'lemmas', 'nav', 'entities',
       'noun_chunks', 'no_tokens', 'no_sentences', 'no_noun_chunks'],
      dtype='object')


## Vectorize data

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS as stop_words

tfidf_vectorizer = TfidfVectorizer(stop_words=list(stop_words), min_df=10, sublinear_tf=True, use_idf=True)
tfidf_vectors = tfidf_vectorizer.fit_transform(df["nav"])




In [6]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
print(tfidf_feature_names)

['ability' 'able' 'abuse' ... 'year' 'zombie' 'zone']


## Calculate (or load) word vectors

In [7]:
#!pip install "gensim>=4.0.0"
import numpy as np

In [8]:
import regex as re
from spacy.lang.en.stop_words import STOP_WORDS as stop_words
gensim_words = [[w for w in re.split(r'[\\|\\#]', doc.lower()) if w not in stop_words]
                    for doc in df["nav"]]

In [9]:
from gensim.models import Word2Vec
w2v = Word2Vec(gensim_words, min_count=5)
w2v.wv.save_word2vec_format("wiki_articles_hw1_extended.w2v")

In [10]:
from gensim.models import KeyedVectors
w2v = KeyedVectors.load_word2vec_format("wiki_articles_hw1_extended.w2v")

In [None]:
document_embeddings = []
for doc_idx in range(tfidf_vectors.shape[0]):
    doc_embedding = np.zeros(w2v.vector_size)
    total_weight = 0 
    non_zero_indices = tfidf_vectors[doc_idx].nonzero()[1]
    non_zero_values = tfidf_vectors[doc_idx].data

    for word_idx, tfidf_score in zip(non_zero_indices, non_zero_values):
        word = tfidf_feature_names[word_idx]
        
        if word in w2v:
            word_vector = w2v[word] * tfidf_score
            doc_embedding += word_vector  
            total_weight += tfidf_score  

    if total_weight > 0:
        doc_embedding /= total_weight

    document_embeddings.append(doc_embedding)

document_embeddings = np.array(document_embeddings)

print(document_embeddings)

[[-0.14145867  0.32231176 -0.17363445 ... -0.35801931  0.16587122
  -0.22775691]
 [-0.14308172  0.31667366 -0.17883091 ... -0.37549515  0.17352547
  -0.24800613]
 [-0.15063347  0.31556189 -0.18278075 ... -0.35310491  0.16726538
  -0.23173124]
 ...
 [-0.14500113  0.31567867 -0.17750292 ... -0.3576245   0.17119767
  -0.23905128]
 [-0.15443757  0.34816562 -0.19528794 ... -0.38837465  0.18887584
  -0.268179  ]
 [-0.14659697  0.32772475 -0.1865     ... -0.36739377  0.18036608
  -0.25042022]]


## Ans to the Ques no 2

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

w2v = KeyedVectors.load_word2vec_format("wiki_articles_hw1_extended.w2v")

if "scareware" in w2v:
    scareware_vector = w2v["scareware"]
else:
    raise ValueError("The word 'scareware' is not in the Word2Vec vocabulary")

document_embeddings = np.array(document_embeddings)

similarities = cosine_similarity(scareware_vector.reshape(1, -1), document_embeddings)

most_similar_doc_index = np.argmax(similarities)

# Print the title of the most similar document
most_similar_doc_title = df.iloc[most_similar_doc_index]["title"]
print("The document most similar to 'scareware' is:", most_similar_doc_title)

The document most similar to 'scareware' is: Rogue security software


In [19]:
len(similarities[0])

85

Yes, the result is plausible. "Rogue security software" is closely related to "scareware," as both terms refer to malicious software designed to deceive users into thinking their computer is infected with viruses or other security threats. These programs often employ scare tactics (hence "scareware") to trick users into purchasing fake security solutions or providing sensitive information.

In essence, both "scareware" and "rogue security software" describe fraudulent applications that mislead users about system security to exploit them financially or compromise their data. The similarity between these terms makes it reasonable for the document on "Rogue security software" to have a high cosine similarity to the vector for "scareware."

## Ans to the Ques no 3

In [14]:
from sklearn.cluster import KMeans
import pandas as pd

k = 10

kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(document_embeddings)

df["cluster"] = cluster_labels

clusters = {}
for cluster_num in range(k):
    cluster_titles = df[df["cluster"] == cluster_num]["title"].tolist()
    clusters[cluster_num] = cluster_titles
    print(f"Cluster {cluster_num + 1}:")
    for title in cluster_titles:
        print(" -", title)
    print("\n")


Cluster 1:
 - DREAD (risk assessment model)
 - Risk factor (computing)
 - Security controls


Cluster 2:
 - Automated threat
 - Buffer overflow
 - Code refactoring
 - Computer security software
 - Cross-site scripting
 - Defensive programming
 - Dynamic application security testing
 - Intrusion detection system
 - Keystroke logging
 - Obfuscation (software)
 - Principle of least privilege
 - Social engineering (security)
 - SQL injection
 - Web application firewall


Cluster 3:
 - Abuse case
 - Authentication
 - Coding best practices
 - Cyber threat hunting
 - Cyberattack
 - Defense in depth (computing)
 - Denial-of-service attack
 - Information security
 - IT risk
 - Misuse case
 - Penetration test
 - Software bug
 - Computer security
 - Software quality
 - Software testing
 - Systems development life cycle
 - Vulnerability (computer security)


Cluster 4:
 - Drive-by download
 - Security bug
 - Vulnerability scanner


Cluster 5:
 - Computer emergency response team
 - DevOps
 - Inform

In [16]:
mean_similarities = {}

for cluster_num in range(k):
    cluster_embeddings = document_embeddings[df["cluster"] == cluster_num]
    
    if len(cluster_embeddings) > 1:
        similarity_matrix = cosine_similarity(cluster_embeddings)
        
        upper_triangular_indices = np.triu_indices(len(cluster_embeddings), k=1)
        pairwise_similarities = similarity_matrix[upper_triangular_indices]
        
        mean_similarity = pairwise_similarities.mean()
    else:
        mean_similarity = 0  
    
    mean_similarities[cluster_num] = mean_similarity
    print(f"Cluster {cluster_num + 1} Mean Similarity: {mean_similarity:.4f}")

Cluster 1 Mean Similarity: 0.9996
Cluster 2 Mean Similarity: 0.9998
Cluster 3 Mean Similarity: 0.9997
Cluster 4 Mean Similarity: 0.9986
Cluster 5 Mean Similarity: 0.9996
Cluster 6 Mean Similarity: 0.9997
Cluster 7 Mean Similarity: 0.9998
Cluster 8 Mean Similarity: 0.9999
Cluster 9 Mean Similarity: 0.0000
Cluster 10 Mean Similarity: 0.9997


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

mean_similarities = {}

for cluster_num in range(k):
    cluster_embeddings = document_embeddings[df["cluster"] == cluster_num]
    cluster_titles = df[df["cluster"] == cluster_num]["title"].tolist()
    
    if len(cluster_embeddings) > 1:
        similarity_matrix = cosine_similarity(cluster_embeddings)
        
        upper_triangular_indices = np.triu_indices(len(cluster_embeddings), k=1)
        pairwise_similarities = similarity_matrix[upper_triangular_indices]
        
        mean_similarity = pairwise_similarities.mean()
    else:
        mean_similarity = np.nan
    
    mean_similarities[cluster_num] = {
        "mean_similarity": mean_similarity,
        "titles": cluster_titles,
        "interpretation": ""
    }
    
    if mean_similarity > 0.5:
        interpretation = (
            f"Cluster {cluster_num + 1} has a high mean similarity ({mean_similarity:.4f}), "
            f"suggesting that these documents share a strong topical similarity. This aligns with the "
            f"general topical field of cybersecurity and information security, indicating cohesive content "
            f"within the cluster. Documents in this cluster likely focus on similar aspects, such as malware "
            f"types or security threats."
        )
    elif np.isnan(mean_similarity):
        interpretation = (
            f"Cluster {cluster_num + 1} has only one document, so no pairwise similarity could be calculated. "
            f"The document in this cluster may cover a niche topic within the corpus."
        )
    else:
        interpretation = (
            f"Cluster {cluster_num + 1} has a lower mean similarity ({mean_similarity:.4f}), suggesting more "
            f"diverse topics within this cluster. This could mean that the documents in this cluster cover "
            f"broader or less cohesive topics within cybersecurity."
        )
    
    mean_similarities[cluster_num]["interpretation"] = interpretation

for cluster_num, info in mean_similarities.items():
    print(f"Cluster {cluster_num + 1} Mean Similarity: {info['mean_similarity']:.4f}")
    print("Titles in Cluster:")
    for title in info["titles"]:
        print(" -", title)
    print(info["interpretation"])
    print("\n")

most_similar_doc_index = np.argmax(cosine_similarity(scareware_vector.reshape(1, -1), document_embeddings))
most_similar_doc_title = df.iloc[most_similar_doc_index]["title"]

if most_similar_doc_title in sum([info["titles"] for info in mean_similarities.values()], []):
    print(f"The document most similar to 'scareware' is: '{most_similar_doc_title}', which is grouped within a cluster containing related topics.")
    print("This confirms the clustering result, as the document titles in the same cluster likely discuss similar cybersecurity threats.")
else:
    print(f"The document most similar to 'scareware' is: '{most_similar_doc_title}', but it is not strongly clustered with other related documents.")
    print("This might indicate that 'scareware' topics are distributed across multiple clusters, or the clustering could be improved.")


Cluster 1 Mean Similarity: 0.9996
Titles in Cluster:
 - DREAD (risk assessment model)
 - Risk factor (computing)
 - Security controls
Cluster 1 has a high mean similarity (0.9996), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.


Cluster 2 Mean Similarity: 0.9998
Titles in Cluster:
 - Automated threat
 - Buffer overflow
 - Code refactoring
 - Computer security software
 - Cross-site scripting
 - Defensive programming
 - Dynamic application security testing
 - Intrusion detection system
 - Keystroke logging
 - Obfuscation (software)
 - Principle of least privilege
 - Social engineering (security)
 - SQL injection
 - Web application firewall
Cluster 2 has a high mean similarity (0.9998), suggesting that these documents share a strong to

### Cluster 1 Mean Similarity: 0.9996
**Titles in Cluster:**
 - DREAD (risk assessment model)
 - Risk factor (computing)
 - Security controls

Cluster 1 has a high mean similarity (0.9996), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 2 Mean Similarity: 0.9998
**Titles in Cluster:**
 - Automated threat
 - Buffer overflow
 - Code refactoring
 - Computer security software
 - Cross-site scripting
 - Defensive programming
 - Dynamic application security testing
 - Intrusion detection system
 - Keystroke logging
 - Obfuscation (software)
 - Principle of least privilege
 - Social engineering (security)
 - SQL injection
 - Web application firewall

Cluster 2 has a high mean similarity (0.9998), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 3 Mean Similarity: 0.9997
**Titles in Cluster:**
 - Abuse case
 - Authentication
 - Coding best practices
 - Cyber threat hunting
 - Cyberattack
 - Defense in depth (computing)
 - Denial-of-service attack
 - Information security
 - IT risk
 - Misuse case
 - Penetration test
 - Software bug
 - Computer security
 - Software quality
 - Software testing
 - Systems development life cycle
 - Vulnerability (computer security)

Cluster 3 has a high mean similarity (0.9997), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 4 Mean Similarity: 0.9986
**Titles in Cluster:**
 - Drive-by download
 - Security bug
 - Vulnerability scanner

Cluster 4 has a high mean similarity (0.9986), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 5 Mean Similarity: 0.9996
**Titles in Cluster:**
 - Computer emergency response team
 - DevOps
 - Information security management
 - OWASP
 - Security engineering
 - Site reliability engineering
 - Software development process

Cluster 5 has a high mean similarity (0.9996), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 6 Mean Similarity: 0.9997
**Titles in Cluster:**
 - Access-control list
 - Attack tree
 - Authorization
 - Countermeasure (computer)
 - Data security
 - Database security
 - Role-based access control
 - Security information and event management
 - Security testing
 - Security through obscurity
 - STRIDE model
 - Threat model

Cluster 6 has a high mean similarity (0.9997), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 7 Mean Similarity: 0.9998
**Titles in Cluster:**
 - Application firewall
 - Browser security
 - Capability-based security
 - Defense strategy (computing)
 - Encryption software
 - Exploit (computer security)
 - Internet security
 - Privilege escalation
 - Sandbox (computer security)

Cluster 7 has a high mean similarity (0.9998), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 8 Mean Similarity: 0.9999
**Titles in Cluster:**
 - Antivirus software
 - Computer virus
 - Computer worm
 - Cross-site request forgery
 - Malware
 - Mobile security
 - Ransomware
 - Rogue security software
 - Rootkit
 - Scareware
 - Spyware
 - Trojan horse (computing)

Cluster 8 has a high mean similarity (0.9999), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Cluster 9 Mean Similarity: nan
**Titles in Cluster:**
 - Asset (computer security)

Cluster 9 has only one document, so no pairwise similarity could be calculated. The document in this cluster may cover a niche topic within the corpus.

---

### Cluster 10 Mean Similarity: 0.9997
**Titles in Cluster:**
 - Application security
 - Secure by design
 - Security-focused operating system
 - Shift-left testing
 - Software security assurance
 - Static application security testing
 - System testing

Cluster 10 has a high mean similarity (0.9997), suggesting that these documents share a strong topical similarity. This aligns with the general topical field of cybersecurity and information security, indicating cohesive content within the cluster. Documents in this cluster likely focus on similar aspects, such as malware types or security threats.

---

### Conclusion

The document most similar to 'scareware' is: **'Rogue security software'**, which is grouped within a cluster containing related topics. This confirms the clustering result, as the document titles in the same cluster likely discuss similar cybersecurity threats.
