# **Clustering Textual Notes Using Sentence Transformers and KMeans**

## **Introduction**

In this project, we aim to group a set of textual notes based on their **semantic similarity** using **natural language processing (NLP)** techniques. By converting each sentence into a vector representation through a pre-trained transformer model, we can apply **KMeans clustering** to organize them into meaningful categories. This is useful in tasks such as topic discovery, content summarization, or automatic tagging.

---

## **Step-by-Step Code Explanation**

The following steps explain the techniques used in the project, including code comments for a deeper understanding of each part.

### 1. Installing Required Libraries

```python
!pip install sentence-transformers scikit-learn numpy pandas
```

This command installs the required libraries:

* `sentence-transformers`: For converting text into dense vector embeddings.
* `scikit-learn`: Contains the KMeans clustering algorithm.
* `numpy`: For numerical computations.
* `pandas`: For structured data handling and display.


In [None]:
!pip install sentence-transformers scikit-learn numpy pandas

---

### 2. Importing Libraries

```python
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
```

* Importing the necessary modules for embedding, clustering, numerical operations, and data management.

* `sentence-transformers`: For converting sentences into vectors.
* `KMeans`: For applying the clustering algorithm.
* `numpy`: For numerical operations.
* `pandas`: For working with data frames.


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

---

### 3. Defining Input Notes

```python
notes = [
    "Artificial intelligence mimics human behavior.",
    "Machine learning requires data for model training.",
    "Privacy is a major ethical concern in AI.",
    "Decision trees are a supervised learning method.",
    "Unsupervised learning finds patterns in data.",
    "Deep learning uses layered neural networks.",
    "GPT models process natural language.",
    "Ethical issues in AI are widely debated.",
    "Bias in AI can have societal impacts."
]
```

* A list of 9 short English sentences related to AI and machine learning.
* These will be the input to our model for semantic grouping.

In [None]:
notes = [
    "Artificial intelligence mimics human behavior.",
    "Machine learning requires data for model training.",
    "Privacy is a major ethical concern in AI.",
    "Decision trees are a supervised learning method.",
    "Unsupervised learning finds patterns in data.",
    "Deep learning uses layered neural networks.",
    "GPT models process natural language.",
    "Ethical issues in AI are widely debated.",
    "Bias in AI can have societal impacts."
]


----

### 4. Converting Notes to Sentence Embeddings

```python
model = SentenceTransformer("all-MiniLM-L6-v2")  # Load a small, efficient pre-trained transformer model
embeddings = model.encode(notes)                # Convert all sentences into vector representations
```

* Each sentence is transformed into a high-dimensional numeric vector that captures its meaning.
* This is the foundation for semantic comparison between sentences.



In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(notes)

---

### 5. Applying KMeans Clustering

```python
n_clusters = 3                                       # Set the number of clusters (groups) we want
kmeans = KMeans(n_clusters=n_clusters, random_state=42)  # Initialize KMeans with a fixed seed for reproducibility
labels = kmeans.fit_predict(embeddings)              # Assign each sentence to a cluster based on its vector
```

* `KMeans` tries to group similar vectors by minimizing distance within each group.
* The `labels` array tells us which sentence belongs to which cluster.

In [None]:
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(embeddings)

---

### 6. Displaying Clustered Notes

```python
df = pd.DataFrame({'Note': notes, 'Group': labels})  # Create a DataFrame pairing each note with its group
print("\n--- Grouped Notes ---")
for group_id in range(n_clusters):                   # Loop through each group
    print(f"\n Group {group_id+1}:")
    group_notes = df[df['Group'] == group_id]['Note'].tolist()  # Get notes belonging to the current group
    for note in group_notes:
        print(f" - {note}")
```

* This section creates a clear printed output showing which notes fall under which group.
* Helpful for visually evaluating the effectiveness of the clustering.


In [None]:
df = pd.DataFrame({'Note': notes, 'Group': labels})
print("\n--- Grouped Notes ---")
for group_id in range(n_clusters):
    print(f"\n Group {group_id+1}:")
    group_notes = df[df['Group'] == group_id]['Note'].tolist()
    for note in group_notes:
        print(f" - {note}")

---

### 7. Automatic Group Title Suggestions

```python
print("\n--- Automatic Group Title Suggestions ---")
for group_id in range(n_clusters):                                      # Loop through each group
    group_indices = np.where(labels == group_id)[0]                     # Find the indices of notes in this group
    center_vector = kmeans.cluster_centers_[group_id]                   # Get the center vector of the current cluster
    group_vectors = embeddings[group_indices]                           # Get the vectors of notes in this group
    closest_idx = np.argmin(np.linalg.norm(group_vectors - center_vector, axis=1))  # Find the vector closest to the center
    title = notes[group_indices[closest_idx]]                           # Use the closest note as the group title
    print(f" Group {group_id+1} Title: {title}")
```

* The goal here is to **suggest a representative title** for each cluster.
* The sentence whose embedding is **closest to the cluster center** is used as a title.
* This helps in understanding what each group is primarily about.


In [None]:
print("\n--- Automatic Group Title Suggestions ---")
for group_id in range(n_clusters):
    group_indices = np.where(labels == group_id)[0]
    center_vector = kmeans.cluster_centers_[group_id]
    group_vectors = embeddings[group_indices]
    closest_idx = np.argmin(np.linalg.norm(group_vectors - center_vector, axis=1))
    title = notes[group_indices[closest_idx]]
    print(f" Group {group_id+1} Title: {title}")


---

## **Conclusion**

This project demonstrates how to combine **pre-trained language models** and **unsupervised learning** to semantically cluster text. The SentenceTransformer model accurately embedded the meaning of each sentence into numerical vectors. KMeans clustering then effectively grouped similar sentences together. Finally, we automatically suggested a meaningful title for each group based on the most representative sentence. This pipeline can be extended for large-scale text organization tasks in areas like **document management**, **topic modeling**, and **AI content moderation**.