# ðŸŽ“ Clustering Wikipedia Articles with K-Means

This notebook demonstrates how to fetch text documents from Wikipedia, process it, and then use **K-Means** to discover topic clusters. The key steps are:

**Setup:** Installing and importing the required libraries.

**Data Collection:** Fetching articles from Wikipedia.

**Text Preprocessing:** Cleaning the text data.

**Vectorization:** Converting text into numerical TF-IDF vectors.

**Clustering:** Applying the K-Means algorithm.

**Analysis:** Inspecting the clusters to understand their topics.

**Prediction:** Categorizing a new document using the trained model.

You may need to install the libraries below.

pip3 install wikipedia-api

pip3 install nltk

In [None]:
# --- Step 1: Imports ---
import wikipediaapi
import nltk
import ssl
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# Now, try to download the data again
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

print("NLTK data downloaded successfully using the manual method.")

print("Libraries imported and NLTK data downloaded successfully!")

NLTK data downloaded successfully using the manual method.
Libraries imported and NLTK data downloaded successfully!


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dasunathukolage/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dasunathukolage/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/dasunathukolage/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
# --- Step 2: Fetch Wikipedia Articles ---

# List of articles to cluster. We've chosen topics in astronomy, biology, and computer science.
article_titles = [
    "Galaxy", "Black hole", "Supernova", # Astronomy
    "DNA", "Photosynthesis", "Evolution", # Biology
    "Machine learning", "Artificial intelligence", "Computer programming" # Computer Science
]

# Initialize the Wikipedia API
wiki_api = wikipediaapi.Wikipedia('MyClusteringProject/1.0', 'en')

documents = []
for title in article_titles:
    page = wiki_api.page(title)
    if page.exists():
        documents.append(page.text)
        print(f"Successfully fetched: {title}")
    else:
        print(f"Could not find page: {title}")

Successfully fetched: Galaxy
Successfully fetched: Black hole
Successfully fetched: Supernova
Successfully fetched: DNA
Successfully fetched: Photosynthesis
Successfully fetched: Evolution
Successfully fetched: Machine learning
Successfully fetched: Artificial intelligence
Successfully fetched: Computer programming


In [3]:
# --- Step 3: Preprocess the Text ---

stop_words = set(stopwords.words('english'))
print(f"Stop words:\n {stop_words}")
print(f"Number of stop words: {len(stop_words)}")

lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize
    words = text.split()
    # Remove stop words and lemmatize
    processed_words = [lemmatizer.lemmatize(word, pos='v') for word in words if word not in stop_words]
    return ' '.join(processed_words)

processed_documents = [preprocess_text(doc) for doc in documents]
print("Text preprocessing complete.")

print(f"\nDocument 1 Before preprocessing:\n{documents[0][:500]}")  # Print first 500 characters of the first original document
print(f"\nDocument 1 After preprocessing:\n{processed_documents[0][:500]}")  # Print first 500 characters of the first processed document

Stop words:
 {'had', 'during', "hasn't", 'y', 'yourself', 'which', 'ma', 'against', 'won', 'the', "you've", 'now', "you'll", "couldn't", 'what', 'other', "it'll", 'we', 'below', 'ain', 'it', 'with', 'from', "we'll", 'aren', 'yourselves', 'not', 'myself', 'haven', 'should', 'am', 'mustn', 'if', 'he', 'them', 'she', 'up', "won't", 'whom', 'can', 'o', "mightn't", 'both', 'further', 'out', "wasn't", "she'd", 'did', 'how', 'do', 'into', 'on', 'ours', 'where', 'when', "shan't", 'you', 'shan', 'his', 'him', "he'll", 'hasn', 'itself', 'so', 'these', 'has', 'under', 'having', 'more', 'own', 'have', "isn't", "i've", 'their', "aren't", 'was', "hadn't", 'at', "they've", 'mightn', "they're", 'an', 'to', 'only', 'than', 'but', 'isn', 'about', 'nor', "she'll", 'doesn', 'this', 'off', 'me', 'i', 'no', 'just', 'didn', 'over', 'yours', 'doing', 'be', 'hers', 'before', 't', 'or', 'after', 'while', 'most', "should've", 'don', "you'd", "she's", "mustn't", 'wasn', "that'll", 'hadn', "doesn't", "he's", 'll',

In [4]:
# --- Step 4: Convert Text to Vectors ---

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=1000) # Limit to the top 1000 features

# Create the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(processed_documents)

print(tfidf_matrix)

print("TF-IDF matrix created successfully.")
print(f"Shape of the matrix: {tfidf_matrix.shape}")

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5236 stored elements and shape (9, 1000)>
  Coords	Values
  (0, 371)	0.4036268253155702
  (0, 898)	0.011085748508054258
  (0, 864)	0.36822096344578337
  (0, 870)	0.031865275682808177
  (0, 769)	0.00625660127693623
  (0, 463)	0.034411307023149255
  (0, 375)	0.08021203859031145
  (0, 268)	0.037539607661617375
  (0, 213)	0.05005281021548984
  (0, 540)	0.045198205997058954
  (0, 101)	0.010026504823788931
  (0, 927)	0.013559461799117687
  (0, 395)	0.021243517121872117
  (0, 991)	0.0067797308995588435
  (0, 561)	0.17348872316195563
  (0, 757)	0.0036952495026847526
  (0, 979)	0.08577624081121084
  (0, 190)	0.024019121767450892
  (0, 842)	0.016748632904593976
  (0, 370)	0.6620896169650143
  (0, 81)	0.016338331583087777
  (0, 307)	0.01838062303097375
  (0, 562)	0.012933373259396634
  (0, 733)	0.007390499005369505
  (0, 833)	0.027118923598235374
  :	:
  (8, 780)	0.10684975235785324
  (8, 841)	0.19945287106799273
  (8, 840)	0.007123316

In [5]:
# --- Step 5: Run K-Means ---

k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=5)
## The below line is functionally identical to the above line
#kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42, n_init=10)
kmeans.fit(tfidf_matrix)

# Get the cluster assignments for each document
labels = kmeans.labels_
print(labels)

[2 2 2 0 0 0 1 1 1]


In [6]:
# --- Step 6: Evaluation Model performance ---
# --- 1. Calculate WCSS (Within-Cluster Sum of Squares) ---
# This is already calculated automatically when you fit the model!
# It is stored in the variable 'kmeans.inertia_'
wcss = kmeans.inertia_

# --- 2. Calculate Silhouette Score ---
# This measures how well-separated the clusters are.
# It takes the data (tfidf_matrix) and the labels the model assigned.
sil_score = silhouette_score(tfidf_matrix, kmeans.labels_)

print("--- Model Evaluation Metrics ---")
print(f"WCSS (Inertia): {wcss:.4f}")
print(f"Silhouette Score: {sil_score:.4f}")



--- Model Evaluation Metrics ---
WCSS (Inertia): 4.4805
Silhouette Score: 0.1053


In [7]:
# --- Step 7: Putting the Model to Work - Predicting on New Documents ---
# Now for the exciting part! We can take our final "trained" model and 
# use it to instantly categorize a brand new, unseen document. 
# Let's see which topic cluster it belongs to!

# --- Define your new document ---
new_text = "An algorithm is a set of well-defined instructions designed to perform a specific task or solve a computational problem. In computer science, the study of algorithms is fundamental to creating efficient and scalable software. Data structures, such as arrays and hash tables, are used to organize data in a way that allows these algorithms to access and manipulate it effectively."

# --- Apply the SAME preprocessing ---
# We use the preprocess_text function we defined earlier
processed_new_text = preprocess_text(new_text)
print(f"Cleaned Text: {processed_new_text}")

# --- Use the FITTED vectorizer to transform the text ---
# IMPORTANT: Use .transform(), not .fit_transform()
# This ensures it uses the same vocabulary learned from the original documents.
new_tfidf_vector = vectorizer.transform([processed_new_text])

print(f"\nShape of the new vector: {new_tfidf_vector.shape}")

# --- Now you can predict its cluster ---
predicted_label = kmeans.predict(new_tfidf_vector)

print(f"\nThe new document belongs to cluster: {predicted_label[0]}")

Cleaned Text: algorithm set welldefined instructions design perform specific task solve computational problem computer science study algorithms fundamental create efficient scalable software data structure array hash table use organize data way allow algorithms access manipulate effectively

Shape of the new vector: (1, 1000)

The new document belongs to cluster: 1
