# 🎓 Clustering Wikipedia Articles with K-Means

This notebook demonstrates how to fetch text documents from Wikipedia, process it, and then use **K-Means** to discover topic clusters. The key steps are:

**Setup:** Installing and importing the required libraries.

**Data Collection:** Fetching articles from Wikipedia.

**Text Preprocessing:** Cleaning the text data.

**Vectorization:** Converting text into numerical TF-IDF vectors.

**Clustering:** Applying the K-Means algorithm.

**Analysis:** Inspecting the clusters to understand their topics.

**Prediction:** Categorizing a new document using the trained model.

You may need to install the libraries below.

pip3 install wikipedia-api

pip3 install nltk

In [7]:
# --- Step 1: Imports ---
import wikipediaapi
import nltk
import ssl
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [6]:
# This is not necessary (If there is any error, use this code set)
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [8]:
# Now, try to download the data again
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

print("NLTK data downloaded successfully using the manual method.")

print("Libraries imported and NLTK data downloaded successfully!")

NLTK data downloaded successfully using the manual method.
Libraries imported and NLTK data downloaded successfully!


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GRB\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\GRB\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\GRB\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [9]:
# --- Step 2: Fetch Wikipedia Articles ---

# List of articles to cluster. We've chosen topics in astronomy, biology, and computer science.
article_titles = [
    "Galaxy", "Black hole", "Supernova", # Astronomy
    "DNA", "Photosynthesis", "Evolution", # Biology
    "Machine learning", "Artificial intelligence", "Computer programming" # Computer Science
]

# Initialize the Wikipedia API
wiki_api = wikipediaapi.Wikipedia('MyClusteringProject/1.0', 'en')

documents = []
for title in article_titles:
    page = wiki_api.page(title)
    if page.exists():
        documents.append(page.text)
        print(f"Successfully fetched: {title}")
    else:
        print(f"Could not find page: {title}")

Successfully fetched: Galaxy
Successfully fetched: Black hole
Successfully fetched: Supernova
Successfully fetched: DNA
Successfully fetched: Photosynthesis
Successfully fetched: Evolution
Successfully fetched: Machine learning
Successfully fetched: Artificial intelligence
Successfully fetched: Computer programming


In [10]:
# --- Step 3: Preprocess the Text ---

# Data preprocessing means cleaning and preparing text data for analysis. This involves removing noise, normalizing text, and extracting relevant features.
# Note: Data gaining (Data lookup) from set is faster than list

stop_words = set(stopwords.words('english')) # Stop words (Examples: "the", "is", "in")
lemmatizer = WordNetLemmatizer() # Lemmatizer - Converts words to their base form (Examples: "running", "ran", "runs" -> "run")

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize the sentence to words (split by whitespace) and assign to a list
    words = text.split()
    # Remove stop words and lemmatize inorder to convert words to their base form and join to return as a single string
    processed_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(processed_words)

processed_documents = [preprocess_text(doc) for doc in documents] # Preprocess each document and append to the list
print("Text preprocessing complete.")

Text preprocessing complete.


In [11]:
# --- Step 4: Convert Text to Vectors ---

# Initialize the TF-IDF Vectorizer - To convert the preprocessed text documents into numerical feature vectors
# Here TF stands for Term Frequency - measures how frequently a term occurs in a document (Frequent words contribute more to the document's representation)
# IDF stands for Inverse Document Frequency - measures how important a word is to a document in a collection (Rare words contribute more to the document's representation)
vectorizer = TfidfVectorizer(max_features=1000) # Limit to the top 1000 features

# So here it creates 09 rows (for 09 documents) with 1000 columns (for 1000 features) vectorized representation (Sparse Matrix - Table)

# Create the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(processed_documents)

print(tfidf_matrix) # In the Matrix it prints only non-zero values

print("TF-IDF matrix created successfully.")
print(f"Shape of the matrix: {tfidf_matrix.shape}")

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5187 stored elements and shape (9, 1000)>
  Coords	Values
  (0, 380)	0.8694624356020488
  (0, 906)	0.01159671739286774
  (0, 871)	0.28333919626648235
  (0, 876)	0.025000517317630797
  (0, 777)	0.004908737339371123
  (0, 471)	0.026998055366541176
  (0, 383)	0.06293190367536866
  (0, 270)	0.029452424036226738
  (0, 214)	0.039269898714968984
  (0, 552)	0.035461125238748774
  (0, 107)	0.008760308894358658
  (0, 932)	0.010638337571624632
  (0, 402)	0.016667011545087197
  (0, 992)	0.005319168785812316
  (0, 571)	0.13611392761821212
  (0, 763)	0.002899179348216935
  (0, 982)	0.0710298940313149
  (0, 192)	0.005319168785812316
  (0, 853)	0.013140463341537987
  (0, 309)	0.009833109949276352
  (0, 572)	0.015945486415193142
  (0, 742)	0.004806957568994388
  (0, 843)	0.019503618881311826
  (0, 271)	0.05555670515029067
  (0, 512)	0.007247948370542337
  :	:
  (8, 4)	0.05115613811068583
  (8, 918)	0.03654009865048988
  (8, 635)	0.0248134786

In [15]:
# --- Step 5: Run K-Means ---

k = 3  # Number of clusters (Astronomy, Biology, Computer Science)

# Since we mention KMeans, in scikit learn by default it runs KMeans++ (which is an improved version of KMeans)
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
## The below line is functionally identical to the above line
# kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42, n_init=10)
# Here n_init is the number of times the algorithm will be run with different centroid seeds - in-order to choose the best one

kmeans.fit(tfidf_matrix)

# Get the cluster assignments for each document
labels = kmeans.labels_
print(labels) # Labels indicate which cluster each document belongs to (0, 1, or 2) - Zᵢ values

[2 2 2 0 0 0 1 1 1]


In [16]:
# --- Step 6: Analyze the Results ---

# Group document titles by cluster
clusters = {i: [] for i in range(k)}
for i, label in enumerate(labels):
    clusters[label].append(article_titles[i])

# Get the top terms per cluster
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()

for i in range(k):
    print(f"--- Cluster {i} ---")
    print(f"Documents: {clusters[i]}")
    
    top_terms = [terms[ind] for ind in order_centroids[i, :10]]
    print(f"Top Keywords: {top_terms}\n")

--- Cluster 0 ---
Documents: ['DNA', 'Photosynthesis', 'Evolution']
Top Keywords: ['dna', 'organism', 'photosynthesis', 'specie', 'gene', 'plant', 'photosynthetic', 'evolution', 'carbon', 'cell']

--- Cluster 1 ---
Documents: ['Machine learning', 'Artificial intelligence', 'Computer programming']
Top Keywords: ['learning', 'ai', 'language', 'programming', 'machine', 'data', 'algorithm', 'intelligence', 'program', 'computer']

--- Cluster 2 ---
Documents: ['Galaxy', 'Black hole', 'Supernova']
Top Keywords: ['galaxy', 'star', 'supernova', 'hole', 'black', 'mass', 'type', 'collapse', 'milky', 'light']



In [18]:
# --- Step 7: Putting the Model to Work - Predicting on New Documents ---
# Now for the exciting part! We can take our final "trained" model and 
# use it to instantly categorize a brand new, unseen document. 
# Let's see which topic cluster it belongs to!

# --- Define your new document ---
new_text = "An algorithm is a set of well-defined instructions designed to perform a specific task or solve a computational problem. In computer science, the study of algorithms is fundamental to creating efficient and scalable software. Data structures, such as arrays and hash tables, are used to organize data in a way that allows these algorithms to access and manipulate it effectively."

# --- Apply the SAME preprocessing ---
# We use the preprocess_text function we defined earlier
processed_new_text = preprocess_text(new_text)
print(f"Cleaned Text: {processed_new_text}")

# --- Use the FITTED vectorizer to transform the text ---
# IMPORTANT: Use .transform(), not .fit_transform()
# This ensures it uses the same vocabulary learned from the original documents.
new_tfidf_vector = vectorizer.transform([processed_new_text])

print(f"\nShape of the new vector: {new_tfidf_vector.shape}")

# --- Now you can predict its cluster ---
predicted_label = kmeans.predict(new_tfidf_vector)

print(f"\nThe new document belongs to cluster: {predicted_label[0]}")

Cleaned Text: algorithm set welldefined instruction designed perform specific task solve computational problem computer science study algorithm fundamental creating efficient scalable software data structure array hash table used organize data way allows algorithm access manipulate effectively

Shape of the new vector: (1, 1000)

The new document belongs to cluster: 1
