KNMeans allows clustering related data. We can tell it how many groups we want it to use, and it will group based on that.
In this case, we will use wikipedia articles and we will perform a naive word extraction.
To be able to work with documents, we will use [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [85]:
from bs4 import BeautifulSoup
import requests
import re
from sklearn.feature_extraction.text import TfidfVectorizer

documents = []
articles = [
    'https://en.wikipedia.org/wiki/Tf%E2%80%93idf',
    'https://en.wikipedia.org/wiki/Nintendo',
    'https://en.wikipedia.org/wiki/Video_game',
    'https://en.wikipedia.org/wiki/War',
    'https://en.wikipedia.org/wiki/Love',
    'https://en.wikipedia.org/wiki/Stephen_Hawking',
    'https://en.wikipedia.org/wiki/Metal_Gear_Solid',
    'https://en.wikipedia.org/wiki/Neil_deGrasse_Tyson',
    'https://en.wikipedia.org/wiki/The_Cure',
    'https://en.wikipedia.org/wiki/Metallica',
    'https://en.wikipedia.org/wiki/Freddie_Mercury',
    'https://en.wikipedia.org/wiki/Hatred,',
    'https://en.wikipedia.org/wiki/Peace',
]
article_titles = [
    'tf-idf',
    'Nintendo',
    'VideoGames',
    'War',
    'Love',
    'Stephen Hawking',
    'Metal Gear Solid',
    'Neil deGrasse Tyson',
    'The cure',
    'Metallica',
    'Freddy Mercury',
    'Hatred',
    'Peace',
]

In [86]:
for article in articles:
    response = requests.get(article)
    soup = BeautifulSoup(response.content)
    paragraphs = soup.find_all('p')
    extracted_text = [paragraph.get_text().strip('\n') for paragraph in paragraphs]
    extracted_text = ' '.join(extracted_text)
    documents.append(re.sub("[\(\[].*?[\)\]]", "", extracted_text))

In [87]:
tfidf = TfidfVectorizer()
sparse_mat = tfidf.fit_transform(documents)

Now that we have the tf-idf matrix, we can proceed.
We will start by extracting the [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) on the data. With this we will be able to reduce the dimensions of our data, extracting the principal components of the dataset.
After that we will use KMeans to create our clusters.

In [90]:
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
import pandas as pd

pca = TruncatedSVD(n_components=30)
kmeans = KMeans(n_clusters=6)
pipeline = make_pipeline(pca, kmeans)

In [95]:
pipeline.fit(sparse_mat)
labels = pipeline.predict(sparse_mat)
df = pd.DataFrame({'label': labels, 'title': article_titles})
print(df.sort_values('label').to_string(index=False))

label                title
    0             The cure
    0            Metallica
    1                  War
    1                 Love
    1                Peace
    2               Hatred
    3             Nintendo
    3           VideoGames
    3     Metal Gear Solid
    4               tf-idf
    5      Stephen Hawking
    5  Neil deGrasse Tyson
    5       Freddy Mercury


If you examine the output, you can see that articles that have similar topics are in the same cluster. You can add some more articles and play around with n_components(TruncatedSVD) and n_clusters(KMeans) to see how the behavior varies with that.