# Clustering News Articles

Previously, we performed data mining knowing what we were looking for. Our use of target classes allowed us to learn how our variables model those targets during the training phase. This type of learning, where we have targets to train against, is called **supervised learning**. Here, we consider what we do without those targets. This is **unsupervised learning** and is much more of an exploratory task. Rather than wanting to classify with our model, the goal in unsupervised learning is more about exploring the data to find insights.

We will look at clustering news articles to find trends and patterns in the data. We look at how we can extract data from different websites using a link aggregation website to show a variety of news stories.

The key concepts covered in this chapter include:
- Obtaining text from arbitrary websites
- Using the reddit API to collect interesting news stories
- Cluster analysis for unsupervised data mining
- Extracting topics from documents
- Online learning for updating a model without retraining it
- Cluster ensembling to combine different models

## Obtaining news articles

We will build a system that takes a news articles and groups them together, where the groups have similar topics. Our goal is to generate 500 fake news articles in python and then cluster them to
see any major themes or concepts that occur.

## Generate news articles

In [1]:
import faker
import random
from datetime import datetime, timedelta

In [2]:
fake = faker.Faker()

In [3]:
my_word_list = [
    'danish','cheesecake','sugar',
    'Lollipop','wafer','Gummies',
    'sesame','Jelly','beans',
    'pie','bar','Ice','oat' ]

random.seed(42)
sample_size = 3
num_of_iters = 6
samples = []
for _ in range(num_of_iters):
    random_sample = random.sample(my_word_list, sample_size)
    samples.append(random_sample)

print("Fixed random sample:", samples)


Fixed random sample: [['bar', 'cheesecake', 'danish'], ['Ice', 'wafer', 'Lollipop'], ['Lollipop', 'sugar', 'cheesecake'], ['bar', 'Ice', 'beans'], ['cheesecake', 'pie', 'sesame'], ['danish', 'oat', 'cheesecake']]


In [4]:
def generate_title():
    max_words = 5  # Maximum number of words
    num_words = random.randint(1, max_words)
    return ' '.join(fake.word() for _ in range(num_words))

def generate_author():
    return f"{fake.first_name()} {fake.last_name()}"

def generate_content(sample, num_indicators=1):
    lines = []
    for _ in range(300):
        lines.append(fake.text())
    for _ in range(num_indicators):
        lines.append(fake.sentence(ext_word_list=sample))
    
    random.shuffle(lines)
    
    text = '/n'.join(lines)

    return text

In [7]:
documents = []
def generate_article(i, samples, num_indicators=1):
    docs = []
    for sample in samples:
        title = generate_title()
        author = generate_author()
        content = generate_content(sample, num_indicators)

        article = f"""
        Title: {title}
        Author: {author}

        {content}
        """
        
        docs.append(article)
    return docs

def return_docs(return_docs=60, num_indicators=1):
    dcuments = []
    for i in range(return_docs):
        dcuments += generate_article(i, samples, num_indicators)

    return dcuments

documents = return_docs()
print(len(documents))


360


In [8]:
print(documents[0][:400])


        Title: enjoy identify
        Author: John Carlson

        Such some room strategy low go. Wait during house site believe whether easy. Plant wife ok might trouble modern./nImportant color onto billion. Source check memory help firm. Must meet seek defense follow article right./nPretty company such six life. Bill maybe fact fire company space skin. Little why boy citizen plan color./nCon


In [9]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
from sklearn.pipeline import Pipeline

n_clusters = 10
pipeline = Pipeline([('feature_extraction', TfidfVectorizer(max_df=0.4)),
                     ('clusterer', KMeans(n_clusters=n_clusters))])


In [12]:
pipeline.fit(documents)
labels = pipeline.predict(documents)

from collections import Counter
c = Counter(labels)
for cluster_number in range(n_clusters):
    print("Cluster {} contains {} samples".format(cluster_number, c[cluster_number]))

Cluster 0 contains 57 samples
Cluster 1 contains 80 samples
Cluster 2 contains 62 samples
Cluster 3 contains 68 samples
Cluster 4 contains 42 samples
Cluster 5 contains 51 samples


In [13]:
c[0]

57

In [14]:
pipeline.named_steps['clusterer'].inertia_

272.54612026875157

In [16]:
import numpy as np

inertia_scores = []
n_cluster_values = list(range(2, 20))
for n_clusters in n_cluster_values:
    cur_inertia_scores = []
    X = TfidfVectorizer(max_df=0.4).fit_transform(documents)
    for i in range(30):
        km = KMeans(n_clusters=n_clusters).fit(X)
        cur_inertia_scores.append(km.inertia_)
    inertia_scores.append(cur_inertia_scores)
inertia_scores = np.array(inertia_scores)

KeyboardInterrupt: 

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

inertia_means = np.mean(inertia_scores, axis=1)
inertia_stderr = np.std(inertia_scores, axis=1)

fig = plt.figure(figsize=(40,20))
plt.errorbar(n_cluster_values, inertia_means, inertia_stderr, color='green')
plt.show()

In [None]:
n_clusters = 6

pipeline = Pipeline([('feature_extraction', TfidfVectorizer(max_df=0.4)),
                     ('clusterer', KMeans(n_clusters=n_clusters))
                     ])
pipeline.fit(documents)

In [None]:
labels = pipeline.predict(documents)

In [None]:
# Note: the higher the number of indicatorers the more perfect the model
# trivial_docs = return_docs(num_indicators=10)
# pipeline.fit(trivial_docs)
# trivial_labels = pipeline.predict(trivial_docs)

# from collections import Counter
# c = Counter(trivial_labels)
# for cluster_number in range(n_clusters):
#     print("Cluster {} contains {} samples".format(cluster_number, c[cluster_number]))

In [None]:

c = Counter(labels)

terms = pipeline.named_steps['feature_extraction'].get_feature_names()

for cluster_number in range(n_clusters):
    print("Cluster {} contains {} samples".format(cluster_number, c[cluster_number]))
    print("  Most important terms")
    centroid = pipeline.named_steps['clusterer'].cluster_centers_[cluster_number]
    most_important = centroid.argsort()
    for i in range(5):
        term_index = most_important[-(i+1)]
        print("  {0}) {1} (score: {2:.4f})".format(i+1, terms[term_index], centroid[term_index]))
    print()

In [None]:
from sklearn.metrics import silhouette_score
X = pipeline.named_steps['feature_extraction'].transform(documents)
silhouette_score(X, labels)

In [None]:
len(terms)

In [None]:
Y = pipeline.transform(documents) 

In [None]:
km = KMeans(n_clusters=n_clusters)
labels = km.fit_predict(Y)

In [None]:
c = Counter(labels)
for cluster_number in range(n_clusters):
    print("Cluster {} contains {} samples".format(cluster_number, c[cluster_number]))

In [None]:
silhouette_score(Y, labels)

In [None]:
Y.shape

## Evidence Accumulation Clustering

In [None]:
from scipy.sparse import csr_matrix


def create_coassociation_matrix(labels):
    rows = []
    cols = []
    unique_labels = set(labels)
    for label in unique_labels:
        indices = np.where(labels == label)[0]
        for index1 in indices:
            for index2 in indices:
                rows.append(index1)
                cols.append(index2)
    data = np.ones((len(rows),))
    return csr_matrix((data, (rows, cols)), dtype='float')


In [None]:
C = create_coassociation_matrix(labels)

In [None]:
C

In [None]:
C.shape, C.shape[0] * C.shape[1]

In [None]:
len(C.nonzero()[0]) / (C.shape[0] * C.shape[1])

In [None]:
from scipy.sparse.csgraph import minimum_spanning_tree

In [None]:
mst = minimum_spanning_tree(C)

In [None]:
mst

In [None]:
pipeline = Pipeline([('feature_extraction', TfidfVectorizer(max_df=0.4)),
                     ('clusterer', KMeans(n_clusters=3))
                     ])
pipeline.fit(documents)
labels2 = pipeline.predict(documents)
C2 = create_coassociation_matrix(labels2)

In [None]:

C_sum = (C + C2) / 2
#C_sum.data = C_sum.data
C_sum.todense()

In [None]:
mst = minimum_spanning_tree(-C_sum)
mst

In [None]:

#mst.data[mst.data < 1] = 0
mst.data[mst.data > -1] = 0
mst.eliminate_zeros()
mst

In [None]:
from scipy.sparse.csgraph import connected_components
number_of_clusters, labels = connected_components(mst)

In [None]:
from sklearn.base import BaseEstimator, ClusterMixin

class EAC(BaseEstimator, ClusterMixin):
    def __init__(self, n_clusterings=10, cut_threshold=0.5, n_clusters_range=(3, 10)):
        self.n_clusterings = n_clusterings
        self.cut_threshold = cut_threshold
        self.n_clusters_range = n_clusters_range
    
    def fit(self, X, y=None):
        C = sum((create_coassociation_matrix(self._single_clustering(X))
                 for i in range(self.n_clusterings)))
        mst = minimum_spanning_tree(-C)
        mst.data[mst.data > -self.cut_threshold] = 0
        mst.eliminate_zeros()
        self.n_components, self.labels_ = connected_components(mst)
        return self
    
    def _single_clustering(self, X):
        n_clusters = np.random.randint(*self.n_clusters_range)
        km = KMeans(n_clusters=n_clusters)
        return km.fit_predict(X)
    
    def fit_predict(self, X):
        self.fit(X)
        return self.labels_

In [None]:

pipeline = Pipeline([('feature_extraction', TfidfVectorizer(max_df=0.4)),
                     ('clusterer', EAC())])

In [None]:
pipeline.fit(documents)

In [None]:
labels = pipeline.named_steps['clusterer'].labels_

In [None]:
c = Counter(labels)
c

## Online Learning

In [None]:
from sklearn.cluster import MiniBatchKMeans

In [None]:
vec = TfidfVectorizer(max_df=0.4)

In [None]:
X = vec.fit_transform(documents)

In [None]:
mbkm = MiniBatchKMeans(random_state=14, n_clusters=3)
batch_size = 500

indices = np.arange(0, X.shape[0])
for iteration in range(100):
    sample = np.random.choice(indices, size=batch_size, replace=True)
    mbkm.partial_fit(X[sample[:batch_size]])

In [None]:
mbkm = MiniBatchKMeans(random_state=14, n_clusters=3)
batch_size = 10

for iteration in range(int(X.shape[0] / batch_size)):
    start = batch_size * iteration
    end = batch_size * (iteration + 1)
    mbkm.partial_fit(X[start:end])

In [None]:
labels_mbkm = mbkm.predict(X)
mbkm.inertia_

In [None]:
km = KMeans(random_state=14, n_clusters=3)
labels_km = km.fit_predict(X)
km.inertia_

In [None]:
from sklearn.metrics import adjusted_mutual_info_score, homogeneity_score
from sklearn.metrics import mutual_info_score, v_measure_score

In [None]:
v_measure_score(labels_mbkm, labels_km)

In [None]:
X.shape

In [None]:
labels_mbkm

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

In [None]:
class PartialFitPipeline(Pipeline):
    def partial_fit(self, X, y=None):
        Xt = X
        for name, transform in self.steps[:-1]:
            Xt = transform.transform(Xt)
        return self.steps[-1][1].partial_fit(Xt, y=y)

In [None]:
pipeline = PartialFitPipeline([('feature_extraction', HashingVectorizer()),
                             ('clusterer', MiniBatchKMeans(random_state=14, n_clusters=3))])

In [None]:
batch_size = 10

for iteration in range(int(len(documents) / batch_size)):
    start = batch_size * iteration
    end = batch_size * (iteration + 1)
    pipeline.partial_fit(documents[start:end])

In [None]:
labels = pipeline.predict(documents)
labels