The data presented here are news articles collected from various news sources. Articles from each news source is grouped together are stored inside the same file. Each line of the file are each separated news article with corresponding timestamp. We are only interested in the raw news data part which is mostly text. Here's the link to the file https://archive.ics.uci.edu/dataset/438/health+news+in+twitter.

This is a real dataset which will be used to test the performance and compared to the automatically generated ones. The auto gen data are generated during runtime, so it doesn't really make sense to visualize it since it will change from a run to another anyway. Therefore, this will be the only non changing dataset that we will use to test the algorithm.

Note that although the data is automatically generated, we can still control the parameters which are the distribution of the sampling, standard deviation, number of clusters, sparsity and so on.

In [1]:
%matplotlib notebook
import pandas as pd
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import functools
from dataclasses import *
from typing import *

Util class for computing statistics.

In [2]:
@dataclass(repr=True)
class data_stats:

    stds_sum: float
    stds_mean: float
    stds_median: float
    std_max: float
    std_min: float
    dist_from_origin: float
    shape: Any
    sparsity: float

    def __init__(self, data: np.ndarray):
        self.stds = data.std(axis=1)
        self.stds_sum = self.stds.sum()
        self.stds_mean = self.stds.mean()
        self.stds_median = np.median(self.stds)
        self.std_max = np.max(self.stds)
        self.std_min = np.min(self.stds)
        self.dist_from_origin = np.linalg.norm(data.mean(axis=0))
        self.shape = data.shape
        self.sparsity = 1.0 - np.count_nonzero(data) / float(data.size)


Load data from path. There are multiple clusters (16 in total), we will only select 5 clusters because the array gets very large. Also when there are multiple clusters piling up on each other, it gets really hard to visualize. In any case, in the real implementation, I have made it so that we can dynamically load each cluster by passing in the number of cluster we want to load. Hence, we can still test any values in the range from 2-16.

Once loaded, we will combine all the series into one single series for further processing.

In [3]:
path = 'data/health+news+in+twitter/Health-Tweets'

dfs = []
lens = []
names = []
k_clus = 5
cnt = 0
for filename in os.listdir(path):
    f = os.path.join(path, filename)
    if os.path.isfile(f) and cnt < k_clus:
        dfs.append(pd.read_csv(f, sep="|", on_bad_lines='skip', encoding="latin1").iloc[:, -1])
        lens.append(dfs[-1].size)
        names.append(filename[:-4])
        cnt += 1

lens = np.array([0] + lens)
aggregated_df = pd.concat(dfs)
aggregated_df, names, lens

(0       GP workload harming care - BMA poll http://bbc...
 1       Short people's 'heart risk greater' http://bbc...
 2       New approach against HIV 'promising' http://bb...
 3       Coalition 'undermined NHS' - doctors http://bb...
 4       Review of case against NHS manager http://bbc....
                               ...                        
 1994    Researchers use video games to study how sleep...
 1995    Are energy drinks really that bad for you? htt...
 1996    Men suffering from #depression may also suffer...
 1997    #Thanksgiving science: Why #gratitude is good ...
 1998    Clinton Kellys fresh and #fruity take on #hol...
 Length: 16936, dtype: object,
 ['bbchealth', 'cbchealth', 'cnnhealth', 'everydayhealth', 'foxnewshealth'],
 array([   0, 3928, 3727, 4044, 3238, 1999]))

Since our raw data is a text, before we need to preprocess it into numerical values so that we can use ml algorithm on it. Here we use the TfidVectorizer module to transform words into numerical data. Basically it counts the frequency of each words. We then produce the correponding labels of each data points to pair with the numerical data.

In [4]:
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(aggregated_df).toarray()
y = np.array(functools.reduce(lambda a, b: a + b, [[names[i]] * lens[i + 1] for i in range(len(names))]))
X, y, X.shape

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array(['bbchealth', 'bbchealth', 'bbchealth', ..., 'foxnewshealth',
        'foxnewshealth', 'foxnewshealth'], dtype='<U14'),
 (16936, 30112))

We will use PCA to visualize the data. See that the shape of X is (16936, 30112), and it will be impossible to plot such data. To maximize variance, the best we could do while visualizing data is to plot it as a 3D scatter plot. Therefore, we will choose ethe number of components to be three here. Then, we will apply the dimensionality reduction on it.

In [5]:
pca = PCA(n_components=3)
reduc_X = pca.fit_transform(X)
reduc_X

array([[ 0.12508905,  0.01893124, -0.03730875],
       [ 0.13665667,  0.01262441, -0.05037594],
       [ 0.13682485,  0.01893929, -0.03562506],
       ...,
       [-0.00252938, -0.07758758, -0.03194731],
       [-0.07043023, -0.07821037,  0.00260349],
       [-0.02956656, -0.06640642, -0.00736219]])

Here, we simply calculate the prefix sum fo the length array to use it for indexing for plotting down the line.

In [6]:
clusters_means = []
clusters = []
prefix = lens.cumsum()
prefix

array([    0,  3928,  7655, 11699, 14937, 16936])

Here we split each cluster here.

In [7]:
for i in range(lens.size - 1):
    clusters.append(reduc_X[prefix[i]:prefix[i+1],:])
    clusters_means.append(X[prefix[i]:prefix[i+1],:].mean(axis=0))

clusters_means, clusters, len(clusters)

([array([0.00010171, 0.00187191, 0.        , ..., 0.        , 0.        ,
         0.        ]),
  array([0.        , 0.00224272, 0.        , ..., 0.        , 0.        ,
         0.        ]),
  array([0.00000000e+00, 1.33511908e-03, 8.77821625e-05, ...,
         5.28469375e-05, 8.44114572e-05, 0.00000000e+00]),
  array([1.78937949e-04, 2.70675201e-04, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 8.45450557e-05]),
  array([0.        , 0.00182102, 0.        , ..., 0.        , 0.        ,
         0.        ])],
 [array([[ 0.12508905,  0.01893124, -0.03730875],
         [ 0.13665667,  0.01262441, -0.05037594],
         [ 0.13682485,  0.01893929, -0.03562506],
         ...,
         [ 0.1319689 ,  0.01252517, -0.04120252],
         [ 0.17345353,  0.02422875, -0.05242519],
         [ 0.13224365,  0.02043836, -0.04432558]]),
  array([[ 0.07465068,  0.01927082,  0.18906008],
         [ 0.00328963,  0.0070751 ,  0.16448705],
         [ 0.0053711 ,  0.00297966,  0.16362168],


We plot the projected data down by using the 3d scatter plot and label each clusters according to the news source here. Notice that some clusters like the bbc news are very easily separable from each other while some like everydayhealth and foxnews are harder to separate in 3 dimensions. However, it might be possible to find some projection in higher dimensional space such that it will separate each clusters from each other.

![pca](pics/pca_1.png)

In [13]:
""" UNCOMMENT TO RUN """

"""
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
colors = ["red", "blue", "green", "purple", "orange"]
for i in range(len(clusters)):
    temp = clusters[i]
    ax.scatter(temp[:, 0], temp[:, 1], temp[:, 2], c=colors[i], label=names[i])

ax.legend(loc="upper left")
"""

' \nfig = plt.figure()\nax = fig.add_subplot(projection=\'3d\')\ncolors = ["red", "blue", "green", "purple", "orange"]\nfor i in range(len(clusters)):\n    temp = clusters[i]\n    ax.scatter(temp[:, 0], temp[:, 1], temp[:, 2], c=colors[i], label=names[i])\n\nax.legend(loc="upper left")\n'

We can use tsne to visualize the data similarly here. However, since TSNE are very computationally expensive for data with higher number of features, we will use pca to project the data into the smaller subspace and only then perform tsne on the projected data to embed the projected data points to a 3D space for visualization again. Here we chose perplexity to be 40 because the dataset is large, simiarly for early_exacggeration. The intermidiate space that we project the data point to using the pca will have 100 dimensions.

In [9]:
""" UNCOMMENT TO RUN """

"""
from sklearn.manifold import TSNE
tsne = TSNE(n_components=3, perplexity=40, early_exaggeration=50)
pca = PCA(n_components=100)
first_step_reduction = pca.fit_transform(X)
reduc_X = tsne.fit_transform(first_step_reduction)
reduc_X
"""



array([[  7.1982684 , -28.995901  ,  14.533412  ],
       [-20.763655  , -16.066372  ,   7.5479674 ],
       [  1.7262423 , -20.999937  ,   1.3761098 ],
       ...,
       [-15.520175  ,   1.3319118 ,  12.025691  ],
       [-23.136036  ,  -3.1372528 ,  -4.219664  ],
       [ -0.12938255,   3.1262257 ,  22.0474    ]], dtype=float32)

The clusters here are pretty clump which is somewhat expected because the input matrix X is extremely sparse and the dimension is very high. So using PCA to visualize data is a better choice here.

![title](pics/tsne_1.png)

In [14]:
"""UNCOMMENT TO RUN"""

"""
clusters_means = []
clusters = []
prefix = lens.cumsum()
for i in range(lens.size - 1):
    clusters.append(reduc_X[prefix[i]:prefix[i+1],:])
    clusters_means.append(X[prefix[i]:prefix[i+1],:].mean(axis=0))

clusters_means, clusters, len(clusters)

fig = plt.figure()
ax = fig.add_subplot(projection='3d')
colors = ["red", "blue", "green", "purple", "orange"]
for i in range(len(clusters)):
    temp = clusters[i]
    ax.scatter(temp[:, 0], temp[:, 1], temp[:, 2], c=colors[i], label=names[i])

ax.legend(loc="upper left")
"""

'\nclusters_means = []\nclusters = []\nprefix = lens.cumsum()\nfor i in range(lens.size - 1):\n    clusters.append(reduc_X[prefix[i]:prefix[i+1],:])\n    clusters_means.append(X[prefix[i]:prefix[i+1],:].mean(axis=0))\n\nclusters_means, clusters, len(clusters)\n\nfig = plt.figure()\nax = fig.add_subplot(projection=\'3d\')\ncolors = ["red", "blue", "green", "purple", "orange"]\nfor i in range(len(clusters)):\n    temp = clusters[i]\n    ax.scatter(temp[:, 0], temp[:, 1], temp[:, 2], c=colors[i], label=names[i])\n\nax.legend(loc="upper left")\n'

Here we compute further statistic of data. Stds are computed for each basis. Its sum and mean of the stds of each feature are given below. As we can see, the mean of the standard deviation is very low ~ 0.00576, which implies that the data are very closely packed around the center becaus the min and the max are also around the same value.

Although the matrix is very large, it is mostly zeroes as the sparsity is very high here (nearly 1) which implies that 99 percent of the entries are zeroes.
In conclusion, the data has very high dimension however it is very sparse (99 percent zeroes), and the data points are tightly packed around the origin (0, 0, ..., 0). However, from the pca plot, we know that the data is separable.

In [11]:
stats = data_stats(X)
stats

data_stats(stds_sum=97.57934983373114, stds_mean=0.005761652682671891, stds_median=0.005761757307156004, std_max=0.005762574635993726, std_min=0.005756185691794238, dist_from_origin=0.12231959772827586, shape=(16936, 30112), sparsity=0.9995195703321675)