# Text Clustering Using Auto Encoders

Reference: 
This work is inspired by the blog posted by Chengwei titled: ["How to do Unsupervised Clustering with Keras"](https://github.com/Tony607/Keras_Deep_Clustering/blob/master/Keras-DEC.ipynb)

This work will be comprise of the following tasks:
  - Use the [20 News Group](https://scikit-learn.org/0.18/auto_examples/text/document_clustering.html) data and K-means evaluation provided by Sci Kit Learn
  - Use K-means as a base-line model for computing the clusters in 20 New Group data
  - Extended the K-means model to use the Autoencoder keras based model to improve the prediction of the clusters
  - Visualization of the clusters using T-SNE
  - Evaluation of the models performance metrics

## Mount my working drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Install mpld3 to create interactive matplotlib graphics

In [1]:
!pip install git+git://github.com/mpld3/mpld3@master#egg=mpld3
!pip install MulticoreTSNE-modified
!pip install tensorflow==1.14



## Imports

In [0]:
import mpld3
import sys
sys.path.insert(0,'/content/drive/My Drive/Colab Notebooks/TextClustering/ClusteringDocuments')

In [3]:
import RunDocClustering as rdc
import Constants as c

%matplotlib inline

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Run Baseline Model using K-means with no dimensionality reduction

In [5]:
def demoNoDimReductionKmeansModel():
  c.IS_KMEANS_MODEL = True
  c.IS_AUTOENCODER_MODEL = False
  c.IS_APPLY_LSA_DIM_REDUCTION = False
  return rdc.run()

fig = demoNoDimReductionKmeansModel()
mpld3.display(fig)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In isDataExist()..
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'comp.graphics', 'talk.religion.misc', 'sci.space']
There are 3387 documents
There are 4 categories

X.shape post-TFIDF is: (3387, 10000)
Run time of this operation is: 13.31 seconds
n_samples: 3387, n_features: 10000

data_results[X].shape: (3387, 10000)
Starting to compute the K-means clustering algorithm..
Run time of this operation is: 0.08 seconds
Saving model to cache path: Cache_kmeans_model.pl
------------------------------------------------------------
Top terms per cluster:
cluster_to_label_map: {0: 'alt.atheism', 2: 'comp.graphics', 3: 'sci.space', 1: 'talk.religion.misc'}
Cluster [0 alt.atheism]: god com jesus sandvik people bible don christian article just
Cluster [2 comp.graphics]: space access nasa henry digex pat com toronto alaska gov
Cluster [3 sci.space]: graphics university image thanks com ac posting host nntp files
Cluster [1 talk.religion.misc]: keith sgi caltech com morality livesey

## Run Model using K-means with dimensionality reduction

In [6]:
def demoWithDimReductionKmeansModel():
  c.IS_KMEANS_MODEL = True
  c.IS_AUTOENCODER_MODEL = False
  c.IS_USE_TSNE = True
  c.IS_APPLY_LSA_DIM_REDUCTION = True
  return rdc.run()

fig = demoWithDimReductionKmeansModel()
mpld3.display(fig)

In isDataExist()..
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'comp.graphics', 'talk.religion.misc', 'sci.space']
There are 3387 documents
There are 4 categories

X.shape post-TFIDF is: (3387, 10000)
Run time of this operation is: 0.90 seconds
n_samples: 3387, n_features: 10000

Starting to perform dimensionality reduction using LSA
Run time for LSA processing operation is: 0.13 seconds
X.shape post-SVD is: (3387, 4)
data_results[X].shape: (3387, 4)
Starting to compute the K-means clustering algorithm..
Run time of this operation is: 0.03 seconds
Saving model to cache path: Cache_kmeans_model_with_reduced_dim.pl
------------------------------------------------------------
Top terms per cluster:
cluster_to_label_map: {0: 'alt.atheism', 2: 'comp.graphics', 3: 'sci.space', 1: 'talk.religion.misc'}
Cluster [0 alt.atheism]: god com sandvik people don jesus article think sgi say
Cluster [2 comp.graphics]: graphics space image university nasa ac program uk posting like
Clus

## Run Autoencoder clustering Model 

In [4]:
def demoWithAutoencoderClusteringModel():
  c.IS_KMEANS_MODEL = False
  c.IS_APPLY_LSA_DIM_REDUCTION = False
  c.IS_AUTOENCODER_MODEL = True
  return rdc.run()

fig = demoWithAutoencoderClusteringModel()
mpld3.display(fig)

In isDataExist()..
Pre-process data already exists and will be obtained from the cache path: Cache_pre_process_data.pl
Run time of this operation is: 0.00 seconds
data_results[X].shape: (3387, 10000)
X.shape: (3387, 10000)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 5000)              50005000  
_________________________________________________________________
dense_2 (Dense)              (None, 2500)              12502500  
_________________________________________________________________
dense_3 (Dense)              (None, 1250)              3126250   
_________________________________________________________________
embedding (Dense)            (None, 2)                 2502      
_________________________________________________________________
dense_4 (Dense)              (None, 1250)              3750      
_____________________________



Epoch 3/5
Epoch 4/5
Epoch 5/5
Dimensionality reduction from (3387, 10000) to (3387, 2) using Autoencoder model 
Starting to compute the K-means clustering algorithm..
Model already exists and will be obtained from the cache path: Cache_kmeans_model.pl
Run time of this operation is: 0.00 seconds
------------------------------------------------------------
Top terms per cluster:
cluster_to_label_map: {0: 'alt.atheism', 2: 'comp.graphics', 3: 'sci.space', 1: 'talk.religion.misc'}
Cluster [0 alt.atheism]: god com jesus sandvik people bible don christian article just
Cluster [2 comp.graphics]: space access nasa henry digex pat com toronto alaska gov
Cluster [3 sci.space]: graphics university image thanks com ac posting host nntp files
Cluster [1 talk.religion.misc]: keith sgi caltech com morality livesey objective moral solntze wpd
------------------------------------------------------------
K-means prediction accuracy is: 0.36
K-means prediction homogeniety score is: 0.55
-----------------

In [0]:
import numpy as np
l = np.array([2,3,5,6,7,3,2,1,1,1])
l2 = range(10)
l3 = l[l2]
l3

array([2, 3, 5, 6, 7, 3, 2, 1, 1, 1])