# Blog Post Clustering – Part 3: Autoencoder

After comparing different models and model combinations for clustering my blog posts (see <a href="https://github.com/domreichl/blog-post-clustering/blob/master/bp_clustering_part2.ipynb">Part 2</a>), I now want to try a more complex model, in particular, an autoencoder with three layers.

Autoencoders are neural networks used for unsupervised learning and a powerful tool for dealing with the curse of dimensionality. More precisely, an autoencoder consists of two parts: an *encoder* with multiple layers to reduce dimensionality and (2) a *decoder* with multiple layers to reconstruct the input from the dimensionally-reduced data. By reconstructing its inputs, the network detects the most important features in the data as it learns the identity function under the constraint of reduced dimensionality (or added noise). Since clustering is a form of dimensionality reduction, autoencoders should be useful for categorizing my blog posts into four broad topics.

Overview:
1. Modules & Data
2. Vectorization
3. Autoencoder
4. Evaluation
5. Conclusion

# 1. Modules & Data

Again, I import all modules, load my blog posts, filter them, and convert the html code into text.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow as tf
from sklearn.cluster import KMeans
from sklearn.decomposition import NMF, TruncatedSVD
from sklearn.metrics import silhouette_score, calinski_harabaz_score

data = pd.read_csv('wp_posts.csv', sep=';')
data = data[(data['post_type'] == 'post') & (data['post_status'] == 'publish')]
data = data[['post_content']].reset_index(drop=True)

for i in data.index:
    soup = BeautifulSoup(data['post_content'].loc[i], 'html.parser')
    data['post_content'].loc[i] = soup.get_text().lower()

data = data['post_content']

# 2. Vectorization

As before, I use tf-idf vectorization, but now with a lower bound on document frequency (min_df), which sets a cut-off threshold to ignore the rarest words. This allows the neural network to be trained much faster.

In [2]:
vectorizer = TfidfVectorizer(stop_words='english', min_df = 0.1)
tfidf_matrix = vectorizer.fit_transform(data)
words = vectorizer.get_feature_names()

# 3. Autoencoder

I use TensorFlow to build an autoencoder. First, I set two hyperparameters (learning rate and number of epochs) as well as the network parameters (numbers of nodes for three layers). Then, after defining the graph input (X) and all weights and biases, initialized with normally-distributed random numbers, I build an encoder and a decoder, both with sigmoid activation functions for each layer, construct the model, and define the functions for loss and optimization: minimize squared error. Finally, I initialize the variables and launch the graph before I run the session and training cycles. In the last two lines, I store the results and end the training session.

In [3]:
learning_rate = 0.01
training_epochs = 100

n_input = tfidf_matrix.shape[1]
n_hidden_1 = tfidf_matrix.shape[1] // 2
n_hidden_2 = 4

X = tf.placeholder("float", [None, n_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'encoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'decoder_h1': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_1])),
    'decoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_input])),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'encoder_b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'decoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'decoder_b2': tf.Variable(tf.random_normal([n_input])),
}

def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']),biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

enc = encoder(X)
dec = decoder(enc)

cost = tf.reduce_mean(tf.pow(X - dec, 2))
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)

init = tf.global_variables_initializer()
sess = tf.InteractiveSession() # interactive for jupyter notebook
sess.run(init)

for epoch in range(training_epochs):
    for i in range(len(data)): # one batch per blog post
        _, c = sess.run([optimizer, cost], feed_dict={X: tfidf_matrix[i].toarray()})
    if epoch % 10 == 0: # display every tenth epoch
        print("Epoch:", '%03d' % epoch, "cost =", "{:.9f}".format(c))

autoenc_results = dec.eval(feed_dict={X: tfidf_matrix.toarray()})      
sess.close()

Epoch: 000 cost = 0.098398253
Epoch: 010 cost = 0.016734146
Epoch: 020 cost = 0.007022711
Epoch: 030 cost = 0.002175688
Epoch: 040 cost = 0.002173345
Epoch: 050 cost = 0.002157870
Epoch: 060 cost = 0.002237542
Epoch: 070 cost = 0.002191694
Epoch: 080 cost = 0.002183638
Epoch: 090 cost = 0.002218124


The loss decreased quite well (much better than with all the other variations and parameters I tried), but 100 epochs wouldn't have been necessary.

# 4. Evaluation

In this final step, I use some models we already know from Part 2 to build an evaluation table for their scores on three metrics, now including a column for autoencoder-based k-means (km_autoenc).

In [4]:
k = 4 # number of clusters

# NMF & LSA
nmf = NMF(k)
nmf_matrix = nmf.fit_transform(tfidf_matrix)
lsa = TruncatedSVD(k)
lsa_matrix = lsa.fit_transform(tfidf_matrix)

# k-means variations
km = KMeans(k).fit(tfidf_matrix)
km_nmf = KMeans(k).fit(nmf_matrix)
km_lsa = KMeans(k).fit(lsa_matrix)
km_autoenc = KMeans(k).fit(autoenc_results)

# evaluation table
evaluation = pd.DataFrame({'Model': ['km', 'km_nmf', 'km_lsa', 'km_autoenc']})
sc, wcss, chi = [], [], []

# calculate scores
for model in (km, km_nmf, km_lsa, km_autoenc):
    sc.append(silhouette_score(tfidf_matrix.toarray(), model.labels_))
    wcss.append(round(model.inertia_, 2))
    chi.append(round(calinski_harabaz_score(tfidf_matrix.toarray(), model.labels_), 2))

# fill in and display evaluation table
evaluation['Silhouette'] = sc
evaluation['WCSS'] = wcss
evaluation['Calinski-Harabasz'] = chi
evaluation.head()

Unnamed: 0,Model,Silhouette,WCSS,Calinski-Harabasz
0,km,0.030429,262.18,7.69
1,km_nmf,0.032847,2.59,8.47
2,km_lsa,0.033674,10.73,8.48
3,km_autoenc,0.017647,3.13,6.07


# 5. Conclusion

Unfortunately, the scores for the Autoencoder-KMeans combo aren't particularly good. It has the lowest Silhouette coefficient and the lowest Calinski-Harabasz index, and that after experimenting with tons of different parameters for the neural network. In conclusion, it seems that sticking to NMF-based KMeans or even just NMF alone is the best choice for this data set.