# Introduction

***Deep Semantic Text Hashing with Weak Supervision***

<https://dl.acm.org/doi/pdf/10.1145/3209978.3210090>

**Authors:** *Suthee Chaidaroon, Travis Ebesu and Yi Fang*

**Notebook Author(s):** *Erman Yafay*, *erman.yafay@ceng.metu.edu.tr*, *Middle East Technical University*


We re-implement the NbrReg model proposed in "Deep Semantic Text Hashing with Weak Supervision" by Chaidaroon et al.  in this notebook.

Semantic hashing aims to generate hash codes for text documents where the codes capture the semantic meaning of the documents. Hash codes are much more compact than the actual documents e.g. a code usually has a length between 8 to 256 bits whereas a document can contain thousands of words. Therefore, it can be efficient to measure similarity between documents using the codes rather than comparing raw documents.

Briefly, NbrReg uses a VAE to encode a document into a hash code and then during the decoding stage, reconstructs the document as well as its *k* nearest neighbours. This way, semantic codes generated by NbrReg both contain information of the document content and its similarity space.

In [1]:
import scipy.io
import nbrreg
import torch

# Containing document vectors and their categories
doc_data_path = "ng20_chd_emb.mat"
data = scipy.io.loadmat(doc_data_path)

# Training documents and their categories
train_docs = data["train"]
train_cats = data["gnd_train"]
print(f"Train data shape: {train_docs.shape}")

# Cross-validation documents and their categories
cv_docs = data["cv"]
cv_cats = data["gnd_cv"]
print(f"CV data shape: {cv_docs.shape}")

# Cross-validation documents and their categories
test_docs = data["test"]
test_cats = data["gnd_test"]
print(f"Test data shape: {test_docs.shape}")

# Containing nearest 100 neighbours of each document
# ranked with BM25.
train_knn = data["train_knn"]
print(f"Train knn shape: {train_knn.shape}")

Train data shape: (9551, 13300)
CV data shape: (6301, 13300)
Test data shape: (6301, 13300)
Train knn shape: (9551, 100)


# Training
The objective function obtained to be maximized using the ELBO method is as follows:

$$\log \int_s P(d \mid s)P(NN(d) \mid s)P(s)ds \ge \mathbb{E}_{Q(s \mid \cdot)} \left[ \log P(d \mid s)\right] + \mathbb{E}_{Q(s \mid \cdot)} \left[ \log P(NN(d) \mid s)\right] - D_{KL}\left(Q(s \mid \cdot) \parallel P(s) \right)$$

where $\mathbb{E}_{Q(s \mid \cdot)} \left[ \log P(d \mid s)\right]$ is the reconstruction loss for the encoded document, $\mathbb{E}_{Q(s \mid \cdot)} \left[ \log P(NN(d) \mid s)\right]$ is the reconstruction loss for the neigbourhood space. Detailed explanation of the objective function and how to calculate it can be found in the paper.

## Hyperparameters

NbrReg model has only a single parameter to tune which is the neighbourhood document size i.e. $k$ in KNN. We previously tuned parameters learning rate and latent dimension size and thus this time we only experiment with $k$.
 
Below code, tunes the hyperparameters. We train each model for 5 epochs and measure the average precision at each epoch on cross-validation split. We determine the best model parameters and train it for 15 more epochs for 32-bit hash codes.

In [17]:
import itertools

# Hyperparameters
best_lr = 1e-3
best_ls = 1000
neighbourhood_sizes = [20, 50, 100]

# We train and cross validate with 32 bits.
# Later we'll train for other bit lenghts 
# using the best hyperparameters
train_bit_size = 32

best_model_32 = None
best_prec = 0.0
best_k = None
for k in neighbourhood_sizes:
    print(f"Training with k: {k}")
    # Train for 10 epochs; at each epoch report avg training loss
    # and avg precision over cv dataset. Returns the trained model
    # and the best precision obtained over epochs
    model, prec = nbrreg.train(train_docs, train_cats, train_knn[:,:k],
                               cv_docs, cv_cats, bitsize=train_bit_size, 
                               epoch=5, lr=best_lr, latent_size=best_ls)
    best_prec = max(best_prec, prec)
    if prec == best_prec:
        best_model_32 = model
        best_k = k
            
print(f"Best k: {best_k}")

Training with k: 20
Epoch 1: Avg Loss: 6373.7698094667, Avg Prec: 0.09354388192350223
Epoch 2: Avg Loss: 6149.075785716869, Avg Prec: 0.16508808125694135
Epoch 3: Avg Loss: 6020.863838292338, Avg Prec: 0.23210442786858948
Epoch 4: Avg Loss: 5909.838765934561, Avg Prec: 0.2899492144104076
Epoch 5: Avg Loss: 5885.693073642106, Avg Prec: 0.3404983335978401
Training with k: 50
Epoch 1: Avg Loss: 12932.309364712972, Avg Prec: 0.10373908903348492
Epoch 2: Avg Loss: 12618.673670825025, Avg Prec: 0.17452467862243962
Epoch 3: Avg Loss: 12418.24634776308, Avg Prec: 0.2503142358355796
Epoch 4: Avg Loss: 12383.057273951212, Avg Prec: 0.30464688144738605
Epoch 5: Avg Loss: 12264.154381814194, Avg Prec: 0.3414965878431978
Training with k: 100
Epoch 1: Avg Loss: 21734.78200283679, Avg Prec: 0.11080622123472274
Epoch 2: Avg Loss: 21194.732360928832, Avg Prec: 0.17242659895254653
Epoch 3: Avg Loss: 20899.392868598672, Avg Prec: 0.24020155530868004
Epoch 4: Avg Loss: 20849.890802391095, Avg Prec: 0.2852

In [18]:
# Since k=20 and k=50 almost perform the same we choose k=20
# for lower dimensionality
best_k = 20

In [20]:
# Train the best model epochs
best_model_32, best_prec = nbrreg.train(train_docs, train_cats, train_knn[:,:best_k],
                                        cv_docs, cv_cats, train_bit_size,
                                        epoch=15, lr=best_lr, latent_size=best_ls)
print(f"Best precision on validation: {best_prec}")

Epoch 1: Avg Loss: 6385.680036246301, Avg Prec: 0.07285034121567763
Epoch 2: Avg Loss: 6090.851244187824, Avg Prec: 0.14355340422155152
Epoch 3: Avg Loss: 6049.398451415388, Avg Prec: 0.23030312648785892
Epoch 4: Avg Loss: 5958.065565215617, Avg Prec: 0.29880336454530926
Epoch 5: Avg Loss: 5922.056618546187, Avg Prec: 0.36592445643548444
Epoch 6: Avg Loss: 5900.417233497231, Avg Prec: 0.3981748928741457
Epoch 7: Avg Loss: 5888.142336434635, Avg Prec: 0.41889700047611295
Epoch 8: Avg Loss: 5835.804232069567, Avg Prec: 0.4418378035232494
Epoch 9: Avg Loss: 5845.992989935893, Avg Prec: 0.4499460403110609
Epoch 10: Avg Loss: 5830.98040495628, Avg Prec: 0.4605475321377544
Epoch 11: Avg Loss: 5772.00184009951, Avg Prec: 0.46511029995238945
Epoch 12: Avg Loss: 5778.710141824453, Avg Prec: 0.4640358673226459
Epoch 13: Avg Loss: 5713.2044223530975, Avg Prec: 0.4645675289636553
Epoch 14: Avg Loss: 5698.943521976277, Avg Prec: 0.46788922393270804
Epoch 15: Avg Loss: 5745.446166892099, Avg Prec: 0

In [21]:
# Saving the model
best_model_path = f"ng20_chd_{train_bit_size}.pt"
print(f"Saving best model to {best_model_path}")
torch.save(best_model_32.state_dict(), best_model_path)

Saving best model to ng20_chd_32.pt


In [22]:
# Loading the model (for 32-bit)
best_model_32 = nbrreg.NbrReg(train_docs.shape[1], bit_size=train_bit_size, h_size=best_ls)
best_model_32.double()
best_model_32.load_state_dict(torch.load(best_model_path))

<All keys matched successfully>

# Testing

Evaluation of the model is done by measuring the average precision at top-100 documents. For each test document, top 100 similar documents are retrieved from the training set using the hamming distance and the precision is measured i.e. ratio of relevant documents to retrieved documents. We report the average precision over the test documents.

In [9]:
# Cross-validation and test datasets are the same for the dataset that the original paper used
# Therefore, we obtain the same precision on test set
import numpy as np
if np.sum(cv_docs != test_docs):
    print("Test and cv are different")
else:
    print("Test and cv are the same")

Test and cv are the same


In [23]:
test_avg_prec = nbrreg.test(train_docs, train_cats, test_docs, test_cats, best_model_32)
print(f"Test average precision for 32 bit: {test_avg_prec}")

Test average precision for 32 bit: 0.46760038089192124


In [11]:
# We load and present the results for the remaining bit sizes
rem_bit_sizes = [8, 16, 64, 128]
for rbs in rem_bit_sizes:
    m = nbrreg.NbrReg(train_docs.shape[1], bit_size=rbs, h_size=best_ls)
    m.double()
    m.load_state_dict(torch.load(f"ng20_chd_{rbs}.pt"))
    avg_prec = nbrreg.test(train_docs, train_cats, test_docs, test_cats, m)
    print(f"Test average precision for {rbs} bit: {avg_prec}")

Test average precision for 8 bit: 0.33383272496428945
Test average precision for 16 bit: 0.4148119346135517
Test average precision for 64 bit: 0.4938597048087598
Test average precision for 128 bit: 0.50856213299476


## Results and Conclusion
Following table presents the average precision at 100 using the NbrReg model implemented by us (NbrReg-Self) and compares it to the results of the original paper (NbrReg-Orig) on 20Newsgroups dataset. Original paper results can be found in the Table 1 under the 20Newsgroups dataset with the row name "NbrReg". For most of the bit sizes we managed to produce the original results exactly e.g. for 16 and 32 bits the results are almost the same. Our 8 bit results are a bit lower than the original (original is 3.7% better) but we obtain higher precision on larger bit lengths e.g. 64 and 128 bits with a 3.6% and 3.9% improvements respectively. As a result, our results at most differ by $\approx 3.7\%$ from the original. 

|| 8-bits | 16-bits | 32-bits | 64-bits | 128-bits|
|----- | -------- | ------- | ------- | ------- | ------- |
| NbrReg-Orig | 0.3463 | 0.4120 | 0.4644 | 0.4768 | 0.4893|
| NbrReg-Self | 0.3338 | 0.4148 | 0.4676 |0.4939 | 0.5086 |

## Challenges
Our first challenge was to recreate and preprocess the 20NewsGroups dataset as the paper does not provide any information on how the data is obtained. We obtained the raw documents with their categories from the website <http://ana.cachopo.org/datasets-for-single-label-text-categorization>. We use the version where stop words are removed and Porter stemmer is applied. Please see the python program (prepare_data.py) shipped with this notebook that prepocesses the data and creates BM25 (<https://en.wikipedia.org/wiki/Okapi_BM25>) weighted documents as well as their 100 nearest neighbours. Since the dataset we've created and the paper dataset differ marginally, we decided to show the validity of the model on the the paper dataset and then show results using our dataset.

Secondly, we noticed some inconsistencies with the paper dataset. Authors describe that %80 percent of the data is reserved for training, which should be around $15K$, whereas the dataset has only $9.5K$ training samples. We observed that document vectors created by us has $\approx 130K$ dimensions without any word removal or stemming. Using stop word removal and Porter stemmer we managed to reduce it to $\approx 70K$, but it is no where near to the dimension of the document vectors supplied by the authors; which is only $13300$ (see data loading cell at top). Additionally, the cross validation and test datasets are exactly same. It is possible that our results can diverge from what is reported in the paper, since the dataset we are planning to use can be an order of magnitude larger in document dimension.

In [2]:
# In this cell we show the prepocessed data created by us
self_doc_data_path = "ng20_self.mat"
self_data = scipy.io.loadmat(self_doc_data_path)

# Training documents and their categories
self_train_docs = self_data["train"]
self_train_cats = self_data["gnd_train"]
print(f"Our training shape: {self_train_docs.shape}")

# Cross-validation documents and their categories
self_cv_docs = self_data["cv"]
self_cv_cats = self_data["gnd_cv"]
print(f"Our CV shape: {self_cv_docs.shape}")

# Cross-validation documents and their categories
self_test_docs = self_data["test"]
self_test_cats = self_data["gnd_test"]
print(f"Our test shape: {self_test_docs.shape}")

# Containing nearest 100 neighbours of each document
# ranked with BM25.
self_train_knn = self_data["train_knn"]
print(f"Our train knn shape: {self_train_knn.shape}")

Our training shape: (15056, 70216)
Our CV shape: (1882, 70216)
Our test shape: (1882, 70216)
Our train knn shape: (15056, 100)


In [4]:
# We showcase that the results are no where near to the original values
bit_sizes = [8, 16, 32, 64, 128]
best_ls = 1000
for bs in bit_sizes:
    self_model = nbrreg.NbrReg(self_train_docs.shape[1], bit_size=bs, h_size=best_ls)
    self_model.double()
    self_model.load_state_dict(torch.load(f"ng20_self_{bs}.pt"))
    self_avg_prec = nbrreg.test(self_train_docs, self_train_cats, self_test_docs, self_test_cats, self_model)
    print(f"Test average precision for {bs} bit on our dataset: {self_avg_prec}")

Test average precision for 8 bit on our dataset: 0.10008501594048907
Test average precision for 16 bit on our dataset: 0.11287991498406004
Test average precision for 32 bit on our dataset: 0.15859723698193395
Test average precision for 64 bit on our dataset: 0.2075557917109456
Test average precision for 128 bit on our dataset: 0.24996811902231653


## Additional Results on Our Dataset

Below table summarizes the overall results. NbrReg-Self-Data is the results that we obtained for the models trained on the dataset that we've created. The model performs poorly on our dataset as the average precision for all of the bit sizes is nowhere comparable to the original results reported.

|| 8-bits | 16-bits | 32-bits | 64-bits | 128-bits|
|----- | -------- | ------- | ------- | ------- | ------- |
| NbrReg-Orig | 0.3463 | 0.4120 | 0.4644 | 0.4768 | 0.4893|
| NbrReg-Self | 0.3338 | 0.4148 | 0.4676 |0.4939 | 0.5086 |
| NbrReg-Self-Data | 0.1000 | 0.1129 | 0.1586 | 0.2076 | 0.2500 |
