# Clustering the whole dataset

In this step you should load the saved autoencoder and the embeddings, reduce the dimension of embeddings and apply some clustering (kmeans, hbdscan) to identify the clusters.
Then you should save the cluster information (index of points in each cluster) to be used for analysis. 

The goal here is to identify broad topics that exist in our dataset

In [1]:
import torch
from utils.autoencoders import LinearAutoEncoder

# same deal
from utils.embeddings import get_sbert_embeddings
from pathlib import Path
DATA_DIR = Path('/data/blockchain-interoperability/blockchain-social-media/twitter-data')

embeddings = get_sbert_embeddings(
    snapshot_path = DATA_DIR/'snapshots',
    embeddings_path = DATA_DIR/'embeddings',
)



# and the model
autoenc = LinearAutoEncoder()
autoenc.load_state_dict(torch.load(DATA_DIR/'autoenc_10_epoch.pkl'))
# freeze the model, because we don't want to do any more calculations. This saves computational power
autoenc.requires_grad = False
# autoenc = autoenc.cuda()

loading from cache


In [2]:
# example of using the encoder
# this line may not work, if it complains, just break into smaller batches and append them back. 

reduced_embs = autoenc.encoder(embeddings)


In [3]:
reduced_embs.shape

torch.Size([14973497, 10])

And here the clustering starts...

Once the clusters are chosen, look inside the contents of each cluster

In [None]:
import pandas as pd

# loads a pd.Series object that can be accessed by the indexes found in clusters
whole_text = pd.read_pickle(DATA_DIR/'snapshots/whole_text.pkl')