Topic Modeling with KeyBert and Llama 2.

## Install dependencies

In [2]:
!pip install xformers datasets umap-learn hdbscan keybert accelerate bitsandbytes

Collecting xformers
  Downloading xformers-0.0.26.post1-cp310-cp310-manylinux2014_x86_64.whl (222.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting umap-learn
  Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hdbscan
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m54.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone

In [3]:
!pip list

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
accelerate                       0.30.1
aiohttp                          3.9.5
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.6.0
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array_record                     0.5.1
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.1.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.15.0
backcall                         0.2.0


## Importing Dependencies

In [4]:
# Basic Imports
import sys
import os
import numpy as np
import pandas as pd

# Connecting to GPU
import torch

# Loading dataset
from datasets import load_dataset

# Embeddings
from sentence_transformers import SentenceTransformer

# Dimensionality Reduction
from umap import UMAP

# Clustering
from hdbscan import HDBSCAN

# KeyBert
from keybert import KeyBERT

# Llama-2
from huggingface_hub import notebook_login
import transformers

# Quantization and Optimization
from torch import bfloat16

In [5]:
# Mount drive to access TopicModel (skip, if running locally)
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# Add path to TopicModel for importing it
sys.path.append('/content/drive/MyDrive/path_to_folder_containing_TopicModel.py/')

from TopicModel import TopicModel

## Connecting to the GPU

In [7]:
# Check GPU availablity and connect if found.
if torch.cuda.is_available():
    device = torch.device("cuda")
    print('Connecting to GPU - ', torch.cuda.get_device_name(0))
else:
    print('No GPU found. Using the CPU instead.')
    device = torch.device("cpu")

No GPU found. Using the CPU instead.


## Load Dataset

For this example, we are using the train split of the ML-ArXiv-Papers dataset. Due to resource constraints, only 5000 articles have been used.

In [8]:
print('Loading dataset...')

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Features and size of dataset
print(dataset)

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/986 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/147M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/117592 [00:00<?, ? examples/s]

Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
    num_rows: 117592
})


In [9]:
# Keep 5k articles only (due to resource limitations)
dataset_5K = dataset.train_test_split(train_size=5000)['train']
print(dataset_5K)

Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
    num_rows: 5000
})


In [10]:
# Get abstracts and titles for modelling and visualization
abstracts = dataset_5K['abstract']
titles = dataset_5K['title']

In [11]:
print(abstracts[0])

  Graph Neural Networks (GNNs) are a class of deep learning-based methods for
processing graph domain information. GNNs have recently become a widely used
graph analysis method due to their superior ability to learn representations
for complex graph data. However, due to privacy concerns and regulation
restrictions, centralized GNNs can be difficult to apply to data-sensitive
scenarios. Federated learning (FL) is an emerging technology developed for
privacy-preserving settings when several parties need to train a shared global
model collaboratively. Although several research works have applied FL to train
GNNs (Federated GNNs), there is no research on their robustness to backdoor
attacks.
  This paper bridges this gap by conducting two types of backdoor attacks in
Federated GNNs: centralized backdoor attacks (CBA) and distributed backdoor
attacks (DBA). CBA is conducted by embedding the same global trigger during
training for every malicious party, while DBA is conducted by decomposing

## Create Sub Models

### Embedding Model

In [12]:
senttrans = SentenceTransformer("BAAI/bge-small-en")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Dim Reduce Model

In [13]:
umap = UMAP(n_neighbors=15, n_components=2, min_dist=0.02, metric='cosine')

### Clustering Model

In [14]:
hdbscan = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

## Using KeyBert

### KeyBert model

In [15]:
kbert_model = {'KeyBert': KeyBERT()}

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Create TopicModel

In [16]:
keybert_topic = TopicModel(embedding_model=senttrans,
                           dimreduce_model=umap,
                           cluster_model=hdbscan,
                           keyword_model=kbert_model,
                           n_labels = 10)

### Train model on abstracts

In [17]:
keybert_topic.fit_transform(abstracts)

TopicModel - Creating embeddings...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

TopicModel - Embeddings created
TopicModel - Reducing embeddings...
TopicModel - Embeddings reduced...
TopicModel - Clustering reduced embeddings...
TopicModel - Clusters created
TopicModel - Generating labels...
TopicModel - Keywords generated


### Get cluster-wise topic info

In [18]:
keybert_topic.get_cluster_topic_info()

Unnamed: 0,Topic,Size,KeyBert,Representative Doc
0,-1,1125,"[embeddings, regularization, embedding, superv...",( Recent self-supervised methods for image re...
1,0,32,"[electroencephalogram, electroencephalograms, ...","( In current clinical practice, electroenceph..."
2,1,22,"[tutors, learns, tuition, supervised, learning...",( Learning through experience is time-consumi...
3,2,46,"[activity, features, sensors, sensor, attentio...","( Deep neural networks, including recurrent n..."
4,3,19,"[warehouse, warehouses, knapsack, inventory, r...","( In this paper, we describe a novel solution..."
...,...,...,...,...
82,81,35,"[classifiers, binary, classifier, quantization...","( After being trained, classifiers must often..."
83,82,11,"[imagenet, adversarially, networks, neural, ar...",( Deep learning has proven to be a highly eff...
84,83,21,"[pruning, imagenet, optimize, neural, optimize...","( Architecture optimization, which is a techn..."
85,84,21,"[spiking, neuron, neural, neurons, rnns, hawke...",( The present paper provides exact mathematic...


### Visualize embeddings and topics

In [19]:
# Scatter Plot
keybert_topic.visualize(short_text = titles, embeddings = keybert_topic.get_reduced_embeddings())

## Using LLama-2

### Set up LLama-2

#### Login to HuggingFace Hub

In [20]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Model name

In [19]:
model_name = 'meta-llama/Llama-2-7b-chat-hf'

#### Model Optimization

In [20]:
# Create QLoRa config
lora_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

#### Model Pipeline

In [21]:
# Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

# Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code = True,
    quantization_config = lora_config,
    device_map = 'auto'
)

# Set model in evaluation mode
model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

In [22]:
# Create text generation pipeline
generator = transformers.pipeline(
    model = model,
    tokenizer = tokenizer,
    task='text-generation',
    temperature = 0.1,
    max_new_tokens = 500,
    repetition_penalty = 1.1
)

#### Set up the prompt

In [23]:
# System Prompt
sys_prompt = """
<s>[INST] <<SYS>>
You are a helpful and honest assistant for labeling topics.
<</SYS>>
"""

In [24]:
# Example Prompt
exp_prompt = """
I have a topic that contains the following text data:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

Based on the text data provided above, please create a short label having at most 10 words only of this topic. Make sure you to only return the label and nothing more.
[/INST] Environmental impacts of eating meat
"""

In [25]:
# Main Prompt
main_prompt = """
[INST]
I have a topic that contains the following text data:
[DOCUMENTS]

Based on the text data provided above, please create a short label having at most 10 words only of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""

In [26]:
# Complete prompt
prompt = sys_prompt + exp_prompt + main_prompt

### LLama-2 Model

In [27]:
llama_model = {'Llama': [generator, prompt]}

### Create TopicModel

In [28]:
llama_topic = TopicModel(embedding_model=senttrans,
                         dimreduce_model=umap,
                         cluster_model=hdbscan,
                         keyword_model=llama_model)

### Train model on abstracts

In [29]:
llama_topic.fit_transform(abstracts)

TopicModel - Creating embeddings...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

TopicModel - Embeddings created
TopicModel - Reducing embeddings...
TopicModel - Embeddings reduced...
TopicModel - Clustering reduced embeddings...
TopicModel - Clusters created
TopicModel - Generating labels...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


TopicModel - Keywords generated


### Get cluster-wise topic info

In [30]:
llama_topic.get_cluster_topic_info()

Unnamed: 0,Topic,Size,Llama,Representative Doc
0,-1,1361,[Machine Learning],( We present a method for the classification ...
1,0,35,[Quantum Computing],(Predicting the output of quantum circuits is ...
2,1,27,[Tomography Reconstruction],( Existing deep-learning based tomographic im...
3,2,21,[Causal Inference],"( In this paper, we discuss structure learnin..."
4,3,23,[Cancer progression analysis],"( Recently, there has been a resurgence of in..."
...,...,...,...,...
96,95,68,[Dimensionality reduction],( Dimensionality reduction is a crucial step ...
97,96,22,[Uncertainty Quantification in ML],( Uncertainty quantification of machine learn...
98,97,37,[Machine Learning],( We consider multi-label prediction problems...
99,98,23,[Machine Learning],( Leveraging the wealth of unlabeled data pro...


### Visualize embeddings and topics

In [31]:
# Scatter Plot
llama_topic.visualize(short_text = titles, embeddings = llama_topic.get_reduced_embeddings())