## Generate topic-aware keywords for test documents from LLM

### Step1: Generate global topics

To generate global topics for the document collection from an LLM, we follow the topic generation approach in [TopicGPT](https://github.com/chtmp223/topicGPT). We have saved the output topics under the dataset folder (e.g., datasets/20News/topics.txt).

### Step2: Topic selection

In [1]:
from walm import extract_text_between_strings, generate_topic_select, generate_topics_aware
from transformers import AutoModelForCausalLM, AutoTokenizer
import scipy.io as sio
import torch

# load documents
dataset = '20News'
data_dict = sio.loadmat('datasets/%s/data.mat' % dataset)
test_doc = data_dict['test_text'].tolist()
test_doc = [doc[0][0].strip() for doc in test_doc]

# take 10 documents as an example
test_doc = test_doc[0:10]

# load llm model
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'
# model_name = 'mistralai/Mistral-7B-Instruct-v0.3'
# model_name = 'microsoft/Phi-3-mini-128k-instruct'
# model_name = '01-ai/Yi-1.5-9B-Chat'
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             trust_remote_code=True,
                                             torch_dtype=torch.float16
                                             ).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# load global topics from this dataset
with open('datasets/%s/topics.txt' % dataset, 'r') as file:
    topics = file.readlines()
topics = topics[0:-1]
topics = [extract_text_between_strings(item, "[1]", "(Count:")[0].strip() for item in topics]

# run topic selection
save_path = 'step1_topic-select.txt'
doc_topics = generate_topic_select(model, tokenizer, topics, test_doc, save_path)

[nltk_data] Downloading package wordnet to /home/xiaohao/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.82it/s]


Running LLM Inference ...


  0%|          | 0/10 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
100%|██████████| 10/10 [00:09<00:00,  1.02it/s]


In [2]:
for item in doc_topics:
    print(item)

{'Topics': ['Technology and Electronics', 'Software Development', 'Economic and Employment Issues']}
{'Topics': ['Consumer Rights', 'Law and Justice', 'Ethics', 'Society']}
{'Topics': ['Sports Statistics', 'History']}
{'Topics': ['Sports Statistics']}
{'Topics': ['Technology and Electronics', 'Software Development', 'Internet Culture', 'Communication']}
{'Topics': ['Technology and Electronics', 'Software Development']}
{'Topics': ['Transportation', 'Consumer Rights', 'Society']}
{'Topics': ['Technology and Electronics', 'Repair and Maintenance', 'Software Development']}
{'Topics': ['Technology and Electronics', 'Consumer Rights']}
{'Topics': ['Society', 'Ethics', 'Religion']}


### Step2: Topic-Aware keywords generation

In [3]:
# load selected topics from test documents
with open('step1_topic-select.txt', 'r') as file:
    doc_topics = file.readlines()

# define save path and run generation
save_path = 'step2_topic-aware_kws.txt'
outputs = generate_topics_aware(model, tokenizer, doc_topics, test_doc, save_path)

 10%|█         | 1/10 [00:04<00:38,  4.24s/it]

--------
Error when parsing answer:
I apologize, but it seems like there's been a mistake. You didn't provide a document about 'Consumer Rights'. Instead, you mentioned a case about abortion rights.

If you'd like to provide the document about 'Consumer Rights', I'd be happy to help you with indexing words.
--------


100%|██████████| 10/10 [00:37<00:00,  3.75s/it]


In [4]:
for item in outputs:
    print(item)

Technology, Development, Sale, Software, Economic, Graphics, Harvard, Employment, Electronics, Issues, Price, Windows, Harvard Graphics
Ethical, Justice, Constitution, Societal, Morality, Abortion, Bioethics, Moral, Rights, Ethics, Law
History, Strike, Sports, Baseball, Mound, Zone, Statistics
Role, Plus/minus, Players, Context, Statistics
Async Solutions, Graphics Accelerator, Efficiency, Internet Culture, X11 Clients, Communication, Software Development, X11, Clients
Technology, UART, Serial Communication, Software Development, Electronics, Hardware Interrupts, Interrupts, Hardware, Windows
Train, Station, City, Protection, Dispute, Airport, Cities, Infrastructure, Complaint, Consumer, Rights, Society, Transportation
PCTools, Data Error, Error, IDE drive, IDE, Drive, Low Level Format, Maintenance, Data error, Disk recovery, IDE Drive, Repair, Low level format, Sector Marking, Disk Recovery
Technology, Sound, Sale, Protection, Purchase, Electronics, Super 8mm, Projector, Consumer, Rig