# Repalce OOV (Out-Of-Vocabulary) words by the LLM

The keywords of documents generated by an LLM may contain out-of-vocabulary (OOV) words. Here, we illustrate how OOV words can be replaced using an LLM.

In [1]:
from walm import replace_oov
import gensim.downloader as api
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

[nltk_data] Downloading package wordnet to /home/xiaohao/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# load keyword file
kw_file = 'step2_topic-aware_kws.txt'
with open(kw_file, 'r') as file:
    words_list = file.readlines()

print('Original keywords for documents:')
for i in range(len(words_list)):
    print('document %s: ' % i + words_list[i].strip())

Original keywords for documents:
document 0: Technology, Development, Sale, Software, Economic, Graphics, Harvard, Employment, Electronics, Issues, Price, Windows, Harvard Graphics
document 1: Ethical, Justice, Constitution, Societal, Morality, Abortion, Bioethics, Moral, Rights, Ethics, Law
document 2: History, Strike, Sports, Baseball, Mound, Zone, Statistics
document 3: Role, Plus/minus, Players, Context, Statistics
document 4: Async Solutions, Graphics Accelerator, Efficiency, Internet Culture, X11 Clients, Communication, Software Development, X11, Clients
document 5: Technology, UART, Serial Communication, Software Development, Electronics, Hardware Interrupts, Interrupts, Hardware, Windows
document 6: Train, Station, City, Protection, Dispute, Airport, Cities, Infrastructure, Complaint, Consumer, Rights, Society, Transportation
document 7: PCTools, Data Error, Error, IDE drive, IDE, Drive, Low Level Format, Maintenance, Data error, Disk recovery, IDE Drive, Repair, Low level form

In [3]:
# We use the vocabulary set of GloVe model, we load the Gensim GloVe model firstly.
print('Loading glove model ...')
embedding_model = api.load("glove-wiki-gigaword-50")
print('Loading done!')

Loading glove model ...
Loading done!


In [4]:
# LLM for word repalce for OOV
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'
llm = AutoModelForCausalLM.from_pretrained(model_name,
                                           trust_remote_code=True,
                                           torch_dtype=torch.float16
                                           ).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
filtered_words = replace_oov(words_list, llm, tokenizer, embedding_model)

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.81it/s]
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.


Fail to find replacement for word: graphics accelerator
Fail to find replacement for word: x11 clients
Fail to find replacement for word: serial communication


In [5]:
print('Keywords for documents after filtering:')
for i in range(len(filtered_words)):
    print('document %s: ' % i + ', '.join(filtered_words[i]))

Keywords for documents after filtering:
document 0: technology, development, sale, software, economic, graphics, harvard, employment, electronics, issues, price, windows, art
document 1: ethical, justice, constitution, societal, morality, abortion, bioethics, moral, rights, ethics, law
document 2: history, strike, sports, baseball, mound, zone, statistics
document 3: role, plus/minus, players, context, statistics
document 4: instant, efficiency, cyberspace, communication, programming, x11, clients
document 5: technology, uart, programming, electronics, interrupts, interrupts, hardware, windows
document 6: train, station, city, protection, dispute, airport, cities, infrastructure, complaint, consumer, rights, society, transportation
document 7: tools, mistake, error, mind, ide, drive, base, maintenance, mistake, retrieval, mind, repair, base, mark, retrieval
document 8: technology, sound, sale, protection, purchase, electronics, small, projector, consumer, rights
document 9: life, moral