Improving factual knowledge in Language Models

This repository contains the pre-training code for the EACL 2023 paper "Unsupervised Improvement of Factual Knowledge in Language Models".

Installation

Clone the repository

git clone https://github.com/intuit/wMLM.git
cd wMLM

Install the required libraries using pip within your virtual environment

pip install numpy scipy scikit-learn tqdm datasets torch transformers tensorboard nltk matplotlib Cython

Build cython script

python setup_cython.py build_ext --inplace

Dataset Download

We will perform pre-training using Wikipedia dataset available via datasets library. Download itself requires 37GB of storage and default download location may not have that much space. We use a custom download location called 'dataset_cache' within the project root directory.

mkdir dataset_cache
python get_dataset.py

Preprocessing

We need to create all the resources required for our proposed pre-training approach. The steps include building a custom tokenizer, preparing vocbulary files for informative words, PMI matrix preparation and token specific knowledge value computation.

python build_tokenizer.py
python prepare_vocab.py
python build_word_matrix.py
python build_pmi_matrix.py	
python knowledge_value_compute.py
python normalize_knowledge_value.py

Training

We performed pre-training using AWS p3.8xlarge instance with 4 Nvidia V100 GPUs. The training require around 2 hours 30 minutes for each epoch. You can launch distributed training using the following command.

python -m torch.distributed.launch --nproc_per_node 4 train_lm.py

Unit test

python -m unittest wlm_test.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
build_pmi_matrix.py		build_pmi_matrix.py
build_tokenizer.py		build_tokenizer.py
build_word_matrix.py		build_word_matrix.py
custom_dataset.py		custom_dataset.py
custom_trainer.py		custom_trainer.py
get_dataset.py		get_dataset.py
knowledge_value_compute.py		knowledge_value_compute.py
ngram_calc.pyx		ngram_calc.pyx
normalize_knowledge_value.py		normalize_knowledge_value.py
prepare_vocab.py		prepare_vocab.py
setup_cython.py		setup_cython.py
train_lm.py		train_lm.py
wlm_test.py		wlm_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving factual knowledge in Language Models

Installation

Dataset Download

Preprocessing

Training

Unit test

About

Releases

Packages

Languages

License

ghas-results/wMLM

Folders and files

Latest commit

History

Repository files navigation

Improving factual knowledge in Language Models

Installation

Dataset Download

Preprocessing

Training

Unit test

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages