Knowledge_Card

Repository for Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models @ ICLR 2024, Oral.

Configuration

config.py specifies the configuration/hyperparameters for running Knowledge Card with the three modes. We provide four default settings in config.py.

ChatGPT, slightly slower: We employ ChatGPT (gpt-3.5-turbo) as the base LLM, use GPU 0 for both relevance and pruning selectors, use GPU 1 for the two models in the factuality selector, and GPU 2 for hosting the modular knowledge cards. Note that model sharing on GPUs 0 and 1 will make things a bit slower. 3 GPUs are required in total. Please fill in your OpenAI API key in line 44 of lm_utils.py.
ChatGPT, slightly faster: We employ ChatGPT (gpt-3.5-turbo) as the base LLM, use GPU 0 for the relevance selector, GPU 1 for pruning selector, GPUs 2 and 3 for the two models in the factuality selector, and GPU 4 for hosting the modular knowledge cards. 5 GPUs are required in total. Please fill in your OpenAI API key in line 44 of lm_utils.py.
open-source LLM, slightly slower: We employ an open-source LLM (default: Mistral-7B or LLaMA2-7B) as the base LLM and employ GPU 0 to support it, use GPU 1 for both relevance and pruning selectors, use GPU 2 for the two models in the factuality selector, and GPU 3 for hosting the modular knowledge cards. Note that model sharing on GPUs 1 and 2 will make things a bit slower. 4 GPUs are required in total.
open-source LLM, slightly faster: We employ an open-source LLM (default: Mistral-7B or LLaMA2-7B) as the base LLM and employ GPU 0 to support it, use GPUs 1-4 to support the three selectors, and GPU 5 for the modular knowledge cards. 6 GPUs are required in total.

Other specifications/hyperparameters in config.py should be self-explanatory or come with comments.

Basic Usage

Any environment with a reasonable Huggingface Transformers installation should be fine. If you really want to install the messy environment I used, do conda env create -f environment.yml.

data/sample.jsonl provides an example of input/output format. Just organize your prompts in a JSONL file and one dict per line, with two fields prompt and output in each line.

bottom_up.py, top_down_auto.py, and top_down_explicit.py are the three modes of Knowledge Card. You can run them with:

python <mode>.py -i <path_to_input_file> -o <path_to_output_file>

Please note that it might be slow (downloading all knowledge card checkpoints, running multiple LMs on multiple GPUs, etc.) and you might want to run it on a cluster. There are some potential improvements for better parallelism and efficiency that I may or may not add in the future.

Modular Knowledge Cards

The pool of knowledge cards to leverage is specified in config.py: knowledge_card_paths specify a list of strings where each string represents a model checkpoint on HuggingFace (or local). knowledge_card_names specify a list of strings where each string represents the name of the knowledge card, any string representing the domain/information source/knowledge type should work: commonsense knowledge, Wikipedia, news articles, social media, etc.

In default we employ five knowledge cards specified in the config.py file. We also provide all 26 knowledge cards on HuggingFace:

Model Name	Link	Description
bunsenfeng/knowledge-card-yelp	link	yelp reviews
bunsenfeng/knowledge-card-yago	link	YAGO knowledge graph
bunsenfeng/knowledge-card-wikipedia	link	Wikipedia
bunsenfeng/knowledge-card-wikipedia2	link	Wikipedia, cont.
bunsenfeng/knowledge-card-wikidata	link	Wikidata knowledge graph
bunsenfeng/knowledge-card-twitter	link	tweets
bunsenfeng/knowledge-card-reddit	link	reddit posts
bunsenfeng/knowledge-card-realnews1	link	real news, part 1
bunsenfeng/knowledge-card-realnews2	link	real news, part 2
bunsenfeng/knowledge-card-realnews3	link	real news, part 3
bunsenfeng/knowledge-card-realnews4	link	real news, part 4
bunsenfeng/knowledge-card-pubmed	link	medical literature
bunsenfeng/knowledge-card-opensubtitles	link	movie subtitles
bunsenfeng/knowledge-card-midterm	link	2022 US midterm election news
bunsenfeng/knowledge-card-math	link	math text
bunsenfeng/knowledge-card-legal-contracts	link	legal contracts
bunsenfeng/knowledge-card-kgap	link	KGAP knowledge graph
bunsenfeng/knowledge-card-IMDB	link	IMDB movie reviews
bunsenfeng/knowledge-card-gutenberg	link	Gutenberg
bunsenfeng/knowledge-card-DDB	link	biomedical knowledge graph
bunsenfeng/knowledge-card-ConceptNet	link	commonsense knowledge graph
bunsenfeng/knowledge-card-bookcorpus	link	BookCorpus
bunsenfeng/knowledge-card-atomic	link	commonsense knowledge graph
bunsenfeng/knowledge-card-acl-papers	link	*ACL papers
bunsenfeng/knowledge-card-1btokens	link	1B tokens
bunsenfeng/knowledge-card-politics	link	political news

Note that these knowledge cards are based on the OPT-1.3B model. Please note that they are far from perfect: after all they are just 1B models trained with our very limited compute resources. Any language generation model that supports inference on a single GPU should also work so feel free to use your own models/selections as knowledge cards. If you are interested in contributing/suggesting model checkpoints as knowledge cards, please feel free to open an issue or a pull request.

Training Your Own Knowledge Card

Any language model checkpoint trained with the causal language modeling objective should work as a knowledge card. We provide a most generalized implementation in card_training.py: provide a text file (.txt) of corpora and train your own specialized knowledge card!

python card_training.py -m <model_checkpoint> -d <data_txt_path> -n <name_of_the_card>

The trained knowledge card will appear in cards/<name_of_the_card>.

Evaluation Data

For MMLU, visit link. The fake news detection and MidtermQA datasets are provided in eval_datasets with their respective readmes.

Citation

If you find our work interesting/helpful, please consider citing Knowledge Card:

@inproceedings{feng2023knowledge,
  title={Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models},
  author={Feng, Shangbin and Shi, Weijia and Bai, Yuyang and Balachandran, Vidhisha and He, Tianxing and Tsvetkov, Yulia},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
cards		cards
data		data
eval_datasets		eval_datasets
LICENSE		LICENSE
README.md		README.md
bottom_up.py		bottom_up.py
card_training.py		card_training.py
config.py		config.py
environment.yml		environment.yml
factuality.py		factuality.py
lm_utils.py		lm_utils.py
pruning.py		pruning.py
similarity.py		similarity.py
top_down_auto.py		top_down_auto.py
top_down_explicit.py		top_down_explicit.py

License

BunsenFeng/Knowledge_Card

Folders and files

Latest commit

History

Repository files navigation

Knowledge_Card

Configuration

Basic Usage

Modular Knowledge Cards

Training Your Own Knowledge Card

Evaluation Data

Citation

About

Resources

License

Stars

Watchers

Forks

Languages