Skip to content

BunsenFeng/Knowledge_Card

Repository files navigation

Knowledge_Card

Repository for Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models @ ICLR 2024, Oral.

Configuration

config.py specifies the configuration/hyperparameters for running Knowledge Card with the three modes. We provide four default settings in config.py.

  • ChatGPT, slightly slower: We employ ChatGPT (gpt-3.5-turbo) as the base LLM, use GPU 0 for both relevance and pruning selectors, use GPU 1 for the two models in the factuality selector, and GPU 2 for hosting the modular knowledge cards. Note that model sharing on GPUs 0 and 1 will make things a bit slower. 3 GPUs are required in total. Please fill in your OpenAI API key in line 44 of lm_utils.py.
  • ChatGPT, slightly faster: We employ ChatGPT (gpt-3.5-turbo) as the base LLM, use GPU 0 for the relevance selector, GPU 1 for pruning selector, GPUs 2 and 3 for the two models in the factuality selector, and GPU 4 for hosting the modular knowledge cards. 5 GPUs are required in total. Please fill in your OpenAI API key in line 44 of lm_utils.py.
  • open-source LLM, slightly slower: We employ an open-source LLM (default: Mistral-7B or LLaMA2-7B) as the base LLM and employ GPU 0 to support it, use GPU 1 for both relevance and pruning selectors, use GPU 2 for the two models in the factuality selector, and GPU 3 for hosting the modular knowledge cards. Note that model sharing on GPUs 1 and 2 will make things a bit slower. 4 GPUs are required in total.
  • open-source LLM, slightly faster: We employ an open-source LLM (default: Mistral-7B or LLaMA2-7B) as the base LLM and employ GPU 0 to support it, use GPUs 1-4 to support the three selectors, and GPU 5 for the modular knowledge cards. 6 GPUs are required in total.

Other specifications/hyperparameters in config.py should be self-explanatory or come with comments.

Basic Usage

Any environment with a reasonable Huggingface Transformers installation should be fine. If you really want to install the messy environment I used, do conda env create -f environment.yml.

data/sample.jsonl provides an example of input/output format. Just organize your prompts in a JSONL file and one dict per line, with two fields prompt and output in each line.

bottom_up.py, top_down_auto.py, and top_down_explicit.py are the three modes of Knowledge Card. You can run them with:

python <mode>.py -i <path_to_input_file> -o <path_to_output_file>

Please note that it might be slow (downloading all knowledge card checkpoints, running multiple LMs on multiple GPUs, etc.) and you might want to run it on a cluster. There are some potential improvements for better parallelism and efficiency that I may or may not add in the future.

Modular Knowledge Cards

The pool of knowledge cards to leverage is specified in config.py: knowledge_card_paths specify a list of strings where each string represents a model checkpoint on HuggingFace (or local). knowledge_card_names specify a list of strings where each string represents the name of the knowledge card, any string representing the domain/information source/knowledge type should work: commonsense knowledge, Wikipedia, news articles, social media, etc.

In default we employ five knowledge cards specified in the config.py file. We also provide all 26 knowledge cards on HuggingFace:

Model Name Link Description
bunsenfeng/knowledge-card-yelp link yelp reviews
bunsenfeng/knowledge-card-yago link YAGO knowledge graph
bunsenfeng/knowledge-card-wikipedia link Wikipedia
bunsenfeng/knowledge-card-wikipedia2 link Wikipedia, cont.
bunsenfeng/knowledge-card-wikidata link Wikidata knowledge graph
bunsenfeng/knowledge-card-twitter link tweets
bunsenfeng/knowledge-card-reddit link reddit posts
bunsenfeng/knowledge-card-realnews1 link real news, part 1
bunsenfeng/knowledge-card-realnews2 link real news, part 2
bunsenfeng/knowledge-card-realnews3 link real news, part 3
bunsenfeng/knowledge-card-realnews4 link real news, part 4
bunsenfeng/knowledge-card-pubmed link medical literature
bunsenfeng/knowledge-card-opensubtitles link movie subtitles
bunsenfeng/knowledge-card-midterm link 2022 US midterm election news
bunsenfeng/knowledge-card-math link math text
bunsenfeng/knowledge-card-legal-contracts link legal contracts
bunsenfeng/knowledge-card-kgap link KGAP knowledge graph
bunsenfeng/knowledge-card-IMDB link IMDB movie reviews
bunsenfeng/knowledge-card-gutenberg link Gutenberg
bunsenfeng/knowledge-card-DDB link biomedical knowledge graph
bunsenfeng/knowledge-card-ConceptNet link commonsense knowledge graph
bunsenfeng/knowledge-card-bookcorpus link BookCorpus
bunsenfeng/knowledge-card-atomic link commonsense knowledge graph
bunsenfeng/knowledge-card-acl-papers link *ACL papers
bunsenfeng/knowledge-card-1btokens link 1B tokens
bunsenfeng/knowledge-card-politics link political news

Note that these knowledge cards are based on the OPT-1.3B model. Please note that they are far from perfect: after all they are just 1B models trained with our very limited compute resources. Any language generation model that supports inference on a single GPU should also work so feel free to use your own models/selections as knowledge cards. If you are interested in contributing/suggesting model checkpoints as knowledge cards, please feel free to open an issue or a pull request.

Training Your Own Knowledge Card

Any language model checkpoint trained with the causal language modeling objective should work as a knowledge card. We provide a most generalized implementation in card_training.py: provide a text file (.txt) of corpora and train your own specialized knowledge card!

python card_training.py -m <model_checkpoint> -d <data_txt_path> -n <name_of_the_card>

The trained knowledge card will appear in cards/<name_of_the_card>.

Evaluation Data

For MMLU, visit link. The fake news detection and MidtermQA datasets are provided in eval_datasets with their respective readmes.

Citation

If you find our work interesting/helpful, please consider citing Knowledge Card:

@inproceedings{feng2023knowledge,
  title={Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models},
  author={Feng, Shangbin and Shi, Weijia and Bai, Yuyang and Balachandran, Vidhisha and He, Tianxing and Tsvetkov, Yulia},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2023}
}

About

Code for "Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models", ICLR 2024 Oral.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages