# Lightweight Fact completion Demo
This notebook contains the code to run fact completion benchmark on a small batch of inputs. Fact completion is accomplished by probing whether factual statements are predicted at a higher probability compared to paired counterfactual statements. See the Repo's [README](https://github.com/daniel-furman/Capstone) for compatible models and more information.

<a target="_blank" href="https://colab.research.google.com/github/daniel-furman/Capstone/blob/main/notebooks/fact_completion_notebooks/fact-completion-lightweight-demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dependencies

In [None]:
!git clone https://github.com/daniel-furman/Capstone.git
!pip install -r /content/Capstone/requirements.txt

## Imports

In [None]:
import os
import torch
import json
from datasets import load_dataset

In [None]:
if not torch.cuda.is_available():
    raise Exception("Change runtime type to include a GPU.")

In [None]:
os.chdir("/content/Capstone/src/fact_completion_scripts")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

## Notebook usage

In [None]:
# import the main wrapper function for running cka

from run_from_config import main

config = {}

Below, you specify a language model in ```config["models"]```. In subsequent chunks, Hugging Face's 🤗 `transformers` package will be used to download the input model with `load_in_8_bit` set to True. 

* See [README](https://github.com/daniel-furman/Capstone#models-tested) for the full list of model families supported.  

Some example VRAM usages for this notebook include:
* `google/flan-ul2` requires slightly more than 32GB VRAM
* `EleutherAI/gpt-neox-20b` requires slightly more than 22GB VRAM
* `EleutherAI/gpt-j-6B` requires slightly more than 7GB VRAM
* `google/flan-t5-xl`requires slightly more than 5GB VRAM
* `roberta-large` requires slightly less than 1GB VRAM

In [None]:
config["models"] = [
    "gpt2-xl",
]

Next, new facts can be input in ```config["input_information"]```.
  * For instance: to input "Lebron James is famous for playing the sport of {true: basketball; false: football}", see the last example. 

In [None]:
input_info = [
    {
        "stem": "The 2020 Olympics were held in",
        "true": "Tokyo",
        "false": "London <br> Berlin <br> Chicago",
    },
    {
        "stem": "Operation Overlord took place in",
        "true": "Normandy",
        "false": "Manila <br> Santiago <br> Baghdad",
    },
    {
        "stem": "Steve Jobs is the founder of",
        "true": "Apple",
        "false": "Microsoft <br> Google <br> Facebook",
    },
    {
        "stem": "Lebron James is famous for playing the sport of",
        "true": "basketball",
        "false": "football",
    },
]

with open("/content/input_info_json.json", "w") as f:
    f.write(json.dumps(input_info))

In [None]:
config["input_information"] = load_dataset(
    "json", data_files="/content/input_info_json.json"
)["train"]

Lastly, verbosity controls how much info is printed to the user.

In [None]:
config["verbosity"] = True

Now, we're ready to run the cka pipeline.

In [None]:
score_dicts, log_fpath = main(config)

In [None]:
print(score_dicts[0])

In [None]:
print(score_dicts[1])