In [1]:
import TruthTorchLM as ttlm
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
os.environ["OPENAI_API_KEY"] = 'your_openai_key'
import torch

### What is TruthTorchLM?

TruthTorchLM is an open-source library that collects various state-of-art truth methods and offers an interface to use and evaluate them in a user-friendly way. 

TruthTorchLM is compatible with Huggingface and LiteLLM, enabling users to integrate truthfulness assessment into their workflows with minimal code changes.

In [2]:
#define a huggingface model or api-based model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16).to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", use_fast=False)

api_model = "gpt-4o-mini"


Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  3.93it/s]


### Simple Usage: Assessing Truthfulness to Short Generations
The first important functionality of the TruthTorchLM is to generate a message with a truth value. Truth value indicates the truthfulness of the generated output. TruthTorchLM offers a wide range of truth methods to assess the truthfulness of the generated output, called **truth methods**. Each truth method can have different algorithmic approaches and different output ranges (truth values). Since the different truth methods have different output ranges, we can not directly compare the truth values of different truth methods. However, for a given truth method, higher truth value means more likely the output is truthful. To make the truth values comparable, we will normalize the truth values to a common range in the next section (Calibrating Truth Methods). Note that normalized truth value in the output dictionary is meaningless without calibration.





In [3]:
#define 3 different truth methods, other methods can be found under src/TruthTorchLM/truth_methods
lars = ttlm.truth_methods.LARS(device='cuda:0')#https://arxiv.org/pdf/2406.11278
confidence = ttlm.truth_methods.Confidence()#average log probality of the generated message

truth_methods = [lars, confidence]

In [4]:
#define a chat history
chat = [{"role": "system", "content": 'You are a helpful assistant. Give short and precise answers.'},
        {"role": "user", "content": f"What is the capital city of France?"},]

#generate a message with a truth value, it's a wrapper fucntion for model.generate in Huggingface
output_hf_model = ttlm.generate_with_truth_value(model = model, tokenizer = tokenizer, messages = chat, truth_methods = truth_methods, max_new_tokens = 100, temperature = 0.7, pad_token_id=tokenizer.eos_token_id)

#generate a message with a truth value, it's a wrapper fucntion for litellm.completion in litellm
output_api_model = ttlm.generate_with_truth_value(model = api_model, messages = chat, truth_methods = truth_methods)

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


In [5]:
#print the output of HF model
print(output_hf_model)
#print the output of API model
print(output_api_model)

{'generated_text': 'The capital city of France is Paris.', 'normalized_truth_values': [0.7299441133258416, 0.4969400872781441], 'unnormalized_truth_values': [0.9943390488624573, -0.012239803691733818], 'method_specific_outputs': [{'truth_value': 0.9943390488624573, 'generated_text': 'The capital city of France is Paris.<|eot_id|>', 'normalized_truth_value': 0.7299441133258416}, {'truth_value': -0.012239803691733818, 'generated_text': 'The capital city of France is Paris.<|eot_id|>', 'normalized_truth_value': 0.4969400872781441}], 'all_ids': tensor([[128000, 128000, 128006,    882, 128007,    271,   2675,    527,    264,
          11190,  18328,     13,  21335,   2875,    323,  24473,  11503,     13,
           3639,    374,    279,   6864,   3363,    315,   9822,     30, 128009,
         128006,  78191, 128007,    271,    791,   6864,   3363,    315,   9822,
            374,  12366,     13, 128009]]), 'generated_tokens': tensor([   791,   6864,   3363,    315,   9822,    374,  12366,  

### Calibrating Truth Methods
Truth values for different methods are not comparable. They have different ranges and different meanings. Therefore, it would be better to calibrate the truth values to a common range. This can be done by using the `calibrate_truth_method` function. We can define different calibration functions with various objectives. The default calibration is sigmoid normalization where we subtract the threshold from the truth value and divide by the standard deviation and then apply sigmoid function. The standard deviation and threshold is calculated from the data. In this example, we use Isotonic Regression for calibration, which calibrates the truth values to [0,1] range and makes the truth values interpretable. With this calibration, we expect 0.8 normalized truth value means that the output is 80% likely to be truthful. 



In [6]:
#we need a supervised dataset to calibrate the truth methods. We use trivia_qa dataset for this example.
#we need a correctness evaluator to evaluate the truth methods. We use model_judge for this example. model_judge looks at the model's output and the ground truth and returns a correctness score.
model_judge = ttlm.evaluators.ModelJudge('gpt-4o-mini')
for truth_method in truth_methods:
    truth_method.set_normalizer(ttlm.normalizers.IsotonicRegression())
calibration_results = ttlm.calibrate_truth_method(dataset = 'trivia_qa', model = model, truth_methods = truth_methods, tokenizer = tokenizer, correctness_evaluator = model_judge, 
    size_of_data = 10,  return_method_details = True, seed = 0, max_new_tokens = 64, wandb_push_method_details = False, pad_token_id=tokenizer.eos_token_id)

Loading dataset from Huggingface Datasets, split: train fraction of data: 10


100%|██████████| 10/10 [00:00<00:00, 156503.88it/s]
100%|██████████| 10/10 [00:13<00:00,  1.35s/it]

Calibrated with the following parameters: {'increasing': True, 'out_of_bounds': 'clip', 'y_max': 1.0, 'y_min': 0.0}
Calibrated with the following parameters: {'increasing': True, 'out_of_bounds': 'clip', 'y_max': 1.0, 'y_min': 0.0}





### Evaluating Truth Methods
We can evaluate the truth methods with the `evaluate_truth_method` function. We can define different evaluation metrics including AUROC, AUPRC, AUARC, Accuracy, F1, Precision, Recall, PRR. TruthTorchLM offers a wide range of datasets to evaluate the truth methods. In this example, we use trivia_qa dataset. Note that calibration is suggested for the threshold-based metrics, such as F1, Recall, Precision, and accuracy.


In [7]:
results = ttlm.evaluate_truth_method(dataset = 'trivia_qa', model = model, truth_methods=truth_methods, 
    eval_metrics = ['auroc', 'prr'], tokenizer = tokenizer, size_of_data = 10, correctness_evaluator = model_judge, 
    return_method_details = True,  batch_generation = True, wandb_push_method_details = False,
    max_new_tokens = 64, do_sample = True, seed = 0, pad_token_id=tokenizer.eos_token_id)


Loading dataset from Huggingface Datasets, split: test fraction of data: 10


100%|██████████| 10/10 [00:00<00:00, 144631.17it/s]
100%|██████████| 10/10 [00:14<00:00,  1.47s/it]


In [8]:
for i in range(len(results['eval_list'])):
    print(results['output_dict']['truth_methods'][i],results['eval_list'][i])

LARS {'auroc': 1.0, 'prr': 1.0}
Confidence {'auroc': 0.8888888888888888, 'prr': 0.8762188574381626}


### Assessing Truthfulness in Long-Form Generation

Assigning a single truth value for a long text is neither practical nor useful. TruthTorchLM first decomposes the generated text into short, single-sentence claims and assigns truth values to these claims using claim check methods.


Most truth methods are not directly applicable to assign a truth value to a single claim. To overcome this, TruthTorchLM provides several claim check approaches, which takes turth methods as parameter. Claim check methods are the way we make truth methods usable for decomposed claims. Note that there can be some claim check methods that are directly designed for this purpose, not utilizing truth methods.


At the end, `long_form_generation_with_truth_value` function returns the generated text, decomposed claims, and the truth values assigned to the claims (as well as all details during the process).


Long form generation functionalities of TruthTorchLM is collected under `long_form_generation` module.

In [9]:
import TruthTorchLM.long_form_generation as LFG
from transformers import DebertaForSequenceClassification, DebertaTokenizer

In [10]:
#define decomposition method that breaks the the long text into claims
decomposition_method = LFG.decomposition_methods.StructuredDecompositionAPI(model="gpt-4o-mini", decomposition_depth=2, split_by_paragraphs=False) #Utilize API models to decompose text
# decomposition_method = LFG.decomposition_methods.StructuredDecompositionLocal(model, tokenizer, decomposition_depth=1) #Utilize HF models to decompose text

In [11]:
#entailment model is used by some truth methods and claim check methods
model_for_entailment = DebertaForSequenceClassification.from_pretrained('microsoft/deberta-large-mnli').to('cuda:0')
tokenizer_for_entailment = DebertaTokenizer.from_pretrained('microsoft/deberta-large-mnli')

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
#define truth methods 
confidence = ttlm.truth_methods.Confidence() #average log probality of the generated message
lars = ttlm.truth_methods.LARS(device='cuda:0') #https://arxiv.org/pdf/2406.11278

#define the claim check methods that applies truth methods
qa_generation = LFG.claim_check_methods.QuestionAnswerGeneration(model="gpt-4o-mini", tokenizer=None, num_questions=2, max_answer_trials=2,
                                                                     truth_methods=[confidence, lars], seed=0,
                                                                     entailment_model=model_for_entailment, entailment_tokenizer=tokenizer_for_entailment) #HF model and tokenizer can also be used, LM is used to generate question
#there are some claim check methods that are directly designed for this purpose, not utilizing truth methods
ac_entailment = LFG.claim_check_methods.AnswerClaimEntailment( model="gpt-4o-mini", tokenizer=None, 
                                                                      num_questions=3, num_answers_per_question=2, 
                                                                      entailment_model=model_for_entailment, entailment_tokenizer=tokenizer_for_entailment) #HF model and tokenizer can also be used, LM is used to generate question

In [13]:
#define a chat history
chat = [{"role": "system", "content": 'You are a helpful assistant. Give brief and precise answers.'},
        {"role": "user", "content": f'Give some information about Eiffel Tower.'}]

In [14]:
#generate a message with a truth value, it's a wrapper fucntion for model.generate in Huggingface
output_hf_model = LFG.long_form_generation_with_truth_value(model=model, tokenizer=tokenizer, messages=chat, decomp_method=decomposition_method, 
                                          claim_check_methods=[qa_generation, ac_entailment], generation_seed=0)

Decomposing the generated text...
Applying claim check method  QuestionAnswerGeneration
Applying claim check method  AnswerClaimEntailment


In [15]:
print("Generated Text:\n", output_hf_model['generated_text'])

print("\nClaims:")
for i in range(len(output_hf_model['claims'])):
    print(output_hf_model['claims'][i]) 
    print(f"     Conf: {output_hf_model['unnormalized_truth_values'][0][i][0]:.2f}", 
          f"   LARS: {output_hf_model['unnormalized_truth_values'][0][i][1]:.2f}",
          f"   AS Ent: {output_hf_model['unnormalized_truth_values'][1][i]:.2f}")

Generated Text:
 I'd be happy to help!

Here's some brief information about the Eiffel Tower:

* Location: Paris, France
* Height: 324 meters (1,063 feet)
* Built: 1889 for the World's Fair
* Architect: Gustave Eiffel
* Materials: Iron
* Original purpose: Radio broadcasting tower
* Now: Iconic tourist attraction and symbol of Paris

Claims:
The Eiffel Tower is located in Paris, France.
     Conf: -0.01    LARS: 0.99    AS Ent: 1.00
The Eiffel Tower has a height of 324 meters.
     Conf: -0.02    LARS: 0.99    AS Ent: 1.00
The height of the Eiffel Tower is 1,063 feet.
     Conf: -0.02    LARS: 0.99    AS Ent: 1.00
The Eiffel Tower was built in 1889.
     Conf: -0.13    LARS: 0.99    AS Ent: 1.00
The Eiffel Tower was built for the World's Fair.
     Conf: -0.06    LARS: 0.99    AS Ent: 1.00
The architect of the Eiffel Tower is Gustave Eiffel.
     Conf: -0.03    LARS: 0.99    AS Ent: 1.00
The Eiffel Tower is made of iron.
     Conf: -0.11    LARS: 0.92    AS Ent: 1.00
The original purpos

In [16]:
#generate a message with a truth value, it's a wrapper fucntion for litellm.completion in litellm
output_api_model = LFG.long_form_generation_with_truth_value(model="gpt-4o-mini", messages=chat, decomp_method=decomposition_method, 
                                          claim_check_methods=[qa_generation, ac_entailment], generation_seed=0)


Decomposing the generated text...
Applying claim check method  QuestionAnswerGeneration
Applying claim check method  AnswerClaimEntailment


In [17]:
print("Generated Text:\n", output_api_model['generated_text'])

print("\nClaims:")
for i in range(len(output_api_model['claims'])):
    print(output_api_model['claims'][i]) 
    print(f"     Conf: {output_api_model['unnormalized_truth_values'][0][i][0]:.2f}", 
          f"   LARS: {output_api_model['unnormalized_truth_values'][0][i][1]:.2f}",
          f"   AS Ent: {output_api_model['unnormalized_truth_values'][1][i]:.2f}")

Generated Text:
 The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. It was designed by engineer Gustave Eiffel and completed in 1889 for the Exposition Universelle (World's Fair) held to celebrate the 100th anniversary of the French Revolution. The tower stands approximately 1,083 feet (330 meters) tall, making it one of the tallest structures in the world at the time of its completion.

It has three levels accessible to visitors, with restaurants on the first and second levels, and an observation deck on the third level that offers panoramic views of Paris. The Eiffel Tower is an iconic symbol of France and attracts millions of tourists annually. It was initially criticized by some artists and intellectuals but has since become one of the most recognizable landmarks globally.

Claims:
The Eiffel Tower is a wrought-iron lattice tower located in Paris, France.
     Conf: -0.08    LARS: 0.99    AS Ent: 1.00
The Eiffel Tower was designed by engineer Gustave Eiffel.

### Evaluation of Truth Methods in Long-Form Generation

We can evaluate truth methods on long form generation by using `evaluate_truth_method_long_form` function. To obtain the correctness of the claims we follow SAFE from https://arxiv.org/pdf/2403.18802. SAFE performs Google search for each claim and assigns labels as supported, unsupported or irrelevant. TruthTorhLM offers different evaluation metrics including AUROC, AUPRC, AUARC, Accuracy, F1, Precision, Recall, PRR. 

In [18]:
#SAFE utilized serper
os.environ['SERPER_API_KEY'] = 'your_serper_api_key'#https://serper.dev/


In [19]:
#create safe object that assigns labels to the claims
#for faster run, you can decrease these parameters, but these are the default params in the original SAFE implementation
safe = LFG.ClaimEvaluator(rater='gpt-4o-mini', tokenizer = None, max_steps = 5, max_retries = 10, num_searches = 3) 

#Define metrics
sample_level_eval_metrics = ['f1'] #calculate metric over the claims of a question, then average across all the questions
dataset_level_eval_metrics = ['auroc', 'prr'] #calculate the metric across all claims 

In [21]:
results = LFG.evaluate_truth_method_long_form(dataset='longfact_objects', model='gpt-4o-mini', tokenizer=None,
                                sample_level_eval_metrics=sample_level_eval_metrics, dataset_level_eval_metrics=dataset_level_eval_metrics,
                                decomp_method=decomposition_method, claim_check_methods=[qa_generation],
                                claim_evaluator = safe, size_of_data=3,  previous_context=[{'role': 'system', 'content': 'You are a helpful assistant. Give brief and precise answers.'}], 
                                user_prompt="Question: {question_context}", seed=0,  return_method_details = False, return_calim_eval_details=False, wandb_run = None,  
                                add_generation_prompt = True, continue_final_message = False)

Loading dataset... Size of data: 3


  0%|          | 0/3 [00:00<?, ?it/s]

Decomposing the generated text...
Applying claim check method  QuestionAnswerGeneration
Checking for claim support by google search...


 33%|███▎      | 1/3 [04:03<08:07, 243.79s/it]

Time ellapsed for google search: 176.2192018032074
Decomposing the generated text...
Applying claim check method  QuestionAnswerGeneration
Checking for claim support by google search...


 67%|██████▋   | 2/3 [09:10<04:40, 280.98s/it]

Time ellapsed for google search: 247.78795313835144
Decomposing the generated text...
Applying claim check method  QuestionAnswerGeneration
Checking for claim support by google search...


100%|██████████| 3/3 [14:18<00:00, 286.23s/it]

Time ellapsed for google search: 227.9723973274231





In [22]:
# stmt_check_methods_0_truth_method_0 : qa_generation + confidence
# stmt_check_methods_0_truth_method_1 : qa_generation + LARS
results['dataset_level_eval_list']

{'claim_check_methods_0_truth_method_0': {'auroc': 0.6754385964912282,
  'prr': 0.039265960161460244},
 'claim_check_methods_0_truth_method_1': {'auroc': 0.7719298245614037,
  'prr': 0.6202041235870827}}

In [23]:
# stmt_check_methods_0_truth_method_0 : qa_generation + confidence
# stmt_check_methods_0_truth_method_1 : qa_generation + LARS
results['sample_level_eval_list']

{'claim_check_methods_0_truth_method_0': {'f1': {'values': [0.0, 0.0, 0.0],
   'mean': 0.0,
   'max': 0.0,
   'min': 0.0,
   'std': 0.0}},
 'claim_check_methods_0_truth_method_1': {'f1': {'values': [1.0,
    1.0,
    0.9230769230769231],
   'mean': 0.9743589743589745,
   'max': 1.0,
   'min': 0.9230769230769231,
   'std': 0.03626188621469472}}}

: 