In [None]:
import TruthTorchLM as ttlm
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ["OPENAI_API_KEY"] = 'your_openai_key'
import torch

### What is TruthTorchLM?

TruthTorchLM is an open-source library that collects various state-of-art hallucination detection methods and offers an interface to use and evaluate them in a user-friendly way.  

In [13]:
#define a huggingface model or  api-based model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.bfloat16).to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_fast=False)

api_model = "gpt-4o-mini"


Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.81s/it]


### Simple Usage: Abstain or not abstain
The first important functionality of the TruthTorchLM is to generate a message with a truth value. Truth value indicates whether the model is hallucinating or not. Various methods are available to detect hallucination. These methods are named as **truth methods** in the library. Each truth method can have different algorithmic approaches and different output ranges (truth values). For a given truth method, lower truth value means more likely the model is hallucinating, therefore abstain from the output. 





In [3]:
#define 3 different truth methods, other methods can be found under src/TruthTorchLM/truth_methods
lars = ttlm.truth_methods.LARS()#https://arxiv.org/pdf/2406.11278
confidence = ttlm.truth_methods.Confidence()#average log probality of the generated message
self_detection = ttlm.truth_methods.SelfDetection(number_of_questions=5)#https://arxiv.org/pdf/2310.17918

truth_methods = [lars, confidence, self_detection]

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
#define a chat history
chat = [{"role": "system", "content": 'You are a helpful assistant. Give short and precise answers.'},
        {"role": "user", "content": f"What is the capital city of France?"},]

#generate a message with a truth value, it's a wrapper fucntion for model.generate in Huggingface
output_hf_model = ttlm.generate_with_truth_value(model = model, tokenizer = tokenizer, messages = chat, truth_methods = truth_methods, max_new_tokens = 100, temperature = 0.7)

#generate a message with a truth value, it's a wrapper fucntion for litellm.completion in litellm
output_api_model = ttlm.generate_with_truth_value(model = api_model, messages = chat, truth_methods = truth_methods)

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


In [8]:
#print the output of HF model
print(output_hf_model)
#print the output of API model
print(output_api_model)



{'generated_text': ' The capital city of France is Paris.', 'normalized_truth_values': [0.7300473799334672, 0.4973040655079717, 0.5], 'unnormalized_truth_values': [0.994862973690033, -0.010783842472449123, -0.0], 'method_specific_outputs': [{'truth_value': 0.994862973690033, 'generated_text': ' The capital city of France is Paris.</s>', 'normalized_truth_value': 0.7300473799334672}, {'truth_value': -0.010783842472449123, 'generated_text': ' The capital city of France is Paris.</s>', 'normalized_truth_value': 0.4973040655079717}, {'truth_value': -0.0, 'entropy': 0.0, 'consistency': 1.0, 'generated_questions': [" Sure! Here's a rephrased version of the question:\n\nWhat city serves as the seat of government and political power for France?", ' Of course! Here is a rephrased version of the question:\n\nWhat city serves as the political and administrative hub of France?', ' Sure! Here is a rephrased version of the question:\n\nWhat is the seat of government located in the country of France?

### Calibration of the truth methods
Truth values for different methods are not comparable. They have different ranges and different meanings. Therefore, it would be better to calibrate the truth values to a common range. This can be done by using the `calibrate_truth_method` function. We can define different calibration functions with various objectives. The default calibration is sigmoid normalization where we subtract the threshold from the truth value and divide by the standard deviation and then apply sigmoid function. The standard deviation and threshold is calculated from the data. 

In [None]:
#we need a supervised dataset to calibrate the truth methods. We use trivia_qa dataset for this example.
#we need a correctness evaluator to evaluate the truth methods. We use model_judge for this example. model_judge looks at the model's output and the ground truth and returns a correctness score.
model_judge = ttlm.evaluators.ModelJudge('gpt-4o-mini')
for truth_method in truth_methods:
    truth_method.set_normalizer(ttlm.normalizers.IsotonicRegression())
calibration_results = ttlm.calibrate_truth_method(dataset = 'trivia_qa', model = model, truth_methods = truth_methods, tokenizer = tokenizer, correctness_evaluator = model_judge, 
    size_of_data = 10,  return_method_details = True, seed = 0, max_new_tokens = 64, wandb_push_method_details = False)

Loading dataset from Huggingface Datasets, split: train fraction of data: 10


100%|██████████| 10/10 [00:00<00:00, 107546.26it/s]
100%|██████████| 10/10 [02:38<00:00, 15.85s/it]

Calibrated with the following parameters: threshold = 0.6613246202468872 std = 0.24197971376981708
Calibrated with the following parameters: threshold = -0.30496641806830665 std = 0.08716437970354843
Calibrated with the following parameters: threshold = -0.6730116670092565 std = 0.5121234893961678





### Evaluation of the truth methods
We can evaluate the truth methods with the `evaluate_truth_method` function. We can define different evaluation metrics including AUROC, AUPRC, AUARC, Accuracy, F1, Precision, Recall, PRR.


In [18]:
results = ttlm.evaluate_truth_method(dataset = 'trivia_qa', model = model, truth_methods=truth_methods, 
    eval_metrics = ['auroc', 'prr'], tokenizer = tokenizer, size_of_data = 10, correctness_evaluator = model_judge, 
    return_method_details = True,  batch_generation = True, wandb_push_method_details = False,
    max_new_tokens = 64, do_sample = True, seed = 0)


Loading dataset from Huggingface Datasets, split: test fraction of data: 10


100%|██████████| 10/10 [00:00<00:00, 119837.26it/s]
100%|██████████| 10/10 [02:05<00:00, 12.56s/it]


In [20]:
for i in range(len(results['eval_list'])):
    print(results['output_dict']['truth_methods'][i],results['eval_list'][i])

LARS {'auroc': 0.9047619047619049, 'prr': 0.8712251850518855}
Confidence {'auroc': 0.9047619047619049, 'prr': 0.8712251850518855}
SelfDetection {'auroc': 0.8571428571428571, 'prr': 0.7597854413467863}


### Long Form Generation

Assigning a single score for the long text is not practical and useful. Therefore, we first decompose the generated text into short, single-sentenfe statements. The goal is to assign truth values to these statements.

Most truth methods are not directly applicable to assign a truth value to a single statement. To overcome this, TruthTorchLM provides several statement check approaches, which takes turth methods as parameter. Statement check methods are the way we make truth methods usable for decomposed statements. 

At the end, `long_form_generation_with_truth_value` function returns the generated text, decomposed statements, and the truth values assigned to the statements (as well as all details during the process).

Long form generation functionalities of TruthTorchLM is collected under `long_form_generation` module.

In [2]:
import TruthTorchLM.long_form_generation as LFG
from transformers import DebertaForSequenceClassification, DebertaTokenizer

In [3]:
#define decomposition method that breaks the the long text into statements
decomposition_method = LFG.decomposition_methods.StructuredDecompositionAPI(model="gpt-4o-mini", decomposition_depth=1, instruction=ttlm.LFG_DECOMPOSITION_PROMPT) #Utilize API models to decompose text
# decomposition_method = LFG.decomposition_methods.StructuredDecompositionLocal(model, tokenizer, decomposition_depth=1, chat_template=DECOMPOSITION_PROMT) #Utilize HF models to decompose text

In [4]:
#entailment model is used by some truth methods and statement check methods
model_for_entailment = DebertaForSequenceClassification.from_pretrained('microsoft/deberta-large-mnli').to('cuda:0')
tokenizer_for_entailment = DebertaTokenizer.from_pretrained('microsoft/deberta-large-mnli')

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
#define truth methods 
confidence = ttlm.truth_methods.Confidence() #average log probality of the generated message
lars = ttlm.truth_methods.LARS() #https://arxiv.org/pdf/2406.11278

#define the statement check methods that applies truth methods
qa_generation = LFG.statement_check_methods.QuestionAnswerGeneration(model="gpt-4o-mini", tokenizer=None, num_questions=2, max_answer_trials=2,
                                                                     truth_methods=[confidence, lars], seed=0,
                                                                     instruction=ttlm.LFG_QUESTION_GENERATION_PROMPT, 
                                                                     first_statement_instruction=ttlm.LFG_QUESTION_GENERATION_PROMPT,
                                                                     entailment_model=model_for_entailment, entailment_tokenizer=tokenizer_for_entailment) #HF model and tokenizer can also be used, LM is used to generate question
#there are some statement check methods that are directly designed for this purpose, not utilizing truth methods
as_entailment = LFG.statement_check_methods.AnswerStatementEntailment( model="gpt-4o-mini", tokenizer=None, 
                                                                      num_questions=3, num_answers_per_question=2, 
                                                                      instruction=ttlm.LFG_QUESTION_GENERATION_PROMPT, 
                                                                      first_statement_instruction=ttlm.LFG_QUESTION_GENERATION_PROMPT,
                                                                      entailment_model=model_for_entailment, entailment_tokenizer=tokenizer_for_entailment) #HF model and tokenizer can also be used, LM is used to generate question

In [14]:
#define a chat history
chat = [{"role": "system", "content": 'You are a helpful assistant. Give brief and precise answers.'},
        {"role": "user", "content": f'Who is Ryan Reynolds?'}]

In [15]:
#generate a message with a truth value, it's a wrapper fucntion for model.generate in Huggingface
output_hf_model = LFG.long_form_generation_with_truth_value(model=model, tokenizer=tokenizer, messages=chat, fact_decomp_method=decomposition_method, 
                                          stmt_check_methods=[qa_generation, as_entailment], generation_seed=0)

Decomposing the generated text...
Applying stement check method  QuestionAnswerGeneration


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


Applying stement check method  AnswerStatementEntailment


In [16]:
print("Generated Text:\n", output_hf_model['generated_text'])

print("\nStatements:")
for i in range(len(output_hf_model['statements'])):
    print(output_hf_model['statements'][i]) 
    print("     Conf: ", output_hf_model['unnormalized_truth_values'][0][i][0], 
          "   LARS: ", output_hf_model['unnormalized_truth_values'][0][i][1],
          "   AS Ent: ", output_hf_model['unnormalized_truth_values'][1][i])

Generated Text:
  Ryan Reynolds is a Canadian actor and producer. He is known for his charming wit and good looks, and has appeared in a wide range of films, including romantic comedies, action movies, and superhero films. Some of his most notable roles include:

* Green Lantern in the DC Extended Universe
* Deadpool in the X-Men franchise
* Michael Bryce in the action comedy "The Hitman's Bodyguard"
* Hal Jordan in the 2011 film "Green Lantern"
* Andrew Paxton in the romantic comedy "The Proposal"

Ryan Reynolds has received several awards and nominations for his performances, including a Golden Globe Award and a People's Choice Award. He is married to actress Blake Lively and the couple has two daughters together.

Statements:
Ryan Reynolds is a Canadian actor.
     Conf:  -0.0654496248273034    LARS:  0.9798025488853455    AS Ent:  1.0
Ryan Reynolds is a Canadian producer.
     Conf:  0.0    LARS:  0.0    AS Ent:  0.0
Ryan Reynolds is known for his charming wit.
     Conf:  -0.07014

In [17]:
#generate a message with a truth value, it's a wrapper fucntion for litellm.completion in litellm
output_api_model = LFG.long_form_generation_with_truth_value(model="gpt-4o-mini", messages=chat, fact_decomp_method=decomposition_method, 
                                          stmt_check_methods=[qa_generation, as_entailment], generation_seed=0, seed=0)


Decomposing the generated text...
Applying stement check method  QuestionAnswerGeneration
Applying stement check method  AnswerStatementEntailment


In [18]:
print("Generated Text:\n", output_api_model['generated_text'])

print("\nStatements:")
for i in range(len(output_api_model['statements'])):
    print(output_api_model['statements'][i]) 
    print("     Conf: ", output_api_model['unnormalized_truth_values'][0][i][0], 
          "   LARS: ", output_api_model['unnormalized_truth_values'][0][i][1],
          "   AS Ent: ", output_api_model['unnormalized_truth_values'][1][i])

Generated Text:
 Ryan Reynolds is a Canadian actor, producer, and entrepreneur, known for his roles in films such as "Deadpool," "The Proposal," and "Free Guy." He is also recognized for his comedic talent and charismatic personality. In addition to his film career, Reynolds has been involved in various business ventures, including his ownership of the gin brand Aviation American Gin.

Statements:
Ryan Reynolds is a Canadian actor.
     Conf:  -0.0006410555669399999    LARS:  0.9929046034812927    AS Ent:  1.0
Ryan Reynolds is a producer.
     Conf:  -0.2420390216741639    LARS:  0.7057953476905823    AS Ent:  1.0
Ryan Reynolds is an entrepreneur.
     Conf:  0.0    LARS:  0.0    AS Ent:  0.6666666666666666
Ryan Reynolds is known for his roles in the film "Deadpool."
     Conf:  -0.23288472235094299    LARS:  0.9801365733146667    AS Ent:  1.0
Ryan Reynolds is known for his roles in the film "The Proposal."
     Conf:  -0.20384988830168754    LARS:  0.41287338733673096    AS Ent:  0.0


### Evaluation on Long Form Generation

We can evaluate truth methods on long form generation by using `evaluate_truth_method_long_form` function. To obtain the correctness labels of the statements we follow SAFE from https://arxiv.org/pdf/2403.18802. SAFE performs google search for each statement and assigns labels as supported, unsupported or irrelevant. We can define different evaluation metrics including AUROC, AUPRC, AUARC, Accuracy, F1, Precision, Recall, PRR. 

Note: Calibrating truth methods before running evaluation is recommended.

In [None]:
#SAFE utilized serper
os.environ['SERPER_API_KEY'] = 'your_serper_api_key'#https://serper.dev/

In [7]:
#create safe object that assigns labels to the statements
safe = LFG.ClaimEvaluator(rater='gpt-4o-mini', tokenizer = None, max_steps = 2, max_retries = 2, num_searches = 2)

#Define metrics
sample_level_eval_metrics = ['f1'] #calculate metric over the statements of a question, then average across all the questions
dataset_level_eval_metrics = ['auroc', 'prr'] #calculate the metric across all statements 

In [8]:
results = LFG.evaluate_truth_method_long_form(dataset='longfact_objects', model='gpt-4o-mini', tokenizer=None,
                                sample_level_eval_metrics=sample_level_eval_metrics, dataset_level_eval_metrics=dataset_level_eval_metrics,
                                fact_decomp_method=decomposition_method, stmt_check_methods=[qa_generation],
                                claim_evaluator = safe, size_of_data=3,  previous_context=[{'role': 'system', 'content': 'You are a helpful assistant. Give precise answers.'}], 
                                user_prompt="Question: {question_context}", seed=41,  return_method_details = False, return_calim_eval_details=False, wandb_run = None,  
                                add_generation_prompt = True, continue_final_message = False)

Loading dataset... Size of data: 3


  0%|          | 0/3 [00:00<?, ?it/s]

Decomposing the generated text...
Applying stement check method  QuestionAnswerGeneration
Checking for claim support by google search...


 33%|███▎      | 1/3 [09:32<19:05, 572.54s/it]

Time ellapsed for google search: 322.5224087238312
Decomposing the generated text...
Applying stement check method  QuestionAnswerGeneration
Checking for claim support by google search...


 67%|██████▋   | 2/3 [12:47<05:50, 350.19s/it]

Time ellapsed for google search: 124.50002813339233
Decomposing the generated text...
Applying stement check method  QuestionAnswerGeneration
Checking for claim support by google search...


100%|██████████| 3/3 [20:04<00:00, 401.53s/it]

Time ellapsed for google search: 241.48697590827942





In [11]:
# stmt_check_methods_0_truth_method_0 : qa_generation + confidence
# stmt_check_methods_0_truth_method_1 : qa_generation + LARS
results['dataset_level_eval_list']

{'stmt_check_methods_0_truth_method_0': {'auroc': 0.477751756440281,
  'prr': -0.05373714329351997},
 'stmt_check_methods_0_truth_method_1': {'auroc': 0.5386416861826697,
  'prr': -0.02268101641869807}}

In [12]:
# stmt_check_methods_0_truth_method_0 : qa_generation + confidence
# stmt_check_methods_0_truth_method_1 : qa_generation + LARS
results['sample_level_eval_list']

{'stmt_check_methods_0_truth_method_0': {'f1': {'values': [0.0, 0.0, 0.0],
   'mean': 0.0,
   'max': 0.0,
   'min': 0.0,
   'std': 0.0}},
 'stmt_check_methods_0_truth_method_1': {'f1': {'values': [0.6363636363636364,
    0.88,
    0.8292682926829268],
   'mean': 0.7818773096821877,
   'max': 0.88,
   'min': 0.6363636363636364,
   'std': 0.10495744653213784}}}