# Supported External Tools

AISI Inspect_AI, LLM_Comparator

This tutorial does not aim to teach using these tools, but if you are already familiar with them, you can use it with FAID easily.


## AISI Inspect AI


In [None]:
# Run the inspect eval command
#!inspect eval demo-inspectai.py --model azureai/Phi-3-5-mini-instruct-xbafx
# To call the model with environment variables see this documentation: https://inspect.ai-safety-institute.org.uk/models.html

In [1]:
import sys
sys.path.append('../../')

In [None]:
from faid import logging as faidlog
faidlog.init()

In [2]:
# Copy the JSON file log location
aisi_log_path = "./logs/2024-10-03T17-39-58+01-00_winogrande_R6ZmSsDFpRfPx6ubJCd5d7.json"

![AISI Inspect AI Tool Screenshot](./docs/media/aisi_inspectai.png)

In [5]:
from faid.logging import pretty_aisi_summary
aisi_sum = pretty_aisi_summary(aisi_log_path)
aisi_sum

{'name': 'R6ZmSsDFpRfPx6ubJCd5d7',
 'description': '{\'name\': \'plan\', \'steps\': [{\'solver\': \'system_message\', \'params\': {\'template\': "The following are multiple choice questions, with answers on the best logical completion to replace [BLANK] by A or B.\\n\\nSentence: The phone of Donald is a lot better than Adam\'s because [BLANK] paid extra for his phone.\\nA) Donald\\nB) Adam\\nANSWER: A\\n\\nSentence: Dennis was buying more books while Donald was buying more video games because [BLANK] was more studious.\\nA) Dennis\\nB) Donald\\nANSWER: A\\n\\nSentence: Jessica sneezed more than Carrie was sneezing because there was more dust in the room of [BLANK] .\\nA) Jessica\\nB) Carrie\\nANSWER: A\\n\\nSentence: When it comes to travel, Eric likes to ride a bicycle, but William uses a car. This is due to [BLANK] being environmentally conscious.\\nA) Eric\\nB) William\\nANSWER: A\\n\\nSentence: The grip of the goalkeeper couldn\'t save the ball shot from entering the net. The [BLAN

In [8]:
# Now you can use this dictionary to add the results to your own fairness logs
fairness_context = faidlog.ExperimentContext(name="winogrande")
fairness_context.add_model_entry(key="aisi-results", entry=aisi_sum)

Added aisi-results to project metadata under ['model'] and log updated


## LLM Comparator

Now, we will use LLM Comparator and other potential model comparison approaches to generate fairness report based on individual samples.

In [1]:
from faid.report.llm_comparator import LLMComparator

llm_comparator = LLMComparator()

In [2]:
# You can use **.jsonl** or **.csv** outputs to generate comparison files compatible with LLM comparator.
comparison_result = llm_comparator.create_comparison_json(
    'data/llm-comparison/example_llm1.jsonl', 'data/llm-comparison/example_llm2.jsonl', query_key="question", response_key="answer")

In [3]:
file_path = llm_comparator.write(comparison_result, 'data/llm-comparison/example_comparison_result.json')

In [4]:
username = "asabuncuoglu13"
repository = "faid"
branch = "main"
online_path = f"https://raw.githubusercontent.com/{username}/{repository}/refs/heads/{branch}/{file_path}"

In [5]:
llm_comparator.show_in_llm_comparator(online_path)

# LIT: Learning Interpretability Tool

LIT is a general-purpose interpretability tool to explore model behaviour and decisions in an interactive environment. LIT is a good tool to conduct some interpretability experiments, however it can work slowly when you load bulk data and large models. So, it is better to use a filtered data partition based on experiment needs. You can track this process with faid.

In [1]:
from faid.faidlog import faidlog

experiment_name = "financial-sentiment-analysis-finbert-fairness"
ctx = faidlog.ExperimentContext(name=experiment_name)
# Let's say we want to explore only false positive samples
sample_data = ctx.get_sample_data_entry()

In [5]:
from lit_nlp.api import model as lit_model
from lit_nlp.api import dataset as lit_dataset
from lit_nlp.api import types as lit_types
from lit_nlp import notebook
from transformers import BertTokenizer, BertForSequenceClassification, BertConfig, pipeline
import pandas as pd
import torch

In [21]:
sample_df = pd.DataFrame(sample_data["fps"])
sample_df = sample_df.rename(columns={"Summary": "text", "Sentiment": "sentiment"})
sample_df.head()

Unnamed: 0,text,sentiment
0,consumer spending plunges 13.6 percent in Apri...,Negative
1,RBI governor announces measures to help econom...,Positive


You'll need to load the FinBERT model within a custom model class that LIT can recognize.

In [22]:
class FinBERTModel(lit_model.Model):
    """A wrapper for FinBERT to work with LIT."""
    
    def __init__(self):
        # Load FinBERT model and tokenizer
        self.model_name = "yiyanghkust/finbert-tone"
        self.model = BertForSequenceClassification.from_pretrained(self.model_name, num_labels=3)
        self.tokenizer = BertTokenizer.from_pretrained(self.model_name)
        self.config = BertConfig.from_pretrained(self.model_name)

    def _load_model(self):
        return BertForSequenceClassification.from_pretrained(self.model_name, num_labels=3)
    
    def _load_tokenizer(self):
        return BertTokenizer.from_pretrained(self.model_name)
        
    def input_spec(self) -> lit_types.Spec:
        return {
            "text": lit_types.TextSegment()
        }
    
    def output_spec(self) -> lit_types.Spec:
        return {
            "score": lit_types.MulticlassPreds(vocab=["Positive", "Neutral", "Negative"], parent="label"),
            "label": lit_types.CategoryLabel(vocab=["Positive", "Neutral", "Negative"]),
        }
    
    def predict(self, inputs):
        # create a list of strings from the input
        # input_list = []
        results = []
        for input in inputs:
            # input_list.append(input["text"])
            
            # if you don't want to use the pipeline, you can use the model directly
            with torch.no_grad():
                encoded_input = self.tokenizer(input["text"], padding=True, return_tensors='pt')
                output = self.model(**encoded_input)
                probs = torch.softmax(output['logits'], dim=1)
                label = self.config.id2label[torch.argmax(probs).item()]
                results.append({
                    "score": probs[0].tolist(),
                    "label": label
                })

        # pipe = pipeline("text-classification", model=self.model, tokenizer=self.tokenizer)
        # results = pipe(input_list)
        return results

# Instantiate the FinBERT model
model = FinBERTModel()



In [24]:
class FinDataset(lit_dataset.Dataset):
  """Loader for MultiNLI development set."""

  TEXT_COLUMN = "text"
  TARGET_COLUMN = "sentiment"

  def __init__(self):
    
    self._examples = pd.DataFrame.to_dict(sample_df, orient="records")

  def spec(self) -> lit_types.Spec:
    return {
      'text': lit_types.TextSegment(),
      'sentiment': lit_types.CategoryLabel(vocab=["Positive", "Neutral", "Negative"])
    }

In [26]:
dataset = FinDataset()
model = FinBERTModel()

# Create the LIT widget
lit_widget = notebook.LitWidget(models={"model": model}, datasets={"data": dataset}, port=8892)
#lit_widget.render()



127.0.0.1 - - [07/Oct/2024 16:04:18] "GET / HTTP/1.1" 200 1408
127.0.0.1 - - [07/Oct/2024 16:04:18] "GET /main.js HTTP/1.1" 200 2024211
127.0.0.1 - - [07/Oct/2024 16:04:18] "GET /static/favicon.png HTTP/1.1" 200 13257
127.0.0.1 - - [07/Oct/2024 16:04:18] "POST /get_info? HTTP/1.1" 200 24138
127.0.0.1 - - [07/Oct/2024 16:04:18] "POST /get_dataset?dataset_name=data HTTP/1.1" 200 1272
127.0.0.1 - - [07/Oct/2024 16:04:18] "GET /static/onboarding_1_welcome.gif HTTP/1.1" 200 584363
127.0.0.1 - - [07/Oct/2024 16:04:19] "POST /get_interpretations?model=model&dataset_name=data&interpreter=classification&do_predict=1 HTTP/1.1" 200 399
127.0.0.1 - - [07/Oct/2024 16:04:19] "POST /push_ui_state?dataset_name=data HTTP/1.1" 200 4
127.0.0.1 - - [07/Oct/2024 16:04:19] "POST /get_metrics?model=model&dataset_name=data&metrics=multiclass,paired&do_predict=1 HTTP/1.1" 200 32
127.0.0.1 - - [07/Oct/2024 16:04:21] "POST /push_ui_state?dataset_name=data HTTP/1.1" 200 4
127.0.0.1 - - [07/Oct/2024 16:04:23] "POS