In [1]:
!pip install langchain langchain-community openai tiktoken --quiet

In [2]:
from aimon import Client

# Create an AIMon client

This creates the AIMon client that will be used for the various different operations under evaluation and continuous monitoring of production applications.

In [3]:
# Create the AIMon client. You would need an API Key (that can be retrieved from the UI in your user profile). 
import os
am_api_key = os.getenv("AIMON_API_KEY")
aimon_client = Client(auth_header="Bearer {}".format(am_api_key))

# Generative AI Model

A model is a generative model that will be powering your application. 

In [4]:
# Pick from existing model model types in the company. These are created by you or other member of your organization.
# The AIMon client has a convenience function to easily retrieve this.
list_model_types = aimon_client.models.list()
print(list_model_types)

['GPT4', 'GPT-4', 'GPT-4o', 'Llama-2', 'Llama3', 'model_type', 'string']


In [5]:
# Using the AIMon client, create or get a model for a given model type. 
# This API will automatically create a new model if it does not exist.
my_model = aimon_client.models.create(
    name="my_gpt4_model_fine_tuned", 
    type="GPT-4", 
    description="This model is a GPT4 based model and is fine tuned on the awesome_finetuning dataset", 
    metadata={"model_s3_location":"s3://bucket/key"}
)

# LLM application

This is to create or get an application that is using the above model. Each application is versioned i.e., each application is associated with a particular model for a given version of the application. When you use a different model for the same application, AIMon will automatically increment the version of the application. 

In [6]:
# Using the AIMon client, create or get an existing application
new_app = aimon_client.applications.create(
    name="my_application_sept_4_2024", 
    model_name=my_model.name, 
    stage="evaluation", 
    type="summarization", 
    metadata={"application_url": "https://acme.com/summarization"}
)

### Core LLM Application code

The below example uses Langchain to do summarization of documents using OpenAI.

In [7]:
openai_api_key = os.getenv("OPENAI_API_KEY")

In [8]:
# Lanchain app example
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.llms.openai import OpenAI
from langchain.chains.summarize import load_summarize_chain

def run_application(new_app, source_text, prompt=None, user_query=None, eval_run=None):
    # Split the source text
    text_splitter = CharacterTextSplitter()
    texts = text_splitter.split_text(source_text)
    
    # Create Document objects for the texts
    docs = [Document(page_content=t) for t in texts[:3]]
    
    # Initialize the OpenAI module, load and run the summarize chain
    llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    summary = chain.run(docs)

    payload = {
                "application_id": new_app.id,
                "version": new_app.version,
                "context_docs": [d.page_content for d in docs],
                "output": summary,
                "actual_request_timestamp": "09/04/2024, 13:32:16"
            }
    if prompt:
        payload["prompt"] = prompt
    if user_query:
        payload["user_query"] = user_query
    if eval_run is None and new_app.stage == 'evaluation':
        raise Exception("Evaluation and run ID missing for an application that is in evaluation mode.")
    if eval_run is not None:
         payload["evaluation_id"] = eval_run.evaluation_id
         payload["evaluation_run_id"] = eval_run.id
    # print(payload)
    # Analyze quality of the generated output using AIMon
    res = aimon_client.analyze.create(
        body=[payload]
    )
    print("Aimon response: {}\n".format(res))

# Evaluation of the LLM Application

Before deploying the application to production, it is a good idea to test it end to end with either a curated golden dataset or a snapshot of production traffic. In this section, we will demonstrate how AIMon can assist you to perform these tests.

### Evaluation Dataset

The dataset should be a CSV file with these columns: 
 * "prompt": This is the prompt used for the LLM
 * "user_query": This the query specified by the user 
* "context_docs": These are context documents that are either retrieved from a RAG or through other methods. 
                  For tasks like summarization, these documents could be directly specified by the user.

In [9]:
import json
# Create a new datasets
file_path1 = "./test_evaluation_dataset_aug_2024_1.csv"
file_path2 = "./test_evaluation_dataset_aug_2024_2.csv"

dataset_data_1 = json.dumps({
    "name": "test_evaluation_dataset_1.csv",
    "description": "This is one custom dataset"
})

dataset_data_2 = json.dumps({
    "name": "test_evaluation_dataset_2.csv",
    "description": "This is another custom dataset"
})

with open(file_path1, 'rb') as file1:
    dataset1 = aimon_client.datasets.create(
        file=file1,
        json_data=dataset_data_1
    )

with open(file_path2, 'rb') as file2:
    dataset2 = aimon_client.datasets.create(
        file=file2,
        json_data=dataset_data_2
    )


In [10]:
dataset1 = aimon_client.datasets.list(name="test_evaluation_dataset_1.csv")
dataset2 = aimon_client.datasets.list(name="test_evaluation_dataset_2.csv")

### Dataset Collection

You can define a collection of evaluation datasets for ease of use. 

In [11]:
# Create a new dataset collection
dataset_collection = aimon_client.datasets.collection.create(
    name="my_first_dataset_collection_aug_9_2024_5_12", 
    dataset_ids=[dataset1.sha, dataset2.sha], 
    description="This is a collection of two datasets."
)

### Evaluation

An evaluation is associated with a dataset collection and an application.

In [12]:
# Using the AIMon client, create a new evaluation
evaluation = aimon_client.evaluations.create(
    name="offline_evaluation_fine_tuned_model", 
    application_id=new_app.id, 
    model_id=my_model.id, 
    dataset_collection_id=dataset_collection.id
)


### New Run
A "run" is an instance of an evaluation that you would like to track metrics against. You could have multiple runs of the same evaluation. This is typically done in a CI/CD context where the same evaluation would run at regular intervals. Since LLMs are probabilitic in nature, they could produce different outputs for the same query and context. It is a good idea to run the evaluations regularly to understand the variations of outputs produced by your LLMs. In addition, runs give you the ability to choose different metrics for each run. 

Metrics are specified using the `metrics_config` parameter in the format shown below. The keys indicate the type of metric computed and the values are the specific algorithms used to compute those metrics. For most cases, we recommend using the `default` algorithm.

Tags allow you to specify metadata like the application commit SHA or other key-value pairs that you want to insert for analytics purposes.

In [13]:
# Using the AIMon client, create a new evaluation run. 
eval_run = aimon_client.evaluations.run.create(
    evaluation_id=evaluation.id, 
    metrics_config={'hallucination': 'default', 'toxicity': 'default', 'conciseness': 'default', 'completeness': 'default'},
)

In [14]:
eval_run.evaluation_id

61

In [15]:
# Get all records from the datasets
dataset_collection_records = []
for dataset_id in dataset_collection.dataset_ids:
    dataset_records = aimon_client.datasets.records.list(sha=dataset_id)
    dataset_collection_records.extend(dataset_records)

# Run the application on all the records in the dataset collection. 
for record in dataset_collection_records:
    # Run the application code
    run_application(new_app, record['context_docs'], record['prompt'], record['user_query'], eval_run=eval_run)
    
# You can view metrics for your application on the UI: https://www.app.aimon.ai/llmapps

  llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
  summary = chain.run(docs)


Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Aimon response: AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)



# Metrics

In [16]:
# Get metrics by application
app_metrics = aimon_client.applications.evaluations.metrics.retrieve(application_name=new_app.name)

# Get metrics by evaluation
eval_metrics = aimon_client.applications.evaluations.metrics.get_evaluation_metrics(evaluation_id=evaluation.id, application_name=new_app.name)

# #Get metrics by evaluation run
# eval_run_metrics = aimon_client.applications.evaluations.metrics.get_evaluation_run_metrics(evaluation_id=40, evaluation_run_id=268, application_name="my_first_dataset_collection_aug_9_2024_5_12")
# quality_metrics = eval_run_metrics.evaluations[0].quality_metrics

print(f"Quality Metrics: {eval_metrics.evaluations[0]}")

Quality Metrics: Evaluation(metric_name=None, timestamp=None, value=None, avg_completeness_score=0.8942857142857142, avg_conciseness_score=0.8360000000000001, avg_context_doc_length=363.14285714285717, avg_hallucination_score=0.04690142857142857, avg_instruction_adherence_score=0.0, avg_toxicity_scores={'identity_hate': 0.11006870120763779, 'insult': 0.30732529716832296, 'obscene': 0.28974676557949613, 'severe_toxic': 0.05291888649974551, 'threat': 0.12063558399677277, 'toxic': 0.11930480173655919})


# Production

Once you have built enough confidence through your evaluations of your application, you can deploy it to production. AIMon gives you the ability to continuously monitor your application for the configured metrics in production.

In [9]:
# Using the AIMon client, create or get an existing application
new_app_prod = aimon_client.applications.create(
    name="my_application_sept_4_2024", 
    model_name=my_model.name, 
    stage="production", 
    type="summarization", 
    metadata={"application_url": "https://acme.com/summarization"}
)

In [12]:
source_text = """
Large Language Models (LLMs) have become integral to automating and enhancing various business processes. 
However, a significant challenge these models face is the concept of \"hallucinations\" - outputs that, 
although fluent and confident, are factually incorrect or nonsensical. For enterprises relying on AI 
for decision-making, content creation, or customer service, these hallucinations can undermine credibility, 
spread misinformation, and disrupt operations. Recently, AirCanada lost a court case due to hallucinations 
in its chatbot [7]. Also, the 2024 Edelman Trust Barometer reported a drop in trust in AI companies from 
61% to 53% compared to 90% 8 years ago [8]. Recognizing the urgency of the issue, we have developed a 
state-of-the-art system designed for both offline and online detection of hallucinations, ensuring higher 
reliability and trustworthiness in LLM outputs.
"""

In [None]:
run_application(new_app_prod, source_text, prompt="Langhchain based summarization of documents")

# Delete Application

In [20]:
del_resp = aimon_client.applications.delete(
    name="my_application_sept_4_2024", 
    version="0", 
    stage="evaluation"
)

In [21]:
del_resp.message

'Application my_application_sept_4_2024 deleted successfully.'