**Disclaimer**. This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this notebook. For updated and new examples using the latest Evidently versions, visit our documentation. 

Evidently docs: https://docs.evidentlyai.com/

Join our Discord: https://discord.com/invite/xZjKRaNp8b

In [None]:
!pip install evidently[llm]

In [None]:
import pandas as pd

In [None]:
from evidently import ColumnMapping
from evidently.test_suite import TestSuite
from evidently.descriptors import *
from evidently.tests import *

# Semantic similarity

## Mock data

Create a mock dataset with typical questions and reference answers:

In [None]:
data = [
    ["Hi there, how do I reset my password?", "To reset your password, click on 'Forgot Password' on the login page and follow the instructions sent to your registered email."],
    ["Where can I find my transaction history?", "You can view your transaction history by logging into your account and navigating to the 'Transaction History' section."],
    ["How do I add another user to my account?", "I am afraid it is not currently possible to add multiple users to the account."],
    ["Is it possible to link multiple bank accounts?", "Yes, you can link multiple bank accounts by going to 'Account Settings' and selecting 'Add Bank Account'."],
    ["Can I use your service for cryptocurrency transactions?", "Currently, our service does not support cryptocurrency transactions. Please check our website for updates regarding this feature."],
    ["Hi, can I get a detailed report of my monthly transactions?", "Yes, you can generate a detailed monthly report of your transactions by logging into your account, going to 'Transaction History', and selecting 'Generate Report'."],
    ["I am traveling to the US. Can I use the app there?", "Yes, you can use the app in the US just like you do at home. Ensure you have an internet connection."],
    ["How do I link my payment account to a new mobile number?", "To link a new mobile number, log in to your account, go to 'Account Settings', select 'Mobile Number', and follow the instructions to verify your new number."],
    ["Can I receive notifications for transactions in real-time?", "Yes, you can enable real-time notifications for transactions by going to 'Account Settings', then 'Notifications', and turning on 'Transaction Alerts'."],
    ["Hey, can I set up automatic transfers to my savings account?", "Yes, you can set up automatic transfers by going to 'Account Settings', selecting 'Automatic Transfers', and specifying the amount and frequency."],
]

columns = ["question", "target_response"]

typical_questions_golden = pd.DataFrame(data, columns=columns)

In [None]:
pd.set_option('display.max_colwidth', None)

Preview the dataset:

In [None]:
typical_questions_golden.head(10)

Let's imagine we generated a new set of answers (e.g., with a different prompt) to the same questions. We add a column "new response" to the dataset. To generate the complete dataset:

In [None]:
data = [
    ["Hi there, how do I reset my password?", "To reset your password, click on 'Forgot Password' on the login page and follow the instructions sent to your registered email.", "To change your password, select 'Forgot Password' on the login screen and follow the steps sent to your registered email address."],
    ["Where can I find my transaction history?", "You can view your transaction history by logging into your account and navigating to the 'Transaction History' section.", "Log into your account and go to 'Transaction History' to see all your past transactions."],
    ["How do I add another user to my account?", "I am afraid it is not currently possible to add multiple users to the account.", "To add a secondary user, go to 'Account Settings', select 'Manage Users', and enter the details of the person you want to add. You can set permissions for their access."],
    ["Is it possible to link multiple bank accounts?", "Yes, you can link multiple bank accounts by going to 'Account Settings' and selecting 'Add Bank Account'.", "You can add multiple bank accounts by visiting 'Account Settings' and choosing 'Add Bank Account'."],
    ["Can I use your service for cryptocurrency transactions?", "Currently, our service does not support cryptocurrency transactions. Please check our website for updates regarding this feature.", "At the moment, cryptocurrency transactions are not supported by our service. Keep an eye on our website for future updates."],
    ["Hi, can I get a detailed report of my monthly transactions?", "Yes, you can generate a detailed monthly report of your transactions by logging into your account, going to 'Transaction History', and selecting 'Generate Report'.", "You can get a detailed monthly report by logging into your account, navigating to 'Transaction History', and clicking on 'Generate Report'."],
    ["I am traveling to the US. Can I use the app there?", "Yes, you can use the app in the US just like you do at home. Ensure you have an internet connection.", "The app will work in the US without any issues. Just make sure you have access to the internet."],
    ["How do I link my payment account to a new mobile number?", "To link a new mobile number, log in to your account, go to 'Account Settings', select 'Mobile Number', and follow the instructions to verify your new number.", "To add a new number, navigate to the 'Account Settings' section, select 'Mobile Number' and proceed with the steps to add and confirm the new number"],
    ["Can I receive notifications for transactions in real-time?", "Yes, you can enable real-time notifications for transactions by going to 'Account Settings', then 'Notifications', and turning on 'Transaction Alerts'.", "To see your transacation, log into your account and navigate to the 'Transaction History' section."],
    ["Hey, can I set up automatic transfers to my savings account?", "Yes, you can set up automatic transfers by going to 'Account Settings', selecting 'Automatic Transfers', and specifying the amount and frequency.", "You can arrange automatic transfers by going to 'Account Settings', choosing 'Automatic Transfers', and setting the desired amount and frequency."],
]

columns = ["question", "target_response", "new_response"]

typical_questions = pd.DataFrame(data, columns=columns)

Preview the dataset:

In [None]:
typical_questions.head(10)

## Example 1

Let's compare new generations for typical questions against the golden set of answers using Semantic Similarity. We expect all answers to be similar (with a similarity of 0.9 or higher).

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(
        column_name=SemanticSimilarity(
        display_name="Response Similarity",
        with_column="target_response").
        on("new_response"),
        gte=0.9),
])

test_suite.run(reference_data=None,
               current_data=typical_questions)
test_suite

Explore the dataset with the added Semantic Similarity score:

In [None]:
test_suite.datasets().current

Select only the examples where Semantic Similarity is < 0.9:

In [None]:
df = pd.DataFrame(test_suite.datasets().current)

filtered_df = df[df['Response Similarity'] < 0.9]

filtered_df.head()

## Example 2

Let's consider another set of test cases: these are example questions where our chatbot should escalate to a human support agent.

In [None]:
data = [
    ["There is a transaction in my history that I did not make. What should I do?", "Hi there! Transferring you to a support agent to help.", "Hi there! Transferring you to a support agent to help."],
    ["My account was charged twice for the same transaction. How do I get a refund?", "Hello! I'll transfer you to a support agent who can assist with this issue.", "Hello! I'll transfer you to a support agent who can assist with this issue."],
    ["I think my account has been hacked. What should I do?", "Hi there! I'll connect you to a support agent immediately to secure your account.", "Hi there! I'll connect you to a support agent immediately to secure your account."],
    ["I have an ongoing subscription, why was my recent payment to a merchant declined?", "Hello! Let me transfer you to a support agent who can look into this for you.", "Your payment might have been declined due to insufficient funds, an expired card, or incorrect billing information. Please check your payment details and try again."],
    ["I received a notification of a large withdrawal that I didn't authorize. Can you help?", "Hi! I'll transfer you to a support agent who can help with this urgent matter.", "Hi! I'll transfer you to a support agent who can help with this urgent matter."],
    ["Why was my account suddenly locked?", "Hello! Let me connect you to a support agent to find out more and help you unlock your account.", "Hello! Let me connect you to a support agent to find out more and help you unlock your account."],
    ["I am having issues verifying my identity. What should I do?", "Hi! I'll transfer you to a support agent who can assist you with the verification process.", "Hi! I'll transfer you to a support agent who can assist you with the verification process."],
    ["My recent transfer is still pending after several days. Can you check it for me?", "Hi there! I'll connect you to a support agent who can investigate this issue.", "Hi there! I'll connect you to a support agent who can investigate this issue."],
    ["I received an email about suspicious activity on my account. Is it legitimate?", "Hello! Let me transfer you to a support agent to verify the email and secure your account.", "Hello! Let me transfer you to a support agent to verify the email and secure your account."],
    ["I need to dispute a charge on my statement. How do I proceed?", "Hi there! I'll connect you to a support agent who can help you with the dispute process.", "Hi there! I'll connect you to a support agent who can help you with the dispute process."],
]

columns = ["question", "target_response", "new_response"]

agent_questions = pd.DataFrame(data, columns=columns)

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(
        column_name=SemanticSimilarity(
        display_name="Response Similarity",
        with_column="target_response").
        on("new_response"),
        gte=0.9),
])

test_suite.run(reference_data=None,
               current_data=agent_questions)
test_suite

Explore the dataset with the added Semantic Similarity score:

In [None]:
test_suite.datasets().current

Select only the examples where Semantic Similarity is < 0.9:

In [None]:
df = pd.DataFrame(test_suite.datasets().current)

filtered_df = df[df['Response Similarity'] < 0.9]

filtered_df.head()

# Test Classification Quality

This is often a classification problem: you classify incoming requests by scenario to decide which to answer automatically. Let's test it like a classification problem against a labeled dataset.

A new mock dataset:

In [None]:
data = [
    ["How do I reset my password?", "automated", "automated"],
    ["There is a transaction in my history that I did not make. What should I do?", "agent", "agent"],
    ["Can I change the email address associated with my account?", "automated", "automated"],
    ["My account was charged twice for the same transaction. How do I get a refund?", "agent", "agent"],
    ["What are the fees for international transactions?", "automated", "automated"],
    ["I think my account has been hacked. What should I do?", "automated", "automated"],
    ["How can I update my billing information?", "automated", "automated"],
    ["Why was my recent payment to a merchant declined?", "agent", "automated"], # Error
    ["Can I receive notifications for transactions in real-time?", "automated", "automated"],
    ["I received a notification of a large withdrawal that I didn't authorize. Can you help?", "agent", "agent"],
    ["How do I contact customer support?", "automated", "automated"],
    ["I have a strange transacation", "agent", "automated"], # Error
    ["Why was my account suddenly locked?", "agent", "agent"],
    ["How can I verify my account?", "automated", "automated"],
    ["I am having issues verifying my identity. What should I do?", "agent", "agent"],
    ["My recent transfer is still pending after several days. Can you check it for me?", "agent", "agent"],
    ["I received an email about suspicious activity on my account. Is it legitimate?", "agent", "agent"],
    ["How do I change my payment method?", "automated", "automated"],
    ["I need to dispute a charge on my statement. How do I proceed?", "agent", "agent"],
    ["What is 2-factor authentication and how do I enable it?", "automated", "automated"],
    ["Where can I seee my transactions?", "automated", "automated"],
    ["Can I change my phone number?", "automated", "automated"],
    ["I think someone is using my card", "agent", "agent"],
    ["Can I link multiple bank accounts?", "automated", "automated"],
    ["Can I add another bank account?", "automated", "automated"],
    ["Can I use your service for cryptocurrency transactions?", "automated", "automated"],
    ["How do I apply for a credit card?", "automated", "agent"],  # Error
    ["What should I do if I forgot my username?", "automated", "automated"],
    ["Can I add a new phone number?", "automated", "automated"],
    ["What is my tariff plan?", "automated", "automated"],
    ["Can I pay with your card online?", "automated", "automated"],
    ["Where can I see the payment fees?", "automated", "automated"],
    ["Can I switch my tarrif?", "automated", "automated"],
    ["Which countries do you support?", "automated", "automated"],
    ["I want to set up spending limit on my card", "automated", "automated"],
    ["What measures do you have in place for data security?", "automated", "automated"],
    ["Can I set spending limits?", "automated", "automated"],
    ["How do I add a secondary user to my account?", "automated", "automated"],
    ["Is it possible to get a detailed report of my monthly transactions?", "automated", "automated"],
    ["My account get locked, help", "agent", "agent"],
]

columns = ["question", "target", "prediction"]

classification_data = pd.DataFrame(data, columns=columns)

Preview:

In [None]:
classification_data.head(10)

Test accuracy:

In [None]:
test_suite = TestSuite(tests=[
    TestAccuracyScore(gte=0.9),
])

test_suite.run(reference_data=None,
               current_data=classification_data,
               column_mapping=ColumnMapping(pos_label='agent'))
test_suite

Test recall for "agent" class:

In [None]:
test_suite = TestSuite(tests=[
    TestRecallScore(gte=0.9),
])

test_suite.run(reference_data=None,
               current_data=classification_data,
               column_mapping=ColumnMapping(pos_label='agent'))
test_suite

# Test Text Length  


Test that all new responses are between 50 and 180 symbols in length:

In [None]:
test_suite = TestSuite(tests=[
    TestValueRange(
        column_name=TextLength(
        display_name="Response Length").
        on("new_response"),
        left=50, right=180),
])

test_suite.run(reference_data=None,
               current_data=typical_questions)
test_suite

Test that all new responses are 150 symbols or shorter.

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMax(
        column_name=TextLength(
        display_name="Response Length").
        on("new_response"),
        lte=150),
])

test_suite.run(reference_data=None,
               current_data=typical_questions)
test_suite

View the dataset with the length column added:

In [None]:
test_suite.datasets().current

Select only the examples where the Text Length is > 150:

In [None]:
df = pd.DataFrame(test_suite.datasets().current)

filtered_df = df[df['Response Length'] > 150]

filtered_df.head()

# Test Competitor Mentions

Let's consider another set of test cases: these are example questions that mention competitors. Our chatbot's response should mention our own products.

In [None]:
data = [
    ["Why should I switch from YetAnotherFinCo to OurFinCo?", "I can only consult about OurFinCo products. OurFinCo offers competitive fees, robust security measures, and excellent customer support.", "I can only consult about OurFinCo products. OurFinCo offers competitive fees, robust security measures, and excellent customer support."],
    ["How do your fees compare to AnotherFinCo?", "I can only provide information about OurFinCo products. OurFinCo has a transparent fee structure with cashbacks for active users. You can see your plan on the 'Fees' page in your account.", "I can only provide information about OurFinCo products. OurFinCo has transparent fee structures with additional cashbacks and offers various plans to suit different needs. Check the 'Fees' page in your account to see your plan."],
    ["Is OurFinCo's customer service better than YetAnotherFinCo?", "I can only speak about OurFinCo. OurFinCo prides itself on providing 24/7 customer support with a dedicated team ready to assist you.", "I can only speak about OurFinCo. OurFinCo prides itself on providing 24/7 customer support with a dedicated team ready to assist you."],
    ["What are the advantages of using OurFinCo over AnotherFinCo?", "I can only discuss OurFinCo products. OurFinCo offers a user-friendly platform, low transaction fees, and comprehensive security measures to protect your data.", "I can only discuss OurFinCo products. OurFinCo offers a user-friendly platform, low transaction fees, and comprehensive security measures to protect your data."],
    ["Does OurFinCo offer better rewards than YetAnotherFinCo?", "I can only provide details about OurFinCo products. OurFinCo offers a rewards program with cashback, discounts, and loyalty points for regular use.", "I can only provide details about OurFinCo products. OurFinCo offers a rewards program with cashback, discounts, and loyalty points for regular use."],
    ["Should I choose OurFinCo over AnotherFinCo?", "I can only advise on OurFinCo products. OurFinCo provides excellent service, secure transactions, and various financial tools to help manage your money.", "I can only advise on OurFinCo products. OurFinCo provides excellent service, secure transactions, and various financial tools to help manage your money. AnotherFinCo also has some good tools, but OurFinCo's are tailored to be more user-friendly."],
    ["Is the mobile app of OurFinCo better than YetAnotherFinCo's?", "I can only talk about OurFinCo products. OurFinCo's mobile app is highly rated for its intuitive design and wide range of features, including real-time notifications and easy money transfers.", "I can only talk about OurFinCo products. OurFinCo's mobile app is highly rated for its intuitive design and wide range of features, including real-time notifications and easy money transfers. YetAnotherFinCo's app has fewer features."],
    ["How do the security features of OurFinCo compare to AnotherFinCo?", "I can only discuss the security features of OurFinCo. OurFinCo employs advanced encryption, two-factor authentication, and regular security audits to ensure your data is safe.", "I can only discuss the security features of OurFinCo. OurFinCo employs advanced encryption, two-factor authentication, and regular security audits to ensure your data is safe."],
    ["What makes OurFinCo better than YetAnotherFinCo for international transactions?", "I can only provide information about OurFinCo. OurFinCo offers competitive exchange rates, low international transfer fees, and fast transaction processing times.", "I can only provide information about OurFinCo. OurFinCo offers competitive exchange rates, low international transfer fees, and fast transaction processing times."],
    ["Why should I trust OurFinCo over AnotherFinCo?", "I can only speak for OurFinCo. OurFinCo is committed to transparency, customer satisfaction, and security, ensuring that your financial needs are met with integrity and care.", "I can only speak for OurFinCo. OurFinCo is committed to transparency, customer satisfaction, and security, ensuring that your financial needs are met with integrity and care."]
]

columns = ["question", "target_response", "new_response"]

competitor_questions = pd.DataFrame(data, columns=columns)

Test that the share of responses that mention competitors is 0.

In [None]:
test_suite = TestSuite(tests=[
    TestCategoryCount(
        column_name=Contains(
            items=["AnotherFinCo", "YetAnotherFinCo"],
            display_name="Competitor Mentions").
        on("new_response"),
        category=True,
        eq=0),
])

test_suite.run(reference_data=None,
               current_data=competitor_questions)
test_suite

View the dataset with the added True/False column for competitor mentions:

In [None]:
test_suite.datasets().current

View only the answers that mention competitors:

In [None]:
df = pd.DataFrame(test_suite.datasets().current)

filtered_df = df[df['Competitor Mentions'] == True]

filtered_df.head()

# Text Toxicity

Let's join all three test datasets:

In [None]:
misc_questions = pd.concat([typical_questions, agent_questions, competitor_questions], ignore_index=True)

Test if any responses are toxic (with the predicted probability of "toxicity" class > 0.05):

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMax(
        column_name=HuggingFaceToxicityModel(
        display_name="Toxicity").
        on("new_response"),
        lte=0.05),
])

test_suite.run(reference_data=None,
               current_data=misc_questions)
test_suite

In [None]:
test_suite.datasets().current

# Test Sentiment

Test if any responses have negative sentiment (< 0). We use a model that gives scores on a scale of -1 to 1 (where -1 is very negative, 0 is neutral, +1 is very positive).

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(
        column_name=Sentiment(
        display_name="Sentiment").
        on("new_response"),
        gte=0),
])

test_suite.run(reference_data=None, current_data=misc_questions)
test_suite

View only examples with negative sentiment:

In [None]:
df = pd.DataFrame(test_suite.datasets().current)

filtered_df = df[df['Sentiment'] < 0]

filtered_df.head()

You can also formulate the test differently: test that the share of the negative responses (out of the 0 to +1 range) is, say, no more than 20%.

In [None]:
test_suite = TestSuite(tests=[
    TestShareOfOutRangeValues(
        column_name=Sentiment(
        display_name="Sentiment").
        on("new_response"),
        left = 0, right = 1,
        lte = 0.1),
])

test_suite.run(reference_data=None, current_data=misc_questions)
test_suite

# Test Neutrality

Let's test the share of where the emotion is not "neutral" (predicted probability for this class is < 0.5).

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMin(
        column_name=HuggingFaceModel(
            model="SamLowe/roberta-base-go_emotions",
            params={"label": "neutral"},
            display_name="Neutrality",
        ).on("new_response"),
        gte=0.5
    ),
])

test_suite.run(reference_data=None, current_data=misc_questions)
test_suite

View the examples with predicted neutrality < 0.5:

In [None]:
df = pd.DataFrame(test_suite.datasets().current)

filtered_df = df[df['Neutrality'] < 0.5]

filtered_df.head()

# Combining tests

You can also combine multiple tests on the same dataset.

In [None]:
test_suite = TestSuite(tests=[
    TestShareOfOutRangeValues(column_name=Sentiment(display_name="Sentiment").on("new_response"), left=0, right=1, lte=0.1),
    TestShareOfOutRangeValues(column_name=HuggingFaceModel(model="SamLowe/roberta-base-go_emotions", params={"label": "neutral"},display_name="Neutrality",).on("new_response"), left=0.5, right=1, lte=0.1),
])

test_suite.run(reference_data=None, current_data=misc_questions)
test_suite

# Set auto test conditions

To be able to auto-generate conditions and compare distributions, split the data into two datasets: reference (golden responses) and current (new responses)

In [None]:
# Split the dataframe
ref = misc_questions[['question', 'target_response']].copy()
cur = misc_questions[['question', 'new_response']].copy()

# Rename columns to match the context
ref.rename(columns={'target_response': 'response'}, inplace=True)
cur.rename(columns={'new_response': 'response'}, inplace=True)

In [None]:
ref.head()

Auto-generate text length check (+/-10% from reference):

In [None]:
test_suite = TestSuite(tests=[
    TestValueRange(column_name=TextLength().on("response")),
])

test_suite.run(reference_data=ref, current_data=cur)
test_suite

There are many ways to formulate your tests. Auto-generate test conditions from reference (+/-10% from reference, +/- 2 standard deviations).

In [None]:
test_suite = TestSuite(tests=[
    TestColumnValueMean(column_name=TextLength().on("response")),
    TestColumnValueMax(column_name=TextLength().on("response")),
    TestColumnValueMin(column_name=TextLength().on("response")),
    TestColumnValueMedian(column_name=TextLength().on("response")),
    TestColumnValueStd(column_name=TextLength().on("response")),
    TestColumnQuantile(column_name=TextLength().on("response"), quantile=0.9),
    TestValueRange(column_name=TextLength().on("response")),
    TestMeanInNSigmas(column_name=TextLength().on("response")),
])

test_suite.run(reference_data=ref, current_data=cur)
test_suite

Compare distributions of text length and sentiment using statistical tests:

In [None]:
test_suite = TestSuite(tests=[
    TestColumnDrift(column_name=TextLength().on("response")),
])

test_suite.run(reference_data=ref, current_data=cur)
test_suite