## General

### __A better solution for this problem could be using LangSmith's evaluation tools, but without having the bot itself and without knowing the model through which these responses were generated, it was not possible to do this. The optimal solution in this case was to leave the evaluation to another language model and give instructions on how the evaluation should be done.__

## Imports

In [1]:
import pandas as pd
import openai
from dotenv import load_dotenv
load_dotenv()

True

## Main functionality

In [6]:
class EvaluationArtifact:
    _BOT_BEHAVIOUR_DESCRIPTION = """
    Evaluate the following Conversation, by considering that they have not any prior conversation.
     
    AI Bot is aiming to book them by suggesting available intro classes in groups of three and answering their questions. 
    The bot's final goal is a successful booking. In the case of a positive reply, the bot proceeds to book the lead and sends a success message to the lead. 
    The bot has specific instructions that should be followed. 
    The bot should respond with the message "One of our representatives will be in touch with you shortly to assist you further" whenever 
    any of the following is detected during a conversation:
        ● The customer wants to bring someone else (e.g., a friend, daughter, etc.) as a guest (so this is opportunity for having more customers). 
        ● The customer asks questions about gift cards, payments, discounts, or costs, or even just have interest about any theme connected with finance.
        ● The customer wants to use 1Pass (onepass, One Pass) credits.
    The conversation is relatively good if the bot detects one of these patterns and returns only "One of our representatives will be in touch with you shortly to assist you further" string exactly, so can have higher score.
    The conversation is relatively bad if the user's case correspond to one of these cases however bot does not detect, so can have lower score.
    
    Based on the Conversation the output need to be value from 1 to 10.
    """
    
    def __init__(self,
                 model: str = "gpt-4o-mini",
                 max_tokens: int = 5,
                 temperature: float = 0.0
                 ) -> None:
        self.model = model
        self.max_tokens = max_tokens
        self.temperature = temperature
        self.client = openai.OpenAI()
        
    def get_score(self, conversation: pd.DataFrame) -> int:
        res = self.client.chat.completions.create(
            model = self.model,
            temperature = self.temperature,
            max_tokens = self.max_tokens,
            messages=[{"role": "system", "content": self._BOT_BEHAVIOUR_DESCRIPTION}] + [
                {"role": "assistant", "content": context} if role == 1 else {"role": "user", "content": context} for role, context in zip(conversation.source, conversation.message)
            ] + [{"role": "system", "content": "Please do Evaluation by returning only one integer."}]
        )
        try:
            return int(res.choices[0].message.content)
        except Exception:
            return -1
    

## Evaluation

In [7]:
df = pd.read_excel('dataset1 (1).xlsx')

In [10]:
evart = EvaluationArtifact()
data = {
    'lead_id': [],
    'score': []
}
for lead_id, group in df.groupby('lead_id'):
    score = evart.get_score(group)
    data['lead_id'].append(lead_id)
    data['score'].append(score)

result_df = pd.DataFrame(data)
result_df.to_csv('results.csv', index=False, header=['lead_id', 'score'])