## Data Ingestion

In [1]:
# Import libraries and modules
import pandas as pd
import minsearch
from tqdm.auto import tqdm
from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()  # This loads the variables from .env into the environment

True

In [2]:
df=pd.read_csv('../App/data/Mental_Health_FAQ.csv')

df.isnull().sum()  # null
df.head()
df.Question_ID.unique() # unique IDs 98
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Question_ID  98 non-null     int64 
 1   Questions    98 non-null     object
 2   Answers      98 non-null     object
dtypes: int64(1), object(2)
memory usage: 2.4+ KB


In [3]:
documents = df.to_dict(orient='records')
import json
with open('../App/data/documents.json', 'w') as file:
    json.dump(documents, file)

In [4]:
print(len(documents))
documents[0]


98


{'Question_ID': 1590140,
 'Questions': 'What does it mean to have a mental illness?',
 'Answers': 'Mental illnesses are health conditions that disrupt a personâ€™s thoughts, emotions, relationships, and daily functioning. They are associated with distress and diminished capacity to engage in the ordinary activities of daily life.\nMental illnesses fall along a continuum of severity: some are fairly mild and only interfere with some aspects of life, such as certain phobias. On the other end of the spectrum lie serious mental illnesses, which result in major functional impairment and interference with daily life. These include such disorders as major depression, schizophrenia, and bipolar disorder, and may require that the person receives care in a hospital.\nIt is important to know that mental illnesses are medical conditions that have nothing to do with a personâ€™s character, intelligence, or willpower. Just as diabetes is a disorder of the pancreas, mental illness is a medical condit

### Minsearch

In [5]:
# I decided not to use keywords as I discovered that it was helping with hit rate but slightly lowering it

In [6]:
index = minsearch.Index(
    text_fields=['Questions', 'Answers'],
    keyword_fields=['Question_ID']
)

In [7]:
index

<minsearch.Index at 0x20cf0bac190>

In [8]:
index.fit(documents)

<minsearch.Index at 0x20cf0bac190>

In [9]:
index.text_fields

['Questions', 'Answers']

## RAG Flow

In [10]:
client = OpenAI()

In [11]:
query = 'What are the symptoms of mental illness?'

In [12]:
def search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=5
    )

    return results

In [13]:
search(query)

[{'Question_ID': 9434130,
  'Answers': 'Symptoms of mental health disorders vary depending on the type and severity of the condition. The following is a list of general symptoms that may suggest a mental health disorder, particularly when multiple symptoms are expressed at once.\nIn adults:\nConfused thinking\nLong-lasting sadness or irritability\nExtreme highs and lows in mood\nExcessive fear, worrying, or anxiety\nSocial withdrawal\nDramatic changes in eating or sleeping habits\nStrong feelings of anger\nDelusions or hallucinations (seeing or hearing things that are not really there)\nIncreasing inability to cope with daily problems and activities\nThoughts of suicide\nDenial of obvious problems\nMany unexplained physical problems\nAbuse of drugs and/or alcohol\n  In older children and pre-teens:\nAbuse of drugs and/or alcohol\nInability to cope with daily problems and activities\nChanges in sleeping and/or eating habits\nExcessive complaints of physical problems\nDefying authority, 

In [14]:
## LLM response
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{"role": "user", "content": query}]
)

response.choices[0].message.content


"Mental illness encompasses a wide range of conditions that affect mood, thinking, and behavior. Symptoms can vary significantly depending on the specific disorder, but some common symptoms may include:\n\n1. **Emotional Symptoms:**\n   - Persistent sadness or low mood\n   - Increased anxiety or worry\n   - Feelings of hopelessness or helplessness\n   - Irritability or mood swings\n   - Emotional numbness or detachment\n\n2. **Cognitive Symptoms:**\n   - Difficulty concentrating or making decisions\n   - Memory problems\n   - Distorted thinking or irrational beliefs\n   - Feelings of confusion or disorientation\n\n3. **Behavioral Symptoms:**\n   - Withdrawal from social activities or relationships\n   - Changes in appetite or weight\n   - Changes in sleep patterns (insomnia or oversleeping)\n   - Increased use of substances (alcohol, drugs)\n   - Decline in work or academic performance\n\n4. **Physical Symptoms:**\n   - Unexplained physical symptoms (e.g., headaches, digestive issues)\

In [15]:
print(_)

Mental illness encompasses a wide range of conditions that affect mood, thinking, and behavior. Symptoms can vary significantly depending on the specific disorder, but some common symptoms may include:

1. **Emotional Symptoms:**
   - Persistent sadness or low mood
   - Increased anxiety or worry
   - Feelings of hopelessness or helplessness
   - Irritability or mood swings
   - Emotional numbness or detachment

2. **Cognitive Symptoms:**
   - Difficulty concentrating or making decisions
   - Memory problems
   - Distorted thinking or irrational beliefs
   - Feelings of confusion or disorientation

3. **Behavioral Symptoms:**
   - Withdrawal from social activities or relationships
   - Changes in appetite or weight
   - Changes in sleep patterns (insomnia or oversleeping)
   - Increased use of substances (alcohol, drugs)
   - Decline in work or academic performance

4. **Physical Symptoms:**
   - Unexplained physical symptoms (e.g., headaches, digestive issues)
   - Fatigue or low ener

In [16]:
def build_prompt(query, search_results):
    prompt_template = """
    You're a mental health psychiatrist. Answer the QUESTION based on the CONTEXT from our mental questions and answer database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT:
    {context}
    """.strip()
    
    entry_template = """
    ANSWER: {Answers}
    """.strip()
    context = ""
    
    for doc in search_results:
        context = context + entry_template.format(**doc) + "\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [17]:
query='what are the symptoms of mental illness?'
search_results=search(query)
build_prompt(query, search_results)

"You're a mental health psychiatrist. Answer the QUESTION based on the CONTEXT from our mental questions and answer database.\n    Use only the facts from the CONTEXT when answering the QUESTION.\n    \n    QUESTION: what are the symptoms of mental illness?\n    \n    CONTEXT:\n    ANSWER: Symptoms of mental health disorders vary depending on the type and severity of the condition. The following is a list of general symptoms that may suggest a mental health disorder, particularly when multiple symptoms are expressed at once.\nIn adults:\nConfused thinking\nLong-lasting sadness or irritability\nExtreme highs and lows in mood\nExcessive fear, worrying, or anxiety\nSocial withdrawal\nDramatic changes in eating or sleeping habits\nStrong feelings of anger\nDelusions or hallucinations (seeing or hearing things that are not really there)\nIncreasing inability to cope with daily problems and activities\nThoughts of suicide\nDenial of obvious problems\nMany unexplained physical problems\nAbuse

In [18]:
print(_)

You're a mental health psychiatrist. Answer the QUESTION based on the CONTEXT from our mental questions and answer database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: what are the symptoms of mental illness?
    
    CONTEXT:
    ANSWER: Symptoms of mental health disorders vary depending on the type and severity of the condition. The following is a list of general symptoms that may suggest a mental health disorder, particularly when multiple symptoms are expressed at once.
In adults:
Confused thinking
Long-lasting sadness or irritability
Extreme highs and lows in mood
Excessive fear, worrying, or anxiety
Social withdrawal
Dramatic changes in eating or sleeping habits
Strong feelings of anger
Delusions or hallucinations (seeing or hearing things that are not really there)
Increasing inability to cope with daily problems and activities
Thoughts of suicide
Denial of obvious problems
Many unexplained physical problems
Abuse of drugs and/or alco

In [19]:
 models=['gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo', 'gpt-3.5-turbo-0613']

In [20]:
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [21]:
query='What is the best way to deal with mental illness?'
def rag(query, model='gpt-4o-mini'):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    #print(prompt)
    answer = llm(prompt, model=model)
    return answer



In [23]:
print(rag(query, model=models[2]))

The best way to deal with mental illness is through early identification and treatment. It is important for the individual affected to be proactive and fully engaged in their own recovery process. There are a range of effective treatments available based on the nature of the illness, and with the right supports and tools, anyone can live well and work towards their goals despite any challenges. It is possible to live a fulfilled and productive life even when dealing with a mental illness, as many individuals who are diagnosed and treated respond well to treatment. Regular monitoring and management of the disorder can help individuals lead successful and meaningful lives.


In [24]:
print(_)

You're a mental health psychiatrist. Answer the QUESTION based on the CONTEXT from our mental questions and answer database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: what are the symptoms of mental illness?
    
    CONTEXT:
    ANSWER: Symptoms of mental health disorders vary depending on the type and severity of the condition. The following is a list of general symptoms that may suggest a mental health disorder, particularly when multiple symptoms are expressed at once.
In adults:
Confused thinking
Long-lasting sadness or irritability
Extreme highs and lows in mood
Excessive fear, worrying, or anxiety
Social withdrawal
Dramatic changes in eating or sleeping habits
Strong feelings of anger
Delusions or hallucinations (seeing or hearing things that are not really there)
Increasing inability to cope with daily problems and activities
Thoughts of suicide
Denial of obvious problems
Many unexplained physical problems
Abuse of drugs and/or alco

In [None]:
import pandas as pd

### Retrieval evaluation

In [29]:

df_questions = pd.read_csv('../App/data/ground-truth-data.csv')

In [30]:
ground_truth=df_questions.to_dict(orient='records')
len(ground_truth)

485

In [28]:
len(ground_truth)
for q in ground_truth[0:6]:
    print(q['question'])
type(ground_truth)

What are the different levels of severity for mental illnesses?
Can you explain how mental illnesses impact daily functioning?
Are mental illnesses linked to a person's character or intelligence?
What treatments are available for mental illnesses?
How effective is treatment for individuals with mental illness?
What percentage of mental health conditions develop by age 24?


list

In [29]:
ground_truth[0]

{'id': 1590140,
 'question': 'What are the different levels of severity for mental illnesses?'}

In [30]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [31]:
def minsearch_search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=5
    )

    return results

In [32]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['id']
        results = search_function(q)
        relevance = [d['Question_ID'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [33]:
from tqdm.auto import tqdm

In [34]:
evaluate(ground_truth, lambda q: minsearch_search(q['question']))

  0%|          | 0/485 [00:00<?, ?it/s]

{'hit_rate': 0.7257731958762886, 'mrr': 0.549828178694158}

In [35]:
## {'hit_rate': 0.8371134020618557, 'mrr': 0.564302078219604} at 10K
## {'hit_rate': 0.73, 'mrr': 0.55} at 5k

### Retrieval Evaluation with Elasticsearch Text

In [102]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
          "analyzer": {
            "default": {
                  "type": "standard",
                  "stopwords": "_english_"
        }
      }
    }
    },
    "mappings": {
        "properties": {
            "Questions": 
            
                {"type": "text"},
            "Answers": 
                {"type": "text"},
            "Question_ID": 
                {"type": "keyword"},
        }
    }
}

index_name = "mental-health-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'mental-health-questions'})

In [103]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/98 [00:00<?, ?it/s]

In [189]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["Questions", "Answers"],
                        "type": "best_fields"
                    }
                },
               
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [190]:
# elastic_search(
#     query="What are the symptoms of mental illness?"
# )

In [191]:
evaluate(ground_truth, lambda q: elastic_search(q['question']))

  0%|          | 0/485 [00:00<?, ?it/s]

{'hit_rate': 0.8123711340206186, 'mrr': 0.6548109965635736}

In [176]:
# {'hit_rate': 0.865979381443299, 'mrr': 0.6268622156766485}   with 10K
# {'hit_rate': 0.78969, 'mrr': 0.616529}   with 5K @ 5K with questions ^2.5
# {'hit_rate': 0.8123711340206186, 'mrr': 0.6548109965635736} @ 5K with questions

### Finding the best parameters

In [140]:
df_validation = df_questions[:100]
df_test = df_questions[100:]

In [141]:
import random

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')  # Assuming we're minimizing. Use float('-inf') if maximizing.

    for _ in range(n_iterations):
        # Generate random parameters
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            if isinstance(min_val, int) and isinstance(max_val, int):
                current_params[param] = random.randint(min_val, max_val)
            else:
                current_params[param] = random.uniform(min_val, max_val)
        
        # Evaluate the objective function
        current_score = objective_function(current_params)
        
        # Update best if current is better
        if current_score > best_score:  # Change to > if maximizing
            best_score = current_score
            best_params = current_params
    
    return best_params, best_score

In [142]:
gt_val = df_validation.to_dict(orient='records')

In [143]:
def minsearch_search(query, boost=None):
    if boost is None:
        boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=5
    )

    return results

In [146]:
param_ranges = {
    'Questions':(0,3.0),
    'Answers': (0.0, 3.0),
    
    
}

def objective(boost_params):
    def search_function(q):
        return minsearch_search(q['question'], boost_params)

    results = evaluate(gt_val, search_function)
    return results['mrr']

In [147]:
simple_optimize(param_ranges, objective, n_iterations=20)

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

({'Questions': 0.20303988721391708, 'Answers': 1.5766410134276012},
 0.6184999999999999)

In [149]:
## {'Questions': 0.23047272444669298, 'Answers': 2.083714474915904}, 0.6286468253968254)  @10K
## {'Questions': 0.20303988721391708, 'Answers': 1.5766410134276012}, 0.6184999999999999) @5K

In [31]:
def minsearch_improved(query):
    boost = {
   
    'Questions': 0.20303988721391708,
    'Answers': 1.5766410134276012
    
    
    }

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

evaluate(ground_truth, lambda q: minsearch_improved(q['question']))

  0%|          | 0/485 [00:00<?, ?it/s]

{'hit_rate': 0.954639175257732, 'mrr': 0.7323204058255603}

In [151]:
# {'hit_rate': 0.9525773195876288, 'mrr': 0.7322197676321388} @10K
# {'hit_rate': 0.954639175257732, 'mrr': 0.7323204058255603} @5K

## RAG evaluation

In [25]:
# from groq import Groq

# client = Groq(
#     # This is the default and can be omitted
#     api_key=os.environ.get("GROQ_API_KEY"),
# )

In [26]:
# def llm(prompt, model='llama3-8b-8192'):
#     response = client.chat.completions.create(
#     messages=[
#         {
#             "role": "system",
#             "content": "you are a helpful assistant."
#         },
#         {
#             "role": "user",
#             "content": prompt,
#             }
#         ],
#         model=model,
#     )


    
    
#     return response.choices[0].message.content

In [31]:
prompt2_template = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}

Only return the JSON object. Make sure it is valid JSON and includes all required commas and quotes.

""".strip()

In [32]:
len(ground_truth)

485

In [33]:
def rag(query, model='gpt-4o-mini'):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    #print(prompt)
    answer = llm(prompt, model=model)
    return answer


In [34]:
record = ground_truth[0]
record

{'id': 1590140,
 'question': 'What are the different levels of severity for mental illnesses?'}

In [36]:

record = ground_truth[0]
question = record['question']
answer_llm = rag(question)
print(answer_llm)

The context provided does not explicitly outline different levels of severity for mental illnesses. However, it suggests that mental health exists on a continuum, ranging from good health to illness or disability. It emphasizes that symptoms can vary in intensity and duration among individuals, influencing how mental illnesses affect their lives. There is an acknowledgment that people can experience episodes of poor mental health without having a serious illness and that some may have mental illnesses while still maintaining periods of good mental health.


In [37]:
prompt = prompt2_template.format(question=question, answer_llm=answer_llm)
print(prompt)

You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: What are the different levels of severity for mental illnesses?
Generated Answer: The context provided does not explicitly outline different levels of severity for mental illnesses. However, it suggests that mental health exists on a continuum, ranging from good health to illness or disability. It emphasizes that symptoms can vary in intensity and duration among individuals, influencing how mental illnesses affect their lives. There is an acknowledgment that people can experience episodes of poor mental health without having a serious illness and that some may have mental illnesses while still maintaining periods of good mental health.

Please analyze the content and context of the generated a

In [38]:
rag(prompt)

'{\n  "Relevance": "RELEVANT",\n  "Explanation": "The generated answer provides a clear explanation of the continuum of mental illnesses, discussing both mild and serious conditions. It defines mental illnesses and describes their impact on daily life, which directly pertains to the question about levels of severity for mental illnesses."\n}'

In [39]:
import json

In [40]:
df_sample = df_questions.sample(n=200, random_state=1)

In [41]:
sample = df_sample.to_dict(orient='records')

In [42]:
evaluations = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question,model='gpt-4o-mini') 

    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    try:
        evaluation = json.loads(evaluation)
    
    except json.JSONDecodeError as e:
        print("Error decoding JSON:", e)
    

    evaluations.append((record, answer_llm, evaluation))

  0%|          | 0/200 [00:00<?, ?it/s]

In [43]:
evaluations[0]
type(evaluations[0][0])

dict

In [44]:
df_eval = pd.DataFrame(evaluations, columns=['record', 'answer', 'evaluation'])
df_eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   record      200 non-null    object
 1   answer      200 non-null    object
 2   evaluation  200 non-null    object
dtypes: object(3)
memory usage: 4.8+ KB


In [45]:
df_eval.evaluation[0]

{'Relevance': 'RELEVANT',
 'Explanation': 'The generated answer provides a comprehensive list of actionable steps and resources to find a suitable mental health professional for a child, directly addressing the question asked.'}

In [46]:
df_eval = pd.DataFrame(evaluations, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [47]:
type(evaluations[0])

tuple

In [48]:
df_eval.relevance.value_counts()

relevance
RELEVANT           182
PARTLY_RELEVANT     12
NON_RELEVANT         6
Name: count, dtype: int64

In [49]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.91
PARTLY_RELEVANT    0.06
NON_RELEVANT       0.03
Name: proportion, dtype: float64

In [51]:
df_eval.to_csv('../App/data/rag-eval-gpt4o-mini.csv', index=False)

In [54]:
evaluations_gpt4o = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question, model='gpt-4o') 

    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    evaluation = json.loads(evaluation)
    
    evaluations_gpt4o.append((record, answer_llm, evaluation))

  0%|          | 0/200 [00:00<?, ?it/s]

In [55]:
df_eval = pd.DataFrame(evaluations_gpt4o, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [56]:
df_eval.relevance.value_counts()

relevance
RELEVANT           190
PARTLY_RELEVANT      5
NON_RELEVANT         5
Name: count, dtype: int64

In [57]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.950
PARTLY_RELEVANT    0.025
NON_RELEVANT       0.025
Name: proportion, dtype: float64

In [58]:
df_eval.to_csv('../App/data/rag-eval-gpt4o .csv', index=False)