# Batch/ragas
Test batch and ragas capability.

Uses this article as a model: https://towardsdatascience.com/visualize-your-rag-data-evaluate-your-retrieval-augmented-generation-system-with-ragas-fc2486308557

Ragas repository: https://github.com/explodinggradients/ragas/tree/main

In [16]:
import os
from ragas.testset import TestsetGenerator
from dotenv import load_dotenv,find_dotenv
import chromadb
from chromadb import PersistentClient
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document
import pandas as pd

# Set environment variables with .env
load_dotenv(find_dotenv(), override=True)

True

## Connect to database

In [4]:
persistent_client = chromadb.PersistentClient(path=os.path.join(os.getenv('LOCAL_DB_PATH'),'chromadb'))   
query_model=OpenAIEmbeddings(model='text-embedding-ada-002',openai_api_key=os.getenv('OPENAI_API_KEY'))

# Connect to vectorstore where no chunking was done only full PDF pages
vectorstore = Chroma(client=persistent_client,
                        collection_name='chromadb-openai-ams-full',
                        embedding_function=query_model)  


In [5]:
all_docs = vectorstore.get(include=["metadatas", "documents", "embeddings"])

In [6]:
lcdocs = [Document(page_content=doc, metadata=metadata) 
          for doc, metadata in zip(all_docs['documents'], all_docs['metadatas'])]

# Generate synthetic dataset

In [13]:
generator_model="gpt-3.5-turbo-16k"
critic_model="gpt-4"

generator_llm = ChatOpenAI(model=generator_model)
critic_llm = ChatOpenAI(model=critic_model)
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)
n_docs=100
testset = generator.generate_with_langchain_docs(lcdocs[:n_docs], test_size=2)

embedding nodes:   0%|          | 0/204 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/2 [00:00<?, ?it/s]

In [19]:
testset

TestDataset(test_data=[DataRow(question='What are the different material combinations tested for the gliding surfaces in the hinge system?', contexts=['350 Development :  In a preliminary development study , the principle of the hinge was  developed and a functional verification  was run with a modifiable demonstrator model , shown in Figure 6, to prove the concept functional ity. The  hinge for the demonstration model  is designed to have a broad range of adjustable functional parameters  as to accommodate different springs and featur ing an adaptable spring force and stroke  length to vary the  kick-off energy.  The demo nstrator consists of  a fairing dummy, down scaled from a large launch vehicle.  The dummy was constructed with adaptable mass distribution, in order to simulate possible design changes  on the fairing for the new separation system as well as asymmetric mass distribution to assess the  robustness of the system  by test . Only one hinge has been attached to the demons

In [15]:
questions_all = [
    {
        "question": qa.question,
        "ground_truth": qa.ground_truth,
        "question_by": generator_model,
    }
    for qa in testset.test_data
]

len(questions_all)

2

In [17]:
df_questions = pd.DataFrame(
    {
        "id": [f"Question {i}" for i, _ in enumerate(questions_all)],
        "question": [qa["question"] for qa in questions_all],
        "ground_truth": [qa["ground_truth"] for qa in questions_all],
        "question_by": [qa["question_by"] for qa in questions_all],
    }
)
# keep only the first question if questions are duplicated
df_questions = df_questions.drop_duplicates(subset=["question"])
df_questions

Unnamed: 0,id,question,ground_truth,question_by
0,Question 0,What are the different material combinations t...,Three different material combinations were tes...,rags_gpt35_40
1,Question 1,What is the connection between angular runout ...,,rags_gpt35_40


In [None]:
all_docs = vectorstore.get(include=["metadatas", "documents", "embeddings"])
df_docs = pd.DataFrame(
    {
        "id": [stable_hash_meta(metadata) for metadata in all_docs["metadatas"]],
        "source": [metadata.get("source") for metadata in all_docs["metadatas"]],
        "page": [metadata.get("page", -1) for metadata in all_docs["metadatas"]],
        "document": all_docs["documents"],
        "embedding": all_docs["embeddings"],
    }
)