# Test Set Generator

This Jupyter Notebook demonstrates the process of generating a test set using various tools and libraries. The workflow includes the following steps:

1. **Installation of Required Packages**:
    - The necessary packages `ragas`, `unstructured`, and `unstructured[pdf]` are installed.

2. **Loading Documents**:
    - Documents are loaded from a specified directory using `DirectoryLoader` from `langchain_community.document_loaders`.

3. **Setting Up Azure OpenAI**:
    - Environment variables are loaded from a `.env` file.
    - The Azure OpenAI API key and endpoint are set up.
    - Instances of `AzureChatOpenAI` and `AzureOpenAIEmbeddings` are created for generating document-level summaries and embeddings.

4. **Generating Test Set**:
    - A `TestsetGenerator` is created using the previously set up Azure OpenAI instances.
    - The test set is generated from the loaded documents with specified distributions for different types of evolutions (`simple`, `reasoning`, `multi_context`).

5. **Saving and Viewing the Test Set**:
    - The generated test set is converted to a pandas DataFrame for inspection.
    - The test set is saved to disk in a specified directory.

This notebook provides a comprehensive guide to setting up and generating a test set using Azure OpenAI and other related tools.

In [None]:
%pip install ragas unstructured "unstructured[pdf]"

### Loading Documents

In this step, documents are loaded from a specified directory using the `DirectoryLoader` from `langchain_community.document_loaders`. The loader is configured to use multithreading, handle errors silently, and sample a single document. After loading, the filename metadata is set for each document.

In [2]:
from langchain_community.document_loaders import DirectoryLoader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context


loader = DirectoryLoader(
    "../data/", use_multithreading=True, silent_errors=True, sample_size=1
)
documents = loader.load()

for document in documents:
    document.metadata["filename"] = document.metadata["source"]

### Loading Models

In [3]:

import getpass
import os

from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings


# Load environment variables from a .env file
load_dotenv('../.env')

# Set the OpenAI API key environment variable
api_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT') 
api_key=os.getenv('AZURE_OPENAI_API_KEY')
llm_deployment_name = os.getenv('AZURE_OPENAI_MODEL_NAME')
embedding_deployment_name = os.getenv('AZURE_OPENAI_EMBEDDING_MODEL')
api_version = '2024-02-15-preview' # this might change in the future


if "AZURE_OPENAI_API_KEY" not in os.environ:
    os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass(
        "Enter your AzureOpenAI API key: "
    )
os.environ["AZURE_OPENAI_ENDPOINT"] = api_endpoint


 # Create document-level summaries
llm = AzureChatOpenAI(
    model=llm_deployment_name,
    azure_deployment=llm_deployment_name,
    api_version=api_version,
    
)

embeddings = AzureOpenAIEmbeddings()

### Generating the Test Set

In this step, a 'TestsetGenerator' is created using the previously set up Azure OpenAI instances. The test set is generated from the loaded documents with specified distributions for different types of evolutions (`simple`, `reasoning`, `multi_context`). The test set generation process includes options to handle exceptions and enable debugging logs.

In [None]:
generator = TestsetGenerator.from_langchain(
    generator_llm=llm, critic_llm=llm, embeddings=embeddings
)

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=30,
    raise_exceptions=False,
    with_debugging_logs=False,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

In [None]:
testset.to_pandas()

In [None]:
testset.to_dataset().save_to_disk("../data/testset")