## Extra Dependencies

In [2]:
! pip install chromadb

Collecting chromadb
  Obtaining dependency information for chromadb from https://files.pythonhosted.org/packages/43/cd/a875ed1f61365c9fdb46ee2de0cbea1735a9575ff718886f7eb218d4ef45/chromadb-0.5.12-py3-none-any.whl.metadata
  Downloading chromadb-0.5.12-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Obtaining dependency information for build>=1.0.3 from https://files.pythonhosted.org/packages/84/c2/80633736cd183ee4a62107413def345f7e6e3c01563dbca1417363cf957e/build-1.2.2.post1-py3-none-any.whl.metadata
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Obtaining dependency information for chroma-hnswlib==0.7.6 from https://files.pythonhosted.org/packages/0d/19/aa6f2139f1ff7ad23a690ebf2a511b2594ab359915d7979f76f3213e46c4/chroma_hnswlib-0.7.6-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-macosx_11_0_arm64.whl.metadata (252 bytes)
Collecting posthog>=2.

## Imports

In [6]:
import os
import openai 
from openai import OpenAI
import pprint
import pandas as pd
import random

import chromadb
from chromadb.utils import embedding_functions

## Chunked data collection setup

In [14]:
collection_name = "db_collection"
default_embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

chroma_client = chromadb.PersistentClient(path="./chromadb/")
# declare ChromaDB collection
collection = chroma_client.get_or_create_collection(
    name=collection_name,
    embedding_function=default_embedding_function
    )

result = collection.get()

print(f"Collection {collection_name} created successfully")
pprint.pprint(result)

Collection db_collection created successfully
{'data': None,
 'documents': [],
 'embeddings': None,
 'ids': [],
 'included': ['metadatas', 'documents'],
 'metadatas': [],
 'uris': None}


In [15]:
def load_md_from_dir(dir_path):
    """
    Loads Markdown (.md) files from the specified directory.

    Args:
        dir_path (str): Path to the directory containing .md files.

    Returns:
        List[dict]: A list of dictionaries with the text content of each .md file.
    """
    md_files = [
        os.path.join(dir_path, filename) 
        for filename in os.listdir(dir_path) 
        if filename.endswith(".md")
    ]
    
    documents = []
    for file_path in md_files:
        with open(file_path, "r", encoding="utf-8") as file:
            documents.append({"text": file.read()})
    
    return documents

In [16]:
def split_text(text, chunk_size=100, chunk_overlap=20):
    """
    Splits the input text into overlapping chunks.

    Args:
        text (str): The text to split.
        chunk_size (int): The size of each chunk. Default is 100.
        chunk_overlap (int): The number of overlapping characters between chunks. Default is 20.

    Returns:
        List[str]: A list of text chunks.
    """
    chunks = []
    text_length = len(text)
    
    for start in range(0, text_length, chunk_size - chunk_overlap):
        end = min(start + chunk_size, text_length)
        chunks.append(text[start:end])
    
    return chunks

In [17]:
directory_path = "./evidently_reference/"

# load documents from directory
md_files = load_md_from_dir(directory_path)

print(f" {len(md_files)} files loaded")

# Split text into chunks
chunked_files = [
    {
        'id': f"{file_id}-{chunk_id}",
        'text': chunk,
    }
    for file_id, file in enumerate(md_files)
    for chunk_id, chunk in enumerate(split_text(file["text"], chunk_size=500, chunk_overlap=50))
]

print(f"Split in to {len(chunked_files)} chunks")

 4 files loaded
Split in to 270 chunks


In [18]:
# insert documents with embeddings to collection ChromaDB
for chunk in chunked_files:
    collection.upsert(
            ids=chunk['id'],
            documents=chunk['text'],
    )

result = collection.get()

print(f"Collection {collection_name} has {len(result['ids'])} documents")

Collection db_collection has 270 documents


In [19]:
#Just incase we need to delete collection
list_collections = chroma_client.list_collections()
print(list_collections)

#chroma_client.delete_collection(collection_name)
#list_collections = chroma_client.list_collections()
#print(list_collections)

[Collection(id=9d10a2e1-2a39-4ba6-863a-577069d1d2af, name=db_collection)]


## Dataset Generation chain of promts

In [20]:
openai.api_key = os.environ["OPENAI_API_KEY"]
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

### Naive questions generation

In [21]:
# Fixed size for the random list
sample_size = 10

# Generate a random list with the fixed size from the existing list
random_chuncks = [item['text'] for item in random.sample(chunked_files, min(sample_size, len(chunked_files)))]
random_chuncks

['ity Metrics than included in the `DataQualityPreset`. \n\n# How to read the tables\n\n* **Name**: the name of the Metric.  \n* **Description**: plain text explanation. For Metrics, we also specify whether it applies to the whole dataset or individual columns.\n* **Parameters**: required and optional parameters for the Metric or Preset. We also specify the defaults that apply if you do not pass a custom parameter.\n\n**Metric visualizations**. Each Metric includes a default render. To see the visualizati',
 'r><br> | **Required**:<ul><li>`k`</li></ul>**Optional**:<ul><li>-</li></ul> |\n| **PopularityBias()** <br><br> Evaluates the popularity bias in recommendations by computing ARP (average recommendation popularity), Gini index, and coverage. <br><br>Requires a training dataset. | **Required**:<ul><li>`K`</li><li>`normalize_arp (default: False)` - whether to normalize ARP calculation by the most popular item in training</li></ul>**Optional**:<ul><li>-</li></ul> |\n| **ItemBiasMetric(

In [22]:
system_prompt = "You are an assisstant who generates questions based on provided context"
number_of_questions = 10
user_prompt = """
Generate {N} conceptual questions based on the provided context and can be answered from the information in the provided context.
Here is a context
<context>
    {context}
</context>

Remain faithful to the underlying context. 
Avoid providing any preamble!
Avoid providing any closing statement!
Please return only a list of coma separated generated questions in string format.
"""

context = "\n\n".join(random_chuncks)

formated_user_prompt = user_prompt.format(context=context, N=number_of_questions)

In [23]:
response = client.chat.completions.create(
    model="gpt-4o",  # Updated to a valid model
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": formated_user_prompt}
    ],
    max_tokens=400,  # Limits the response length
    temperature=0.7,  # Controls randomness in the output
    n=1
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [24]:
generated_queries = response.choices[0].message.content.strip().split(",")

In [25]:
generated_queries

['"How is the \'Name\' of a Metric used in reading tables?',
 " What information does the 'Description' section provide in the context of Metrics?",
 " What parameters are considered for the 'PopularityBias()' Metric?",
 " How does the 'TestFPR()' function operate at the dataset level?",
 " What is the role of the 'TestGiniIndex(k=k)' in evaluating dataset bias?",
 " What visualization is provided by the 'RegressionErrorDistribution()'?",
 " How does the 'RegressionErrorNormality()' assess value normality?",
 " What is the primary focus of the 'DiversityMetric' in recommendation systems?",
 " How are missing values tested in the 'TestShareOfRowsWithMissingValues()'?",
 ' How does the \'TestNumberOfDuplicatedRows()\' function evaluate dataset integrity?"']

### [PLEASE IGNORE THE WHOLE BLOCK] Get alternative questions

In [1]:
#it is not used so far
seed_query = "How do I get Evidently data drift report for my data?"

In [26]:
#random seed question generation
system_prompt = "You are an assisstant who generates questions based on provided context"
user_prompt = """
Generate a conceptual question based on the provided context and can be answered from the information in the provided context.
Here is a context
<context>
    {context}
</context>

Remain faithful to the underlying context. 
Avoid providing any preamble!
Avoid providing any closing statement!
Please return only a question
"""

context = "\n\n".join(random_chuncks)

formated_user_prompt = user_prompt.format(context=context, N=number_of_questions)

response = client.chat.completions.create(
    model="gpt-4o",  # Updated to a valid model
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": formated_user_prompt}
    ],
    max_tokens=400,  # Limits the response length
    temperature=0.7,  # Controls randomness in the output
    n=1
)

In [27]:
generated_seed = response.choices[0].message.content.strip().split(",")

In [28]:
generated_seed

['How does the `PopularityBias()` metric evaluate recommendation systems',
 ' and what parameters are required to compute this metric?']

In [29]:
#do not forget to write a prompt for seed query generation
system_prompt = "You are a smart assistant who helps rephrase questions" 

number_of_reformulations = 5

seed_query = generated_seed

user_prompt = """Write for me {number_of_reformulations} alternative questions quite similar to the question you got.
The question: {seed_query}

Return a list of questions.
This should be only a list of string questions, separated by comma
"""

formated_user_prompt = user_prompt.format(number_of_reformulations=number_of_reformulations, 
                                          seed_query=seed_query)

In [30]:
# Make a request to the OpenAI to expand a seed question

response = client.chat.completions.create(
    model="gpt-4o",  # Updated to a valid model
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": formated_user_prompt}
    ],
    max_tokens=400,  # Limits the response length
    temperature=0.7,  # Controls randomness in the output
    n=1
)

In [31]:
completion_text = response.choices[0].message.content
print(f"Generated Completion:\n{completion_text}")

queries = completion_text.strip().split(",")
queries

Generated Completion:
['What parameters are needed to calculate the `PopularityBias()` metric in evaluating recommendation systems?', 'In what way does the `PopularityBias()` metric assess recommendation systems, and what are the necessary parameters?', 'Which parameters are essential for the `PopularityBias()` metric, and how does it evaluate recommendation systems?', 'How is the `PopularityBias()` metric used to evaluate recommendation systems, and what parameters does it need?', 'What is the role of the `PopularityBias()` metric in assessing recommendation systems, and which parameters are required for its computation?']


["['What parameters are needed to calculate the `PopularityBias()` metric in evaluating recommendation systems?'",
 " 'In what way does the `PopularityBias()` metric assess recommendation systems",
 " and what are the necessary parameters?'",
 " 'Which parameters are essential for the `PopularityBias()` metric",
 " and how does it evaluate recommendation systems?'",
 " 'How is the `PopularityBias()` metric used to evaluate recommendation systems",
 " and what parameters does it need?'",
 " 'What is the role of the `PopularityBias()` metric in assessing recommendation systems",
 " and which parameters are required for its computation?']"]

### Find relevant chuncks

In [32]:
def query_collection(question, n_results = 3):
    """
    Queries the collection with a given question and returns the relevant text chunks.
    
    Args:
        question (str): The query or question text to search for.
        n_results (int): Number of results to retrieve. Default is 3.

    Returns:
        List[str]: A list of relevant text chunks.
    """
    # Perform the query
    results = collection.query(
        query_texts=question,
        n_results=n_results,
        # include=['embeddings', 'documents', 'distances']
    )

    # Extract relevant text chunks from the documents
    relevant_chunks = [
        chunk for document in results["documents"] for chunk in document
    ]
    
    return relevant_chunks

In [33]:
query_collection(seed_query)

[' times item *i* was rated in the training set (popularity of item *i*)\n\n**Range**: 0 to infinity \n\n**Interpretation**: the higher the value, the more popular on average the recommendations are in top-K.  \n\n**Note**: This metric is not normalized and depends on the number of recommendations in the training set.\n\nFurther reading: [Abdollahpouri, H., Mansoury, M., Burke, R., Mobasher, B., & Malthouse, E. (2021). User-centered Evaluation of Popularity Bias in Recommender Systems](https://dl.acm.org/',
 'bdollahpouri, H., Mansoury, M., Burke, R., Mobasher, B., & Malthouse, E. (2021). User-centered Evaluation of Popularity Bias in Recommender Systems](https://dl.acm.org/doi/fullHtml/10.1145/3450613.3456821)\n\n# Recommendation table\n\n![](../.gitbook/assets/reports/metric_recsys_table-min.png)\n\n**Evidently Metric**: `RecCasesTable`\n\nThis visual Metric shows the list of recommendations for the specified user IDs (`user_ids: List`). If you do not pass the list of IDs, Evidently 

In [36]:
#relevant_chunks = [query_collection(query) for query in queries]
relevant_chunks = [query_collection(query) for query in generated_queries]

In [37]:
relevant_chunks

[['ity Metrics than included in the `DataQualityPreset`. \n\n# How to read the tables\n\n* **Name**: the name of the Metric.  \n* **Description**: plain text explanation. For Metrics, we also specify whether it applies to the whole dataset or individual columns.\n* **Parameters**: required and optional parameters for the Metric or Preset. We also specify the defaults that apply if you do not pass a custom parameter.\n\n**Metric visualizations**. Each Metric includes a default render. To see the visualizati',
  'igate the sections. \n\n# How to read the tables\n\n* **Name**: the name of the Test or Test preset.  \n* **Description**: plain text explanation. For Tests, we specify whether it applies to the whole dataset or individual columns.\n* **Parameters**: available configurations. \n  * Required parameters are necessary for calculations, e.g. a column name for a column-level test.\n  * Optional parameters modify how the underlying metric is calculated, e.g. which statistical test or 

### Baseline answer generation

In [38]:
# Make a request to the OpenAI to answer generated question with relevant context

def generate_baseline_answer(query, relevant_chunks):
    system_prompt = "You are a helpful assistant thet answer a given question directly withou any preamble"

    user_prompt = """
    Your task is to answer the following query: 
    <query>
    {query}
    </query>
    
    You have access to the following documents which are meant to provide context as you answer the query:
    <documents>
    {context}
    </documents>
    
    Please remain faithful to the underlying context, and deviate from it only if you haven't found the answer in the provided context. 
    Avoid providing any preamble!
    Avoid providing any closing statement!
    Please return the answer only
    """
    
    context = "\n\n".join(relevant_chunks)
    formated_user_prompt = user_prompt.format(query=query, context=context)

    response = client.chat.completions.create(
        model="gpt-4o",  # Updated to a valid model
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": formated_user_prompt}
        ],
        max_tokens=400,  # Limits the response length
        temperature=0.7,  # Controls randomness in the output
        n=1
    )
    
    completion_text = response.choices[0].message.content
    return completion_text

In [40]:
baseline_answers = [generate_baseline_answer(generated_queries[i], relevant_chunks[i]) for i in range(min(len(generated_queries), len(relevant_chunks)))]

In [41]:
generated_dataset = pd.DataFrame({
    'Query': generated_queries,
    'Relevant chunks': relevant_chunks,
    'Baseline_answers': baseline_answers
})

In [42]:
generated_dataset

Unnamed: 0,Query,Relevant chunks,Baseline_answers
0,"""How is the 'Name' of a Metric used in reading...",[ity Metrics than included in the `DataQuality...,The 'Name' of a Metric is used to identify the...
1,What information does the 'Description' secti...,"[---\ndescription: List of Metrics, Descriptor...",The 'Description' section provides a plain tex...
2,What parameters are considered for the 'Popul...,[reports/metric_popularity_bias-min.png)\n\n**...,"ARP, Coverage, and Gini index are the paramete..."
3,How does the 'TestFPR()' function operate at ...,[th reference**: the test fails if the TNR is ...,The `TestFPR()` function operates at the datas...
4,What is the role of the 'TestGiniIndex(k=k)' ...,"[wer, the test fails.<br><br>**No reference**:...",The role of the 'TestGiniIndex(k=k)' in evalua...
5,What visualization is provided by the 'Regres...,[ter plot. | **Required:**<br>n/a<br><br>**Opt...,Visualizes the distribution of the model error...
6,How does the 'RegressionErrorNormality()' ass...,[rcentage error in a line plot. | **Required:*...,RegressionErrorNormality() assesses value norm...
7,What is the primary focus of the 'DiversityMe...,[\n**Note**: Only a single top relevant item i...,The primary focus of the 'DiversityMetric' in ...
8,How are missing values tested in the 'TestSha...,[r> **Optional**: <ul><li>`missing_values = []...,The 'TestShareOfRowsWithMissingValues()' tests...
9,How does the 'TestNumberOfDuplicatedRows()' f...,[*: the test fails if there is at least one em...,The 'TestNumberOfDuplicatedRows()' function ev...


In [43]:
pd.set_option("display.max_colwidth", None)

In [44]:
generated_dataset[["Query", "Baseline_answers"]]

Unnamed: 0,Query,Baseline_answers
0,"""How is the 'Name' of a Metric used in reading tables?",The 'Name' of a Metric is used to identify the specific Metric being referenced.
1,What information does the 'Description' section provide in the context of Metrics?,"The 'Description' section provides a plain text explanation of the Metric, specifying whether it applies to the whole dataset or individual columns."
2,What parameters are considered for the 'PopularityBias()' Metric?,"ARP, Coverage, and Gini index are the parameters considered for the 'PopularityBias()' Metric."
3,How does the 'TestFPR()' function operate at the dataset level?,The `TestFPR()` function operates at the dataset level by computing the False Positive Rate (FPR) and comparing it to a reference or against a defined condition.
4,What is the role of the 'TestGiniIndex(k=k)' in evaluating dataset bias?,"The role of the 'TestGiniIndex(k=k)' in evaluating dataset bias is to compute the Gini Index at the top K recommendations and compare it to a reference or a defined condition. If the Gini Index at the top K is over 10% higher or lower than the reference, the test fails. This helps in assessing the fairness and distribution of recommendations, indicating potential bias if the Gini Index significantly deviates from the reference."
5,What visualization is provided by the 'RegressionErrorDistribution()'?,Visualizes the distribution of the model error in a histogram.
6,How does the 'RegressionErrorNormality()' assess value normality?,RegressionErrorNormality() assesses value normality by visualizing the quantile-quantile plot (Q-Q plot).
7,What is the primary focus of the 'DiversityMetric' in recommendation systems?,"The primary focus of the 'DiversityMetric' in recommendation systems is to measure the average intra-list diversity at K, reflecting the variety of items within the same user's recommendation list, averaged by all users."
8,How are missing values tested in the 'TestShareOfRowsWithMissingValues()'?,"The 'TestShareOfRowsWithMissingValues()' tests the share of rows that contain missing values against a reference or a defined condition. With reference, the test fails if the share of rows with missing values is over 10% higher than in the reference. Without reference, the test fails if the dataset contains any rows with missing values."
9,"How does the 'TestNumberOfDuplicatedRows()' function evaluate dataset integrity?""","The 'TestNumberOfDuplicatedRows()' function evaluates dataset integrity by testing the number of duplicate rows against a reference or a defined condition. If a reference is provided, the test fails if the share of duplicate rows is over 10% higher or lower than in the reference. If no reference is provided, the test fails if there is at least one duplicate row."
