## Extra Dependencies

In [2]:
! pip install chromadb

Collecting chromadb
  Obtaining dependency information for chromadb from https://files.pythonhosted.org/packages/43/cd/a875ed1f61365c9fdb46ee2de0cbea1735a9575ff718886f7eb218d4ef45/chromadb-0.5.12-py3-none-any.whl.metadata
  Downloading chromadb-0.5.12-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Obtaining dependency information for build>=1.0.3 from https://files.pythonhosted.org/packages/84/c2/80633736cd183ee4a62107413def345f7e6e3c01563dbca1417363cf957e/build-1.2.2.post1-py3-none-any.whl.metadata
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Obtaining dependency information for chroma-hnswlib==0.7.6 from https://files.pythonhosted.org/packages/0d/19/aa6f2139f1ff7ad23a690ebf2a511b2594ab359915d7979f76f3213e46c4/chroma_hnswlib-0.7.6-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-macosx_11_0_arm64.whl.metadata (252 bytes)
Collecting posthog>=2.

## Imports

In [98]:
import os
import openai 
from openai import OpenAI
import pprint
import pandas as pd
import random

import chromadb
from chromadb.utils import embedding_functions

## Chunked data collection setup

In [57]:
collection_name = "db_collection"
default_embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

chroma_client = chromadb.PersistentClient(path="./chromadb/")
# declare ChromaDB collection
collection = chroma_client.get_or_create_collection(
    name=collection_name,
    embedding_function=default_embedding_function
    )

result = collection.get()

print(f"Collection {collection_name} created successfully")
pprint.pprint(result)

Collection db_collection created successfully
{'data': None,
 'documents': [],
 'embeddings': None,
 'ids': [],
 'included': ['metadatas', 'documents'],
 'metadatas': [],
 'uris': None}


In [13]:
def load_md_from_dir(dir_path):
    """
    Loads Markdown (.md) files from the specified directory.

    Args:
        dir_path (str): Path to the directory containing .md files.

    Returns:
        List[dict]: A list of dictionaries with the text content of each .md file.
    """
    md_files = [
        os.path.join(dir_path, filename) 
        for filename in os.listdir(dir_path) 
        if filename.endswith(".md")
    ]
    
    documents = []
    for file_path in md_files:
        with open(file_path, "r", encoding="utf-8") as file:
            documents.append({"text": file.read()})
    
    return documents

In [18]:
def split_text(text, chunk_size=100, chunk_overlap=20):
    """
    Splits the input text into overlapping chunks.

    Args:
        text (str): The text to split.
        chunk_size (int): The size of each chunk. Default is 100.
        chunk_overlap (int): The number of overlapping characters between chunks. Default is 20.

    Returns:
        List[str]: A list of text chunks.
    """
    chunks = []
    text_length = len(text)
    
    for start in range(0, text_length, chunk_size - chunk_overlap):
        end = min(start + chunk_size, text_length)
        chunks.append(text[start:end])
    
    return chunks

In [58]:
directory_path = "./evidently_reference/"

# load documents from directory
md_files = load_md_from_dir(directory_path)

print(f" {len(md_files)} files loaded")

# Split text into chunks
chunked_files = [
    {
        'id': f"{file_id}-{chunk_id}",
        'text': chunk,
    }
    for file_id, file in enumerate(md_files)
    for chunk_id, chunk in enumerate(split_text(file["text"], chunk_size=500, chunk_overlap=50))
]

print(f"Split in to {len(chunked_files)} chunks")

 4 files loaded
Split in to 270 chunks


In [59]:
# insert documents with embeddings to collection ChromaDB
for chunk in chunked_files:
    collection.upsert(
            ids=chunk['id'],
            documents=chunk['text'],
    )

result = collection.get()

print(f"Collection {collection_name} has {len(result['ids'])} documents")

Collection db_collection has 270 documents


In [56]:
#Just incase we need to delete collection
#list_collections = chroma_client.list_collections()
#print(list_collections)

chroma_client.delete_collection(collection_name)
list_collections = chroma_client.list_collections()
print(list_collections)

[]


## Dataset Generation chain of promts

In [39]:
openai.api_key = os.environ["OPENAI_API_KEY"]
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

### Get a seed query

In [None]:
seed_query = "How do I get Evidently data drift report for my data?"

In [131]:
# Fixed size for the random list
sample_size = 10

# Generate a random list with the fixed size from the existing list
random_chuncks = [item['text'] for item in random.sample(chunked_files, min(sample_size, len(chunked_files)))]
random_chuncks

['ods](../customization/options-for-statistical-tests.md).\n{% endhint %}\n\n## Text Data \n\n![](../.gitbook/assets/reports/metric_column_drift_text-min.png)\n\nText content drift using a **domain classifier**. Evidently trains a binary classification model to discriminate between data from reference and current distributions. \n\nThe default for **small data with <= 1000 observations** detects drift if the ROC AUC of the drift detection classifier > possible ROC AUC of the random classifier at a 95th per',
 'score and compares it to the reference or against a defined condition. | **Required**:<br>N/A<br><br> **Optional**:<ul><li>`threshold_probas`(default for classification = None; default for probabilistic classification = 0.5)</li><li>`k`</li></ul> **Test conditions**: <ul><li>*standard parameters*</li></ul>| Expects +/-20% or better than a dummy model.<br><br>**With reference**: if the F1 is over 20% higher or lower, the test fails.<br><br>**No reference**: if the F1 is lower than

In [111]:
system_prompt = "You are an assisstant who generates questions based on provided context"
number_of_questions = 10
user_prompt = """
Generate {N} conceptual questions based on the provided context and can be answered from the information in the provided context.
Here is a context
<context>
    {context}
</context>

Remain faithful to the underlying context. 
Avoid providing any preamble!
Avoid providing any closing statement!
Please return only a list of coma separated generated questions in string format.
"""

context = "\n\n".join(random_chuncks)

formated_user_prompt = user_prompt.format(context=context, N=number_of_questions)

In [112]:
response = client.chat.completions.create(
    model="gpt-4o",  # Updated to a valid model
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": formated_user_prompt}
    ],
    max_tokens=400,  # Limits the response length
    temperature=0.7,  # Controls randomness in the output
    n=1
)

In [117]:
generated_seed_queries = response.choices[0].message.content.strip().split(",")

In [118]:
generated_seed_queries

['"How does the TestShareOfColumnsWithMissingValues function determine if a dataset fails the test with reference?',
 ' What optional parameters can be included in the TestShareOfColumnsWithMissingValues function?',
 ' What is the purpose of the HuggingFaceModel function?',
 ' How does the HuggingFaceToxicityModel function detect hate speech?',
 ' What condition causes the TestNumberOfDuplicatedRows to fail without a reference?',
 ' What is measured by the TestShareOfDriftedColumns function?',
 ' What are the required and optional parameters for the ScoreDistribution function?',
 ' What is the role of the ColumnSummaryMetric in the DataQualityPreset?',
 ' How does the drift detection method choose the appropriate test for each column?',
 ' How is AP@K calculated in the context of relevant item positions?"']

### Get alternative questions

In [108]:
#do not forget to write a prompt for seed query generation
system_prompt = "You are a smart assistant who helps rephrase questions" 

number_of_reformulations = 5

seed_query = "How do I get Evidently data drift report for my data?"

user_prompt = """Write for me {number_of_reformulations} alternative questions quite similar to the question you got.
The question: {seed_query}

Return a list of questions.
This should be only a list of string questions, separated by comma
"""

formated_user_prompt = user_prompt.format(number_of_reformulations=number_of_reformulations, 
                                          seed_query = generated_seed_query)
                         #seed_query=seed_query)

In [109]:
# Make a request to the OpenAI to expand a seed question

response = client.chat.completions.create(
    model="gpt-4o",  # Updated to a valid model
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": formated_user_prompt}
    ],
    max_tokens=400,  # Limits the response length
    temperature=0.7,  # Controls randomness in the output
    n=1
)

In [110]:
completion_text = response.choices[0].message.content
print(f"Generated Completion:\n{completion_text}")

queries = completion_text.strip().split(",")
queries

Generated Completion:
What criteria does the `TestShareOfColumnsWithMissingValues()` function use to identify failure without a reference dataset?, How does the absence of a reference affect the `TestShareOfColumnsWithMissingValues()` function's failure detection?, In what way does the `TestShareOfColumnsWithMissingValues()` function assess failure without having a reference?, How is failure determined by the `TestShareOfColumnsWithMissingValues()` function when a reference is not given?, What is the method used by the `TestShareOfColumnsWithMissingValues()` function to evaluate failure without a reference dataset?


['What criteria does the `TestShareOfColumnsWithMissingValues()` function use to identify failure without a reference dataset?',
 " How does the absence of a reference affect the `TestShareOfColumnsWithMissingValues()` function's failure detection?",
 ' In what way does the `TestShareOfColumnsWithMissingValues()` function assess failure without having a reference?',
 ' How is failure determined by the `TestShareOfColumnsWithMissingValues()` function when a reference is not given?',
 ' What is the method used by the `TestShareOfColumnsWithMissingValues()` function to evaluate failure without a reference dataset?']

### Find relevant chuncks

In [54]:
def query_collection(question, n_results = 3):
    """
    Queries the collection with a given question and returns the relevant text chunks.
    
    Args:
        question (str): The query or question text to search for.
        n_results (int): Number of results to retrieve. Default is 3.

    Returns:
        List[str]: A list of relevant text chunks.
    """
    # Perform the query
    results = collection.query(
        query_texts=question,
        n_results=n_results,
        # include=['embeddings', 'documents', 'distances']
    )

    # Extract relevant text chunks from the documents
    relevant_chunks = [
        chunk for document in results["documents"] for chunk in document
    ]
    
    return relevant_chunks

In [71]:
query_collection(seed_query)

['how to detect drift in ML embeddings](https://www.evidentlyai.com/blog/embedding-drift-detection).  \n\nAdditional links:  \n\n* [How to interpret data and prediction drift together? ](https://evidentlyai.com/blog/data-and-prediction-drift)  \n\n* [Do I need to monitor data drift if I can measure the ML model quality?](https://evidentlyai.com/blog/ml-monitoring-do-i-need-data-drift)  \n\n* ["My data drifted. What\'s next?" How to handle ML model drift in production.](https://evidentlyai.com/blog/ml-monit',
 'arget). </li><li> Returns predicted probability for the “hate” label. </li><li> Scale: 0 to 1. </li></ul> | **Optional**: <ul><li>`toxic_label="hate"` (default)</li><li> `display_name`</li></ul> |\n\n# Data Drift\n\n**Defaults for Data Drift**. By default, all data drift metrics use the Evidently [drift detection logic](data-drift-algorithm.md) that selects a drift detection method based on feature type and volume. You always need a reference dataset.\n\nTo modify the logic or se

In [119]:
#relevant_chunks = [query_collection(query) for query in queries]
relevant_chunks = [query_collection(query) for query in generated_seed_queries]

In [120]:
relevant_chunks

[['r><br>**With reference**: the test fails if the number of columns with missing values is higher than in reference.  <br>**No reference**: the test fails if the dataset contains columns with missing values.|\n| **TestShareOfColumnsWithMissingValues()** | Dataset-level. <br><br> Tests the share of columns that contain missing values in the dataset against the reference or a defined condition.| **Required**:<br> N/A <br><br> **Optional**: <ul><li>`missing_values = [], replace = True/False` (default ',
  '**With reference**: the test fails if the share of rows with missing values is over 10% higher than in reference. <br><br>**No reference**: the test fails if the dataset contains rows with missing values.|\n| **TestNumberOfDifferentMissingValues()**| Dataset-level. <br><br> Tests the number of differently encoded missing values in the dataset against the reference or a defined condition. Detects 4 types of missing values by default and/or values from a user list. | **Required**:<br>N/A

### Baseline answer generation

In [90]:
# Make a request to the OpenAI to answer generated question with relevant context

def generate_baseline_answer(query, relevant_chunks):
    system_prompt = "You are a helpful assistant thet answer a given question directly withou any preamble"

    user_prompt = """
    Your task is to answer the following query: 
    <query>
    {query}
    </query>
    
    You have access to the following documents which are meant to provide context as you answer the query:
    <documents>
    {context}
    </documents>
    
    Please remain faithful to the underlying context, and deviate from it only if you haven't found the answer in the provided context. 
    Avoid providing any preamble!
    Avoid providing any closing statement!
    Please return the answer only
    """
    
    context = "\n\n".join(relevant_chunks)
    formated_user_prompt = user_prompt.format(query=query, context=context)

    response = client.chat.completions.create(
        model="gpt-4o",  # Updated to a valid model
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": formated_user_prompt}
        ],
        max_tokens=400,  # Limits the response length
        temperature=0.7,  # Controls randomness in the output
        n=1
    )
    
    completion_text = response.choices[0].message.content
    return completion_text

In [126]:
baseline_answers = [generate_baseline_answer(generated_seed_queries[i], relevant_chunks[i]) for i in range(min(len(generated_seed_queries), len(relevant_chunks)))]

In [127]:
generated_dataset = pd.DataFrame({
    'Query': generated_seed_queries,
    'Relevant chunks': relevant_chunks,
    'Baseline_answers': baseline_answers
})

In [128]:
generated_dataset

Unnamed: 0,Query,Relevant chunks,Baseline_answers
0,"""How does the TestShareOfColumnsWithMissingValues function determine if a dataset fails the test with reference?","[r><br>**With reference**: the test fails if the number of columns with missing values is higher than in reference. <br>**No reference**: the test fails if the dataset contains columns with missing values.|\n| **TestShareOfColumnsWithMissingValues()** | Dataset-level. <br><br> Tests the share of columns that contain missing values in the dataset against the reference or a defined condition.| **Required**:<br> N/A <br><br> **Optional**: <ul><li>`missing_values = [], replace = True/False` (default , **With reference**: the test fails if the share of rows with missing values is over 10% higher than in reference. <br><br>**No reference**: the test fails if the dataset contains rows with missing values.|\n| **TestNumberOfDifferentMissingValues()**| Dataset-level. <br><br> Tests the number of differently encoded missing values in the dataset against the reference or a defined condition. Detects 4 types of missing values by default and/or values from a user list. | **Required**:<br>N/A<br><br>**O, test fails if the dataset contains rows with missing values.|\n| **TestShareOfRowsWithMissingValues()** | Dataset-level. <br><br> Tests the share of rows that contain missing values against the reference or a defined condition. | **Required**:<br>N/A<br><br>**Optional**:<ul><li>`missing_values = [], replace = True/False` (default = default list)</li></ul>**Test conditions** <ul><li>*standard parameters*</li></ul>| Expects up to +10% or 0.<br><br>**With reference**: the test fails if the share of]",The TestShareOfColumnsWithMissingValues function determines that a dataset fails the test if the number of columns with missing values is higher than in the reference dataset.
1,What optional parameters can be included in the TestShareOfColumnsWithMissingValues function?,"[the Test's defaults. You can see them in the tables below. The listed Preset parameters apply to the relevant individual Tests inside the Preset.\n\n<details>\n \n<summary>NoTargetPerformance Test Preset</summary>\n\nPreset name: `NoTargetPerformanceTestPreset()`\n\n**Composition**: \n* `TestShareOfDriftedColumns()`\n* `TestColumnDrift(column_name=prediction)`\n* `TestColumnShareOfMissingValues()` for `all` or `сolumns` if provided\n* `TestShareOfOutRangeValues()` for all numerical or specified `columns`\n* , lumnsType()`\n* `TestColumnShareOfMissingValues()` for all or specified `columns`\n* `TestShareOfOutRangeValues()` for all numerical or specified `columns`\n* `TestShareOfOutListValues()` for all categorical or specified `columns`\n* `TestMeanInNSigmas()` for all numerical or specified `columns`\n\n**Optional parameters**: \n* `columns`\n\n</details>\n\n<details>\n \n<summary>Data Quality Test Preset</summary>\n\nPreset name: `DataQualityTestPreset()`\n\n**Composition**: \n* `TestColumnShareOfMissingValues()` fo, **: N/A |\n\n## Column Values\n\n| Test name | Description | Parameters | Default test conditions | \n|---|---|---|---|\n| **TestColumnValueMin**(column_name='num-column') | Column-level. <br><br> Tests the minimum value of a given numerical column against reference or a defined condition. | **Required**:<ul><li>`column_name`</li></ul> **Optional:** N/A <br><br> **Test conditions**: <ul><li>*standard parameters*</li></ul> | Expects not lower.<br><br>**With reference**: the test fails if the minimum ]",The optional parameters for the `TestShareOfColumnsWithMissingValues` function are `columns`.
2,What is the purpose of the HuggingFaceModel function?,"[tems by a chosen characteristic.\n\nThe visualization shows:\n* The distribution of items in the training set for the defined `column_name` (with duplicates dropped). This represents the item catalog by this dimension. \n* The distribution of the recommended items for the defined `column_name` in the current and reference (if available) datasets. \n\nThis visualization helps see the patterns in the model recommendations. In a simplified example, you might observe that the training data contains 3x com, ity Metrics than included in the `DataQualityPreset`. \n\n# How to read the tables\n\n* **Name**: the name of the Metric. \n* **Description**: plain text explanation. For Metrics, we also specify whether it applies to the whole dataset or individual columns.\n* **Parameters**: required and optional parameters for the Metric or Preset. We also specify the defaults that apply if you do not pass a custom parameter.\n\n**Metric visualizations**. Each Metric includes a default render. To see the visualizati, igate the sections. \n\n# How to read the tables\n\n* **Name**: the name of the Test or Test preset. \n* **Description**: plain text explanation. For Tests, we specify whether it applies to the whole dataset or individual columns.\n* **Parameters**: available configurations. \n * Required parameters are necessary for calculations, e.g. a column name for a column-level test.\n * Optional parameters modify how the underlying metric is calculated, e.g. which statistical test or correlation method is use]",The purpose of the HuggingFaceModel function is not specified in the provided documents.
3,How does the HuggingFaceToxicityModel function detect hate speech?,"[l>| **Required:**<br>n/a<br><br>**Optional:**<ul><li>`display_name`</li></ul> |\n| **HuggingFaceModel()** <br><br> Scores the text using the user-selected HuggingFace model.| See [docs](../customization/huggingface_descriptor.md) with some example models (classification by topic, emotion, etc.)|\n| **HuggingFaceToxicityModel()** <ul><li> Detects hate speech using [HuggingFace Model](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target). </li><li> Returns predicted probability fo, xts (containing critical or pessimistic tone). Returns a label (NEGATIVE or POSITIVE) or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.|\n| **BiasLLMEval()** <br><br> Detects biased texts (containing prejudice for or against a person or group). Returns a label (BIAS or OK) or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.|\n| **ToxicityLLMEval()** <br><br> Detects toxic texts (containing harmful, offensive, or derogatory language). Returns a label (T, arget). </li><li> Returns predicted probability for the “hate” label. </li><li> Scale: 0 to 1. </li></ul> | **Optional**: <ul><li>`toxic_label=""hate""` (default)</li><li> `display_name`</li></ul> |\n\n# Data Drift\n\n**Defaults for Data Drift**. By default, all data drift metrics use the Evidently [drift detection logic](data-drift-algorithm.md) that selects a drift detection method based on feature type and volume. You always need a reference dataset.\n\nTo modify the logic or select a different test,]","The HuggingFaceToxicityModel function detects hate speech using the HuggingFace model found at https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target. It returns a predicted probability for the ""hate"" label, with a scale from 0 to 1."
4,What condition causes the TestNumberOfDuplicatedRows to fail without a reference?,"[*: the test fails if there is at least one empty column.|\n| **TestNumberOfDuplicatedRows()** | Dataset-level. <br><br> Tests the number of duplicate rows against reference or a defined condition. |**Required**:<br> N/A <br><br> **Optional**:<br> N/A <br><br>**Test conditions**: <ul><li>*standard parameters*</li></ul>| Expects +/- 10% or none.<br><br>**With reference**: the test fails if the share of duplicate rows is over 10% higher or lower than in the reference.<br><br>**No reference**: the te, **With reference**: the test fails if the share of rows with missing values is over 10% higher than in reference. <br><br>**No reference**: the test fails if the dataset contains rows with missing values.|\n| **TestNumberOfDifferentMissingValues()**| Dataset-level. <br><br> Tests the number of differently encoded missing values in the dataset against the reference or a defined condition. Detects 4 types of missing values by default and/or values from a user list. | **Required**:<br>N/A<br><br>**O, in the reference.<br><br>**No reference**: the test fails if there is at least one duplicate row. |\n| **TestNumberOfDuplicatedColumns()** | Dataset-level. <br><br> Tests the number of duplicate columns against reference or a defined condition. |**Required**:<br> N/A <br><br> **Optional**:<br> N/A <br><br>**Test conditions**: <ul><li>*standard parameters*</li></ul>| Expects =< or none.<br><br>**With reference**: the test fails if the number of duplicate columns is higher than in the reference.<b]",The test fails if there is at least one duplicate row.
5,What is measured by the TestShareOfDriftedColumns function?,"[lumnsType()`\n* `TestColumnShareOfMissingValues()` for all or specified `columns`\n* `TestShareOfOutRangeValues()` for all numerical or specified `columns`\n* `TestShareOfOutListValues()` for all categorical or specified `columns`\n* `TestMeanInNSigmas()` for all numerical or specified `columns`\n\n**Optional parameters**: \n* `columns`\n\n</details>\n\n<details>\n \n<summary>Data Quality Test Preset</summary>\n\nPreset name: `DataQualityTestPreset()`\n\n**Composition**: \n* `TestColumnShareOfMissingValues()` fo, sition**: \n* `TestColumnShareOfMissingValues()` for all or specified `columns`\n* `TestMostCommonValueShare()` for all or specified `columns`\n* `TestNumberOfConstantColumns()`\n* `TestNumberOfDuplicatedColumns()`\n* `TestNumberOfDuplicatedRows()`\n\n**Optional parameters**: \n* `columns`\n\n</details>\n\n<details>\n \n<summary>Data Drift Test Preset</summary>\n\nPreset name: `DataDriftTestPreset()`\n\n**Composition**: \n* `TestShareOfDriftedColumns()`\n* `TestColumnDrift()` for all or specified `columns`\n\n**Optio, 10%.<br><br>**With reference**: the test fails if the median value is different by more than 10%.<br><br>**No reference**: N/A |\n| **TestColumnValueStd**(column_name='num-column')<br>| Column-level. <br><br> Tests the standard deviation of a given numerical column against reference or a defined condition. | **Required**:<ul><li>`column_name`</li></ul> **Optional:**<br> N/A <br><br> **Test conditions**: <ul><li>*standard parameters*</li></ul> | Expects +/-10%.<br><br>**With reference**: the tes]",The `TestShareOfDriftedColumns` function measures the proportion of columns that have drifted between datasets.
6,What are the required and optional parameters for the ScoreDistribution function?,"[in the training dataset.<br><br>Requires a training dataset. | **Required**:<ul><li>`k`</li><li>`column_name`</li></ul>**Optional**:<ul><li>-</li></ul> |\n| **ScoreDistribution()** <br><br> Computes the predicted score entropy. Visualizes the distribution of the scores at `k` (and all scores, if available).<br><br>Applies only when the `recommendations_type` is a `score`. | **Required**:<ul><li>`k`</li></ul>**Optional**:<ul><li>-</li></ul> |\n| **RecCasesTable()** <br><br> Shows the list of recomm, Evidently Metric**: `ScoreDistribution`\n\nThis metric computes the predicted score entropy. It applies only when the `recommendations_type` is a score.\n\n**Implementation**:\n* Apply softmax transformation for top-K scores for all users.\n* Compute the KL divergence (relative entropy in [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html)). \n\nThe visualization shows the distribution of the predicted scores at K (and all scores, if available). \n\n# Item Bias \n\n![](../, icationProbDistribution()`- if probabilistic classification\n* `ClassificationRocCurve()` - if probabilistic classification\n* `ClassificationPRCurve()` - if probabilistic classification\n* `ClassificationPRTable()` - if probabilistic classification\n* `ClassificationQualityByFeatureTable()` for all or specified `columns`</li></ul>\n\n**Optional parameters**:\n* `columns`\n* `probas_threshold`\n\n</details>\n\n<details>\n \n<summary>Text Overview Preset</summary>\n\n`TextOverviewPreset()` provides a summary fo]",**Required**: `k` \n**Optional**: None
7,What is the role of the ColumnSummaryMetric in the DataQualityPreset?,"[ity Metrics than included in the `DataQualityPreset`. \n\n# How to read the tables\n\n* **Name**: the name of the Metric. \n* **Description**: plain text explanation. For Metrics, we also specify whether it applies to the whole dataset or individual columns.\n* **Parameters**: required and optional parameters for the Metric or Preset. We also specify the defaults that apply if you do not pass a custom parameter.\n\n**Metric visualizations**. Each Metric includes a default render. To see the visualizati, %}\n\n# Metric Presets\n\n**Defaults**: Presets use the default parameters for each Metric. You can see them in the tables below. \n\n<details>\n\n<summary>Data Quality Preset</summary>\n\n`DataQualityPreset` captures column and dataset summaries. Input columns are required. Prediction and target are optional.\n\n**Composition**:\n* `DatasetSummaryMetric()`\n* `ColumnSummaryMetric()` for `all` or specified `сolumns`\n* `DatasetMissingValuesMetric()`\n\n**Optional parameters**:\n* `columns`\n\n</details>\n\n<details>\n, lumnsType()`\n* `TestColumnShareOfMissingValues()` for all or specified `columns`\n* `TestShareOfOutRangeValues()` for all numerical or specified `columns`\n* `TestShareOfOutListValues()` for all categorical or specified `columns`\n* `TestMeanInNSigmas()` for all numerical or specified `columns`\n\n**Optional parameters**: \n* `columns`\n\n</details>\n\n<details>\n \n<summary>Data Quality Test Preset</summary>\n\nPreset name: `DataQualityTestPreset()`\n\n**Composition**: \n* `TestColumnShareOfMissingValues()` fo]","The `ColumnSummaryMetric` in the `DataQualityPreset` is used to capture summaries for each column, either for all columns or specified columns."
8,How does the drift detection method choose the appropriate test for each column?,"[In some tests and metrics, Evidently uses the default Data Drift Detection algorithm. It helps detect the distribution drift in the individual features, prediction, or target. This page describes how the **default** algorithm works.\n\n# How it works\n\nEvidently compares the distributions of the values in a given column (or columns) of the two datasets. You should pass these datasets as **reference** and **current**. Evidently applies several statistical tests and drift detection methods to detect , ct a different test, you should set [data drift parameters](../customization/options-for-statistical-tests.md). \n\n| Test name | Description | Parameters | Default test conditions | \n|---|---|---|---|\n| **TestNumberOfDriftedColumns()** | Dataset-level. <br><br> Compares the distribution of each column in the current dataset to the reference and tests the number of drifting features against a defined condition.| **Required**:<br>N/A<br><br>**Optional:**<ul><li>`сolumns`</li><li>`stattest`(default=, tical tests and drift detection methods to detect if the distribution has changed significantly. It returns a ""drift detected"" or ""not detected"" result.\n\nThere is a default logic to choosing the appropriate drift test for each column. It is based on:\n\n* column type: categorical, numerical, text data or embeddings\n* the number of observations in the reference dataset\n* the number of unique values in the column (n\_unique)\n\n## Tabular Data \n\n![](../.gitbook/assets/reports/metric_data_drift_table_2]","The drift detection method chooses the appropriate test for each column based on the column type (categorical, numerical, text data, or embeddings), the number of observations in the reference dataset, and the number of unique values in the column."
9,"How is AP@K calculated in the context of relevant item positions?""","[at each relevant item position within the top K. To do that, we sum up precision at all values of K when the item is relevant (e.g., Precision @1, Precision@2..), and divide it by the total number of relevant items in K.\n\n$$\n\text{AP@K} = \frac{1}{N} \sum_{k=1}^{K} Precision(k) \times rel(k)\n$$\n\nWhere *N* is the total number of relevant items at K, and *rel(k)* is equal to 1 if the item is relevant, and is 0 otherwise.\n\nExample: if K = 10, and items in positions 1, 2, and 10 are relevant, the fo, 1 if any relevant item is included in K, or 0 otherwise.\n* **Compute average hit rate**. The average of this metric is calculated across all users or queries.\n\n**Range**: 0 to 1, where 1 indicates that each user / query gets at least one relevant recommendation / retrieval.\n\n**Interpretation**: A higher Hit Rate indicates that a higher share of users / queries have relevant items in their lists. \n\n**Note**: the Hit Rate will typically increase for higher values of K (since there is a higher cha, ems in positions 1, 2, and 10 are relevant, the formula will look as:\n\n$$\nAP@10 = \frac{Precision@1+Precision@2+Precision@10}{3}\n$$\n\n* **Compute Mean Average Precision (MAP) at K**. Average the results across all users (or queries) in the dataset.\n\n$$\n\text{MAP@K} = \frac{1}{U} \sum_{u=1}^{U} \text{AP@K}_u\n$$\n\nWhere *U* is the total number of users or queries in the dataset, and *AP* is the average precision for a given list.\n\n**Range**: 0 to 1.\n\n**Interpretation**: Higher MAP at K values indica]","AP@K is calculated by summing the precision at each position up to K where the item is relevant and dividing by the total number of relevant items within K. The formula is:\n\n$$\n\text{AP@K} = \frac{1}{N} \sum_{k=1}^{K} Precision(k) \times rel(k)\n$$\n\nwhere *N* is the total number of relevant items in K, and *rel(k)* is 1 if the item at position k is relevant, otherwise 0."


In [129]:
pd.set_option("display.max_colwidth", None)

In [130]:
generated_dataset[["Query", "Baseline_answers"]]

Unnamed: 0,Query,Baseline_answers
0,"""How does the TestShareOfColumnsWithMissingValues function determine if a dataset fails the test with reference?",The TestShareOfColumnsWithMissingValues function determines that a dataset fails the test if the number of columns with missing values is higher than in the reference dataset.
1,What optional parameters can be included in the TestShareOfColumnsWithMissingValues function?,The optional parameters for the `TestShareOfColumnsWithMissingValues` function are `columns`.
2,What is the purpose of the HuggingFaceModel function?,The purpose of the HuggingFaceModel function is not specified in the provided documents.
3,How does the HuggingFaceToxicityModel function detect hate speech?,"The HuggingFaceToxicityModel function detects hate speech using the HuggingFace model found at https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target. It returns a predicted probability for the ""hate"" label, with a scale from 0 to 1."
4,What condition causes the TestNumberOfDuplicatedRows to fail without a reference?,The test fails if there is at least one duplicate row.
5,What is measured by the TestShareOfDriftedColumns function?,The `TestShareOfDriftedColumns` function measures the proportion of columns that have drifted between datasets.
6,What are the required and optional parameters for the ScoreDistribution function?,**Required**: `k` \n**Optional**: None
7,What is the role of the ColumnSummaryMetric in the DataQualityPreset?,"The `ColumnSummaryMetric` in the `DataQualityPreset` is used to capture summaries for each column, either for all columns or specified columns."
8,How does the drift detection method choose the appropriate test for each column?,"The drift detection method chooses the appropriate test for each column based on the column type (categorical, numerical, text data, or embeddings), the number of observations in the reference dataset, and the number of unique values in the column."
9,"How is AP@K calculated in the context of relevant item positions?""","AP@K is calculated by summing the precision at each position up to K where the item is relevant and dividing by the total number of relevant items within K. The formula is:\n\n$$\n\text{AP@K} = \frac{1}{N} \sum_{k=1}^{K} Precision(k) \times rel(k)\n$$\n\nwhere *N* is the total number of relevant items in K, and *rel(k)* is 1 if the item at position k is relevant, otherwise 0."
