# RAG - Datadrift

The purpose of this notebook is to demonstrate a straightforward method for assessing data drift in a Large Language Model (LLM) that utilizes the Retrieve, Analyze, Generate (RAG) architecture.

1. Data Ingestion: Collect and store a representative sample of data that the LLM was trained on or has been exposed to during its operational phase. This data serves as the reference dataset against which we will compare incoming user input.
2. User Input: Whenever a user provides input to the LLM, the text is processed and embedded into a numerical representation.
3. Embedding Comparison: We compare the embedding of the user's input with the embeddings of the data in our reference dataset. This comparison can be performed using various techniques, such as cosine similarity or other distance metrics.
4. Thresholding: Set a threshold for acceptable similarity scores. If the similarity between the user input and the reference data falls below this threshold, it may indicate data drift.

In the upcoming notebook, we will utilize pymilvus to establish a direct connection to the Docker environment.

In [62]:
from pymilvus import connections, Collection
import numpy as np

The initial step is to establish a connection to the Milvus database.

In [3]:
connections.connect()

Now, we can access the collection that contains the query sent by the user."

In [158]:
query_collection_name = "QueryCollection"

In [159]:
query_collection = Collection(query_collection_name)

In [160]:
print(query_collection)

<Collection>:
-------------
<name>: QueryCollection
<description>: 
<schema>: {'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'pk', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}]}



We can now retrieve the vector field associated with this collection.

In [161]:
res_query = query_collection.query(
    expr = "",
    limit=100,
    output_fields=["vector"]
)

In [162]:
number_of_queries = len(res_query)
number_of_queries

5

We are now equipped to compare user queries with the ingested data.

In [163]:
queries = [query['vector'] for query in res_query]

## Compare the query to the ingested data

In the second part, our objective is to load the data that was ingested from multiple websites and verify if the embeddings correspond to the user's queries. The collection where we stored the content for retrieval augmentation generation is referred to as LangChainCollection.

In [164]:
langchain_collection_name = "LangChainCollection"

In [165]:
data_collection = Collection(langchain_collection_name)

In [166]:
print(data_collection)

<Collection>:
-------------
<name>: LangChainCollection
<description>: 
<schema>: {'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'pk', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}]}



We simply want to determine the number of data chunks present in this collection. This number can then be used to calculate the number of chunks that were utilized.

In [272]:
res_data = data_collection.query(
    expr = "",
    limit=9999,
    output_fields=["vector"]
)

In [273]:
number_of_chunck_of_data = len(res_data)
number_of_chunck_of_data

379

## RAG Datadrift detection

Now, we can employ Milvus' similarity search to compare the queries for their similarity to the current query. We utilize the L2 metric, where a smaller value indicates a better match.

In [274]:
param = {
    # use `L2` as the metric to calculate the distance
    "metric_type": "L2",
    "params": {
        # search for vectors with a distance smaller than 1.0
        "radius": 1.0,
        # filter out vectors with a distance smaller than or equal to 0.8
        "range_filter" : 0.8
    }
}

In [275]:
res = data_collection.search(
    data=queries,
    anns_field="vector",
    limit=number_of_chunck_of_data,
    param=param
)

In [276]:
res

['[]', '[]', '[]', '[]', "['id: 447155909408064991, distance: 0.8176913261413574, entity: {}', 'id: 447155909408065433, distance: 0.8176913261413574, entity: {}', 'id: 447155909408065069, distance: 0.833382785320282, entity: {}', 'id: 447155909408065049, distance: 0.840793251991272, entity: {}', 'id: 447155909408065045, distance: 0.8410907983779907, entity: {}', 'id: 447155909408064921, distance: 0.8595541715621948, entity: {}', 'id: 447155909408065363, distance: 0.8595541715621948, entity: {}', 'id: 447155909408064927, distance: 0.8837847113609314, entity: {}', 'id: 447155909408065369, distance: 0.8837847113609314, entity: {}', 'id: 447155909408064723, distance: 0.8845637440681458, entity: {}']"]

## Data  Drift Metrics

We can observe interesting insights by examining the following metrics:

1. Non-Matching Queries: This metric checks for queries that do not have any matching documents in the database.
2. Matching Queries: The number of queries that have documents matching the query.
3. Database Utilization Percentage: This metric represents the percentage of documents from the database that were utilized by the RAG system.

In [245]:
def not_matched_queries(res):
    """
    Calculate the percentage of queries with no matching document in the database.

    Parameters:
    res (list): A list containing query results. Each element in the list represents the result of a query,
                and an empty list indicates that no matching documents were found for that query.

    Returns:
    float: The percentage of queries with no matching document, as a value between 0 and 100.

    Example:
    >>> results = [[1, 2], [], [3, 4], [], [], [5, 6]]
    >>> not_matched_percentage = not_matched_queries(results)
    >>> print(not_matched_percentage)
    50.0
    """
    
    return len(list(filter(lambda x: x == [], res))) / len(res) * 100

In [246]:
not_matched_queries(res)

80.0

In [247]:
def matched_queries(res):
    """
    Calculate the percentage of queries with matching documents in the database.

    Parameters:
    res (list): A list containing query results. Each element in the list represents the result of a query,
                and an empty list indicates that no matching documents were found for that query.

    Returns:
    float: The percentage of queries with matching documents, as a value between 0 and 100.

    Example:
    >>> results = [[1, 2], [], [3, 4], [], [], [5, 6]]
    >>> matched_percentage = matched_queries(results)
    >>> print(matched_percentage)
    50.0
    """
    
    return 100 - not_matched_queries(res)

In [248]:
matched_queries(res)

20.0

In [277]:
def percentage_of_document_used(res, number_of_chunck_of_data):
    """
    Calculate the percentage of unique documents that were used in at least one query.

    Parameters:
    res (list): A nested list of query results. Each inner list contains objects with an 'id' attribute.
                These objects represent documents associated with query results.
    number_of_chunck_of_data (int): The total number of data chunks available in the dataset.

    Returns:
    float: The percentage of unique documents used in queries, as a value between 0 and 100.

    Example:
    >>> results = [[doc1, doc2], [doc2, doc3], [doc4], [doc5]]
    >>> num_chunks = 10
    >>> percentage_used = percentage_of_document_used(results, num_chunks)
    >>> print(percentage_used)
    40.0
    """
    # Extract unique 'id' values from the nested list and calculate the percentage used.
    flattened_res = [j.id for i in res for j in i if hasattr(j, 'id')]
    unique_ids = np.unique(flattened_res)

    return len(unique_ids) / number_of_chunck_of_data * 100

In [269]:
percentage_of_document_used(res, number_of_chunck_of_data)

13.418530351437699