In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('/app')

In [3]:

# %%
from qdrant_client import QdrantClient
from datetime import datetime, timedelta
import pandas as pd

In [4]:

pd.set_option('display.max_colwidth', None)

# %% [markdown]
# ## Connect to Qdrant

# %%
def get_qdrant_client():
    """Get Qdrant client with default settings"""
    return QdrantClient(
        host="qdrant",  # Update if running in different environment
        port=6333
    )

# %% [markdown]
# ## Query Recent Chunks

# %%
def get_recent_chunks(collection_name="meeting_chunks", limit=100, hours_ago=24000*30):
    """
    Retrieve the most recent chunks from Qdrant.
    
    Args:
        collection_name (str): Name of the Qdrant collection
        limit (int): Maximum number of chunks to retrieve
        hours_ago (int): Look back period in hours
    
    Returns:
        pd.DataFrame: DataFrame containing chunk data
    """
    client = get_qdrant_client()
    
    # Calculate timestamp threshold
    time_threshold = (datetime.utcnow() - timedelta(hours=hours_ago)).isoformat()
    
    # Search with filtering by timestamp
    search_result = client.scroll(
        collection_name=collection_name,
        scroll_filter={"must": [
            {"key": "timestamp", "range": {"gte": time_threshold}}
        ]},
        limit=limit,
        with_payload=True,
        with_vectors=True
    )[0]  # scroll returns (points, next_page_offset)
    
    # Extract relevant fields from payload
    chunks_data = []
    for point in search_result:
        payload = point.payload
        chunks_data.append({
            'id': point.id,
            'timestamp': pd.to_datetime(payload.get('timestamp')),  # Convert to datetime
            'meeting_id': payload.get('meeting_id'),
            'content': payload.get('content'),
            'contextualized_content': payload.get('contextualized_content'),
            'chunk_index': payload.get('chunk_index'),
            'topic': payload.get('topic'),
            'speaker': payload.get('speaker'),
            'speakers': payload.get('speakers'),
            'vector_dim': len(point.vector) if point.vector else None
        })
    
    # Convert to DataFrame
    df = pd.DataFrame(chunks_data)
    

    return df

In [5]:
get_recent_chunks()

Unnamed: 0,id,timestamp,meeting_id,content,contextualized_content,chunk_index,topic,speaker,speakers,vector_dim
0,032b931a-ce25-4c41-9225-3896f4b6723c,2024-06-17 11:07:08.001148+00:00,649abab5-c1ad-4066-83c7-cb0161aea310,Polar Wanderer: Да...,"The chunk occurs in a conversation among various mystical characters discussing the functionality of a system, with Polar Wanderer responding affirmatively to a previous statement, indicating engagement in the ongoing dialogue.",7,Acknowledgment,Polar Wanderer,"[Mystic Wizard, Polar Wanderer, Mystic Siren, Blazing Samurai, Mariner Comet]",1024
1,04066c2a-403a-43bf-8723-0eb08d2ba4ea,2024-06-17 11:51:54.943091+00:00,125a5015-5712-495d-84f9-1f54737e7ac9,"Polar Vulture: И релизнем первую версию, и уже можно разгуляться и сделать вообще шину между сервисами.","The document is a conversation between Mystic Wizard and Polar Vulture discussing technical issues related to meeting transcriptions and session management in a software system. The chunk reflects Polar Vulture's optimism about releasing the first version of their project and the potential for future enhancements, such as creating a service bus between services.",8,Future Development Plans,Polar Vulture,"[Polar Wanderer, Polar Vulture, Mystic Wizard, Harmonic Spirit]",1024
2,0424cb2e-44fd-4b56-baee-2efa70553ef5,2024-06-11 10:44:30.159113+00:00,48512bee-c060-4e74-8b05-ba19ea65ef9e,"Nova Alchemist: However, that said, rags have some limitations. When we do rag search, we are using some kind of semantic similarity and vectorizing everything in the rag. rack search we are using some kind of semantic similarity and vectorizing everything in the rack so here because we are vectorizing all the passages then we are losing all the connections and relationships that exist in the text because we are basically flattening the entire text and that exist in the text because we are basically flattening the entire text and represent it as a bunch of vectors. So we are not really encoding all of the semantic intent in the text. That's one of the problems with RAG in general or with any kind of vectorization. That's one of the problems with rag in general, or with any kind of vectorization. That's one of the problems with RAG in general, or with any kind of vectorization. So to put it in more plain English, this process of vectorization of text chunks So to put it in more plain English, this process of vectorization of text chunks is really just embedding, turning, let's say the whole sentence, one sentence in the chunk. is really just embedding, turning, let's say, the whole sentence, one sentence in the chunk. like one word in the sentence versus another word, not to mention their interrelationship is not, it's embedded in the embeddings, but not necessarily like a very tight, very, it doesn't pay, I don't know what's the right word to say. Explicitly, right? We are not explicitly describe the relationship between the words or the entities or the concept, right? The sentences. explicitly describe the relationship between the words or the entities or the concept, right? The sentences. So that's basically, so we're losing right here. We are not really including all of that information.","This chunk discusses the limitations of Retrieval Augmented Generation (RAG) systems, specifically focusing on the issues related to semantic similarity and vectorization, which result in the loss of connections and relationships within the text. It highlights how the process of embedding text into vectors fails to capture the explicit relationships between words, entities, and concepts, thereby compromising the semantic intent of the information.",5,Limitations of RAG Systems,Nova Alchemist,"[Terra Tsunami, Nova Alchemist, Eternal Swan, Amber Swan, Polar Paladin, Vigilant Prophet]",1024
3,05b47434-b5c9-4f24-82be-9cd3502e4d10,2024-06-17 11:51:54.943091+00:00,125a5015-5712-495d-84f9-1f54737e7ac9,"Polar Vulture: Хорошо, сейчас попробую. Насчет транскрипта вот здесь труднее. Почему? Если стриминг все еще не знает, если до стриминга все еще, точнее до engine все еще не дошло, что появилось новое. сессия ассистент в натуре будет оперировать предыдущей сессии я прям правда не знаю как Вот транскрипт, да, окей, я сейчас трейловскую задачку заведу, попробую описать. Вот, насчет этого подумаем. Насчет ассистента, может, что-нибудь. Да, вообще, блин, это все решится, когда мы...","The chunk is part of a conversation between Mystic Wizard and Polar Vulture discussing issues related to session management and transcription in a streaming context. Polar Vulture is addressing the challenges of ensuring that the assistant operates with the most current session data, highlighting potential delays in communication between the streaming service and the engine.",7,Transcription Challenges,Polar Vulture,"[Polar Wanderer, Polar Vulture, Mystic Wizard, Harmonic Spirit]",1024
4,087c9427-58ec-4079-b397-ed8a95cf8e3f,2024-06-17 11:51:54.943091+00:00,125a5015-5712-495d-84f9-1f54737e7ac9,"Mystic Wizard: Да, слушай, я предполагал, что как бы как там это работает, митинговые. И если получает бэк, то он возвращает последнюю сессию, да? Все так, абсолютно верно. А он возвращает не последнюю сессию, а, видимо, все-таки предпоследнюю сессию. Потому что, как ты видишь по скриншоту, и ассистент тоже получил контекст гораздо более ранний. Your first time stamp, you see. 13 число. Сегодня у нас 17. Я понял, кажется, о чем ты. Смотри, когда до... Как быстро ты пошел запрашивать... ассистента транскрипции. Ну, транскрипцию, я как бы запустил. Ну, то есть, через сколько секунд","The chunk is part of a conversation between Mystic Wizard and Polar Vulture discussing issues related to meeting sessions and transcription retrieval, specifically focusing on the timing and accuracy of session data being returned by the system.",1,Meeting Functionality,Mystic Wizard,"[Polar Wanderer, Polar Vulture, Mystic Wizard, Harmonic Spirit]",1024
...,...,...,...,...,...,...,...,...,...,...
62,e6b2b99f-523c-416b-b370-18a86324157d,2024-06-17 11:07:08.001148+00:00,649abab5-c1ad-4066-83c7-cb0161aea310,Blazing Samurai: Ну что?,"The chunk occurs in a conversation among various mystical characters discussing the functionality of a system, with Blazing Samurai expressing enthusiasm and prompting further discussion.",9,Inquiry,Blazing Samurai,"[Mystic Wizard, Polar Wanderer, Mystic Siren, Blazing Samurai, Mariner Comet]",1024
63,e6de3edc-2498-45a9-8324-b491602a902f,2024-06-17 14:44:39.887056+00:00,ee4e0b24-8c60-4788-9f88-b441b44cd0aa,Whispering Rogue: Have a good evening. Bye.,"The chunk is a closing remark from Whispering Rogue, following a discussion about a week that was different and the importance of communication within a team, indicating the end of a conversation.",3,Farewell wishes,Whispering Rogue,[Whispering Rogue],1024
64,e8512688-8657-47e5-93ed-6671fbd69164,2024-06-11 10:01:39.015022+00:00,947c4433-b43d-47db-8ac7-f18d29f91c01,"Polar Paladin: So instead of searching through every single chunk, when we are So instead of searching through every single chunk, when we are chunking the documents, Essentially, in this solution, what we can do is to include the document summary. So document summary is the same because each we summarize the document and we add that as part of the embedding part of the metadata with the embedding and then we save them into the vector database so when user ask a document to the user question and then we can go oh now that the answer is in this particular document now i can only go and fetch related chunks from that document and then generate the response so that's the very first solution for this problem however although this one Что происходит? Summary as a metadata to each embedding.","The discussion revolves around enhancing the efficiency of the Retrieval-Augmented Generation (RAG) system by addressing the challenges of chunking documents for better search retrieval. The focus is on incorporating document summaries as metadata in the embedding process, allowing for more targeted searches that can quickly identify relevant chunks when a user poses a question. This approach aims to streamline the retrieval process and improve the accuracy of responses generated by the system.",7,Including document summaries in chunking,Polar Paladin,"[Sunny Storm, Arcane Sphinx, Mystic Wizard, Enchanted Tempest, Nova Alchemist, Polar Wanderer, Lunar Sorcerer, Tempest Vulture, Polar Paladin, Wandering Sniper, Titanic Ninja, Thundering Vulture]",1024
65,f817cfd4-9485-4795-acaa-e60d2bd10f3e,2024-06-17 14:44:39.887056+00:00,ee4e0b24-8c60-4788-9f88-b441b44cd0aa,"Whispering Rogue: week that was different sorry it was me this is what happened when i don't speak too much yeah that's what's happening everybody breaks that is that's not scrum okay okay guys Thanks, and keep in touch.","The chunk is part of a conversation where ""Whispering Rogue"" reflects on a week that was unusual, apologizing for not communicating enough, and acknowledging the impact on team dynamics, while encouraging continued communication among team members.",2,Scrum methodology,Whispering Rogue,[Whispering Rogue],1024
