## ChromaDB + Ollama

In [1]:
!pip install chromadb==0.5.23
!pip install ollama==0.4.7



In [2]:
import os
import sys
import pickle
import chromadb
import torch
from chromadb import Documents, EmbeddingFunction, Embeddings
from ollama import Client
import numpy as np

# Add "src" path to Python path
sys.path.append(os.path.abspath("../src"))

# Import custom embedding function
from embedding_utils import get_ollama_embedding
from sage_utils import get_new_sage_embedding
from graphsage import GraphSAGE


In [3]:
# Load graph
with open("../data/MultiHop_graph_w_sage100_embeddings.pkl", "rb") as f:
    G = pickle.load(f)

In [4]:
G.nodes["100_chunk_0"].keys()

dict_keys(['type', 'category', 'source', 'author', 'text', 'embedding', 'SAGE_embedding'])

In [5]:
# Instantiate client
chroma_client = chromadb.Client()

In [6]:
#chroma_client.delete_collection("semantic_rag")
#chroma_client.delete_collection("SAGE_rag")

In [7]:
# Create a collection
collection_semrag = chroma_client.create_collection(name="semantic_rag")

# Iterate over the nodes and filter those with type "chunk"
for node_id in G.nodes:
    node_data = G.nodes[node_id]
    if node_data.get("type") == "chunk":
        doc_text = node_data.get("text", "")
        doc_embedding = node_data.get("embedding", None)
        
        # Ensure that doc_embedding is a list (or array) of floats
        # Add the node to the collection by specifying IDs, documents, and embeddings
        collection_semrag.add(
            ids=[str(node_id)],         # The ID will be the node's ID (converted to a string)
            documents=[doc_text],         # The node's text
            embeddings=[doc_embedding]    # The precomputed embedding you already have
        )

In [8]:
# Create a collection
collection_SAGErag = chroma_client.create_collection(name="SAGE_rag")

# Iterate over the nodes and filter those with type "chunk"
for node_id in G.nodes:
    node_data = G.nodes[node_id]
    if node_data.get("type") == "chunk":
        doc_text = node_data.get("text", "")
        doc_embedding = node_data.get("SAGE_embedding", None)
        
        # Ensure that doc_embedding is a list (or array) of floats
        # Add the node to the collection by specifying IDs, documents, and embeddings
        collection_SAGErag.add(
            ids=[str(node_id)],         # The ID will be the node's ID (converted to a string)
            documents=[doc_text],         # The node's text
            embeddings=[doc_embedding]    # The precomputed embedding you already have
        )

In [9]:
# Define el modelo (debe tener la misma arquitectura que el modelo guardado)
model = GraphSAGE(in_channels=1024,   # Input features (BGE-M3 embeddings)
    hidden_channels=512,  # First hidden layer (alto para máxima capacidad)
    out_channels=256,   # Output embeddings (más ricos)
    num_layers=2)  # Reemplaza con la clase de tu modelo
model.load_state_dict(torch.load("../data/graphsage_100_model.pth"))
model.eval()  # Cambia el modelo a modo de evaluación

print("Modelo cargado exitosamente")


Modelo cargado exitosamente


In [10]:
# Set device for model training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the device
model = model.to(device)

In [11]:
# Suppose you have a query embedding 'query_embedding'
query = "Some query here"
query="Who is the individual associated with the cryptocurrency industry facing a criminal trial on fraud and conspiracy charges, as reported by both The Verge and TechCrunch, and is accused by prosecutors of committing fraud for personal gain?"
query_bgem3 = get_ollama_embedding(query)["embedding"]
query_sage = get_new_sage_embedding(model=model, new_feature=torch.tensor(query_bgem3, dtype=torch.float32).reshape(1024), device=device)

In [12]:
query_sage = np.array(query_sage.cpu())

In [13]:
# Query the collection (semantic)
sem_results = collection_semrag.query(
    query_embeddings=[query_bgem3],
    n_results=5
)

print(sem_results)

{'ids': [['31_chunk_1', '374_chunk_1', '10_chunk_1', '529_chunk_1', '53_chunk_2']], 'embeddings': None, 'documents': [['On Nov. 8, in the crypto exchange’s final few days, he tried to pull out his cash and crypto. How much? Almost $100,000, he said. And was he ever able to? “Never,” he told prosecutors.\n\nShortly afterward, as the trial neared late afternoon, the government called Adam Yedidia to the stand. A fast-talking graduate of MIT, he and Bankman-Fried were close friends in college, he said. And after Bankman-Fried left Jane Street, the high-frequency trading firm where the former billionaire got his start in finance after MIT, he persuaded Yedidia to join him as a trader at Alameda and then as a developer at FTX.\n\nWhen Yedidia first took the stand, Danielle Sassoon, one of the lead prosecutors, said that the college friend of Bankman-Fried had legal immunity during his testimony. Why did he make such a deal with the government, she asked.\n\n“I was concerned that I had unwit

In [14]:
# Query the collection (SAGE)
SAGE_results = collection_SAGErag.query(
    query_embeddings=[query_sage],
    n_results=5
)

print(SAGE_results)

{'ids': [['116_chunk_1', '10_chunk_1', '374_chunk_1', '28_chunk_1', '195_chunk_1']], 'embeddings': None, 'documents': [['TechCrunch+\n\nTC+ subscribers get access to in-depth commentary, analysis and surveys — which you know if you’re already a subscriber. If you’re not, consider signing up. Here are a few highlights from this week:\n\nAnother superconductor disappointment: Tim writes that a new, supposedly room-temperature superconducting material isn’t what the scientific community hoped it would be. With the Nature-published paper detailing the material facing retraction, the odds of researchers discovering a room-temperature superconductor are looking even longer.\n\nKlarna inches toward an IPO: Mary Ann and Alex write that Swedish fintech Klarna is taking steps toward an eventual IPO. The company has initiated a process for a legal entity restructuring to set up a holding company in the U.K. as an important early step in its plans for an initial public offering, a Klarna spokesper

In [15]:
## Custom ebeddings with CHromaDB
#from chromadb import Documents, EmbeddingFunction, Embeddings
#
#class MyEmbeddingFunction(EmbeddingFunction):
#    def __call__(self, input: Documents) -> Embeddings:
#        # embed the documents somehow
#        return embeddings

In [16]:
## Querying collection with custom embeddings
#collection.query(
#    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
#    n_results=10,
#    where={"metadata_field": "is_equal_to_this"},
#    where_document={"$contains":"search_string"}
#)

In [17]:
client = Client(
  host='http://localhost:11434',
  #headers={'x-some-header': 'some-value'}
)
response = client.chat(model='hf.co/unsloth/QwQ-32B-GGUF:Q6_K', messages=[
  {
    'role': 'user',
    'content': f"""
You are an expert assistant whose task is to answer the user's question using the provided context. Use the context to support your answer and ensure that your response is clear, accurate, and concise.

Context:
{sem_results}

User Question:
{query}

Answer:
""",
  },
])

print(response['message']['content'])
# or access fields directly from the response object
# print(response.message.content)

<think>
Okay, let's tackle this question. The user is asking about an individual in the crypto industry facing a criminal trial for fraud and conspiracy, specifically mentioned by The Verge and TechCrunch, and accused by prosecutors of fraud for personal gain.

First, I need to look through the provided context to find relevant information. The context has several chunks. Let's go through each one.

The first chunk talks about someone named Bankman-Fried. It mentions his trial, prosecutors, and how he couldn't retrieve $100k. There's a reference to Adam Yedidia testifying against him. The context also mentions FTX's collapse, which aligns with the crypto industry. 

Another chunk discusses Binance and Zhao, but the question is about someone facing fraud charges, so maybe not Zhao since the focus here is on Bankman-Fried's trial details. 

The third chunk mentions Bankman-Fried's defense attorney and the aftermath of FTX's collapse. The fourth chunk is about Trump, which seems unrelated

In [18]:
client = Client(
  host='http://localhost:11434',
  #headers={'x-some-header': 'some-value'}
)
response = client.chat(model='hf.co/unsloth/QwQ-32B-GGUF:Q6_K', messages=[
  {
    'role': 'user',
    'content': f"""
You are an expert assistant whose task is to answer the user's question using the provided context. Use the context to support your answer and ensure that your response is clear, accurate, and concise.

Context:
{SAGE_results}

User Question:
{query}

Answer:
""",
  },
])

print(response['message']['content'])
# or access fields directly from the response object
# print(response.message.content)

<think>
Okay, let's tackle this question step by step. The user is asking about an individual in the cryptocurrency industry who's facing a criminal trial for fraud and conspiracy, as reported by The Verge and TechCrunch. The key points here are the individual's association with crypto, the specific charges (fraud and conspiracy), and the mention by both outlets.

First, I'll look through the provided context from TechCrunch. The user mentioned both The Verge and TechCrunch, but since the context here is only from TechCrunch, I'll focus on that. 

Looking at the documents in the context:

1. The first document talks about Klarna's IPO plans and the unicorn term, so probably not related.
2. The second document mentions FTX's collapse, BlockFi, Genesis, and refers to Bankman-Fried. It says he's represented by Mark Cohen, who also handled Ghislaine Maxwell's case. The trial is starting, and the focus is on the investors and creditors affected. This seems relevant because FTX was a major c