In [None]:
from common.data_store.src.data_pipeline import DataPipeline

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [6]:
pipeline = DataPipeline()

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
INFO:data_store.src.embeddings.embedding_service:Loaded Sentence Transformer model: sentence-transformers/all-mpnet-base-v2
INFO:data_store.src.embeddings.embedding_service:Embedding dimension: 768
INFO:data_store.src.vectorstore.chroma_store:Initializing ChromaDB client at path: ./data_layer_db
INFO:data_store.src.vectorstore.chroma_store:Getting or creating ChromaDB collection: my_documents
INFO:data_store.src.vectorstore.chroma_store:Collection 'my_documents' ready.
INFO:data_store.src.contextualizer.llm_service:Local LLM Contextualizer configured for URL: http://localhost:8000/api/llm/generate_response
INFO:data_store.src.core.data_pipeline:Data Pipeline initialized.


In [7]:
print(f"Current vector store count: {pipeline.get_vector_store_count()}")

Current vector store count: 94


In [8]:
pipeline.clear_vector_store()

INFO:data_store.src.vectorstore.chroma_store:Cleared 94 items from collection 'my_documents'.


In [9]:
added_count_pdf = pipeline.process_and_store(
    source="./data/",
    source_type="directory",
    contextualize=True
)

INFO:data_store.src.core.data_pipeline:Starting processing for source: ./data/ (type: directory, contextualize: True)
INFO:data_store.src.loaders.document_processor:Scanning directory: ./data/ (recursive=True)
INFO:data_store.src.loaders.document_processor:Processing file: ./data/machine_learning.md using UnstructuredMarkdownLoader
INFO:data_store.src.loaders.document_processor:Processing file: ./data/superhero.md using UnstructuredMarkdownLoader
INFO:data_store.src.loaders.document_processor:Processing file: ./data/quantum_physics.md using UnstructuredMarkdownLoader
INFO:data_store.src.loaders.document_processor:Processing file: ./data/quantum_computing.md using UnstructuredMarkdownLoader
INFO:data_store.src.loaders.document_processor:Loaded 4 documents initially.
INFO:data_store.src.core.data_pipeline:Loaded 4 documents from source.
INFO:data_store.src.loaders.document_processor:Splitting 4 documents into chunks (size=512, overlap=50)...
INFO:data_store.src.loaders.document_processor

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Successfully added 94 documents.
INFO:data_store.src.core.data_pipeline:Processing complete. Added 94 chunks to the vector store.


In [10]:
print(f"Added {added_count_pdf} chunks from PDF.")

Added 94 chunks from PDF.


In [11]:
print(f"New vector store count: {pipeline.get_vector_store_count()}")

New vector store count: 94


In [12]:
query = "What is the main law of quantum mechanics?"
search_results = pipeline.retrieve(query, k=3)

INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'What is the main law of quantum mechanics?...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'What is the main law of quantum mechanics?...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.


In [13]:
def display_search_results(search_results, query):
    print(f"\nSearch results for: '{query}'")
    if search_results:
        for i, result in enumerate(search_results):
            print(f"\n--- Result {i+1} ---")
            print(f"Source: {result['metadata'].get('source', 'N/A')}")
            # Display original content if available in metadata, else the potentially augmented content
            content_to_display = result['metadata'].get('original_content', result['document'])
            print(f"Content: {content_to_display[:200]}...") # Display snippet
            print(f"Distance: {result['distance']:.4f}")
            # print(f"Metadata: {result['metadata']}") # Uncomment to see all metadata
    else:
        print("No relevant documents found.")

In [23]:
# Making question answering with llm based on the retrived documents
import requests
import json

url = 'http://localhost:8000/api/llm/generate_response'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}
def prepare_docs_for_llm(search_results):
    docs = []
    for result in search_results:
        doc = {
            'content': result['document'],
            'id': result['metadata'].get('source', 'N/A')
        }
        docs.append(doc)
    return docs

def prepare_data_for_llm(query, search_results):
    data = {
        'prompt': query,
        'documents': prepare_docs_for_llm(search_results)
    }
    json_data = json.dumps(data)
    return json_data

def get_llm_response(query, search_results):
    data = prepare_data_for_llm(query, search_results)
    response = requests.post(url, headers=headers, data=data)
    return response.json()

def print_llm_response(query, search_results):
    response = get_llm_response(query, search_results)
    print(f"\nQuestion: {query}")
    print(f"Answer: {response['response']}")

### With Context

In [13]:
display_search_results(search_results, query)


Search results for: 'What is the main law of quantum mechanics?'

--- Result 1 ---
Source: ./data/quantum_physics.md
Content: Core Principles Revisited with More Detail...
Distance: 0.6070

--- Result 2 ---
Source: ./data/quantum_physics.md
Content: A Deeper Dive into Quantum Physics

Quantum physics, at its core, is the study of the microscopic world – the realm of atoms and subatomic particles. Unlike classical physics, which describes the macr...
Distance: 0.8068

--- Result 3 ---
Source: ./data/quantum_physics.md
Content: Interpretations of Quantum Mechanics

Despite its immense success in predicting experimental results, the underlying interpretation of quantum mechanics remains a subject of debate. Several interpreta...
Distance: 0.8567


In [14]:
query1 = "What event is generally considered the beginning of the Golden Age of superhero comics? Name two iconic superheroes that debuted in this era."
query2 = "Explain how the K-Means clustering algorithm works."
query3 = "Discuss at least three potential applications of quantum computing and how they could impact those fields."
query4 = "Describe the phenomenon of quantum entanglement and why Einstein referred to it as 'spooky action at a distance.' What are some potential applications of entanglement?"


search_results1 = pipeline.retrieve(query1, k=3)
search_results2 = pipeline.retrieve(query2, k=3)
search_results3 = pipeline.retrieve(query3, k=3)
search_results4 = pipeline.retrieve(query4, k=3)

INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'What event is generally considered the beginning o...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'What event is generally considered the beginning o...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.
INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'Explain how the K-Means clustering algorithm works...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'Explain how the K-Means clustering algorithm works...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.
INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'Discuss at least three potential applications of q...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'Discuss at least three potential applications of q...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.
INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'Describe the phenomenon of quantum entanglement an...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'Describe the phenomenon of quantum entanglement an...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.


In [15]:
display_search_results(search_results1, query1)


Search results for: 'What event is generally considered the beginning of the Golden Age of superhero comics? Name two iconic superheroes that debuted in this era.'

--- Result 1 ---
Source: ./data/superhero.md
Content: The Golden Age (Late 1930s - 1950s)

The Golden Age of superhero comics is generally considered to have begun with the debut of Superman in Action Comics #1 (1938) and Batman in Detective Comics #27 (...
Distance: 0.3081

--- Result 2 ---
Source: ./data/superhero.md
Content: Superhero Comics: A Pop Culture Phenomenon

Superhero comics are a significant and enduring part of global popular culture. Originating in the late 1930s, these illustrated narratives feature characte...
Distance: 0.4596

--- Result 3 ---
Source: ./data/superhero.md
Content: Themes: Predominantly focused on good versus evil, with heroes often fighting clearly defined villains and upholding patriotic ideals, especially during World War II.

Art Style: Often simpler and mor...
Distance: 0.5934


In [16]:
display_search_results(search_results2, query2)


Search results for: 'Explain how the K-Means clustering algorithm works.'

--- Result 1 ---
Source: ./data/machine_learning.md
Content: Use Cases: Image and video analysis, natural language understanding, speech synthesis, machine translation.

Unsupervised Learning Algorithms

K-Means Clustering: An iterative algorithm that partition...
Distance: 0.9884

--- Result 2 ---
Source: ./data/machine_learning.md
Content: Algorithm: The specific procedure or set of rules that the ML system uses to learn from the data and build the model. Different algorithms have different strengths and weaknesses and are suited for di...
Distance: 1.1372

--- Result 3 ---
Source: ./data/machine_learning.md
Content: Hierarchical Clustering: A family of clustering algorithms that builds a hierarchy of clusters, either by starting with each data point as a separate cluster and iteratively merging the closest cluste...
Distance: 1.1413


In [17]:
display_search_results(search_results3, query3)


Search results for: 'Discuss at least three potential applications of quantum computing and how they could impact those fields.'

--- Result 1 ---
Source: ./data/quantum_computing.md
Content: Quantum Computing: Harnessing the Power of the Quantum Realm

Quantum computing is an emerging field that leverages the principles of quantum mechanics to solve complex problems that are intractable f...
Distance: 0.4053

--- Result 2 ---
Source: ./data/quantum_computing.md
Content: Artificial Intelligence and Machine Learning: Accelerating machine learning algorithms and developing new quantum machine learning techniques.

Cryptography: Breaking current public-key encryption alg...
Distance: 0.4138

--- Result 3 ---
Source: ./data/quantum_physics.md
Content: The Ongoing Revolution and Future Directions

Quantum physics continues to be a vibrant and rapidly evolving field. Current research is focused on:

Quantum Computing: Building computers that exploit ...
Distance: 0.4789


In [18]:
display_search_results(search_results4, query4)


Search results for: 'Describe the phenomenon of quantum entanglement and why Einstein referred to it as 'spooky action at a distance.' What are some potential applications of entanglement?'

--- Result 1 ---
Source: ./data/quantum_physics.md
Content: Quantum Entanglement: This is perhaps one of the most counter-intuitive aspects of quantum mechanics. When two or more particles become entangled, their quantum states are linked in such a way that th...
Distance: 0.4726

--- Result 2 ---
Source: ./data/quantum_computing.md
Content: Entanglement: When two or more qubits become entangled, their quantum states are linked in such a way that they share the same fate, regardless of the distance between them. Measuring the state of one...
Distance: 0.6986

--- Result 3 ---
Source: ./data/quantum_physics.md
Content: called it, is a fundamental feature of quantum mechanics and has profound implications for quantum communication and quantum computing....
Distance: 0.7741


### without context

In [15]:
display_search_results(search_results, query)


Search results for: 'What is the main law of quantum mechanics?'

--- Result 1 ---
Source: ./data/quantum_physics.md
Content: A Deeper Dive into Quantum Physics

Quantum physics, at its core, is the study of the microscopic world – the realm of atoms and subatomic particles. Unlike classical physics, which describes the macr...
Distance: 0.8135

--- Result 2 ---
Source: ./data/quantum_physics.md
Content: called it, is a fundamental feature of quantum mechanics and has profound implications for quantum communication and quantum computing....
Distance: 0.8568

--- Result 3 ---
Source: ./data/quantum_physics.md
Content: Wave Mechanics (Schrödinger): Erwin Schrödinger developed a mathematical equation, the Schrödinger equation, which describes the time evolution of the wave function of a quantum system. The wave funct...
Distance: 0.8685


In [16]:
query1 = "What event is generally considered the beginning of the Golden Age of superhero comics? Name two iconic superheroes that debuted in this era."
query2 = "Explain how the K-Means clustering algorithm works."
query3 = "Discuss at least three potential applications of quantum computing and how they could impact those fields."
query4 = "Describe the phenomenon of quantum entanglement and why Einstein referred to it as 'spooky action at a distance.' What are some potential applications of entanglement?"


search_results1 = pipeline.retrieve(query1, k=3)
search_results2 = pipeline.retrieve(query2, k=3)
search_results3 = pipeline.retrieve(query3, k=3)
search_results4 = pipeline.retrieve(query4, k=3)

INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'What event is generally considered the beginning o...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'What event is generally considered the beginning o...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.
INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'Explain how the K-Means clustering algorithm works...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'Explain how the K-Means clustering algorithm works...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.
INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'Discuss at least three potential applications of q...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'Discuss at least three potential applications of q...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.
INFO:data_store.src.core.data_pipeline:Retrieving top 3 results for query: 'Describe the phenomenon of quantum entanglement an...'
INFO:data_store.src.vectorstore.chroma_store:Performing search in collection 'my_documents' for query: 'Describe the phenomenon of quantum entanglement an...' (k=3)
INFO:data_store.src.embeddings.embedding_service:Generating embeddings for 1 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:data_store.src.embeddings.embedding_service:Embeddings generated.
INFO:data_store.src.vectorstore.chroma_store:Found 3 results.


In [17]:
display_search_results(search_results1, query1)


Search results for: 'What event is generally considered the beginning of the Golden Age of superhero comics? Name two iconic superheroes that debuted in this era.'

--- Result 1 ---
Source: ./data/superhero.md
Content: The Golden Age (Late 1930s - 1950s)

The Golden Age of superhero comics is generally considered to have begun with the debut of Superman in Action Comics #1 (1938) and Batman in Detective Comics #27 (...
Distance: 0.3234

--- Result 2 ---
Source: ./data/superhero.md
Content: Superhero Comics: A Pop Culture Phenomenon

Superhero comics are a significant and enduring part of global popular culture. Originating in the late 1930s, these illustrated narratives feature characte...
Distance: 0.5750

--- Result 3 ---
Source: ./data/superhero.md
Content: Themes: Predominantly focused on good versus evil, with heroes often fighting clearly defined villains and upholding patriotic ideals, especially during World War II.

Art Style: Often simpler and mor...
Distance: 0.6959


In [18]:
display_search_results(search_results2, query2)


Search results for: 'Explain how the K-Means clustering algorithm works.'

--- Result 1 ---
Source: ./data/machine_learning.md
Content: Use Cases: Image and video analysis, natural language understanding, speech synthesis, machine translation.

Unsupervised Learning Algorithms

K-Means Clustering: An iterative algorithm that partition...
Distance: 0.9334

--- Result 2 ---
Source: ./data/machine_learning.md
Content: K-Nearest Neighbors (KNN): A simple yet effective algorithm for classification and regression that classifies a new data point based on the majority class (or average value) of its k nearest neighbors...
Distance: 1.1559

--- Result 3 ---
Source: ./data/machine_learning.md
Content: Unsupervised Learning: In unsupervised learning, the algorithm learns from unlabeled data, without any explicit output labels. The goal is to discover hidden patterns, structures, or relationships in ...
Distance: 1.1786


In [19]:
display_search_results(search_results3, query3)


Search results for: 'Discuss at least three potential applications of quantum computing and how they could impact those fields.'

--- Result 1 ---
Source: ./data/quantum_computing.md
Content: Quantum Computing: Harnessing the Power of the Quantum Realm

Quantum computing is an emerging field that leverages the principles of quantum mechanics to solve complex problems that are intractable f...
Distance: 0.4516

--- Result 2 ---
Source: ./data/quantum_physics.md
Content: The Ongoing Revolution and Future Directions

Quantum physics continues to be a vibrant and rapidly evolving field. Current research is focused on:

Quantum Computing: Building computers that exploit ...
Distance: 0.5732

--- Result 3 ---
Source: ./data/quantum_computing.md
Content: Artificial Intelligence and Machine Learning: Accelerating machine learning algorithms and developing new quantum machine learning techniques.

Cryptography: Breaking current public-key encryption alg...
Distance: 0.5793


In [20]:
display_search_results(search_results4, query4)


Search results for: 'Describe the phenomenon of quantum entanglement and why Einstein referred to it as 'spooky action at a distance.' What are some potential applications of entanglement?'

--- Result 1 ---
Source: ./data/quantum_physics.md
Content: Quantum Entanglement: This is perhaps one of the most counter-intuitive aspects of quantum mechanics. When two or more particles become entangled, their quantum states are linked in such a way that th...
Distance: 0.6522

--- Result 2 ---
Source: ./data/quantum_computing.md
Content: Entanglement: When two or more qubits become entangled, their quantum states are linked in such a way that they share the same fate, regardless of the distance between them. Measuring the state of one...
Distance: 0.7011

--- Result 3 ---
Source: ./data/quantum_physics.md
Content: Quantum Communication and Cryptography: Developing secure communication methods based on the principles of quantum mechanics, such as quantum key distribution, which offers theoretic

### Answering with context retrival

In [26]:
print_llm_response(query1, search_results1)


Question: What event is generally considered the beginning of the Golden Age of superhero comics? Name two iconic superheroes that debuted in this era.
Answer: The Golden Age of superhero comics is generally considered to have begun with the debut of Superman in Action Comics #1 (1938) and Batman in Detective Comics #27 (1939) [./data/superhero.md]. Two iconic superheroes that debuted during this era are Superman and Batman [./data/superhero.md].


In [27]:
print_llm_response(query2, search_results2)


Question: Explain how the K-Means clustering algorithm works.
Answer: K-Means Clustering is an iterative algorithm that divides a dataset into *k* distinct clusters based on the distance of data points to the centroids of the clusters [./data/machine_learning.md].


In [28]:
print_llm_response(query3, search_results3)


Question: Discuss at least three potential applications of quantum computing and how they could impact those fields.
Answer: Quantum computing has the potential to impact several fields:

1.  **Artificial Intelligence and Machine Learning:** Quantum computing can accelerate machine learning algorithms and develop new quantum machine learning techniques \[./data/quantum_computing.md].
2.  **Cryptography:** Quantum computers could break current public-key encryption algorithms (like RSA) and lead to the development of new, quantum-resistant cryptographic methods \[./data/quantum_computing.md].
3.  **Optimization Problems:** Quantum computers could find optimal solutions to complex logistical and scheduling problems \[./data/quantum_computing.md].


In [29]:
print_llm_response(query4, search_results4)


Question: Describe the phenomenon of quantum entanglement and why Einstein referred to it as 'spooky action at a distance.' What are some potential applications of entanglement?
Answer: Quantum entanglement links the quantum states of two or more particles, regardless of the distance between them. Measuring the state of one entangled particle instantaneously determines the state of the other(s) [./data/quantum_computing.md]. Einstein called this "spooky action at a distance" [./data/quantum_physics.md]. Entanglement has implications for quantum communication and quantum computing [./data/quantum_physics.md]. It allows quantum computers to perform correlated operations on multiple qubits, leading to exponential increases in computational power [./data/quantum_computing.md].


### Answering without context retrival

In [27]:
print_llm_response(query1, search_results1)


Question: What event is generally considered the beginning of the Golden Age of superhero comics? Name two iconic superheroes that debuted in this era.
Answer: The Golden Age of superhero comics is generally considered to have begun with the debut of Superman in Action Comics #1 (1938) and Batman in Detective Comics #27 (1939) [./data/superhero.md]. Two iconic superheroes that debuted during this era are Superman and Batman [./data/superhero.md].


In [28]:
print_llm_response(query2, search_results2)


Question: Explain how the K-Means clustering algorithm works.
Answer: K-Means clustering is an iterative algorithm that divides a dataset into *k* distinct clusters based on the distance of data points to the cluster centroids [./data/machine_learning.md]. This algorithm falls under unsupervised learning, which means it learns from unlabeled data to discover hidden patterns [./data/machine_learning.md].


In [29]:
print_llm_response(query3, search_results3)


Question: Discuss at least three potential applications of quantum computing and how they could impact those fields.
Answer: Quantum computing has the potential to revolutionize several industries:

1.  **Artificial Intelligence and Machine Learning:** Quantum computing can accelerate machine learning algorithms and develop new quantum machine learning techniques [./data/quantum_computing.md].
2.  **Cryptography:** Quantum computers can break current public-key encryption algorithms and enable the development of new, quantum-resistant cryptographic methods [./data/quantum_computing.md].
3.  **Optimization Problems:** Quantum computing can find optimal solutions to complex logistical and scheduling problems [./data/quantum_computing.md].


In [30]:
print_llm_response(query4, search_results4)


Question: Describe the phenomenon of quantum entanglement and why Einstein referred to it as 'spooky action at a distance.' What are some potential applications of entanglement?
Answer: Quantum entanglement links the quantum states of two or more particles, regardless of the distance between them, so they share the same fate. Measuring the state of one entangled particle instantaneously determines the state of the other(s) [./data/quantum_physics.md, ./data/quantum_computing.md]. Einstein referred to this as "spooky action at a distance" [./data/quantum_physics.md]. Entanglement allows quantum computers to perform correlated operations on multiple qubits, potentially leading to exponential increases in computational power [./data/quantum_computing.md].
