### **To run this notebook efficiently you need to ensure**
1. You have GPU in your machine
2. Enable TPU on Colab
3. [Hugging Face token is setup in the notebook or environment](https://huggingface.co/settings/tokens)
4. Enable HF token for notebook scope.


## **Introduction to Vector Databases**


### What are Vector Databases?

A Vector Database is a specialized database designed to store, manage, and search **high-dimensional vectors** (also known as embeddings) efficiently. Unlike traditional databases that store structured data, vector databases are optimized for handling numerical representations of data points, where each point (e.g., text, image, audio) is transformed into a vector of numbers that captures its semantic meaning or characteristics.

#### Core Concepts:

*   **Vector Embeddings**: These are numerical representations of data (like text, images, or audio) in a high-dimensional space. Machine learning models (e.g., neural networks) convert complex data into these vectors, where semantically similar items are located closer together in the vector space.
*   **Similarity Search**: This is the primary function of a vector database. It involves finding vectors that are 'similar' to a given query vector. Common metrics for measuring similarity include:
    *   **Cosine Similarity**: Measures the cosine of the angle between two vectors, indicating their directional similarity. Often used for text embeddings.
    *   **Euclidean Distance**: Measures the straight-line distance between two vectors in a multi-dimensional space. Shorter distances imply greater similarity.
*   **Approximate Nearest Neighbor (ANN) Algorithms**: Since exact nearest neighbor search in high-dimensional spaces is computationally intensive and impractical for large datasets, vector databases employ ANN algorithms (e.g., HNSW, IVFPQ, ANNOY). These algorithms allow for very fast similarity searches by sacrificing a small amount of accuracy for significant speed improvements.

#### Essential for AI Applications:

Vector databases are crucial for modern AI applications because they enable operations that are foundational to understanding and retrieving information based on meaning rather than exact keywords. Key applications include:

*   **Semantic Search**: Allows users to search for content based on its meaning, even if the exact keywords are not present. For example, searching for "pictures of playful cats" and getting results for "kittens frolicking".
*   **Recommendation Systems**: By finding vectors similar to a user's preferences or previously liked items, vector databases can suggest relevant products, movies, or articles.
*   **Anomaly Detection**: Identifying data points (vectors) that are unusually distant from others, signaling potential fraud, errors, or rare events.
*   **Retrieval-Augmented Generation (RAG)**: This technique combines the power of large language models (LLMs) with external knowledge retrieval. A vector database stores a vast collection of documents as embeddings. When an LLM needs to answer a query, it first retrieves relevant contextual information from the vector database (based on semantic similarity) and then uses this information to generate a more accurate and informed response.

By efficiently managing and searching these high-dimensional representations, vector databases bridge the gap between raw data and the semantic understanding required for advanced AI functionalities.

## **Vector vs. Traditional Databases**


### Vector Databases vs. Traditional Databases

Vector databases and traditional databases (relational and NoSQL) are designed for different purposes, leading to fundamental differences in their data structures, query types, and optimal use cases.

#### **Data Structure**

*   **Vector Databases**: Store data as high-dimensional vectors, which are numerical representations of objects (e.g., text, images, audio). These vectors capture the semantic meaning or features of the data. The underlying structure often involves specialized index structures (e.g., HNSW, Annoy, FAISS) to enable efficient similarity searches.
*   **Traditional Relational Databases (SQL)**: Organize data into tables with predefined schemas. Data is stored in rows and columns, enforcing strict relationships between different tables through keys. Data types are typically scalar (numbers, strings, dates).
*   **Traditional NoSQL Databases**: Offer more flexible schema designs. They can store data in various formats like documents (JSON/BSON), key-value pairs, wide-column stores, or graphs. While flexible, they primarily handle structured or semi-structured data, not high-dimensional vectors as their native data type.

#### **Query Types**

*   **Vector Databases**: Primarily designed for **similarity search** (also known as nearest neighbor search). Users query by providing a vector, and the database returns vectors that are semantically similar or

### **Industry Leading Vector Databases**

- **Pinecone**:
  A fully managed, cloud-native vector database designed for real-time applications at scale. Pinecone simplifies the deployment and management of vector search infrastructure, offering high performance, low latency, and a developer-friendly API. It's often used for semantic search, recommendation systems, and anomaly detection.

- **Milvus**:
  An open-source vector database built for AI applications and similarity search. Milvus supports various vector indexes and provides efficient query performance for large-scale datasets. It can be deployed on-premise or in the cloud and is highly scalable. Its use cases include image recognition, video analysis, and drug discovery.

- **Weaviate**:
  An open-source vector database that combines vector search with a GraphQL-based API for semantic search and knowledge graph capabilities. Weaviate is schema-aware, allowing for hybrid queries (vector and scalar filters) and offering a module system for integrating with different machine learning models and data sources. It is suitable for semantic search, recommendation engines, and chatbot applications.

- **Qdrant**:
  An open-source vector similarity search engine and database, providing a production-ready service with a convenient API. Qdrant focuses on high-performance vector search with filtering capabilities, supporting various data types and deployments. It is known for its fast performance and suitability for applications like personalized recommendations, semantic search, and deduplication.

- **ChromaDB**:
  An open-source vector database designed for ease of use, making it simple to build LLM applications. Chroma is lightweight and offers a straightforward API for embedding and querying documents. It focuses on being accessible for developers and is often used for RAG (Retrieval Augmented Generation) and other generative AI use cases.

## **Setup Environment**

In [9]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.9.0+cu126
Is CUDA available: True


## **Import Text Data**

In [10]:
example_text_data = """
Vector databases are specialized databases designed to store, manage, and search high-dimensional vectors (also known as embeddings) efficiently. Unlike traditional databases that store structured data, vector databases are optimized for handling numerical representations of data points, where each point (e.g., text, image, audio) is transformed into a vector of numbers that captures its semantic meaning or characteristics.

Core Concepts of Vector Databases:
1.  **Vector Embeddings**: These are numerical representations of data (like text, images, or audio) in a high-dimensional space. Machine learning models (e.g., neural networks) convert complex data into these vectors, where semantically similar items are located closer together in the vector space.
2.  **Similarity Search**: This is the primary function of a vector database. It involves finding vectors that are 'similar' to a given query vector. Common metrics for measuring similarity include Cosine Similarity (measures directional similarity) and Euclidean Distance (measures straight-line distance).
3.  **Approximate Nearest Neighbor (ANN) Algorithms**: Since exact nearest neighbor search in high-dimensional spaces is computationally intensive for large datasets, vector databases employ ANN algorithms (e.g., HNSW, IVFPQ, ANNOY). These algorithms allow for very fast similarity searches by sacrificing a small amount of accuracy for significant speed improvements.

Essential for AI Applications:
Vector databases are crucial for modern AI applications because they enable operations foundational to understanding and retrieving information based on meaning rather than exact keywords. Key applications include:

*   **Semantic Search**: Allows users to search for content based on its meaning, even if exact keywords are not present. For example, searching for "pictures of playful cats" and getting results for "kittens frolicking".
*   **Recommendation Systems**: By finding vectors similar to a user's preferences or previously liked items, vector databases can suggest relevant products, movies, or articles.
*   **Anomaly Detection**: Identifying data points (vectors) that are unusually distant from others, signaling potential fraud, errors, or rare events.
*   **Retrieval-Augmented Generation (RAG)**: This technique combines large language models (LLMs) with external knowledge retrieval. A vector database stores a vast collection of documents as embeddings. When an LLM needs to answer a query, it first retrieves relevant contextual information from the vector database (based on semantic similarity) and then uses this information to generate a more accurate and informed response.

By efficiently managing and searching these high-dimensional representations, vector databases bridge the gap between raw data and the semantic understanding required for advanced AI functionalities.

Traditional vs. Vector Databases:
Traditional relational databases (SQL) organize data into tables with predefined schemas, optimized for exact match queries, joins, and complex analytical queries. NoSQL databases offer flexible schema designs for various data formats like documents or key-value pairs, primarily handling structured or semi-structured data. In contrast, vector databases specifically store high-dimensional vectors for similarity searches. While traditional databases excel at managing structured, transactional, and analytical data, vector databases are purpose-built for semantic similarity searches, which are critical for modern AI-driven applications.

Industry Leading Vector Databases:
-   **Pinecone**: A fully managed, cloud-native vector database designed for real-time applications at scale.
-   **Milvus**: An open-source vector database built for AI applications and similarity search, supporting various vector indexes.
-   **Weaviate**: An open-source vector database combining vector search with a GraphQL API for semantic search and knowledge graph capabilities.
-   **Qdrant**: An open-source vector similarity search engine and database, providing high-performance vector search with filtering capabilities.
-   **ChromaDB**: An open-source vector database designed for ease of use, making it simple to build LLM applications, often used for RAG.
"""

print(example_text_data[:500]) # Print first 500 characters to confirm data load
print(f"\nTotal length of example_text_data: {len(example_text_data)} characters")


Vector databases are specialized databases designed to store, manage, and search high-dimensional vectors (also known as embeddings) efficiently. Unlike traditional databases that store structured data, vector databases are optimized for handling numerical representations of data points, where each point (e.g., text, image, audio) is transformed into a vector of numbers that captures its semantic meaning or characteristics.

Core Concepts of Vector Databases:
1.  **Vector Embeddings**: These ar

Total length of example_text_data: 4261 characters


## **Text Chunking**

In [11]:
!pip install langchain-community



In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " "] # Define separators for intelligent splitting
)

# Split the text into chunks
text_chunks = text_splitter.split_text(example_text_data)

# Print results
print(f"Total number of chunks: {len(text_chunks)}")
print("\nFirst two chunks:")
for i, chunk in enumerate(text_chunks[:2]):
    print(f"--- Chunk {i+1} ---")
    print(chunk)


Total number of chunks: 14

First two chunks:
--- Chunk 1 ---
Vector databases are specialized databases designed to store, manage, and search high-dimensional vectors (also known as embeddings) efficiently. Unlike traditional databases that store structured data, vector databases are optimized for handling numerical representations of data points, where each point (e.g., text, image, audio) is transformed into a vector of numbers that captures its semantic meaning or characteristics.
--- Chunk 2 ---
Core Concepts of Vector Databases:
1.  **Vector Embeddings**: These are numerical representations of data (like text, images, or audio) in a high-dimensional space. Machine learning models (e.g., neural networks) convert complex data into these vectors, where semantically similar items are located closer together in the vector space.


### Reasoning Behind Text Chunking

Text chunking is a vital preprocessing step before generating embeddings for several reasons:

*   **Input Limits of Embedding Models**: Most embedding models have a maximum token or character limit for their input. Large documents cannot be fed into these models directly. Chunking breaks down extensive texts into smaller, digestible segments that fit within these input constraints.

*   **Granularity of Information (`chunk_size`)**: The `chunk_size` parameter defines the maximum length of each segment. Choosing an appropriate `chunk_size` is critical as it determines the granularity of the information that an embedding will represent. If chunks are too small, they might lack sufficient context. If they are too large, they might exceed model limits or combine too many disparate ideas, diluting the semantic meaning of the embedding.

*   **Maintaining Context (`chunk_overlap`)**: The `chunk_overlap` parameter ensures that there is continuity and contextual flow between adjacent chunks. By having a small overlap, a chunk can include some of the preceding text, preventing the loss of meaning that might occur if a crucial sentence or idea is split exactly at a chunk boundary. This overlap helps to maintain semantic integrity when querying, as the retrieved chunks are more likely to contain complete thoughts or related concepts.

*   **Intelligent Splitting (`RecursiveCharacterTextSplitter`)**: The `RecursiveCharacterTextSplitter` from `langchain` is designed to split text intelligently. Instead of arbitrarily cutting text at fixed intervals, it attempts to split based on a list of `separators` (like `\n\n`, `\n`, `.`, ` `). It tries to split on larger, more semantically meaningful delimiters first (e.g., double newlines for paragraphs), and if a chunk is still too large, it progressively moves to smaller delimiters (e.g., single newlines, then periods, then spaces). This method helps ensure that chunks end on natural breaks in the text, preserving as much coherent meaning as possible within each chunk.

## **Generate Embeddings**


In [13]:
from transformers import AutoTokenizer, AutoModel
import torch

# Define the model name
model_name = 'sentence-transformers/all-MiniLM-L6-v2'

print(f"Using model: {model_name} for embedding generation.")

Using model: sentence-transformers/all-MiniLM-L6-v2 for embedding generation.


In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_embeddings(texts):
    # Tokenize the input texts
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Move model to CPU (Colab's free GPU tier might be limited for sentence-transformers in some cases, CPU is generally safer for smaller models without explicit device handling here).
    # For best performance on GPU, ensure model and inputs are on 'cuda' if available.
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform mean pooling to get sentence embeddings
    # (outputs.last_hidden_state * attention_mask.unsqueeze(-1)).sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    input_mask_expanded = encoded_input['attention_mask'].unsqueeze(-1).expand(model_output.last_hidden_state.size()).float()
    sum_embeddings = torch.sum(model_output.last_hidden_state * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    embeddings = sum_embeddings / sum_mask

    return embeddings.tolist()

print("Tokenizer and model loaded. 'get_embeddings' function defined.")

Tokenizer and model loaded. 'get_embeddings' function defined.


In [15]:
embeddings = get_embeddings(text_chunks)

print(f"Total number of embeddings generated: {len(embeddings)}")
if embeddings:
    print(f"Shape of the first embedding: {len(embeddings[0])}")

Total number of embeddings generated: 14
Shape of the first embedding: 384


## **Save to ChromaDB**


In [17]:
import chromadb

# 2. Initialize a ChromaDB client (in-memory for this demo)
client = chromadb.Client()

# 3. Create a new collection
collection_name = 'my_document_collection'
collection = client.get_or_create_collection(name=collection_name)

print(f"ChromaDB client initialized and collection '{collection_name}' created.")

ChromaDB client initialized and collection 'my_document_collection' created.


In [18]:
import uuid

# 4. Generate unique IDs for each text chunk
ids = [str(uuid.uuid4()) for _ in range(len(text_chunks))]

# 5. Add the text_chunks, embeddings, and IDs to the ChromaDB collection
collection.add(
    embeddings=embeddings,
    documents=text_chunks,
    ids=ids
)

# 6. Print a confirmation message
print(f"Added {len(text_chunks)} documents (chunks) to ChromaDB collection '{collection_name}'.")
print(f"Collection count: {collection.count()} documents.")

Added 14 documents (chunks) to ChromaDB collection 'my_document_collection'.
Collection count: 14 documents.


## Summary:

### Q&A
The task was to create a comprehensive tutorial on vector databases and implement a practical demonstration using ChromaDB. This was achieved by:
*   Explaining core concepts of vector databases, their differences from traditional databases, and listing industry leaders.
*   Implementing a practical demonstration involving setting up the environment, importing text data, chunking text, generating embeddings with a Qwen model (substituted with `sentence-transformers/all-MiniLM-L6-v2`), and storing chunks and embeddings in ChromaDB.

### Data Analysis Key Findings
*   **Vector Database Concepts**: The tutorial defined vector databases as specialized systems for high-dimensional vectors (embeddings), emphasizing core concepts such as vector embeddings, similarity search (using metrics like Cosine Similarity and Euclidean Distance), and Approximate Nearest Neighbor (ANN) algorithms.
*   **AI Application Essentiality**: Vector databases were highlighted as crucial for AI applications like Semantic Search, Recommendation Systems, Anomaly Detection, and Retrieval-Augmented Generation (RAG).
*   **Vector vs. Traditional Databases**: Key distinctions were drawn based on:
    *   **Data Structure**: Vector databases store high-dimensional vectors for semantic meaning, while traditional relational databases use structured tables and NoSQL databases offer flexible schemas for structured/semi-structured data.
    *   **Query Types**: Vector databases are optimized for similarity search, whereas traditional databases excel at exact match queries (SQL) or various model-specific queries (NoSQL).
    *   **Use Cases**: Vector databases are suited for AI-driven semantic understanding, while traditional databases handle transactional data (SQL) or flexible, scalable data models (NoSQL).
*   **Industry Leaders**: Prominent vector databases like Pinecone, Milvus, Weaviate, Qdrant, and ChromaDB were described.
*   **Environment Setup**: Necessary libraries, including `chromadb`, `transformers`, `torch`, and `sentence-transformers`, were successfully installed.
*   **Text Data Preparation**: An example text dataset of 4261 characters explaining vector database concepts was loaded.
*   **Text Chunking**: The `example_text_data` was split into 14 manageable text chunks using `RecursiveCharacterTextSplitter` from `langchain_text_splitters` with a `chunk_size` of 500 characters and `chunk_overlap` of 50 characters, ensuring contextual continuity.
*   **Embedding Generation**: Embeddings were generated for all 14 text chunks using the `sentence-transformers/all-MiniLM-L6-v2` model. Each embedding had a dimension of 384.
*   **ChromaDB Storage**: An in-memory ChromaDB client was initialized, a collection named `my_document_collection` was created, and all 14 text chunks along with their generated embeddings and unique IDs were successfully added to this collection.

### Insights or Next Steps
*   The populated ChromaDB collection is now ready for semantic search queries, allowing for retrieval of semantically similar text chunks based on a query embedding, which is foundational for RAG applications.
*   Further exploration could involve implementing a semantic search functionality to query the ChromaDB collection and retrieve relevant chunks, demonstrating the practical utility of the stored embeddings.
