# Building Vector Indexes and OpenSearch Integration

## Overview

In this tutorial, we'll learn how to create a **Hybrid Search** system that combines semantic search and keyword search using:
- **OpenSearch** for vector database and keyword search
- **LlamaIndex** for document management and indexing
- **BGE-M3** embedding model for text-to-vector conversion

### What is Hybrid Search?
Hybrid search combines:
- **Keyword Search**: Exact term matching (good for precise queries)
- **Semantic Search**: Understanding context and meaning (good for conceptual queries)
- **Combined Results**: More comprehensive and accurate results

## Section 1: Installing Dependencies

First, we need to install all required packages for our hybrid search system.

In [None]:
# Install LlamaIndex and dependencies
!pip install llama-index -q
!pip install llama-index-embeddings-huggingface -q
!pip install llama-index-vector-stores-opensearch -q
!pip install requests -q
!pip install nest_asyncio -q

**Package Explanations:**
- `llama-index`: Main framework for building RAG applications
- `llama-index-embeddings-huggingface`: For using Hugging Face embedding models
- `llama-index-vector-stores-opensearch`: OpenSearch connector
- `requests`: For HTTP API calls
- `nest_asyncio`: Fixes event loop issues in Jupyter Notebook

## Section 2: Import Modules and Initial Setup

Import all necessary modules and configure the environment.

In [None]:
import os
import torch
import urllib.request
import pickle
import requests
import nest_asyncio
import json
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient
from llama_index.core.vector_stores.types import VectorStoreQueryMode

# Apply nest_asyncio to avoid runtime errors
nest_asyncio.apply()

print("‚úÖ All modules imported successfully!")

**Key Imports:**
- **Core LlamaIndex**: Document reading, indexing, and storage
- **Node Parser**: For splitting documents into manageable chunks
- **Embedding**: For converting text to vectors
- **Vector Store**: For OpenSearch integration
- **nest_asyncio**: Fixes async issues in Jupyter environments

## Section 3: Configuration Setup

‚ö†Ô∏è **Important**: Change `OPENSEARCH_INDEX` to your unique name (e.g., `yourname_doc_index`)

In [None]:
# OpenSearch Configuration
OPENSEARCH_ENDPOINT = "http://34.101.178.186:9200"
OPENSEARCH_INDEX = "yourname_doc_index"  # ‚ö†Ô∏è CHANGE THIS TO YOUR NAME
TEXT_FIELD = "content"
EMBEDDING_FIELD = "embedding"

# Check if CUDA is available for GPU acceleration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üñ•Ô∏è Using device: {device}")

# Display configuration
print(f"üîó OpenSearch Endpoint: {OPENSEARCH_ENDPOINT}")
print(f"üìä Index Name: {OPENSEARCH_INDEX}")
print(f"üìù Text Field: {TEXT_FIELD}")
print(f"üî¢ Embedding Field: {EMBEDDING_FIELD}")

**Configuration Details:**
- **Endpoint**: OpenSearch cluster URL
- **Index Name**: Must be unique for each user (lowercase)
- **Fields**: Specify where text content and embeddings are stored
- **Device**: GPU acceleration if available, otherwise CPU

## Section 4: Creating Hybrid Search Pipeline

This creates a search pipeline that combines keyword and semantic search results.

In [None]:
def create_hybrid_search_pipeline():
    """Create a hybrid search pipeline in OpenSearch"""
    pipeline_url = f"{OPENSEARCH_ENDPOINT}/_search/pipeline/hybrid-search-pipeline"
    headers = {'Content-Type': 'application/json'}

    pipeline_config = {
        "description": "Pipeline for hybrid search",
        "phase_results_processors": [
            {
                "normalization-processor": {
                    "normalization": {
                        "technique": "min_max"  # Normalize scores to 0-1 range
                    },
                    "combination": {
                        "technique": "harmonic_mean",
                        "parameters": {
                            "weights": [0.3, 0.7]  # keyword: 30%, semantic: 70%
                        }
                    }
                }
            }
        ]
    }

    try:
        response = requests.put(pipeline_url, headers=headers, data=json.dumps(pipeline_config))
        if response.status_code in [200, 201]:
            print(f"‚úÖ Hybrid search pipeline created successfully!")
            return True
        else:
            print(f"‚ùå Failed to create pipeline: {response.text}")
            return False
    except Exception as e:
        print(f"‚ùå Error creating pipeline: {e}")
        return False

# Create the pipeline
print("üîß Setting up hybrid search pipeline...")
create_hybrid_search_pipeline()

**Hybrid Search Pipeline Components:**
- **Min-Max Normalization**: Scales all scores to 0-1 range
- **Harmonic Mean**: Combines keyword and semantic scores
- **Weights**: 30% keyword search + 70% semantic search
- **Result**: Balanced search that leverages both approaches

## Section 5: Document Download and Preparation

Download sample documents for our search system.

In [None]:
def download_corpus():
    """Download sample markdown documents"""
    os.makedirs('./corpus_input', exist_ok=True)
    
    urls = [
        ("https://storage.googleapis.com/llm-course/md/1.md", "./corpus_input/1.md"),
        ("https://storage.googleapis.com/llm-course/md/2.md", "./corpus_input/2.md"),
        ("https://storage.googleapis.com/llm-course/md/44.md", "./corpus_input/44.md"),
        ("https://storage.googleapis.com/llm-course/md/5555.md", "./corpus_input/5555.md")
    ]
    
    downloaded_count = 0
    for url, path in urls:
        if not os.path.exists(path):
            print(f"üì• Downloading {url.split('/')[-1]}...")
            try:
                urllib.request.urlretrieve(url, path)
                downloaded_count += 1
            except Exception as e:
                print(f"‚ùå Failed to download {url}: {e}")
        else:
            print(f"‚úÖ {path.split('/')[-1]} already exists")
    
    print(f"üìö Downloaded {downloaded_count} new files")
    return len(urls)

# Download documents
print("üì• Downloading sample documents...")
total_files = download_corpus()
print(f"üìÅ Total files available: {total_files}")

**Document Preparation:**
- Creates `./corpus_input` directory
- Downloads 4 sample Markdown files
- Checks for existing files to avoid re-downloading
- These documents will be our searchable knowledge base

## Section 6: Document Loading and Parsing

Load documents and split them into searchable chunks (nodes).

In [None]:
# Load Markdown documents from directory
reader = SimpleDirectoryReader(
    input_dir="./corpus_input",
    recursive=True,
    required_exts=[".md", ".markdown"]
)

print("üìñ Loading documents...")
documents = reader.load_data()
print(f"‚úÖ Loaded {len(documents)} documents successfully")

# Display document info
for i, doc in enumerate(documents):
    print(f"üìÑ Document {i+1}: {len(doc.text)} characters")

# Create parser for Markdown
print("\nüîß Creating Markdown parser...")
md_parser = MarkdownNodeParser()
nodes = md_parser.get_nodes_from_documents(documents)
print(f"‚úÖ Created {len(nodes)} nodes with MarkdownNodeParser")

# Display sample node
if nodes:
    print(f"\nüìù Sample node preview:")
    print(f"Text length: {len(nodes[0].text)} characters")
    print(f"Preview: {nodes[0].text[:200]}...")

**Document Processing Steps:**
1. **SimpleDirectoryReader**: Reads all `.md` files from the directory
2. **MarkdownNodeParser**: Splits documents into logical chunks based on Markdown structure
3. **Nodes**: Individual searchable units that maintain context
4. **Benefits**: Better search accuracy and relevance than whole documents

## Section 7: Embedding Model Setup

Configure the BGE-M3 model for converting text to vectors.

In [None]:
# Setup embedding model
print("ü§ñ Setting up BGE-M3 embedding model...")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3", device=device)
print(f"‚úÖ BGE-M3 embedding model setup successful")

# Test embedding and get dimensions
print("\nüß™ Testing embedding model...")
test_text = "This is a test sentence for embedding."
embeddings = embed_model.get_text_embedding(test_text)
dim = len(embeddings)

print(f"‚úÖ Embedding test successful")
print(f"üìè Embedding dimensions: {dim}")
print(f"üî¢ Sample embedding values: {embeddings[:5]}...")  # Show first 5 values

# Test with different languages (BGE-M3 is multilingual)
test_texts = [
    "Hello world",
    "‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏µ‡∏ä‡∏≤‡∏ß‡πÇ‡∏•‡∏Å",
    "Êú∫Âô®Â≠¶‰π†"
]

print("\nüåê Testing multilingual capabilities:")
for text in test_texts:
    emb = embed_model.get_text_embedding(text)
    print(f"'{text}' ‚Üí {len(emb)} dimensions ‚úÖ")

**BGE-M3 Model Features:**
- **Multilingual**: Supports 100+ languages including Thai
- **1024 Dimensions**: High-quality vector representations
- **State-of-the-art**: One of the best embedding models available
- **Multi-functionality**: Supports dense retrieval, sparse retrieval, and multi-vector retrieval

## Section 8: OpenSearch Vector Store Configuration

Connect to OpenSearch and configure the vector store.

In [None]:
# Setup OpensearchVectorClient
print("üîó Setting up OpenSearch Vector Client...")
client = OpensearchVectorClient(
    endpoint=OPENSEARCH_ENDPOINT,
    index=OPENSEARCH_INDEX,
    dim=dim,
    embedding_field=EMBEDDING_FIELD,
    text_field=TEXT_FIELD,
    search_pipeline="hybrid-search-pipeline",
)
print(f"‚úÖ OpensearchVectorClient setup successful for index '{OPENSEARCH_INDEX}'")

# Create vector store
print("\nüì¶ Creating vector store...")
vector_store = OpensearchVectorStore(client)
print("‚úÖ Vector store created successfully")

# Create storage context
print("\nüóÑÔ∏è Creating storage context...")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
print("‚úÖ Storage context created successfully")

# Display configuration summary
print("\nüìã Configuration Summary:")
print(f"   üîó Endpoint: {OPENSEARCH_ENDPOINT}")
print(f"   üìä Index: {OPENSEARCH_INDEX}")
print(f"   üìè Vector Dimensions: {dim}")
print(f"   üîç Search Pipeline: hybrid-search-pipeline")
print(f"   üìù Text Field: {TEXT_FIELD}")
print(f"   üî¢ Embedding Field: {EMBEDDING_FIELD}")

**OpenSearch Vector Store Components:**
- **VectorClient**: Handles communication with OpenSearch
- **Vector Store**: LlamaIndex interface for vector operations
- **Storage Context**: Defines where and how to store vectors
- **Search Pipeline**: Uses our hybrid search configuration
- **Index Schema**: Automatically created based on our specifications

## Section 9: Index Creation and Storage

Create the final searchable index and save it for future use.

In [None]:
# Create the index
print("üèóÔ∏è Creating VectorStoreIndex...")
print("   This process will:")
print("   1. Convert all text nodes to embeddings")
print("   2. Store vectors in OpenSearch")
print("   3. Create searchable index")
print("\n‚è≥ This may take a few minutes...")

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    embed_model=embed_model
)

print("‚úÖ VectorStoreIndex created successfully!")

# Save index with pickle for future use
index_filename = f"{OPENSEARCH_INDEX}.pkl"
print(f"\nüíæ Saving index to {index_filename}...")

try:
    with open(index_filename, 'wb') as f:
        pickle.dump(index, f)
    print(f"‚úÖ Index saved to {index_filename} successfully")
except Exception as e:
    print(f"‚ùå Failed to save index: {e}")

# Display final statistics
print("\nüìä Final Statistics:")
print(f"   üìö Documents processed: {len(documents)}")
print(f"   üîß Nodes created: {len(nodes)}")
print(f"   üî¢ Vector dimensions: {dim}")
print(f"   üì¶ Index name: {OPENSEARCH_INDEX}")
print(f"   üíæ Saved as: {index_filename}")

print("\nüéâ All processes completed successfully!")
print("üöÄ Your hybrid search system is now ready to use!")

**Index Creation Process:**
1. **Embedding Generation**: Each text node is converted to a 1024-dim vector
2. **OpenSearch Storage**: Vectors and text are stored in OpenSearch
3. **Index Structure**: Creates searchable index with hybrid capabilities
4. **Pickle Storage**: Saves LlamaIndex object for easy reloading

**What You Now Have:**
- ‚úÖ Fully functional hybrid search system
- ‚úÖ OpenSearch index with your documents
- ‚úÖ Saved index file for future use
- ‚úÖ Ready to answer questions!

## Next Steps: Testing Your Search System

Now you can test your hybrid search system without needing an LLM!

In [None]:
# ===== üîç ‡∏Å‡∏≤‡∏£‡∏ó‡∏î‡∏™‡∏≠‡∏ö‡∏£‡∏∞‡∏ö‡∏ö‡∏Ñ‡πâ‡∏ô‡∏´‡∏≤ =====
from llama_index.core import Settings

# ‡∏õ‡∏¥‡∏î‡∏Å‡∏≤‡∏£‡πÉ‡∏ä‡πâ LLM ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏´‡∏•‡∏µ‡∏Å‡πÄ‡∏•‡∏µ‡πà‡∏¢‡∏á‡∏õ‡∏±‡∏ç‡∏´‡∏≤ OpenAI API key
Settings.llm = None

# Quick test of the search system
print("\nüîç Testing the search system...")

# Create retriever instead of query_engine
retriever = index.as_retriever(
    vector_store_query_mode=VectorStoreQueryMode.HYBRID,
    similarity_top_k=3
)

# Test queries
test_queries = [
    "What is machine learning?",
    "How does artificial intelligence work?", 
    "Explain neural networks",
    "‡∏Å‡∏≤‡∏£‡πÄ‡∏£‡∏µ‡∏¢‡∏ô‡∏£‡∏π‡πâ‡∏Ç‡∏≠‡∏á‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á"  # ‡∏ó‡∏î‡∏™‡∏≠‡∏ö‡∏†‡∏≤‡∏©‡∏≤‡πÑ‡∏ó‡∏¢
]

for i, query in enumerate(test_queries, 1):
    print(f"\nüîç Test Query {i}: {query}")
    try:
        nodes = retriever.retrieve(query)
        print(f"‚úÖ Found {len(nodes)} relevant documents:")
        for j, node in enumerate(nodes, 1):
            print(f"  {j}. Score: {node.score:.3f}")
            print(f"     Content: {node.text[:150]}...")
    except Exception as e:
        print(f"‚ùå Error: {e}")

print("\nüéØ Search system is working! Documents are being retrieved successfully.")

# ‡∏ó‡∏î‡∏™‡∏≠‡∏ö‡∏Å‡∏≤‡∏£‡∏Ñ‡πâ‡∏ô‡∏´‡∏≤‡πÅ‡∏ö‡∏ö keyword vs semantic
print("\nüî¨ Testing different search modes...")

# Semantic search only
print("\nüß† Semantic Search:")
semantic_retriever = index.as_retriever(
    vector_store_query_mode=VectorStoreQueryMode.DEFAULT,
    similarity_top_k=2
)

# Keyword search (text search)
print("\nüîç Hybrid Search vs Semantic Search:")
# Note: OpenSearch hybrid search ‡∏à‡∏∞‡∏£‡∏ß‡∏° keyword + semantic ‡∏≠‡∏¢‡∏π‡πà‡πÅ‡∏•‡πâ‡∏ß

test_query = "machine learning algorithms"
semantic_results = semantic_retriever.retrieve(test_query)
hybrid_results = retriever.retrieve(test_query)

print(f"Query: {test_query}")
print(f"Semantic only: {len(semantic_results)} results")
print(f"Hybrid search: {len(hybrid_results)} results")

print("\n‚ú® Hybrid search test completed!")

## Conclusion

üéâ **Congratulations!** You have successfully built a hybrid search system that combines:

### Key Achievements:
- ‚úÖ **Semantic Search**: Understanding context and meaning
- ‚úÖ **Keyword Search**: Exact term matching
- ‚úÖ **Hybrid Results**: Best of both worlds
- ‚úÖ **Multilingual Support**: Works with multiple languages
- ‚úÖ **Scalable Architecture**: Ready for production use

### System Components:
1. **BGE-M3 Embeddings**: State-of-the-art text representations
2. **OpenSearch**: Powerful search and analytics engine
3. **LlamaIndex**: Seamless document management
4. **Hybrid Pipeline**: Intelligent result combination

### Future Enhancements:
- Add more documents to expand the knowledge base
- Implement question-answering capabilities
- Create a web interface for easier interaction
- Add document filtering and metadata search
- Implement user feedback and result ranking

Your hybrid search system is now ready to handle complex queries and provide accurate, contextual results! üöÄ