# Introduction

In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and [CrewAI](https://github.com/joaomdmoura/crewAI) for agent-based RAG operations. CrewAI allows us to create specialized agents that can work together to handle different aspects of the RAG workflow, from document retrieval to response generation. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch.

# Setting the Stage: Installing Necessary Libraries

To build our semantic search engine, we need a robust set of tools. The libraries we install handle everything from connecting to databases to performing complex machine learning tasks.

In [1]:
%pip install --quiet datasets langchain-couchbase langchain-openai crewai python-dotenv tqdm

Note: you may need to restart the kernel to use updated packages.


# Importing Libraries and Setting Up Logging

Import all necessary libraries and configure logging.

In [2]:
import json
import logging
import time
import sys
import os
from datetime import timedelta
from uuid import uuid4
from typing import Any, Optional
from dotenv import load_dotenv

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.exceptions import InternalServerFailureException, QueryIndexAlreadyExistsException
from couchbase.management.search import SearchIndex
from couchbase.options import ClusterOptions
from datasets import load_dataset
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain_core.globals import set_llm_cache
from langchain_couchbase.cache import CouchbaseCache
from langchain_couchbase.vectorstores import CouchbaseVectorStore
from langchain.tools import Tool
from crewai import Agent, Task, Crew, Process
from tqdm import tqdm

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

  from .autonotebook import tqdm as notebook_tqdm


# Loading Configuration Settings

Load configuration settings from environment variables or use default values.

In [3]:
# Load environment variables
load_dotenv()

# Configuration
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY is not set")

CB_HOST = os.getenv('CB_HOST', 'couchbase://localhost')
CB_USERNAME = os.getenv('CB_USERNAME', 'Administrator')
CB_PASSWORD = os.getenv('CB_PASSWORD', 'password')
CB_BUCKET_NAME = os.getenv('CB_BUCKET_NAME', 'vector-search-testing')
INDEX_NAME = os.getenv('INDEX_NAME', 'vector_search_crew')
SCOPE_NAME = os.getenv('SCOPE_NAME', 'shared')
COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'crew')
CACHE_COLLECTION = os.getenv('CACHE_COLLECTION', 'cache')

print("Configuration loaded successfully")

Configuration loaded successfully


# Connecting to Couchbase

Connect to Couchbase and set up the required collections.

In [4]:
# Connect to Couchbase
auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(CB_HOST, options)
cluster.wait_until_ready(timedelta(seconds=5))
print("Successfully connected to Couchbase")

# Setup collections
bucket = cluster.bucket(CB_BUCKET_NAME)
bucket_manager = bucket.collections()

# Setup main collection
collections = bucket_manager.get_all_scopes()
collection_exists = any(
    scope.name == SCOPE_NAME and COLLECTION_NAME in [col.name for col in scope.collections]
    for scope in collections
)

if not collection_exists:
    bucket_manager.create_collection(SCOPE_NAME, COLLECTION_NAME)
    print(f"Collection '{COLLECTION_NAME}' created")
else:
    print(f"Collection '{COLLECTION_NAME}' already exists")

# Create primary index
cluster.query(
    f"CREATE PRIMARY INDEX IF NOT EXISTS ON `{CB_BUCKET_NAME}`.`{SCOPE_NAME}`.`{COLLECTION_NAME}`"
).execute()
print(f"Primary index created for '{COLLECTION_NAME}'")

# Clear collection
cluster.query(
    f"DELETE FROM `{CB_BUCKET_NAME}`.`{SCOPE_NAME}`.`{COLLECTION_NAME}`"
).execute()
print(f"Collection '{COLLECTION_NAME}' cleared")

# Setup cache collection
collection_exists = any(
    scope.name == SCOPE_NAME and CACHE_COLLECTION in [col.name for col in scope.collections]
    for scope in collections
)

if not collection_exists:
    bucket_manager.create_collection(SCOPE_NAME, CACHE_COLLECTION)
    print(f"Collection '{CACHE_COLLECTION}' created")
else:
    print(f"Collection '{CACHE_COLLECTION}' already exists")

# Create primary index for cache
cluster.query(
    f"CREATE PRIMARY INDEX IF NOT EXISTS ON `{CB_BUCKET_NAME}`.`{SCOPE_NAME}`.`{CACHE_COLLECTION}`"
).execute()
print(f"Primary index created for '{CACHE_COLLECTION}'")

# Clear cache collection
cluster.query(
    f"DELETE FROM `{CB_BUCKET_NAME}`.`{SCOPE_NAME}`.`{CACHE_COLLECTION}`"
).execute()
print(f"Collection '{CACHE_COLLECTION}' cleared")

Successfully connected to Couchbase
Collection 'crew' already exists
Primary index created for 'crew'
Collection 'crew' cleared
Collection 'cache' already exists
Primary index created for 'cache'
Collection 'cache' cleared


# Setting Up Vector Search Index

Load and create the vector search index in Couchbase.

In [5]:
# Load index definition
with open('crew_index.json', 'r') as file:
    index_definition = json.load(file)

# Setup vector search index
scope_index_manager = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME).search_indexes()

# Check if index exists
existing_indexes = scope_index_manager.get_all_indexes()
index_exists = any(index.name == INDEX_NAME for index in existing_indexes)

if index_exists:
    print(f"Index '{INDEX_NAME}' already exists")
else:
    search_index = SearchIndex.from_json(index_definition)
    scope_index_manager.upsert_index(search_index)
    print(f"Index '{INDEX_NAME}' created")

Index 'vector_search_crew' already exists


# Setting Up OpenAI Components

Initialize OpenAI embeddings and language model.

In [6]:
# Initialize OpenAI components
embeddings = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY,
    model="text-embedding-ada-002"
)

llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model="gpt-4o",
    temperature=0.7
)

print("OpenAI components initialized")

OpenAI components initialized


# Setting Up Vector Store and Cache

Initialize the vector store and cache components.

In [7]:
# Setup vector store
vector_store = CouchbaseVectorStore(
    cluster=cluster,
    bucket_name=CB_BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    embedding=embeddings,
    index_name=INDEX_NAME,
)
print("Vector store initialized")

# Setup cache
cache = CouchbaseCache(
    cluster=cluster,
    bucket_name=CB_BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=CACHE_COLLECTION,
)
set_llm_cache(cache)
print("Cache initialized")

Vector store initialized
Cache initialized


# Loading Sample Data

Load and process the TREC dataset.

In [8]:
# Load TREC dataset
trec = load_dataset('trec', split='train[:1000]')
print(f"Loaded {len(trec)} samples from TREC dataset")

# Add documents in batches
batch_size = 50
for i in tqdm(range(0, len(trec['text']), batch_size), desc="Loading data"):
    batch = trec['text'][i:i + batch_size]
    documents = [Document(page_content=text) for text in batch]
    uuids = [str(uuid4()) for _ in range(len(documents))]
    vector_store.add_documents(documents=documents, ids=uuids)

print("Sample data loaded into vector store")

Loaded 1000 samples from TREC dataset


Loading data:   0%|          | 0/20 [00:00<?, ?it/s]2024-12-09 19:16:29 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Loading data:   5%|▌         | 1/20 [00:04<01:28,  4.65s/it]2024-12-09 19:16:34 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Loading data:  10%|█         | 2/20 [00:08<01:17,  4.29s/it]2024-12-09 19:16:38 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Loading data:  15%|█▌        | 3/20 [00:11<01:01,  3.64s/it]2024-12-09 19:16:41 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Loading data:  20%|██        | 4/20 [00:14<00:51,  3.23s/it]2024-12-09 19:16:43 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Loading data:  25%|██▌       | 5/20 [00:16<00:42,  2.85s/it]2024-12-09 19:16:45 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Loading data:  30%|███       | 6/20 [00:18

Sample data loaded into vector store





# Creating Vector Search Tool

Create a tool for performing vector searches.

In [9]:
# Create vector search tool
search_tool = Tool(
    name="vector_search",
    func=lambda query: "\n\n".join([
        f"Document {i+1}:\n{'-'*40}\n{doc.page_content}"
        for i, doc in enumerate(vector_store.similarity_search(
            query if isinstance(query, str) else str(query.get('query', '')),
            k=8,
            fetch_k=20
        ))
    ]),
    description="""Search for relevant documents using vector similarity.
    Input should be a simple text query string.
    Returns a list of relevant document contents with metadata.
    Use this tool to find detailed information about topics."""
)

print("Vector search tool created")

Vector search tool created


# Creating CrewAI Agents

Create specialized agents for research and writing tasks.

In [10]:
# Custom response template
response_template = """
Analysis Results
===============
{%- if .Response %}
{{ .Response }}
{%- endif %}

Sources
=======
{%- for tool in .Tools %}
* {{ tool.name }}
{%- endfor %}

Metadata
========
* Confidence: {{ .Confidence }}
* Analysis Time: {{ .ExecutionTime }}
"""

# Create research agent
researcher = Agent(
    role='Research Expert',
    goal='Find and analyze the most relevant documents to answer user queries accurately',
    backstory="""You are an expert researcher with deep knowledge in information retrieval 
    and analysis. Your expertise lies in finding, evaluating, and synthesizing information 
    from various sources. You have a keen eye for detail and can identify key insights 
    from complex documents. You always verify information across multiple sources and 
    provide comprehensive, accurate analyses.""",
    tools=[search_tool],
    llm=llm,
    verbose=True,
    memory=True,
    allow_delegation=False,
    response_template=response_template
)

# Create writer agent
writer = Agent(
    role='Technical Writer',
    goal='Generate clear, accurate, and well-structured responses based on research findings',
    backstory="""You are a skilled technical writer with expertise in making complex 
    information accessible and engaging. You excel at organizing information logically, 
    explaining technical concepts clearly, and creating well-structured documents. You 
    ensure all information is properly cited, accurate, and presented in a user-friendly 
    manner. You have a talent for maintaining the reader's interest while conveying 
    detailed technical information.""",
    llm=llm,
    verbose=True,
    memory=True,
    allow_delegation=False,
    response_template=response_template
)

print("Agents created successfully")

Agents created successfully


# Testing the Search System

Test the system with some example queries.

In [11]:
def process_query(query, researcher, writer):
    print(f"\nQuery: {query}")
    print("-" * 80)
    
    # Create tasks
    research_task = Task(
        description=f"Research and analyze information relevant to: {query}",
        agent=researcher,
        expected_output="A detailed analysis with key findings and supporting evidence"
    )
    
    writing_task = Task(
        description="Create a comprehensive and well-structured response",
        agent=writer,
        expected_output="A clear, comprehensive response that answers the query",
        context=[research_task]
    )
    
    # Create and execute crew
    crew = Crew(
        agents=[researcher, writer],
        tasks=[research_task, writing_task],
        process=Process.sequential,
        verbose=True,
        cache=True,
        planning=True
    )
    
    try:
        start_time = time.time()
        result = crew.kickoff()
        elapsed_time = time.time() - start_time
        
        print(f"\nQuery completed in {elapsed_time:.2f} seconds")
        print("=" * 80)
        print("RESPONSE")
        print("=" * 80)
        print(result)
        
        if hasattr(result, 'tasks_output'):
            print("\n" + "=" * 80)
            print("DETAILED TASK OUTPUTS")
            print("=" * 80)
            for task_output in result.tasks_output:
                print(f"\nTask: {task_output.description[:100]}...")
                print("-" * 40)
                print(f"Output: {task_output.raw}")
                print("-" * 40)
    except Exception as e:
        print(f"Error executing crew: {str(e)}")
        logging.error(f"Crew execution failed: {str(e)}", exc_info=True)

In [12]:
query = "What caused the 1929 Great Depression?"
process_query(query, researcher, writer)

[92m19:17:20 - LiteLLM:INFO[0m: utils.py:2741 - 
LiteLLM completion() model= gpt-4o-mini; provider = openai
2024-12-09 19:17:20 [INFO] 
LiteLLM completion() model= gpt-4o-mini; provider = openai



Query: What caused the 1929 Great Depression?
--------------------------------------------------------------------------------
[1m[93m 
[2024-12-09 19:17:20][INFO]: Planning the crew execution[00m


2024-12-09 19:17:28 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[92m19:17:28 - LiteLLM:INFO[0m: utils.py:890 - Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:28 [INFO] Wrapper: Completed Call, calling success_handler
[92m19:17:28 - LiteLLM:INFO[0m: utils.py:2741 - 
LiteLLM completion() model= gpt-4o; provider = openai
2024-12-09 19:17:28 [INFO] 
LiteLLM completion() model= gpt-4o; provider = openai


[1m[95m# Agent:[00m [1m[92mResearch Expert[00m
[95m## Task:[00m [92mResearch and analyze information relevant to: What caused the 1929 Great Depression?1. Begin by formulating a clear and specific query to guide the research. Example query: 'causes of the 1929 Great Depression.' 2. Utilize the vector_search tool to input the query, ensuring that it captures the essence of the task by entering it in the query field accurately. 3. Execute the vector_search tool, which will initiate a search for relevant documents concerning the 1929 Great Depression. Review the returned list of documents and metadata for quality and relevance. 4. Identify and select the most pertinent documents that provide insights into the economic, social, and political factors that contributed to the Great Depression. 5. Analyze the contents of these documents for key findings, such as stock market speculation, bank failures, and decline in consumer spending. 6. Organize the information systematically, highl

2024-12-09 19:17:29 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[92m19:17:29 - LiteLLM:INFO[0m: utils.py:890 - Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:29 [INFO] Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:30 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[92m19:17:31 - LiteLLM:INFO[0m: utils.py:2741 - 
LiteLLM completion() model= gpt-4o; provider = openai
2024-12-09 19:17:31 [INFO] 
LiteLLM completion() model= gpt-4o; provider = openai




[1m[95m# Agent:[00m [1m[92mResearch Expert[00m
[95m## Thought:[00m [92mI need to perform a search to find relevant documents about the causes of the 1929 Great Depression. This will involve using the vector_search tool with a clear and specific query.[00m
[95m## Using tool:[00m [92mvector_search[00m
[95m## Tool Input:[00m [92m
"{\"query\": \"causes of the 1929 Great Depression\"}"[00m
[95m## Tool Output:[00m [92m
Document 1:
----------------------------------------
Why did the world enter a global depression in 1929 ?

Document 2:
----------------------------------------
When was `` the Great Depression '' ?

Document 3:
----------------------------------------
What crop failure caused the Irish Famine ?

Document 4:
----------------------------------------
What were popular songs and types of songs in the 1920s ?

Document 5:
----------------------------------------
What historical event happened in Dogtown in 1899 ?

Document 6:
--------------------------------

2024-12-09 19:17:33 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[92m19:17:33 - LiteLLM:INFO[0m: utils.py:890 - Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:33 [INFO] Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:33 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[92m19:17:34 - LiteLLM:INFO[0m: utils.py:2741 - 
LiteLLM completion() model= gpt-4o; provider = openai
2024-12-09 19:17:34 [INFO] 
LiteLLM completion() model= gpt-4o; provider = openai




[1m[95m# Agent:[00m [1m[92mResearch Expert[00m
[95m## Thought:[00m [92mThe search results did not return any relevant documents specifically addressing the causes of the 1929 Great Depression. I need to refine my search query or attempt the search again to ensure we find detailed documents that answer the query accurately.[00m
[95m## Using tool:[00m [92mvector_search[00m
[95m## Tool Input:[00m [92m
"{\"query\": \"economic and political factors of the 1929 Great Depression\"}"[00m
[95m## Tool Output:[00m [92m
Document 1:
----------------------------------------
Why did the world enter a global depression in 1929 ?

Document 2:
----------------------------------------
When was `` the Great Depression '' ?

Document 3:
----------------------------------------
What do economists do ?

Document 4:
----------------------------------------
What are some of the significant historical events of the 1990s ?

Document 5:
----------------------------------------
What were po

2024-12-09 19:17:36 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[92m19:17:36 - LiteLLM:INFO[0m: utils.py:890 - Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:36 [INFO] Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:36 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[92m19:17:38 - LiteLLM:INFO[0m: utils.py:2741 - 
LiteLLM completion() model= gpt-4o; provider = openai
2024-12-09 19:17:38 [INFO] 
LiteLLM completion() model= gpt-4o; provider = openai




[1m[95m# Agent:[00m [1m[92mResearch Expert[00m
[95m## Thought:[00m [92mThe search results still do not provide relevant documents concerning the specific causes of the 1929 Great Depression. I'll attempt another search, focusing explicitly on the economic, social, and political factors that contributed to the Great Depression in 1929.[00m
[95m## Using tool:[00m [92mvector_search[00m
[95m## Tool Input:[00m [92m
"{\"query\": \"economic causes of 1929 Great Depression\"}"[00m
[95m## Tool Output:[00m [92m
Document 1:
----------------------------------------
Why did the world enter a global depression in 1929 ?

Document 2:
----------------------------------------
When was `` the Great Depression '' ?

Document 3:
----------------------------------------
What do economists do ?

Document 4:
----------------------------------------
What crop failure caused the Irish Famine ?

Document 5:
----------------------------------------
What were popular songs and types of song

2024-12-09 19:17:39 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[92m19:17:39 - LiteLLM:INFO[0m: utils.py:890 - Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:39 [INFO] Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:40 [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[92m19:17:41 - LiteLLM:INFO[0m: utils.py:2741 - 
LiteLLM completion() model= gpt-4o; provider = openai
2024-12-09 19:17:41 [INFO] 
LiteLLM completion() model= gpt-4o; provider = openai




[1m[95m# Agent:[00m [1m[92mResearch Expert[00m
[95m## Thought:[00m [92mThe search results are still not providing the relevant information needed to address the causes of the 1929 Great Depression. I need to refine the search query further to capture more specific aspects of the Great Depression. I will try a more focused search query.[00m
[95m## Using tool:[00m [92mvector_search[00m
[95m## Tool Input:[00m [92m
"{\"query\": \"factors leading to the 1929 stock market crash and Great Depression\"}"[00m
[95m## Tool Output:[00m [92m
Document 1:
----------------------------------------
Why did the world enter a global depression in 1929 ?

Document 2:
----------------------------------------
When was `` the Great Depression '' ?

Document 3:
----------------------------------------
What are bear and bull markets ?

Document 4:
----------------------------------------
What animals do you find in the stock market ?

Document 5:
----------------------------------------
W

2024-12-09 19:17:47 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[92m19:17:47 - LiteLLM:INFO[0m: utils.py:890 - Wrapper: Completed Call, calling success_handler
2024-12-09 19:17:47 [INFO] Wrapper: Completed Call, calling success_handler
[92m19:17:47 - LiteLLM:INFO[0m: utils.py:2741 - 
LiteLLM completion() model= gpt-4o; provider = openai
2024-12-09 19:17:47 [INFO] 
LiteLLM completion() model= gpt-4o; provider = openai




[1m[95m# Agent:[00m [1m[92mResearch Expert[00m
[95m## Final Answer:[00m [92m
The 1929 Great Depression was caused by a combination of factors, including:

1. **Stock Market Crash of 1929**: The crash in October 1929 is often cited as the immediate trigger that contributed to the economic downturn. Speculation had driven up stock prices to unsustainable levels, leading to a massive sell-off when confidence dropped.

2. **Bank Failures**: During the 1920s, banks operated without regulations that are in place today, leading to risky banking practices. After the crash, many banks failed as they could not return depositors' money, leading to a loss of savings and further reduction in consumer spending.

3. **Reduction in Consumer Spending**: With the loss of wealth and savings, consumer spending plummeted, leading to a decrease in production and, consequently, a rise in unemployment.

4. **International Trade Decline**: The introduction of the Smoot-Hawley Tariff in 1930, which r

2024-12-09 19:18:01 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[92m19:18:01 - LiteLLM:INFO[0m: utils.py:890 - Wrapper: Completed Call, calling success_handler
2024-12-09 19:18:01 [INFO] Wrapper: Completed Call, calling success_handler




[1m[95m# Agent:[00m [1m[92mTechnical Writer[00m
[95m## Final Answer:[00m [92m
**Outline for the Response on the Causes of the 1929 Great Depression**  

1. **Introduction**  
   - Overview of the Great Depression  
   - Significance in American history  

2. **Key Causes**  
   - Stock Market Crash of 1929  
   - Bank Failures  
   - Reduction in Consumer Spending  
   - International Trade Decline  
   - Drought Conditions  

3. **Supporting Evidence**  
   - Statistics  
   - Quotes from historical figures  
   - Relevant examples  

4. **Conclusion**  
   - Summary of main findings  
   - Emphasis on interconnectedness of causes  
   - Long-term effects on the economy  

**Introduction**  
The Great Depression, which began in 1929, was a catastrophic economic downturn that had profound and lasting impacts on the United States and the global economy. It is often regarded as one of the most severe economic crises in American history, leading to widespread unemployment, pove