Haystack and Llamaindex fst and gsi examples #62

shyam-cb · 2025-10-30T06:19:46Z

No description provided.

gemini-code-assist · 2025-10-30T06:20:06Z

Summary of Changes

Hello @shyam-cb, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the examples for building Retrieval Augmented Generation (RAG) systems by introducing new notebooks for both Haystack and LlamaIndex frameworks. These examples illustrate how to effectively integrate with Couchbase Capella as a vector store, showcasing both its Full Text Search (FTS) and Global Secondary Index (GSI) capabilities for high-performance semantic search. By incorporating OpenAI for embeddings and LLM generation, and utilizing the BBC News dataset, the PR provides comprehensive, real-world demonstrations of RAG pipelines for current event information retrieval.

Highlights

New Haystack RAG Examples: Introduces a new RAG pipeline example using the Haystack framework, Couchbase Capella with FTS, and OpenAI for embeddings and generation.
New LlamaIndex RAG Examples: Adds new RAG pipeline examples leveraging the LlamaIndex framework, Couchbase Capella with both FTS and GSI, and OpenAI.
Couchbase Vector Search Integration: Demonstrates both Full Text Search (FTS) and Global Secondary Index (GSI) vector search capabilities within Couchbase Capella for efficient semantic retrieval across the examples.
OpenAI Integration: Utilizes OpenAI's 'text-embedding-3-large' for document embeddings and 'gpt-4o' for large language model generation across all new examples.
BBC News Dataset: All new examples are built around the BBC News dataset to showcase RAG's ability to provide up-to-date information.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds several new examples for Haystack and LlamaIndex with Couchbase FTS and GSI. The examples are comprehensive and cover the end-to-end RAG pipeline.

I've found a few critical issues in the notebooks that would prevent them from running correctly, mainly related to a syntax error, a variable name typo, and incorrect logic for creating search indexes. There are also some inconsistencies in the documentation (markdown files and comments) versus the code, such as mentioning the wrong datasets or services. I've also pointed out some opportunities for improvement, like using safer parsing methods and more optimal similarity metrics for embeddings.

Additionally, the directory lamaindex seems to have a typo and should probably be llamaindex.

Please review the specific comments for details and suggestions.

gemini-code-assist · 2025-10-30T06:24:03Z

capella-ai/haystack/RAG_with_Couchbase_Capella.ipynb

                "# Configure logging\n",
-                "logger = logging.getLogger(__name__)\n",
-                "logger.setLevel(logging.DEBUG)"
+                "logger = logging.getLogger(__name__)logger.setLevel(logging.DEBUG)\n"


This line contains a syntax error. The two Python statements logger = logging.getLogger(__name__) and logger.setLevel(logging.DEBUG) have been concatenated without a separator. Please add a semicolon between them to fix the syntax while keeping them on one line.

Suggested change

"logger = logging.getLogger(__name__)logger.setLevel(logging.DEBUG)\n"

"logger = logging.getLogger(__name__); logger.setLevel(logging.DEBUG)\n"

gemini-code-assist · 2025-10-30T06:24:03Z

haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb

+                "OPENAI_API_KEY = input(\"OpenAI API Key: \")\n",
+                "\n",
+                "# Check if the variables are correctly loaded\n",
+                "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, CB_OPENAI_API_KEY]):\n",


There is a typo in the variable name CB_OPENAI_API_KEY. The variable defined earlier is OPENAI_API_KEY. This will cause a NameError when this cell is executed.

if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, OPENAI_API_KEY]):

gemini-code-assist · 2025-10-30T06:24:03Z

haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb

+                "        with open('fts_index.json', 'r') as search_file:\n",
+                "            search_index_definition = SearchIndex.from_json(json.load(search_file))\n",
+                "            scope_search_manager.upsert_index(search_index_definition)\n",
+                "            print(f\"Search index '{search_index_name}' created successfully at scope level.\")"


The logic for creating the search index is flawed. Inside the except block, you are re-reading the original fts_index.json file, which overwrites the search_index_definition object that was modified with user inputs (e.g., index name, bucket name). As a result, upsert_index is called with the unmodified template, which will likely fail. You should remove the with open(...) block inside the except and call scope_search_manager.upsert_index(search_index_definition) directly, using the search_index_definition object that was already prepared.

gemini-code-assist · 2025-10-30T06:24:03Z

haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb

+                "    try:\n",
+                "        docs_data.append({\n",
+                "            'id': str(row[\"id\"]),\n",
+                "            'content': f\"Title: {row['title']}\\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\\nOverview: {row['overview']}\",\n",


Using eval() on data from an external source is a security risk, as it can execute arbitrary code. Since the 'genres' field appears to be a JSON string, you should use json.loads() for safe parsing. You will need to import the json module in this cell or a prior one.

'content': f"Title: {row['title']}\nGenres: {', '.join([genre['name'] for genre in json.loads(row['genres'])])}\nOverview: {row['overview']}",

gemini-code-assist · 2025-10-30T06:24:03Z

haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb

+    "        scope=SCOPE_NAME,\n",
+    "        collection=COLLECTION_NAME,\n",
+    "        search_type=QueryVectorSearchType.ANN,\n",
+    "        similarity=QueryVectorSearchSimilarity.L2\n",


The similarity metric is set to L2 (Euclidean distance). OpenAI's text-embedding-3-large model produces normalized embeddings, for which COSINE or DOT_PRODUCT similarity is recommended and generally performs better. Consider changing the similarity metric for optimal performance.

similarity=QueryVectorSearchSimilarity.COSINE

gemini-code-assist · 2025-10-30T06:24:04Z

lamaindex/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb

+   "metadata": {},
+   "source": [
+    "# Conclusion\n",
+    "In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Couchbase Capella, Openai and LlamaIndex. We used the BBC News dataset, which contains real-time news articles, to demonstrate how RAG can be used to answer questions about current events and provide up-to-date information that extends beyond the LLM's training data.\n",


There are a couple of typos in this line: 'Openai' should be 'OpenAI', and 'Llamaindex' should be 'LlamaIndex' for consistency and correct branding.

gemini-code-assist · 2025-10-30T06:24:05Z

lamaindex/fts/frontmatter.md

+  - Learn how to build a semantic search engine using Couchbase Capella AI Services.
+  - This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings provided by Capella AI Services.
+  - You will understand how to perform Retrieval-Augmented Generation (RAG) using Llamaindex and Capella AI services.


The description mentions using 'Capella AI Services'. However, the corresponding notebook uses the OpenAI API directly. Please update the description to reflect that this tutorial uses OpenAI to avoid confusion.

Suggested change

- Learn how to build a semantic search engine using Couchbase Capella AI Services.

- This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings provided by Capella AI Services.

- You will understand how to perform Retrieval-Augmented Generation (RAG) using Llamaindex and Capella AI services.

- Learn how to build a semantic search engine using Couchbase Capella.

- This tutorial demonstrates how to integrate Couchbase's vector search capabilities with embeddings provided by OpenAI.

- You will understand how to perform Retrieval-Augmented Generation (RAG) using LlamaIndex and OpenAI.

gemini-code-assist · 2025-10-30T06:24:05Z

lamaindex/fts/frontmatter.md

+  - Openai
+  - Artificial Intelligence
+  - Llamaindex


There are a couple of typos here: Openai should be OpenAI and Llamaindex should be LlamaIndex for correct branding and consistency.

Suggested change

- Openai

- Artificial Intelligence

- Llamaindex

- OpenAI

- Artificial Intelligence

- LlamaIndex

gemini-code-assist · 2025-10-30T06:24:05Z

lamaindex/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb

+    "    news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split=\"train\")\n",
+    "    print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n",
+    "except Exception as e:\n",
+    "    raise ValueError(f\"Error loading TREC dataset: {str(e)}\")"


The error message incorrectly refers to the 'TREC dataset', but the code is loading the 'RealTimeData/bbc_news_alltime' dataset. Please correct the error message to mention the correct dataset.

raise ValueError(f"Error loading BBC News dataset: {str(e)}")

gemini-code-assist · 2025-10-30T06:24:05Z

lamaindex/gsi/frontmatter.md

+  - Openai
+  - Artificial Intelligence
+  - Llamaindex


There are a couple of typos here: Openai should be OpenAI and Llamaindex should be LlamaIndex for correct branding and consistency.

Suggested change

- Openai

- Artificial Intelligence

- Llamaindex

- OpenAI

- Artificial Intelligence

- LlamaIndex

shyam-cb added 6 commits September 25, 2025 15:15

haystack and llamaindex example for gsi and fts

461afe3

refactor

f48c9e5

added not for default linear vector search

835fbb8

haystack cookbook update

715ee15

composite changes

f64a834

clean up

e4255a9

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

shyam-cb added 2 commits October 30, 2025 12:06

reverted code

61f3620

reverted code

475be2b

teetangh assigned shyam-cb Oct 30, 2025

shyam-cb added 7 commits October 30, 2025 13:16

file name changes

2058d53

test fix

8c07af7

tag name

ea97901

fix for bulk error

640b3fe

ignore llama index gsi example

5418d50

tag changes

412a572

description change

2ce3c60

shyam-cb merged commit 93c3933 into couchbase-examples:main Oct 30, 2025
0 of 2 checks passed

	"logger = logging.getLogger(__name__)logger.setLevel(logging.DEBUG)\n"
	"logger = logging.getLogger(__name__); logger.setLevel(logging.DEBUG)\n"

Haystack and Llamaindex fst and gsi examples #62

Haystack and Llamaindex fst and gsi examples #62

Uh oh!

Conversation

shyam-cb commented Oct 30, 2025

Uh oh!

gemini-code-assist bot commented Oct 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant