Skip to content

Conversation

@shyam-cb
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @shyam-cb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the examples for building Retrieval Augmented Generation (RAG) systems by introducing new notebooks for both Haystack and LlamaIndex frameworks. These examples illustrate how to effectively integrate with Couchbase Capella as a vector store, showcasing both its Full Text Search (FTS) and Global Secondary Index (GSI) capabilities for high-performance semantic search. By incorporating OpenAI for embeddings and LLM generation, and utilizing the BBC News dataset, the PR provides comprehensive, real-world demonstrations of RAG pipelines for current event information retrieval.

Highlights

  • New Haystack RAG Examples: Introduces a new RAG pipeline example using the Haystack framework, Couchbase Capella with FTS, and OpenAI for embeddings and generation.
  • New LlamaIndex RAG Examples: Adds new RAG pipeline examples leveraging the LlamaIndex framework, Couchbase Capella with both FTS and GSI, and OpenAI.
  • Couchbase Vector Search Integration: Demonstrates both Full Text Search (FTS) and Global Secondary Index (GSI) vector search capabilities within Couchbase Capella for efficient semantic retrieval across the examples.
  • OpenAI Integration: Utilizes OpenAI's 'text-embedding-3-large' for document embeddings and 'gpt-4o' for large language model generation across all new examples.
  • BBC News Dataset: All new examples are built around the BBC News dataset to showcase RAG's ability to provide up-to-date information.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds several new examples for Haystack and LlamaIndex with Couchbase FTS and GSI. The examples are comprehensive and cover the end-to-end RAG pipeline.

I've found a few critical issues in the notebooks that would prevent them from running correctly, mainly related to a syntax error, a variable name typo, and incorrect logic for creating search indexes. There are also some inconsistencies in the documentation (markdown files and comments) versus the code, such as mentioning the wrong datasets or services. I've also pointed out some opportunities for improvement, like using safer parsing methods and more optimal similarity metrics for embeddings.

Additionally, the directory lamaindex seems to have a typo and should probably be llamaindex.

Please review the specific comments for details and suggestions.

"# Configure logging\n",
"logger = logging.getLogger(__name__)\n",
"logger.setLevel(logging.DEBUG)"
"logger = logging.getLogger(__name__)logger.setLevel(logging.DEBUG)\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line contains a syntax error. The two Python statements logger = logging.getLogger(__name__) and logger.setLevel(logging.DEBUG) have been concatenated without a separator. Please add a semicolon between them to fix the syntax while keeping them on one line.

Suggested change
"logger = logging.getLogger(__name__)logger.setLevel(logging.DEBUG)\n"
"logger = logging.getLogger(__name__); logger.setLevel(logging.DEBUG)\n"

"OPENAI_API_KEY = input(\"OpenAI API Key: \")\n",
"\n",
"# Check if the variables are correctly loaded\n",
"if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, CB_OPENAI_API_KEY]):\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a typo in the variable name CB_OPENAI_API_KEY. The variable defined earlier is OPENAI_API_KEY. This will cause a NameError when this cell is executed.

if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, OPENAI_API_KEY]):

Comment on lines +235 to +238
" with open('fts_index.json', 'r') as search_file:\n",
" search_index_definition = SearchIndex.from_json(json.load(search_file))\n",
" scope_search_manager.upsert_index(search_index_definition)\n",
" print(f\"Search index '{search_index_name}' created successfully at scope level.\")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for creating the search index is flawed. Inside the except block, you are re-reading the original fts_index.json file, which overwrites the search_index_definition object that was modified with user inputs (e.g., index name, bucket name). As a result, upsert_index is called with the unmodified template, which will likely fail. You should remove the with open(...) block inside the except and call scope_search_manager.upsert_index(search_index_definition) directly, using the search_index_definition object that was already prepared.

" try:\n",
" docs_data.append({\n",
" 'id': str(row[\"id\"]),\n",
" 'content': f\"Title: {row['title']}\\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\\nOverview: {row['overview']}\",\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using eval() on data from an external source is a security risk, as it can execute arbitrary code. Since the 'genres' field appears to be a JSON string, you should use json.loads() for safe parsing. You will need to import the json module in this cell or a prior one.

            'content': f"Title: {row['title']}\nGenres: {', '.join([genre['name'] for genre in json.loads(row['genres'])])}\nOverview: {row['overview']}",

" scope=SCOPE_NAME,\n",
" collection=COLLECTION_NAME,\n",
" search_type=QueryVectorSearchType.ANN,\n",
" similarity=QueryVectorSearchSimilarity.L2\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The similarity metric is set to L2 (Euclidean distance). OpenAI's text-embedding-3-large model produces normalized embeddings, for which COSINE or DOT_PRODUCT similarity is recommended and generally performs better. Consider changing the similarity metric for optimal performance.

        similarity=QueryVectorSearchSimilarity.COSINE

"metadata": {},
"source": [
"# Conclusion\n",
"In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Couchbase Capella, Openai and LlamaIndex. We used the BBC News dataset, which contains real-time news articles, to demonstrate how RAG can be used to answer questions about current events and provide up-to-date information that extends beyond the LLM's training data.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of typos in this line: 'Openai' should be 'OpenAI', and 'Llamaindex' should be 'LlamaIndex' for consistency and correct branding.

Comment on lines 7 to 9
- Learn how to build a semantic search engine using Couchbase Capella AI Services.
- This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings provided by Capella AI Services.
- You will understand how to perform Retrieval-Augmented Generation (RAG) using Llamaindex and Capella AI services.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description mentions using 'Capella AI Services'. However, the corresponding notebook uses the OpenAI API directly. Please update the description to reflect that this tutorial uses OpenAI to avoid confusion.

Suggested change
- Learn how to build a semantic search engine using Couchbase Capella AI Services.
- This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings provided by Capella AI Services.
- You will understand how to perform Retrieval-Augmented Generation (RAG) using Llamaindex and Capella AI services.
- Learn how to build a semantic search engine using Couchbase Capella.
- This tutorial demonstrates how to integrate Couchbase's vector search capabilities with embeddings provided by OpenAI.
- You will understand how to perform Retrieval-Augmented Generation (RAG) using LlamaIndex and OpenAI.

Comment on lines 15 to 17
- Openai
- Artificial Intelligence
- Llamaindex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of typos here: Openai should be OpenAI and Llamaindex should be LlamaIndex for correct branding and consistency.

Suggested change
- Openai
- Artificial Intelligence
- Llamaindex
- OpenAI
- Artificial Intelligence
- LlamaIndex

" news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split=\"train\")\n",
" print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n",
"except Exception as e:\n",
" raise ValueError(f\"Error loading TREC dataset: {str(e)}\")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message incorrectly refers to the 'TREC dataset', but the code is loading the 'RealTimeData/bbc_news_alltime' dataset. Please correct the error message to mention the correct dataset.

    raise ValueError(f"Error loading BBC News dataset: {str(e)}")

Comment on lines 15 to 17
- Openai
- Artificial Intelligence
- Llamaindex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of typos here: Openai should be OpenAI and Llamaindex should be LlamaIndex for correct branding and consistency.

Suggested change
- Openai
- Artificial Intelligence
- Llamaindex
- OpenAI
- Artificial Intelligence
- LlamaIndex

@shyam-cb shyam-cb merged commit 93c3933 into couchbase-examples:main Oct 30, 2025
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant