-
Notifications
You must be signed in to change notification settings - Fork 8
Haystack and Llamaindex fst and gsi examples #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Haystack and Llamaindex fst and gsi examples #62
Conversation
Summary of ChangesHello @shyam-cb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the examples for building Retrieval Augmented Generation (RAG) systems by introducing new notebooks for both Haystack and LlamaIndex frameworks. These examples illustrate how to effectively integrate with Couchbase Capella as a vector store, showcasing both its Full Text Search (FTS) and Global Secondary Index (GSI) capabilities for high-performance semantic search. By incorporating OpenAI for embeddings and LLM generation, and utilizing the BBC News dataset, the PR provides comprehensive, real-world demonstrations of RAG pipelines for current event information retrieval. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds several new examples for Haystack and LlamaIndex with Couchbase FTS and GSI. The examples are comprehensive and cover the end-to-end RAG pipeline.
I've found a few critical issues in the notebooks that would prevent them from running correctly, mainly related to a syntax error, a variable name typo, and incorrect logic for creating search indexes. There are also some inconsistencies in the documentation (markdown files and comments) versus the code, such as mentioning the wrong datasets or services. I've also pointed out some opportunities for improvement, like using safer parsing methods and more optimal similarity metrics for embeddings.
Additionally, the directory lamaindex seems to have a typo and should probably be llamaindex.
Please review the specific comments for details and suggestions.
| "# Configure logging\n", | ||
| "logger = logging.getLogger(__name__)\n", | ||
| "logger.setLevel(logging.DEBUG)" | ||
| "logger = logging.getLogger(__name__)logger.setLevel(logging.DEBUG)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains a syntax error. The two Python statements logger = logging.getLogger(__name__) and logger.setLevel(logging.DEBUG) have been concatenated without a separator. Please add a semicolon between them to fix the syntax while keeping them on one line.
| "logger = logging.getLogger(__name__)logger.setLevel(logging.DEBUG)\n" | |
| "logger = logging.getLogger(__name__); logger.setLevel(logging.DEBUG)\n" |
| "OPENAI_API_KEY = input(\"OpenAI API Key: \")\n", | ||
| "\n", | ||
| "# Check if the variables are correctly loaded\n", | ||
| "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, CB_OPENAI_API_KEY]):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the variable name CB_OPENAI_API_KEY. The variable defined earlier is OPENAI_API_KEY. This will cause a NameError when this cell is executed.
if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, OPENAI_API_KEY]):
| " with open('fts_index.json', 'r') as search_file:\n", | ||
| " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", | ||
| " scope_search_manager.upsert_index(search_index_definition)\n", | ||
| " print(f\"Search index '{search_index_name}' created successfully at scope level.\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for creating the search index is flawed. Inside the except block, you are re-reading the original fts_index.json file, which overwrites the search_index_definition object that was modified with user inputs (e.g., index name, bucket name). As a result, upsert_index is called with the unmodified template, which will likely fail. You should remove the with open(...) block inside the except and call scope_search_manager.upsert_index(search_index_definition) directly, using the search_index_definition object that was already prepared.
| " try:\n", | ||
| " docs_data.append({\n", | ||
| " 'id': str(row[\"id\"]),\n", | ||
| " 'content': f\"Title: {row['title']}\\nGenres: {', '.join([genre['name'] for genre in eval(row['genres'])])}\\nOverview: {row['overview']}\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using eval() on data from an external source is a security risk, as it can execute arbitrary code. Since the 'genres' field appears to be a JSON string, you should use json.loads() for safe parsing. You will need to import the json module in this cell or a prior one.
'content': f"Title: {row['title']}\nGenres: {', '.join([genre['name'] for genre in json.loads(row['genres'])])}\nOverview: {row['overview']}",
| " scope=SCOPE_NAME,\n", | ||
| " collection=COLLECTION_NAME,\n", | ||
| " search_type=QueryVectorSearchType.ANN,\n", | ||
| " similarity=QueryVectorSearchSimilarity.L2\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The similarity metric is set to L2 (Euclidean distance). OpenAI's text-embedding-3-large model produces normalized embeddings, for which COSINE or DOT_PRODUCT similarity is recommended and generally performs better. Consider changing the similarity metric for optimal performance.
similarity=QueryVectorSearchSimilarity.COSINE
| "metadata": {}, | ||
| "source": [ | ||
| "# Conclusion\n", | ||
| "In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Couchbase Capella, Openai and LlamaIndex. We used the BBC News dataset, which contains real-time news articles, to demonstrate how RAG can be used to answer questions about current events and provide up-to-date information that extends beyond the LLM's training data.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lamaindex/fts/frontmatter.md
Outdated
| - Learn how to build a semantic search engine using Couchbase Capella AI Services. | ||
| - This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings provided by Capella AI Services. | ||
| - You will understand how to perform Retrieval-Augmented Generation (RAG) using Llamaindex and Capella AI services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description mentions using 'Capella AI Services'. However, the corresponding notebook uses the OpenAI API directly. Please update the description to reflect that this tutorial uses OpenAI to avoid confusion.
| - Learn how to build a semantic search engine using Couchbase Capella AI Services. | |
| - This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings provided by Capella AI Services. | |
| - You will understand how to perform Retrieval-Augmented Generation (RAG) using Llamaindex and Capella AI services. | |
| - Learn how to build a semantic search engine using Couchbase Capella. | |
| - This tutorial demonstrates how to integrate Couchbase's vector search capabilities with embeddings provided by OpenAI. | |
| - You will understand how to perform Retrieval-Augmented Generation (RAG) using LlamaIndex and OpenAI. |
lamaindex/fts/frontmatter.md
Outdated
| - Openai | ||
| - Artificial Intelligence | ||
| - Llamaindex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| " news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split=\"train\")\n", | ||
| " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", | ||
| "except Exception as e:\n", | ||
| " raise ValueError(f\"Error loading TREC dataset: {str(e)}\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lamaindex/gsi/frontmatter.md
Outdated
| - Openai | ||
| - Artificial Intelligence | ||
| - Llamaindex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.