Research: Improve the ranking of the sources #133

vatsrahul1001 · 2023-11-16T17:21:41Z

While Testing Ask Astro today we noticed some issue with below questions

what is Astro SDK?
The response for this was incorrect(I'm sorry, but there's no such thing as Astro-SDK.) and the sources were irrelevant.
Slack Thread
What is latest version for Astronomer Providers
The response was not correct as per Ask Astro latest version is 1.14.0, however, latest release is 1.18.2
Slack Thread
Related docs incorrect for how to install the CLI on Linux?
Slack Thread
Ask Astro not recognize version 2.7.3 and treat 2.2.2 as latest
Slack Thread

Based on above try the following in the order:

tokenization as lowercase #145
Cohere reranking - PR
hybrid search - PR

Test iteration for each experiment by @vatsrahul1001

--- Update 1/8 Merging two other similar/duplicate issues to this one, closing the other two ---

#213
#80

sunank200 · 2023-11-17T16:01:39Z

Here is Langsmith trace for

Yes, for what is the Astro SDK? The following is langsmith trace. It doesn't find the correct docs in Retriever. But it should have either picked [1], [2], [3], [4], [5] which is already ingested.

But when we ask - What is Astro Python SDK? It gives correct answers with the right sources. Here is langsmith trace

sunank200 · 2023-11-17T16:08:57Z

For thie following question:

Hi all ,

      Need your help and guidance

      I am going to install airflow version 2.7.3 in new Ec2 instance with postgres

      EC2 - t3.x large (4 vCPUs, 16 GB RAM )

      What is the ubuntu and python version that will be compatible with this ?

      Anyone who installed 2.7.3 can share your thought

Langsmith trace is here

For What is the latest Airflow version? it gave the correct response though. Langchain trace is here

mpgreg · 2023-11-20T13:58:05Z

Initial analysis is at https://docs.google.com/document/d/17OBh5b9fQM3kq_n1fxbIL49b2fM0-ipPPD-Ju0_2nIo/edit

TLDR;
The docSource property is vectorized during ingest and search. This is skewing search results containing ‘astro’ and ‘astronomer’ towards certain sources.
By itself the “source skew” is not a problem but without hybrid search the vector will skew somewhat randomly towards.
Formatting of Astro docs ingested from Markdown confuses the LLM for answering.

Recommendations:
Remove docSource. It is not used and inconsistently named.
Or… change schema for docSource to skip vectorization.
Implement hybrid search
Change astro doc ingest to extract from HTML sources

mpgreg · 2023-11-20T13:58:47Z

I'm testing now with local docs with skip=True for docSource in schema:

                {
                    "name": "docSource",
                    "description": "Type of document ('learn', 'astro', 'airflow', 'stackoverflow', 'code_samples')",
                    "dataType": ["text"],
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": "True",
                            "vectorizePropertyName": "False"
                        }
                    }
                },

sunank200 · 2023-11-23T08:15:47Z

We tried hybrid search with Cohere reranking and this has degraded the performance. Hence not a priority for 28th Nov release

phanikumv · 2023-12-18T11:16:01Z

David to look into this

sunank200 · 2023-12-20T15:02:08Z

David to have first results by EOW

sunank200 · 2023-12-26T10:14:46Z

@davidgxue any updates on this?

davidgxue · 2023-12-26T16:34:14Z

My apologies I forgot to update on Github. I sync’ed with Steven on Friday and send out a google doc that contains approaches to experiment with based on initial analysis and observations (https://docs.google.com/document/d/1j-Hr8dchwBWDxejAf1dvcGIA_Y6lVQ-W9VP0k7Zw9zE/edit?usp=sharing). I am currently on PTO this entire week but I will update with more details when I get back.

davidgxue · 2024-01-10T01:57:14Z

Update

Current Progress Update

The enhancements listed under Retrieval items 1, 2, and 4 have been implemented as part of this pull request: PR #247. This PR will resolve the associated issue for the current sprint.
The PR mentioned above have fixed problematic responses to the questions mentioned in this original issue. See the evaluation report here in this issue: [QA] Change MultiQuery Prompt, Add Hybrid Search (BM25 + Embedding), Cohere Reranker & LLM Chain Filter #253 (comment)
To address the issue of noisy data from Data Ingestion item 1, the most problematic document source containing excessive navbar, footer, and header content has been temporarily excluded from the ingestion process.
Exploration and development of the remaining items are ongoing.

Research Report

Next Steps and Approaches for Experimentation and Implementation

Retrieval Enhancements

Hybrid / Sparse + Dense Vector Search Integration with Cohere Reranking ✅
- Implement a hybrid search combining BM25 and an embedding model with rank fusion scoring to narrow down results to the top 100-300 documents.
- Rerank the shortlisted documents using Cohere to refine the selection to approximately 10 documents.
- This approach aims to improve performance and reduce latency compared to the sequential use of BM25, embedding models, and rerankers.
Language Model Prompt Rewording for Multi-Query Retrieval ✅
- Maintain the original user prompt as one of the queries to preserve the initial intent.
- Optimize the rewording prompt for GPT-3.5 to ensure it is concise and summarizes the query without introducing extraneous content.
Parent Document Retriever Implementation
- Address the issue where relevant keywords or related terms appear in one section of a page, but the actual answers are located in a different section of the same document.
Final Relevance Check with a Cost-Effective LLM ✅
- Utilize a less resource-intensive language model like GPT-3.5 (e.g., LLMChainFilter in LangChain) to assess the relevance of the retrieved documents before processing them with GPT-4 for final response generation.

Response Generation Improvements

Prompt Engineering for Source Citation
- Refine the system prompt that triggers GPT-4 for final response generation to require explicit citation of the source document used in the answer, as recommended by Julian. This should prevent the generation of responses without proper backing from source documents.

Data Management and Optimization

Vector Database (Vector DB)

Exploration of Alternative Embedding Models
- Investigate the use of embedding models from Cohere as potential alternatives to the text-ada-002 model from OpenAI.
Vector DB Schema Modification
- Adjust the Vector DB schema to exclude the vectorization of certain non-essential attributes, such as docSource.

Data Ingestion Process

Data Cleaning and Exclusion of Irrelevant Content
- Implement data cleaning during ingestion to remove non-essential content like navigation bars, footers, headers, and other irrelevant sections that may introduce keyword spam and reduce retrieval accuracy.
Review and Refinement of Chunking Logic
- Reassess the logic for document chunking to prevent the inclusion of headers or short, meaningless text segments.
Summarization of Large Documents
- Generate and insert summaries for excessively large documents that are split into numerous chunks, using a language model to aid in comprehension and retrieval.
Topic Keyword Extraction and Metadata Storage
- Perform topic keyword extraction on each document in the Vector DB, store the results as metadata, and enhance queries with user-prompt-derived keywords during Q&A sessions. This strategy requires significant effort and its effectiveness is yet to be determined.

vatsrahul1001 assigned sunank200 Nov 17, 2023

sunank200 mentioned this issue Nov 17, 2023

Change the langchain code for Hybrid search #147

Closed

sunank200 changed the title ~~Ask Astro || Incorrect responses and sources~~ Research: Improve the ranking of the sources Dec 1, 2023

sunank200 added this to the Phase 2.5 - Enhanced Community Release milestone Dec 1, 2023

sunank200 removed their assignment Dec 13, 2023

sunank200 assigned davidgxue Dec 19, 2023

davidgxue linked a pull request Jan 9, 2024 that will close this issue

Change MultiQuery Prompt, Add Hybrid Search (BM25 + Embedding), Cohere Reranker & LLM Chain Filter #247

Merged

This was referenced Jan 9, 2024

Improve quality of responses #213

Closed

Document sources are wrong in the answers #80

Closed

davidgxue closed this as completed in #247 Jan 10, 2024

This was referenced Jan 11, 2024

Research & Implement: Data Ingestion Related Improvements #258

Closed

Research & Implement: Explicit Source Doc Citation & Hallucination Prevention #259

Open

Research: Experiment with Optimal Reranker and Hybrid Search Params (Top N, Alpha, Top K) #260

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: Improve the ranking of the sources #133

Research: Improve the ranking of the sources #133

vatsrahul1001 commented Nov 16, 2023 •

edited by davidgxue

Loading

sunank200 commented Nov 17, 2023

sunank200 commented Nov 17, 2023

mpgreg commented Nov 20, 2023

mpgreg commented Nov 20, 2023 •

edited

Loading

sunank200 commented Nov 23, 2023

phanikumv commented Dec 18, 2023

sunank200 commented Dec 20, 2023

sunank200 commented Dec 26, 2023

davidgxue commented Dec 26, 2023 •

edited

Loading

davidgxue commented Jan 10, 2024 •

edited

Loading

Research: Improve the ranking of the sources #133

Research: Improve the ranking of the sources #133

Comments

vatsrahul1001 commented Nov 16, 2023 • edited by davidgxue Loading

--- Update 1/8 Merging two other similar/duplicate issues to this one, closing the other two ---

sunank200 commented Nov 17, 2023

sunank200 commented Nov 17, 2023

mpgreg commented Nov 20, 2023

mpgreg commented Nov 20, 2023 • edited Loading

sunank200 commented Nov 23, 2023

phanikumv commented Dec 18, 2023

sunank200 commented Dec 20, 2023

sunank200 commented Dec 26, 2023

davidgxue commented Dec 26, 2023 • edited Loading

davidgxue commented Jan 10, 2024 • edited Loading

Update

Current Progress Update

Research Report

Next Steps and Approaches for Experimentation and Implementation

Retrieval Enhancements

Response Generation Improvements

Data Management and Optimization

Vector Database (Vector DB)

Data Ingestion Process

vatsrahul1001 commented Nov 16, 2023 •

edited by davidgxue

Loading

mpgreg commented Nov 20, 2023 •

edited

Loading

davidgxue commented Dec 26, 2023 •

edited

Loading

davidgxue commented Jan 10, 2024 •

edited

Loading