Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research: Improve the ranking of the sources #133

Closed
vatsrahul1001 opened this issue Nov 16, 2023 · 10 comments · Fixed by #247
Closed

Research: Improve the ranking of the sources #133

vatsrahul1001 opened this issue Nov 16, 2023 · 10 comments · Fixed by #247
Assignees

Comments

@vatsrahul1001
Copy link
Collaborator

vatsrahul1001 commented Nov 16, 2023

While Testing Ask Astro today we noticed some issue with below questions

  1. what is Astro SDK?
    The response for this was incorrect(I'm sorry, but there's no such thing as Astro-SDK.) and the sources were irrelevant.
    Slack Thread

  2. What is latest version for Astronomer Providers
    The response was not correct as per Ask Astro latest version is 1.14.0, however, latest release is 1.18.2
    Slack Thread

  3. Related docs incorrect for how to install the CLI on Linux?
    Slack Thread

  4. Ask Astro not recognize version 2.7.3 and treat 2.2.2 as latest
    Slack Thread

Based on above try the following in the order:

  1. tokenization as lowercase #145
  2. Cohere reranking - PR
  3. hybrid search - PR

Test iteration for each experiment by @vatsrahul1001

--- Update 1/8 Merging two other similar/duplicate issues to this one, closing the other two ---

#213
#80

@sunank200
Copy link
Collaborator

Here is Langsmith trace for

  1. Yes, for what is the Astro SDK? The following is langsmith trace. It doesn't find the correct docs in Retriever. But it should have either picked [1], [2], [3], [4], [5] which is already ingested.

But when we ask - What is Astro Python SDK? It gives correct answers with the right sources. Here is langsmith trace

@sunank200
Copy link
Collaborator

  • For thie following question:
Hi all ,

      Need your help and guidance

      I am going to install airflow version 2.7.3 in new Ec2 instance with postgres

      EC2 - t3.x large (4 vCPUs, 16 GB RAM )

      What is the ubuntu and python version that will be compatible with this ?

      Anyone who installed 2.7.3 can share your thought

Langsmith trace is here

For What is the latest Airflow version? it gave the correct response though. Langchain trace is here

@mpgreg
Copy link
Contributor

mpgreg commented Nov 20, 2023

Initial analysis is at https://docs.google.com/document/d/17OBh5b9fQM3kq_n1fxbIL49b2fM0-ipPPD-Ju0_2nIo/edit

TLDR;
The docSource property is vectorized during ingest and search. This is skewing search results containing ‘astro’ and ‘astronomer’ towards certain sources.
By itself the “source skew” is not a problem but without hybrid search the vector will skew somewhat randomly towards.
Formatting of Astro docs ingested from Markdown confuses the LLM for answering.

Recommendations:
Remove docSource. It is not used and inconsistently named.
Or… change schema for docSource to skip vectorization.
Implement hybrid search
Change astro doc ingest to extract from HTML sources

@mpgreg
Copy link
Contributor

mpgreg commented Nov 20, 2023

I'm testing now with local docs with skip=True for docSource in schema:

                {
                    "name": "docSource",
                    "description": "Type of document ('learn', 'astro', 'airflow', 'stackoverflow', 'code_samples')",
                    "dataType": ["text"],
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": "True",
                            "vectorizePropertyName": "False"
                        }
                    }
                },

@sunank200
Copy link
Collaborator

We tried hybrid search with Cohere reranking and this has degraded the performance. Hence not a priority for 28th Nov release

@sunank200 sunank200 changed the title Ask Astro || Incorrect responses and sources Research: Improve the ranking of the sources Dec 1, 2023
@sunank200 sunank200 removed their assignment Dec 13, 2023
@phanikumv
Copy link
Collaborator

David to look into this

@sunank200
Copy link
Collaborator

David to have first results by EOW

@sunank200
Copy link
Collaborator

@davidgxue any updates on this?

@davidgxue
Copy link
Contributor

davidgxue commented Dec 26, 2023

My apologies I forgot to update on Github. I sync’ed with Steven on Friday and send out a google doc that contains approaches to experiment with based on initial analysis and observations (https://docs.google.com/document/d/1j-Hr8dchwBWDxejAf1dvcGIA_Y6lVQ-W9VP0k7Zw9zE/edit?usp=sharing). I am currently on PTO this entire week but I will update with more details when I get back.

@davidgxue
Copy link
Contributor

davidgxue commented Jan 10, 2024

Update

Current Progress Update

  • The enhancements listed under Retrieval items 1, 2, and 4 have been implemented as part of this pull request: PR #247. This PR will resolve the associated issue for the current sprint.
  • The PR mentioned above have fixed problematic responses to the questions mentioned in this original issue. See the evaluation report here in this issue: [QA] Change MultiQuery Prompt, Add Hybrid Search (BM25 + Embedding), Cohere Reranker & LLM Chain Filter #253 (comment)
  • To address the issue of noisy data from Data Ingestion item 1, the most problematic document source containing excessive navbar, footer, and header content has been temporarily excluded from the ingestion process.
  • Exploration and development of the remaining items are ongoing.

Research Report

Next Steps and Approaches for Experimentation and Implementation

Retrieval Enhancements

  1. Hybrid / Sparse + Dense Vector Search Integration with Cohere Reranking

    • Implement a hybrid search combining BM25 and an embedding model with rank fusion scoring to narrow down results to the top 100-300 documents.
    • Rerank the shortlisted documents using Cohere to refine the selection to approximately 10 documents.
    • This approach aims to improve performance and reduce latency compared to the sequential use of BM25, embedding models, and rerankers.
  2. Language Model Prompt Rewording for Multi-Query Retrieval

    • Maintain the original user prompt as one of the queries to preserve the initial intent.
    • Optimize the rewording prompt for GPT-3.5 to ensure it is concise and summarizes the query without introducing extraneous content.
  3. Parent Document Retriever Implementation

    • Address the issue where relevant keywords or related terms appear in one section of a page, but the actual answers are located in a different section of the same document.
  4. Final Relevance Check with a Cost-Effective LLM

    • Utilize a less resource-intensive language model like GPT-3.5 (e.g., LLMChainFilter in LangChain) to assess the relevance of the retrieved documents before processing them with GPT-4 for final response generation.

Response Generation Improvements

  1. Prompt Engineering for Source Citation
    • Refine the system prompt that triggers GPT-4 for final response generation to require explicit citation of the source document used in the answer, as recommended by Julian. This should prevent the generation of responses without proper backing from source documents.

Data Management and Optimization

Vector Database (Vector DB)
  1. Exploration of Alternative Embedding Models

    • Investigate the use of embedding models from Cohere as potential alternatives to the text-ada-002 model from OpenAI.
  2. Vector DB Schema Modification

    • Adjust the Vector DB schema to exclude the vectorization of certain non-essential attributes, such as docSource.
Data Ingestion Process
  1. Data Cleaning and Exclusion of Irrelevant Content

    • Implement data cleaning during ingestion to remove non-essential content like navigation bars, footers, headers, and other irrelevant sections that may introduce keyword spam and reduce retrieval accuracy.
  2. Review and Refinement of Chunking Logic

    • Reassess the logic for document chunking to prevent the inclusion of headers or short, meaningless text segments.
  3. Summarization of Large Documents

    • Generate and insert summaries for excessively large documents that are split into numerous chunks, using a language model to aid in comprehension and retrieval.
  4. Topic Keyword Extraction and Metadata Storage

    • Perform topic keyword extraction on each document in the Vector DB, store the results as metadata, and enhance queries with user-prompt-derived keywords during Q&A sessions. This strategy requires significant effort and its effectiveness is yet to be determined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants