Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinecone: change dummy vector #6932

Merged
merged 2 commits into from
Feb 7, 2024
Merged

Pinecone: change dummy vector #6932

merged 2 commits into from
Feb 7, 2024

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Feb 7, 2024

Related Issues

Proposed Changes:

  • I changed the value of the dummy vector
  • I also refactored the code to create this dummy vector once at init time (to avoid duplication)

How did you test it?

CI

Unfortunately, our CI tests are based on an outdated mock, which is also the reason why this problem never emerged until a user reported it. (in haystack-core-integrations, things are way better)

So I reproduced the issue and tested the change locally using this code:

from haystack.utils import fetch_archive_from_http
from haystack.utils import convert_files_to_docs
from haystack.nodes import PreProcessor
from haystack.document_stores.pinecone import PineconeDocumentStore
from haystack.nodes import EmbeddingRetriever



doc_dir = "data/tutorial8"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial8.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

all_docs = convert_files_to_docs(dir_path=doc_dir)

preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True
)

docs_default = preprocessor.process(all_docs) #create a dictionary with the data in the 'content' key

document_store = PineconeDocumentStore(api_key="FAKE-API-KEY",
        environment = "gcp-starter", index="default")


print(document_store.get_document_count())

document_store.write_documents(docs_default)

retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)

print(retriever.retrieve("What is masking in language models?"))

Notes for the reviewer

Checklist

embeddings_to_index = np.zeros((len(document_chunk), self.embedding_dim), dtype="float32")
# Convert embeddings to list objects
embeddings = [embed.tolist() if embed is not None else None for embed in embeddings_to_index]
embeddings = [self.dummy_vector] * len(document_chunk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anakin87 This is the only diff (aside from -10.0) right? Everything else is the same, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

@vblagoje vblagoje self-requested a review February 7, 2024 16:37
Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@anakin87 anakin87 merged commit b08deef into v1.x Feb 7, 2024
54 checks passed
@anakin87 anakin87 deleted the pinecone-change-dummy-vector branch February 7, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants