Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read embeddings from parquet files stored in S3 #177

Open
anisioti opened this issue Nov 14, 2023 · 0 comments
Open

Cannot read embeddings from parquet files stored in S3 #177

anisioti opened this issue Nov 14, 2023 · 0 comments

Comments

@anisioti
Copy link

Hello everyone!

Thank you for this nice project and the features already developed. I am currently trying to create a big faiss index in a distributed way and found that autofaiss library can help me achieve this.

I am working in a Glue notebook with pyspark and have my embeddings as a pyspark dataframe. Since I saw in other issues that using the build_index function directly with the pyspark dataframe is not possible, I am storing the embeddings and the ids in parquet files in s3 (using no compression, because this was affecting the file extension and I saw in a previous closed issue that the file has to have .parquet extension).

Currently I am running:

build_index(embeddings="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/embeddings_pubmed", 
            index_path="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/knn.index",
            index_infos_path="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/index_infos.json", 
            max_index_memory_usage="4G",
            file_format = 'parquet',
            distributed = 'pyspark', 
            metric_type = 'l2',
            embedding_column_name = 'author_name_embeddings', 
            id_columns = 'author_name_id',
            ids_path = 's3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/', #where to store id emb mapping
            current_memory_available="4G", 
            nb_indices_to_keep=10)

and I am getting the following error:

FileNotFoundError: gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/embeddings_pubmed/part-00000-f2ea6b6c-f41c-4d0d-979f-66347536b1d6-c000.parquet

This file while also the other partition files with the embeddings indeed exist (see screenshot from s3)
image

I am wondering what the issue can be here, since I have tried everything to make it work like this and I can't, having as a result not being able to create the index.

P.S. I have tried reading the data back in the glue notebook from s3 and the results look correct (correct columns and data types). So I have also excluded access related issues of the notebook to s3

Any help would be greatly appreciated
Thank you
Athina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant