Cannot read embeddings from parquet files stored in S3 #177

anisioti · 2023-11-14T13:59:34Z

Hello everyone!

Thank you for this nice project and the features already developed. I am currently trying to create a big faiss index in a distributed way and found that autofaiss library can help me achieve this.

I am working in a Glue notebook with pyspark and have my embeddings as a pyspark dataframe. Since I saw in other issues that using the build_index function directly with the pyspark dataframe is not possible, I am storing the embeddings and the ids in parquet files in s3 (using no compression, because this was affecting the file extension and I saw in a previous closed issue that the file has to have .parquet extension).

Currently I am running:

build_index(embeddings="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/embeddings_pubmed", 
            index_path="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/knn.index",
            index_infos_path="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/index_infos.json", 
            max_index_memory_usage="4G",
            file_format = 'parquet',
            distributed = 'pyspark', 
            metric_type = 'l2',
            embedding_column_name = 'author_name_embeddings', 
            id_columns = 'author_name_id',
            ids_path = 's3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/', #where to store id emb mapping
            current_memory_available="4G", 
            nb_indices_to_keep=10)

and I am getting the following error:

FileNotFoundError: gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/embeddings_pubmed/part-00000-f2ea6b6c-f41c-4d0d-979f-66347536b1d6-c000.parquet

This file while also the other partition files with the embeddings indeed exist (see screenshot from s3)

I am wondering what the issue can be here, since I have tried everything to make it work like this and I can't, having as a result not being able to create the index.

P.S. I have tried reading the data back in the glue notebook from s3 and the results look correct (correct columns and data types). So I have also excluded access related issues of the notebook to s3

Any help would be greatly appreciated
Thank you
Athina

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read embeddings from parquet files stored in S3 #177

Cannot read embeddings from parquet files stored in S3 #177

anisioti commented Nov 14, 2023

Cannot read embeddings from parquet files stored in S3 #177

Cannot read embeddings from parquet files stored in S3 #177

Comments

anisioti commented Nov 14, 2023