With the development of storage and communication technology, large-scale image data retrieval has become an urgent problem to be solved. In practical applications, image data is often converted into high-dimensional vectors by an image encoder, so the retrieval task of large-scale image data is transformed into the indexing problem of high-dimensional vectors. This project focuses on the search of high-dimensional vectors (Embedding dimension of this project is 512) and does not consider the image encoding part, so students only need to complete the retrieval task of the query vector provided by us in the large-scale vector library.
├── submissions
│ ├── output.csv
├── test_b
│ ├── gallery_emb.npy
│ ├── labels_5000.pkl
│ └── query_emb.npy
├── test_a
│ ├── gallery_emb.npy
│ ├── labels_500.pkl
│ └── query_emb.npy
├── evaluation.py
├── run.sh
├── search.py
└── Readme.md
- test_a (query: 500; gallery:500,000) and test_b (query: 5000; gallery:5,000,000) have the same file structure. We will initially provide a smaller dataset test_a for students to debug the search algorithm (query_emb.npy is the query embeddings; gallery_emb.npy is the embeddings to be queried; label_500.pkl is the 10 indexes that belong to the same group in gallery_emb.npy for each query embedding.), and give the running code sample search.py and test code evaluation.py.
- You only need to submit the modified search.py code to 12221073@zju.edu.cn, and we will comprehensively measure the query time and P@10 indicators to give the score of this project.
test_a data download link: https://pan.baidu.com/s/1jKLpwpE1vVodaDTsq2WL7A?pwd=tgi7 code: tgi7
Efficiency: We count the average time per query, the faster the better.
Effectiveness: For the top-10 search for each query given by the algorithm, we will calculate the precision (P@10). The higher the precision, the better.
Final Rank: We will rank submissions based on the efficiency and effectiveness of the search algorithm submitted.