2022 Data Mining Big Jobs

Problem Description

With the development of storage and communication technology, large-scale image data retrieval has become an urgent problem to be solved. In practical applications, image data is often converted into high-dimensional vectors by an image encoder, so the retrieval task of large-scale image data is transformed into the indexing problem of high-dimensional vectors. This project focuses on the search of high-dimensional vectors (Embedding dimension of this project is 512) and does not consider the image encoding part, so students only need to complete the retrieval task of the query vector provided by us in the large-scale vector library.

Data and File Format

├── submissions
│   ├── output.csv
├── test_b
│   ├── gallery_emb.npy
│   ├── labels_5000.pkl
│   └── query_emb.npy
├── test_a
│   ├── gallery_emb.npy
│   ├── labels_500.pkl
│   └── query_emb.npy
├── evaluation.py
├── run.sh
├── search.py
└── Readme.md

test_a (query: 500; gallery:500,000) and test_b (query: 5000; gallery:5,000,000) have the same file structure. We will initially provide a smaller dataset test_a for students to debug the search algorithm (query_emb.npy is the query embeddings; gallery_emb.npy is the embeddings to be queried; label_500.pkl is the 10 indexes that belong to the same group in gallery_emb.npy for each query embedding.), and give the running code sample search.py and test code evaluation.py.
You only need to submit the modified search.py code to 12221073@zju.edu.cn, and we will comprehensively measure the query time and P@10 indicators to give the score of this project.

test_a data download link: https://pan.baidu.com/s/1jKLpwpE1vVodaDTsq2WL7A?pwd=tgi7 code: tgi7

Evaluation indicators

Efficiency: We count the average time per query, the faster the better.

Effectiveness: For the top-10 search for each query given by the algorithm, we will calculate the precision (P@10). The higher the precision, the better.

Final Rank: We will rank submissions based on the efficiency and effectiveness of the search algorithm submitted.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
test_a		test_a
README.md		README.md
Readme.md		Readme.md
irproject.png		irproject.png
precision.png		precision.png
search.py		search.py
yoursearch.py		yoursearch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2022 Data Mining Big Jobs

Problem Description

Data and File Format

Evaluation indicators

About

Releases

Packages

Languages

UniqueClouds/Coursework

Folders and files

Latest commit

History

Repository files navigation

2022 Data Mining Big Jobs

Problem Description

Data and File Format

Evaluation indicators

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages