Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prevent duplicate files ? #1412

Closed
SacDin opened this issue Aug 5, 2020 · 14 comments
Closed

How to prevent duplicate files ? #1412

SacDin opened this issue Aug 5, 2020 · 14 comments

Comments

@SacDin
Copy link

SacDin commented Aug 5, 2020

How to prevent duplicate files ?
Does seaweedfs store md5 hash of the file which can be later used to prevent duplicate while uploading ?

@chrislusf
Copy link
Collaborator

md5 is not fully implemented.

Files uploaded through s3 has md5.

I will leave this issue open until md5 for filer is implemented.

@SacDin
Copy link
Author

SacDin commented Aug 5, 2020

Will this be only implemented in filer ?
I mean, will this be available only while accessing files via filer or through direct API as well ?

@chrislusf
Copy link
Collaborator

chrislusf commented Aug 5, 2020 via email

@SacDin
Copy link
Author

SacDin commented Aug 5, 2020

Filer has its own database for storing metadata, right ?
So this md5 hash will be stored in filer db or with file itself ?

@chrislusf
Copy link
Collaborator

md5 is stored with filer db.

any preference?

@SacDin
Copy link
Author

SacDin commented Aug 5, 2020

No, it is alright. I am currently not using filer but will start using thereafter for md5.

@SacDin
Copy link
Author

SacDin commented Aug 7, 2020

Great, will try shortly. I have one side question, is it possible to retrieve file metadata without filer ?

@chrislusf
Copy link
Collaborator

you can check files in volume servers by http head requests.

@SacDin
Copy link
Author

SacDin commented Aug 12, 2020

I am confused about how to use it to prevent duplicates. For example, I have incoming upload and I want to prevent same file from storing again. How do I know on which path I have to make HEAD request beforehand ?

Is there any API through which I can lookup / search file by MD5 hash ?

@chrislusf
Copy link
Collaborator

No. You would need to maintain a MD5 => file mapping yourself.

@SacDin
Copy link
Author

SacDin commented Aug 12, 2020

Alright, then I will upload using FUSE and create md5 hash by reading file locally myself. Is there any way to access metadata (md5 hash created by seaweed) while using FUSE based local fs ?

If yes, I will not have to read file in memory and generate md5 myself.

@chrislusf
Copy link
Collaborator

files written by FUSE do not have md5 because writes can happen randomly anywhere in the file. It is not efficient to always re-calculate the md5 for any updates.

@CoderYellow
Copy link

请教一下,看了半天,意思是去重方案要自己维护文件摘要和文件之间的关系吗,能不能做个api,根据md5,sha之类的信息查询文件是否存在,毕竟类似秒传之类的需求还是很常见的

@9cat
Copy link

9cat commented Oct 20, 2023

the best thing is the seaweedfs can handle all the duplication itself, however, the file structure kept in the original place. but the only use the one content

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants