How to detect if a file was actually saved? #1

cannatag · 2015-07-01T08:47:44Z

Hi,
how can I detect if a file was actually saved with the copy() method? I mean, if two files with different names have the same content the hash is the same, so the first file is actually saved, while the second gets the same path of the first, but the save operation is not performed. In either cases the copy() method returns the address id of the file. How can I detect this? This information is helpful for to know the ratio of duplicate files.

Thanks,
Giovanni

dgilland · 2015-07-01T13:18:58Z

Currently, you'd have to pre-check whether the file exists before calling put:

obj = # path or readable object
stream = hashfs.hashfs.Stream(obj)  # if you need an iterable wrapper for object
file_id = fs.computehash(stream)

if fs.exists(file_id):
    # Handle duplicate file
    filepath = fs.idpath(file_id)
    address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)
else:
    # Handle unique file
    address = fs.put(stream)

This is somewhat sub-optimal since the file hash is calculated twice if put is used in the else block. You could replace it with fs.copy but you'd only get the filepath returned so if you wanted the HashAddress in both cases, then this may work better:

file_id = fs.computehash(stream)
filepath = fs.idpath(file_id)
address = hashfs.HashAddress(file_id, fs.relpath(filepath), filepath)

if fs.exists(file_id):
    # Handle duplicate file
else:
    # Handle unique file
    fs.copy(stream, file_id)

However, neither of these is ideal since you have to utilize more of the internal implementation of HashFS. I'm open to suggestions to make this less painful.

cannatag · 2015-07-01T14:53:46Z

Hi Derrick, thanks for your response. Could it be possibile to return a tuple from the copy() method? It could return the id and a boolean indicating if the file was actually saved or not. Or if you don't want to modify the copy() method you could add another method (something like verify_copy() ) that does the check:

    def verify_copy(self, stream, id, extension=None):
        """Copy the contents of `stream` onto disk with an optional file
        extension appended. The copy process uses a temporary file to store the
        initial contents and then moves this file to it's final location.
        """
        filepath = self.idpath(id, extension)

        # Only copy file if it doesn't already exist.
        if not os.path.isfile(filepath):
            with tmpfile(stream, self.fmode) as fname:
                self.makepath(os.path.dirname(filepath))
                shutil.copy(fname, filepath)
            return filepath, True
        else:
            return filepath, False

dgilland · 2015-07-01T15:34:16Z

Are you using copy directly? Returning whether the file exists would also need to be returned by put if copy is modified since put is meant for the public API while copy is a more lower-level utility.

Changing the return of put would mean a breaking change so I need to consider this carefully even though it's still pre-v1.

Another option would be to modify HashAddress to contain an attribute that indicates whether the file existed previously but that value would only have meaning when calling put. This would be a non-breaking change.

Seems like the options are:

Modify return of put/copy to indicate whether the file existed previously.
Create new methods for put/copy (perhaps verify_put/verify_copy) that return whether the file existed previously.
Modify HashAddress to include an attribute that indicates previous file existence.

dgilland · 2015-07-01T21:27:37Z

After some more thought, I am leaning towards #3, modifying HashAddress to include whether the file was a duplicate during a put operation.

cannatag · 2015-07-02T06:30:04Z

#3 looks the right solution to me because it should not break the interface contract. I will change my code to use the put() method instead of copy. May I suggest you to change the internal methods (those that are not supposed to be called directly by external code) to _method()? This the the standard way to indicate that a method is not public.

Thanks again for your HashFS project.
Giovanni

…urn HashAddress with "is_duplicate=True" when file with same hash already exists on disk. Refs #1.

dgilland · 2015-07-02T14:42:02Z

This has been released as v0.5.0.

dgilland · 2015-07-02T18:23:05Z

BTW, the new attribute for HashAddress is is_duplicate:

address = fs.put(stream)

if address.is_duplicate:
    # Handle duplicate file condition

cannatag · 2015-07-03T06:57:07Z

Thanks. It works.

Giovanni

dgilland added a commit that referenced this issue Jul 2, 2015

Add "is_duplicate" attribute to HashAddress and make HashFS.put() ret…

244000b

…urn HashAddress with "is_duplicate=True" when file with same hash already exists on disk. Refs #1.

dgilland closed this as completed Jul 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to detect if a file was actually saved? #1

How to detect if a file was actually saved? #1

cannatag commented Jul 1, 2015

dgilland commented Jul 1, 2015

cannatag commented Jul 1, 2015

dgilland commented Jul 1, 2015

dgilland commented Jul 1, 2015

cannatag commented Jul 2, 2015

dgilland commented Jul 2, 2015

dgilland commented Jul 2, 2015

cannatag commented Jul 3, 2015

How to detect if a file was actually saved? #1

How to detect if a file was actually saved? #1

Comments

cannatag commented Jul 1, 2015

dgilland commented Jul 1, 2015

cannatag commented Jul 1, 2015

dgilland commented Jul 1, 2015

dgilland commented Jul 1, 2015

cannatag commented Jul 2, 2015

dgilland commented Jul 2, 2015

dgilland commented Jul 2, 2015

cannatag commented Jul 3, 2015