Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around (temporarily) wrong getsize() output #5335

Merged
merged 3 commits into from Jan 19, 2018

Conversation

Projects
None yet
4 participants
@mvdbeek
Copy link
Member

mvdbeek commented Jan 18, 2018

This is an alternative to #5316.

This consists of 2 things: In the object store code we check the file size twice if it is 0, this seems to be sufficient to work around the behaviour I described in the commit message of b7fb583.

The other thing is that we can set metadata for empty files, we don't need to special-case this IMO. It is possible though that some datatypes won't be able to set metadata, but that's not worse than not loading metadata at all.

@galaxybot galaxybot added this to the 18.01 milestone Jan 18, 2018

@nsoranzo
Copy link
Member

nsoranzo left a comment

I like this PR more than the other!

We should probably also change the empty() method above to be return self.size(obj, **kwargs) == 0

if size == 0:
# May be legitimately 0, or there may be an issue with the FS / kernel, so we try again
time.sleep(0.01)
size = os.path.getsize(self.get_filename(obj, **kwargs))

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jan 18, 2018

Member

We can make lines 371-375 a bit more general:

                filepath = self.get_filename(obj, **kwargs)
                for _ in range(0, 2):
                    size = os.path.getsize(filepath)
                    if size != 0:
                        break
                    # May be legitimately 0, or there may be an issue with the FS / kernel, so we try again
                    time.sleep(0.01)
@nsoranzo

This comment has been minimized.

Copy link
Member

nsoranzo commented Jan 18, 2018

Does the PR need to be rebased for the selenium tests to pass?

mvdbeek added some commits Jan 18, 2018

Work around (temporarily) wrong getsize() output
I'm not sure how/why this happens, but this can be reproduced by copying a file to a location in one process, while checking the size in another process:

Run this to copy a file indefinitely:

```
import os
import shutil

open('test').write("foo")
while True:
    shutil.copy('test', 'test.file')
    os.remove('test.file')
```

In the same directory run this

```
import os

count = 0
wrong_size = False
while True:
    try:
        size = os.path.getsize('test.file')
        if size == 0:
            wrong_size = True
            wrong_count = count
        elif wrong_size and os.path.getsize('test.file') != 0:
            # We have seen the wrong size being reported, but now it's correct again
            print("Got wrong file size at %dnd try, was fixed in %dnd try" % (wrong_count, count))
            break
    except Exception:
        pass
    count += 1
```

It'll (probably) fail like this:

```
Got wrong file size at 0nd try, was fixed in 1nd try
```
or more often:
```
Got wrong file size at 31nd try, was fixed in 32nd try
```

This seems to be fixed when running os.path.getsize() another time,
so we simply do that here.
Use a for loop when checking size in object store
and adjust `empty()` to use the new code. Thanks @nsoranzo.

@mvdbeek mvdbeek force-pushed the mvdbeek:no_check_data branch from 0f250c4 to 1614675 Jan 19, 2018

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Jan 19, 2018

Look like that was the case.

@nsoranzo nsoranzo merged commit c8a7b7b into galaxyproject:dev Jan 19, 2018

6 checks passed

api test Build finished. 343 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 168 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 67 tests run, 0 skipped, 0 failed.
Details
selenium test Build finished. 118 tests run, 2 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@nsoranzo

This comment has been minimized.

Copy link
Member

nsoranzo commented Jan 19, 2018

Thanks @mvdbeek!

@natefoo

This comment has been minimized.

Copy link
Member

natefoo commented Jan 19, 2018

We do this with chown(2) over in the job handling code, which I swear I got from some official documentation somewhere (Linux NFS client I thought) but I can't find it now. Anyway, maybe we should unify these into some sort of utility function?

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Jan 19, 2018

I saw this, but it seems this isn't sufficient in all cases (I've set retry_job_output_collection to 60 in galaxy.ini, and I know this prevents "Output not returned from cluster"). I'm accessing datasets over samba (which I believe exports the mounts from an Isilon storage system via NFS ...), so that may factor in here.

Do you think we should move the getsize hack into that section ?

@natefoo

This comment has been minimized.

Copy link
Member

natefoo commented Jan 19, 2018

I guess it depends on whether or not it's as effective for NFS than the chown() hack, which I think has solved the problem for most people using NFS. But maybe there'd be no harm in doing both?

@mvdbeek mvdbeek deleted the mvdbeek:no_check_data branch Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.