Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up GitRepo.get_content_info() (fixes gh-3300) #3301

Merged
merged 14 commits into from Apr 9, 2019
Merged

Conversation

@mih
Copy link
Member

@mih mih commented Apr 8, 2019

#3300

Benchmark info:

on present master:
% time datalad status
datalad status  10.37s user 1.51s system 100% cpu 11.851 total
% datalad rev-diff -f HEAD~1 -t HEAD
datalad rev-diff -f HEAD~1 -t HEAD  11.00s user 1.79s system 108% cpu 11.751 total

this PR:
% time datalad status
datalad status  7.83s user 1.17s system 105% cpu 8.495 total
(datalad3-dev) mih@meiner ~/forrest/collection/visloc (git)-[master] % datalad rev-diff -f HEAD~1 -t HEAD
datalad rev-diff -f HEAD~1 -t HEAD  9.87s user 1.61s system 100% cpu 11.443 total

Hence no slow-down or minor speed-up, despite much more testing in the diff case. In master symlinks were not properly evaluated at all, but instead any coincidental symlink on the filesystem with the same name was being used for testing. Now every single symlink is read from Git for the given ref. On slow FS this might even be faster than os.readlink in the worktree.

Test repo has 33k symlinks to be tested.

@mih mih changed the title ENH: Speed-up GitRepo.get_content_info() (fixes gh-3300) WIP: Speed-up GitRepo.get_content_info() (fixes gh-3300) Apr 8, 2019
@mih mih mentioned this pull request Apr 8, 2019
@codecov
Copy link

@codecov codecov bot commented Apr 8, 2019

Codecov Report

Merging #3301 into master will decrease coverage by 0.01%.
The diff coverage is 97.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3301      +/-   ##
==========================================
- Coverage   91.03%   91.02%   -0.02%     
==========================================
  Files         263      263              
  Lines       34189    34221      +32     
==========================================
+ Hits        31125    31148      +23     
- Misses       3064     3073       +9
Impacted Files Coverage Δ
datalad/utils.py 87.27% <100%> (+0.04%) ⬆️
datalad/tests/test_utils.py 96.87% <100%> (+0.01%) ⬆️
datalad/support/annexrepo.py 87.31% <100%> (-0.66%) ⬇️
datalad/support/gitrepo.py 89.03% <96.96%> (+0.12%) ⬆️
datalad/cmd.py 96.06% <97.36%> (+0.35%) ⬆️
datalad/downloaders/http.py 82.53% <0%> (-2.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc3a9e6...2e32093. Read the comment docs.

@mih mih changed the title WIP: Speed-up GitRepo.get_content_info() (fixes gh-3300) Speed-up GitRepo.get_content_info() (fixes gh-3300) Apr 8, 2019
@mih mih requested a review from kyleam Apr 8, 2019
@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 8, 2019

Note to myself/other: this PR also has RFing of .support.annexrepo.BatchedAnnex into a more generic cmd.BatchedAnnex

@mih
Copy link
Member Author

@mih mih commented Apr 8, 2019

Note to myself/other: this PR also has RFing of .support.annexrepo.BatchedAnnex into a more generic cmd.BatchedAnnex

Note that this RF takes out the notion of "annex". However, travis seems unhappy with it (stalled tests). SO which one is the other PR that may have done this properly already?

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 8, 2019

I don't think we had any PR for generic BatchCommand. We did rely though on some batched runs of git done by GitPython.

@mih
Copy link
Member Author

@mih mih commented Apr 8, 2019

Damn, I thought there is an easy way out.... :(

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 8, 2019

well, may be there is - don't they stall for you locally? seems to do for me:

(git)hopa:~datalad/datalad[enh-statusperf]git
$> python -m nose -v -s datalad/support/tests/test_*repo.py 
datalad.support.tests.test_annexrepo.test_AnnexRepo_addurl_batched_and_set_metadata ... ok
datalad.support.tests.test_annexrepo.test_AnnexRepo_addurl_to_file_batched ... 

so it is just the matter to figure out what you have changed ;-)

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 8, 2019

FWIW here is how it hangs

[DEBUG  ] Initiating a new process for BatchedCommand(cmd=['git', 'cat-file', '--batch'], output_proc=<function>, path=None) 
[Level 5] Command: ['git', 'cat-file', '--batch'] 
[Level 5] Sending 'HEAD:about.txt\n' to batched command BatchedCommand(cmd=['git', 'cat-file', '--batch'], output_proc=<function>, path=None) 
[Level 5] Done sending. 

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 8, 2019

to be exact on

(Pdb) p entry
'HEAD:about.txt\n'
(Pdb) l
790  	            process = self._process  # _check_process might have restarted it
791  	            process.stdin.write(assure_bytes(entry) if PY2 else entry)
792  	            process.stdin.flush()
793  	            lgr.log(5, "Done sending.")
794  	            import pdb; pdb.set_trace()
795  ->	            still_alive, stderr = self._check_process(restart=False)
796  	            # TODO: we might want to handle still_alive, e.g. to allow for
797  	            #       a number of restarts/resends, but it should be per command
798  	            #       since for some we cannot just resend the same query. But if
799  	            #       it is just a "get"er - we could resend it few times
800  	            # We are expecting a single line output
(Pdb) n
> /home/yoh/proj/datalad/datalad/datalad/cmd.py(803)__call__()
-> if not process.stdout.closed else None
(Pdb) n

so something in how git .. --batch works probably differs from how annex does

@mih
Copy link
Member Author

@mih mih commented Apr 8, 2019

Thx much! I think I know what is happening.

Copy link
Contributor

@kyleam kyleam left a comment

I'll push a few comment/doc touch-ups after the current test run. Otherwise looks good to me (Edit: sadly Travis doesn't agree).

datalad/cmd.py Outdated Show resolved Hide resolved
datalad/cmd.py Outdated Show resolved Hide resolved
datalad/support/gitrepo.py Show resolved Hide resolved
@mih
Copy link
Member Author

@mih mih commented Apr 8, 2019

Please feel free to take this PR over. Tests now run and reveal a bunch of uncovered cases. If you have any idea what they are and how to fix it, please plow ahead!

@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 8, 2019

If you have any idea what they are

Aside from the encoding issues, the failing tests seem to come down to the faster readlink approach being stricter than the previous realpath approach. Our ls-files call is reporting that there is a symlink, but if it changed in the working tree (either deleted or a regular, unlocked file), readlink errors.

and how to fix it

Hmm, I'm not coming up with anything that wouldn't just put us close to where we started with realpath.

@mih
Copy link
Member Author

@mih mih commented Apr 8, 2019

Hmm, I'm not coming up with anything that wouldn't just put us close to where we started with realpath.

Thx for the assessment. Will think about it and see what can be done.

kyleam added 3 commits Apr 8, 2019
The new BatchedCommand, unlike BatchedAnnex, isn't specific for annex
commands.
BatchedCommand may be given any command, and the command may output
more than a single line for each input line.  For example, the number
of lines output by 'git cat-file --batch' depends on the content of
the objects passed in.
It's tempting to say text_type(pathobj) and hope for it to do the
right thing, but that will fail.

    python -c "from pathlib2 import Path; print(unicode(Path(u'β')))"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode [...]

Even if a unicode string is passed in, pathlib2 stores it as bytes and
doesn't define a __unicode__ method.  Define a __unicode__ method for
our pathlib2 objects so that we can call text_type() on them and get
back unicode.
kyleam added 2 commits Apr 8, 2019
Otherwise this will try to encode the path, which can lead to a
UnicodeEncodeError.
Using readlink rather than realpath is faster (dataladgh-3300), but readlink
will fail when the symlink reported by ls-files is no longer in the
working tree.  This can happen if the annex file is removed or
unlocked.
@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 8, 2019

Hmm, I'm not coming up with anything that wouldn't just put us close to where we started with realpath.

Thx for the assessment. Will think about it and see what can be done.

I've adjusted the code to call readlink and, if it errors, fall back to realpath.

Copy link
Member

@yarikoptic yarikoptic left a comment

Try/except or a persistent over with later aguaranteed clean up?

datalad/support/gitrepo.py Outdated Show resolved Hide resolved
@mih
Copy link
Member Author

@mih mih commented Apr 9, 2019

I've adjusted the code to call readlink and, if it errors, fall back to realpath
Thx. This seems to be the optimal choice.

@mih
Copy link
Member Author

@mih mih commented Apr 9, 2019

@kyleam Addressed @yarikoptic advice. Test are running.

@mih
Copy link
Member Author

@mih mih commented Apr 9, 2019

All tests are happy. Thx much for your contributions!

@mih mih merged commit e52de4c into datalad:master Apr 9, 2019
5 checks passed
@mih mih deleted the enh-statusperf branch Apr 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants