Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF+OPT: config - compare full stat records, for "gone" files too, do not bother with 2 sec rule #5276

Merged
merged 3 commits into from Jan 6, 2021

Conversation

yarikoptic
Copy link
Member

See individual commits for more details/reasoning.

If tests pass (I have not run them locally in full, only local interesting ones) -- I would consider myself done with it, although ideally at least the first commit (actual BF) should ideally be adopted for maint as well.

Closes #4644
most likely it could also "close" #4363 but I failed to reproduce it locally even after reverting #4365 workaround. Ideally someone might want to go through those explicit .reloads we have in the tests and kill them

Weird and not yet explained behavior on crippled fs does happen where we do
get mtime "stuck" in the past (saw even 30 seconds) on stating .git/config,
see datalad#4644 (comment) .

For current tests failures at hand, probably should have been sufficient to
just add collection of file sizes, and reacting on those changes as well.

But in some cases I see value to react even on atime change (e.g. file was
made non-readable), so why not to compare that as well?

To not bother subselecting which stat to worry about, I have decided to collect/compare
all stats.  I do not think performance impact would be noticeable

	In [9]: mtime = p.stat().st_mtime

	In [10]: stat = p.stat()

	In [11]: %timeit p.stat().st_mtime == mtime
	1.57 µs ± 5.92 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

	In [12]: %timeit p.stat() == stat
	1.45 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

but we might react on more possibly valid cases when to reload a config.

Also, part of the change is not to go/compare only "existing" paths, but all known.
If path disappeared, and did exist before (we had non-None stat) -- we must reload as well.
Upon a quick test (added, and passes under crippled fs) and adding a
printout of two curstats:

	CURSTATS: {PosixPath('/home/yoh/.tmp/datalad-fs-pDjNG/datalad_temp_test_external_modification_5u13n85/.git/config'): os.stat_result(st_mode=33216, st_ino=460513, st_dev=1828, st_nlink=1, st_uid=47521, st_gid=0, st_size=152, st_atime=1609822800, st_mtime=1609875218, st_ctime=1609875219), PosixPath('/etc/gitconfig'): os.stat_result(st_mode=33188, st_ino=19531808, st_dev=66309, st_nlink=1, st_uid=0, st_gid=0, st_size=126, st_atime=1609866523, st_mtime=1608746390, st_ctime=1608746390), PosixPath('/home/yoh/.tmp/datalad-fs-pDjNG/datalad_temp_0q9iypp1/.gitconfig'): os.stat_result(st_mode=33216, st_ino=460477, st_dev=1828, st_nlink=1, st_uid=47521, st_gid=0, st_size=56, st_atime=1609822800, st_mtime=1609875218, st_ctime=1609875219)}
	CURSTATS: {PosixPath('/home/yoh/.tmp/datalad-fs-pDjNG/datalad_temp_test_external_modification_5u13n85/.git/config'): os.stat_result(st_mode=33216, st_ino=460514, st_dev=1828, st_nlink=1, st_uid=47521, st_gid=0, st_size=152, st_atime=1609822800, st_mtime=1609875218, st_ctime=1609875219), PosixPath('/etc/gitconfig'): os.stat_result(st_mode=33188, st_ino=19531808, st_dev=66309, st_nlink=1, st_uid=0, st_gid=0, st_size=126, st_atime=1609866523, st_mtime=1608746390, st_ctime=1608746390), PosixPath('/home/yoh/.tmp/datalad-fs-pDjNG/datalad_temp_0q9iypp1/.gitconfig'): os.stat_result(st_mode=33216, st_ino=460477, st_dev=1828, st_nlink=1, st_uid=47521, st_gid=0, st_size=56, st_atime=1609822800, st_mtime=1609875218, st_ctime=1609875219)}

you can see that st_inode is changed upon modification!  So git is likely not
modifying in place but creates a new file upon introducing changes!  So we do
not need to bother with comparing times for this matter really.  I have left
comparing of full stats nevertheless since it would provide the most robust
handling (may be on some file systems it would be in place modification?)
@yarikoptic yarikoptic added the merge-if-ok OP considers this work done, and requests review/merge label Jan 5, 2021
@yarikoptic
Copy link
Member Author

FTR: no notable impact on benchmarks besides possible improvement to usecases.study_forrest.time_make_studyforrest_mockup
     [d996ec35]       [bac8182c]
     <bm/merge-target>       <bm/pr>   
       1.27±0.02s       1.18±0.04s     0.93  api.supers.time_createadd
       1.26±0.02s       1.19±0.02s     0.95  api.supers.time_createadd_to_dataset
           failed           failed      n/a  api.supers.time_diff
           failed           failed      n/a  api.supers.time_diff_recursive
       6.06±0.05s       5.60±0.01s     0.92  api.supers.time_installr
          234±6ms          233±7ms     1.00  api.supers.time_ls
       2.16±0.02s       2.08±0.03s     0.96  api.supers.time_ls_recursive
       2.35±0.02s       2.32±0.03s     0.99  api.supers.time_ls_recursive_long_all
           failed           failed      n/a  api.supers.time_remove
       1.28±0.03s       1.26±0.02s     0.99  api.supers.time_status
       2.16±0.07s       2.24±0.01s     1.04  api.supers.time_status_recursive
          101±3ms          105±2ms     1.03  api.supers.time_subdatasets
         410±20ms          441±7ms     1.07  api.supers.time_subdatasets_recursive
         99.6±5ms          101±4ms     1.02  api.supers.time_subdatasets_recursive_first
        3.87±0.2s       3.92±0.02s     1.01  api.supers.time_uninstall
         648±20ms         641±20ms     0.99  api.testds.time_create_test_dataset1
        3.86±0.1s       3.80±0.02s     0.98  api.testds.time_create_test_dataset2x2
         847±30ms         874±10ms     1.03  core.startup.time_help_np
          160±4ms          153±3ms     0.95  core.startup.time_import
          559±7ms         548±10ms     0.98  core.startup.time_import_api
       3.20±0.1ms       3.06±0.1ms     0.96  core.witlessrunner.time_echo
       3.83±0.5ms       3.60±0.2ms     0.94  core.witlessrunner.time_echo_gitrunner
      3.58±0.08ms      3.57±0.06ms     1.00  core.witlessrunner.time_echo_gitrunner_fullcapture
          502±8ms          491±5ms     0.98  plugins.addurls.addurls1.time_addurls('*')
       2.09±0.02s       2.11±0.02s     1.01  plugins.addurls.addurls1.time_addurls(None)
         33.1±1ms         34.0±1ms     1.03  repo.gitrepo.time_get_content_info
       7.31±0.1ms       7.44±0.2ms     1.02  support.path.get_parent_paths.time_allsubmods_toplevel
       6.57±0.1ms      6.39±0.08ms     0.97  support.path.get_parent_paths.time_allsubmods_toplevel_only
         341±10ns         350±10ns     1.03  support.path.get_parent_paths.time_no_submods
      5.52±0.06ms      5.63±0.05ms     1.02  support.path.get_parent_paths.time_one_submod_subdir
       5.43±0.3ms       5.59±0.2ms     1.03  support.path.get_parent_paths.time_one_submod_toplevel
       33.1±0.05s       29.8±0.09s    ~0.90  usecases.study_forrest.time_make_studyforrest_mockup

@yarikoptic yarikoptic added the performance Improve performance of an existing feature label Jan 5, 2021
@codecov
Copy link

codecov bot commented Jan 5, 2021

Codecov Report

Merging #5276 (0ead554) into master (d996ec3) will increase coverage by 0.03%.
The diff coverage is 96.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5276      +/-   ##
==========================================
+ Coverage   90.26%   90.30%   +0.03%     
==========================================
  Files         297      297              
  Lines       41632    41657      +25     
==========================================
+ Hits        37580    37617      +37     
+ Misses       4052     4040      -12     
Impacted Files Coverage Δ
datalad/config.py 96.54% <93.75%> (-0.19%) ⬇️
datalad/tests/test_config.py 100.00% <100.00%> (ø)
datalad/downloaders/base.py 81.05% <0.00%> (+0.35%) ⬆️
datalad/downloaders/tests/test_http.py 91.44% <0.00%> (+2.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d996ec3...0ead554. Read the comment docs.

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using st_atime is a good idea. .git/config's will change with pretty much any call to git.

@yarikoptic
Copy link
Member Author

I don't think using st_atime is a good idea. .git/config's will change with pretty much any call to git.

damn, you are right!
I am using relatime or noatime mount options typically, but it indeed makes little sense generally. stat's attributes are read-only so can't easily reset only that one. I guess we are doomed to try to have a selection as a tuple... will push a fix for that shortly. Thanks!

… file being changed

Since as @kyleam noted, st_atime is likely to nearly always (unless filesystem mount
is configured to not do that) when file is read by   git config call.

I have selected those since they all could potentially reflect having file
or access to file (atime) changed.
@yarikoptic
Copy link
Member Author

done @kyleam - if anything, got better according to benchmarks ;)

       before           after         ratio
     [d996ec35]       [bf0a9ffb]
     <bm/merge-target>       <bm/pr>   
       1.13±0.01s       1.09±0.02s     0.97  api.supers.time_createadd
       1.10±0.02s       1.05±0.02s     0.96  api.supers.time_createadd_to_dataset
           failed           failed      n/a  api.supers.time_diff
           failed           failed      n/a  api.supers.time_diff_recursive
       5.42±0.01s          5.17±0s     0.95  api.supers.time_installr
         209±10ms         214±10ms     1.03  api.supers.time_ls
       1.95±0.02s       1.93±0.03s     0.99  api.supers.time_ls_recursive
       2.11±0.02s       2.12±0.02s     1.01  api.supers.time_ls_recursive_long_all
           failed           failed      n/a  api.supers.time_remove
       1.16±0.02s       1.15±0.04s     1.00  api.supers.time_status
       2.08±0.05s       2.05±0.03s     0.98  api.supers.time_status_recursive
         92.9±5ms         92.6±2ms     1.00  api.supers.time_subdatasets
          415±7ms         402±10ms     0.97  api.supers.time_subdatasets_recursive
         98.5±3ms         91.6±2ms     0.93  api.supers.time_subdatasets_recursive_first
        3.77±0.1s       3.63±0.04s     0.96  api.supers.time_uninstall
         594±20ms          585±8ms     0.98  api.testds.time_create_test_dataset1
       3.61±0.04s       3.50±0.01s     0.97  api.testds.time_create_test_dataset2x2
         877±30ms         841±10ms     0.96  core.startup.time_help_np
          144±8ms          141±7ms     0.98  core.startup.time_import
          524±7ms          537±6ms     1.03  core.startup.time_import_api
      2.70±0.07ms       2.76±0.1ms     1.02  core.witlessrunner.time_echo
       3.16±0.1ms       3.21±0.1ms     1.02  core.witlessrunner.time_echo_gitrunner
       3.30±0.2ms       3.25±0.1ms     0.99  core.witlessrunner.time_echo_gitrunner_fullcapture
         465±20ms         474±10ms     1.02  plugins.addurls.addurls1.time_addurls('*')
       1.90±0.05s       1.90±0.04s     1.00  plugins.addurls.addurls1.time_addurls(None)
         31.6±1ms         29.5±2ms     0.94  repo.gitrepo.time_get_content_info
       7.47±0.5ms       6.77±0.4ms    ~0.91  support.path.get_parent_paths.time_allsubmods_toplevel
       6.43±0.4ms       6.25±0.3ms     0.97  support.path.get_parent_paths.time_allsubmods_toplevel_only
         349±10ns         329±30ns     0.94  support.path.get_parent_paths.time_no_submods
       5.69±0.3ms       5.31±0.4ms     0.93  support.path.get_parent_paths.time_one_submod_subdir
       5.86±0.2ms       5.02±0.4ms    ~0.86  support.path.get_parent_paths.time_one_submod_toplevel
        30.1±0.3s        27.5±0.2s     0.91  usecases.study_forrest.time_make_studyforrest_mockup

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update looks fine to me, thanks.

# Selection of os.stat_result fields we care to collect/compare to judge
# on either file has changed to warrant reload of configuration.
# We cannot just take a full record since st_atime could potentially
# change on every git config call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's understating it a bit:

git init
stat --format=%x .git/config
sleep 2
git ls-files
stat --format=%x .git/config

# 2021-01-06 11:58:33.085193194 -0500
# 2021-01-06 11:58:35.093220679 -0500
Suggested change
# change on every git config call.
# change on every git call.

@@ -49,6 +51,11 @@
modification. This can be disable to make multiple sequential
modifications slightly more efficient.""".lstrip()

# Selection of os.stat_result fields we care to collect/compare to judge
# on either file has changed to warrant reload of configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# on either file has changed to warrant reload of configuration.
# on whether file has changed to warrant reload of configuration.

@kyleam
Copy link
Contributor

kyleam commented Jan 6, 2021

FWIW I think we should avoid a quick-trigger merge here and give others time to review.

# on either file has changed to warrant reload of configuration.
# We cannot just take a full record since st_atime could potentially
# change on every git config call.
_stat_result = namedtuple('_stat_result', 'st_ino st_size st_ctime st_mtime')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ just some thoughts; I'm okay with this as is ]

While I like namedtuples and they are generally cheap, I don't see any spot where we actually access the tuple and to my eyes defining this doesn't improve readability over using a plain tuple and having the above comment to _get_stats. (In fact, I think having that comment local in _get_stats would be an improvement.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed with removal of use of .st_mtime for time comparison there is more sense to just use a tuple. And I kinda both like and dislike namedtuples: indeed adds a bit of clutter and need to keep assignment order in sync with names. But I did find having a namedtuple (over a regular tuple) much better during debugging whenever I simply see what is that [n]th element instead of constantly looking up and doing visual matching for an indexed entry. so I would prefer to keep using namedtuple (did timing -- should be the same for our purposes).

Re comment placement, I have no strong feeling, but typically I prefer description on the purpose of the variable/type by its definition in such cases. But indeed it might be "too much of information" for this purpose -- I will just strip away 2nd half of the comment (on why not a full one).

# we have read files before
# check if any file we read from has changed
current_time = time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the last use of time in the module, so the import could be dropped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, indeed!

@yarikoptic
Copy link
Member Author

yarikoptic commented Jan 6, 2021

Thanks @kyleam for rereview & approval. I have done that minor tune up in 4ff5951 and to minimize CI utilization, will merge into master locally without going through PR/CI testing round -- those who import time from datalad.config should suffer anyways! ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_local_url_with_fetch: fails on crippled fs
2 participants