Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use regular dicts instead of OrderedDicts for speedier operations #7174

Merged
merged 3 commits into from
Nov 15, 2022

Conversation

adswa
Copy link
Member

@adswa adswa commented Nov 14, 2022

This is an attempt to fix #6566, which calls for a replacement of OrderedDicts with regular dicts to increase performance/speed. There was only one instance in status() where OrderedDict-specific functionality was used; I used a patch cooked up by @yarikoptic in #6566 as an alternative.
Let's check if the CI is happy with this change or if it breaks something.

@adswa adswa added the semver-performance Changes only improve performance, no impact on version label Nov 14, 2022
@adswa adswa linked an issue Nov 14, 2022 that may be closed by this pull request
@adswa
Copy link
Member Author

adswa commented Nov 14, 2022

There is an Appveyor failure that I feel I had seen fail before already, but I can't find it in recently opened PRs:

================================== FAILURES ===================================
____________________________ test_gracefull_death _____________________________
[gw0] win32 -- Python 3.9.13 C:\Python39-x64\python.exe
    def test_gracefull_death():
    
        def assert_provides_and_raises(pc, exception, target=None):
            """Helper to get all results before exception is raised"""
            results = []
            with assert_raises(exception):
                for r in pc:
                    results.append(r)
            # results should be sorted since we do not guarantee order
            results = sorted(results)
            if target is not None:
                assert_equal(results, target)
            return results
    
        def interrupted_producer():
            yield 1
            raise ValueError()
    
        def consumer(i):
            sleep(0.001)
            yield i
    
        assert_provides_and_raises(
            ProducerConsumer(interrupted_producer(), consumer, jobs=3), ValueError, [1])
    
        def faulty_consumer(i):
            sleep(0.001)
            if i == 1:
                raise ValueError()
            return i
    
        # so we do not get failed, but other parallel ones finish their job
        results = assert_provides_and_raises(
            ProducerConsumer(range(1000), faulty_consumer, jobs=5), ValueError)
        # and analysis of futures to raise an exception can take some time etc, so
        # we could get more, but for sure we should not get all 999 and not even a 100
        if info_log_level:
            assert_greater(100, len(results))
        assert_equal(results[:4], [0, 2, 3, 4])
    
        def producer():
            for i in range(10):
                sleep(0.0003)
                yield i
            raise ValueError()
        # by default we do not stop upon producer failing
        assert_provides_and_raises(
            ProducerConsumer(producer(), consumer, jobs=2), ValueError, list(range(10)))
        # if producer produces more than we can as quickly consume but then fails
        # ATM we do not proceed to consume other items, but fail when we finish
        # consuming until the time point when producer has failed
        # by default we do not stop upon producer failing
        results = assert_provides_and_raises(
            ProducerConsumer(producer(), consumer, reraise_immediately=True, jobs=2),
            ValueError)
        # we will get some results, seems around 4 and they should be "sequential"
        assert_equal(results, list(range(len(results))))
        try:
            assert_greater_equal(len(results), 2)
        except AssertionError:
            # Possible TODO: if tests below would start failing too, move xfail to the level
            # of the entire test
            pytest.xfail(f"Rarely but happens. Got only {len(results)} instead of at least 2")
    
        # This test relies too much on threads scheduling to not hog up on handling
        # consumers, but if it happens so - they might actually consume all results
        # before producer decides to finally raise an exception.  As such it remains
        # flaky and thus not ran, but could be useful to test locally while
        # changing that logic.
        #
        # if info_log_level and not (on_windows or on_osx):
        #     # consumers should not be able to consume all produced items.
        #     # production of 10 should take 3 unites, while consumers 10/2 (jobs)
        #     # 5 units, so some should not have a chance.
        #     assert_greater_equal(8, len(results))
    
        # Simulate situation close to what we have when outside code consumes
        # some yielded results and then "looses interest" (on_failure="error").
        # In this case we should still exit gracefully (no GeneratorExit warnings),
        # not over-produce, and also do not kill already running consumers
        consumed = []
        def inner():
            def consumer(i):
                sleep(0.01)
                consumed.append(i)
                return i
            pc = iter(ProducerConsumer(range(1000), consumer, jobs=2))
            yield next(pc)
            yield next(pc)
        # typically it should be [0, 1] but it does happen some times that
        # one other worker gets ahead and we get [0, 2]. As it is not per se the
        # purpose of this test to ensure absence of such race, we just allow for any
        # two from first 3 possible.
        assert len(set(inner()).intersection({0, 1, 2})) == 2
        consumed = sorted(consumed)
        assert_equal(consumed, list(range(len(consumed))))
>       assert_greater_equal(len(consumed), 4)  # we should wait for that 2nd batch to finish
..\datalad\support\tests\test_parallel.py:236: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
first = 3, second = 4, msg = None
    def assert_greater_equal(first, second, msg=None):
        if msg is None:
>           assert first >= second
E           assert 3 >= 4
..\datalad\tests\utils_pytest.py:125: AssertionError

I've restarted the failed run to see if it persists.

Edit: It passed in a rerun

@codecov
Copy link

codecov bot commented Nov 14, 2022

Codecov Report

Base: 88.56% // Head: 90.78% // Increases project coverage by +2.22% 🎉

Coverage data is based on head (87995e1) compared to base (5cdda33).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7174      +/-   ##
==========================================
+ Coverage   88.56%   90.78%   +2.22%     
==========================================
  Files         325      325              
  Lines       44144    44130      -14     
  Branches     5863     5863              
==========================================
+ Hits        39095    40064     +969     
+ Misses       5034     4051     -983     
  Partials       15       15              
Impacted Files Coverage Δ
datalad/core/distributed/push.py 91.18% <100.00%> (-0.04%) ⬇️
datalad/core/local/diff.py 90.29% <100.00%> (-0.10%) ⬇️
datalad/core/local/status.py 94.19% <100.00%> (+0.03%) ⬆️
datalad/customremotes/archives.py 90.25% <100.00%> (-0.05%) ⬇️
datalad/downloaders/credentials.py 84.11% <100.00%> (-0.08%) ⬇️
datalad/downloaders/providers.py 95.31% <100.00%> (-0.02%) ⬇️
datalad/interface/base.py 95.33% <100.00%> (-0.03%) ⬇️
datalad/local/wtf.py 77.59% <100.00%> (-0.08%) ⬇️
datalad/support/annexrepo.py 90.41% <100.00%> (-0.01%) ⬇️
datalad/support/gitrepo.py 92.70% <100.00%> (-0.01%) ⬇️
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@yarikoptic yarikoptic added the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Nov 15, 2022
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thank you!

@github-actions github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Nov 15, 2022
@codeclimate
Copy link

codeclimate bot commented Nov 15, 2022

Code Climate has analyzed commit 87995e1 and detected 3 issues on this pull request.

Here's the issue category breakdown:

Category Count
Security 3

View more on Code Climate.

@adswa adswa merged commit 328e28f into datalad:master Nov 15, 2022
@adswa adswa deleted the bf-6566 branch November 15, 2022 16:23
@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-performance Changes only improve performance, no impact on version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use dict instead of OrderedDict
3 participants