Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix few tsan issues #16710

Merged
merged 23 commits into from
Aug 10, 2022
Merged

Fix few tsan issues #16710

merged 23 commits into from
Aug 10, 2022

Conversation

MBkkt
Copy link
Contributor

@MBkkt MBkkt commented Aug 4, 2022

Scope & Purpose

  • 💩 Bugfix
  • 🍕 New feature
  • 🔥 Performance improvement
  • 🔨 Refactoring/simplification

Checklist

  • Tests
    • Regression tests
    • C++ Unit tests
    • integration tests
    • resilience tests
  • 📖 CHANGELOG entry made
  • 📚 documentation written (release notes, API changes, ...)
  • Backports
    • Backport for 3.10: (Please link PR)
    • Backport for 3.9: (Please link PR)
    • Backport for 3.8: (Please link PR)

Related Information

(Please reference tickets / specification / other PRs etc)

  • Docs PR:
  • Enterprise PR:
  • GitHub issue / Jira ticket:
  • Design document:

@MBkkt MBkkt force-pushed the bugfix/tsan-issue-query branch 2 times, most recently from fe7bc07 to 1a23c28 Compare August 4, 2022 11:37
@gnusi gnusi added the TSan label Aug 4, 2022
@mpoeter mpoeter self-requested a review August 4, 2022 13:31
@MBkkt MBkkt requested a review from jsteemann August 4, 2022 17:33
threads.emplace_back([&driver, initialObserver, initialState, iters,
threadSeed = gen(), result = &results[thrIdx]] {
// the random device is thread_local
RandomGenerator::initialize(RandomGenerator::RandomType::MERSENNE);
Copy link
Contributor Author

@MBkkt MBkkt Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RandomGenerator::initialize reset thread_local ptr to null and global enum type to mersenne

So this code no make sense, because

  1. we have data race on type and we can init it before threads creation.
  2. thread_local ptr initialized by null on thread creation, this thread created right here

@MBkkt MBkkt marked this pull request as ready for review August 4, 2022 18:38
@MBkkt MBkkt requested a review from a team as a code owner August 4, 2022 18:38
@@ -27,6 +27,7 @@
#include "Aql/AqlTransaction.h"
#include "Aql/ExecutionEngine.h"
#include "Aql/Timing.h"
#include "Aql/QueryCache.h"
Copy link
Contributor Author

@MBkkt MBkkt Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some magic (don't understand how it works before), but I can't build without it (incomplete type), change only shared_ptr creation

@@ -62,6 +62,8 @@ enum class TraversalProfileLevel : uint8_t {
struct QueryOptions {
QueryOptions();
explicit QueryOptions(arangodb::velocypack::Slice);
QueryOptions(QueryOptions&&) noexcept = default;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It moved in some place but didn't have move :(

Comment on lines 41 to 47
# TODO Fix data race in arangodump
race:arangodump

# TODO Fix data race in arangorestore
race:arangorestore
Copy link
Contributor Author

@MBkkt MBkkt Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of races in random code

deadlock:consensus::Agent::setPersistedState

# TODO Fix data race in arangodbtests
race:DummyConnection::sendRequest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's stupid race, difficult to fix because need to correct tests
Now them written bad

thread:CacheManagerFeature::start

# TODO Fix lock order inversion
deadlock:consensus::Agent::setPersistedState
Copy link
Contributor Author

@MBkkt MBkkt Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed by cluster team.
"Agent call State" but sometimes "State call Agent"

Comment on lines +31 to +41
# TODO Fix known thread leaks
thread:ClusterFeature::startHeartbeatThread
thread:CacheManagerFeature::start
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed by cluster team
Also not critical them leaked after stop application

@MBkkt MBkkt force-pushed the bugfix/tsan-issue-query branch 3 times, most recently from 4df7985 to 7bc64a2 Compare August 6, 2022 12:03
Comment on lines -171 to -173
std::mutex m;
std::condition_variable cv;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is bug. What if timeout and thenFinal callback still not called

FutureStatus wait_until(
const std::chrono::time_point<Clock, Duration>& timeout_time) {
detail::waitImpl(*this, timeout_time);
return FutureStatus::Ready;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice)

Comment on lines -317 to -331
/// waits for the result, returns if it is not available
/// for the specified timeout duration. Future must be valid
template<class Rep, class Period>
FutureStatus wait_for(
const std::chrono::duration<Rep, Period>& timeout_duration) {
return wait_until(std::chrono::steady_clock::now() + timeout_duration);
}

/// waits for the result, returns if it is not available until
/// specified time point. Future must be valid
template<class Clock, class Duration>
FutureStatus wait_until(
const std::chrono::time_point<Clock, Duration>& timeout_time) {
detail::waitImpl(*this, timeout_time);
return FutureStatus::Ready;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used only in itself tests

@@ -51,7 +51,7 @@ static void tryToConnectExpectFailure(f::EventLoopService& eventLoopService,
[&](f::Error, std::unique_ptr<f::Request>,
std::unique_ptr<f::Response>) {});

auto success = wg.wait_for(std::chrono::seconds(5));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed locally for vst

@MBkkt MBkkt requested a review from gnusi August 8, 2022 07:47
@MBkkt MBkkt changed the title Bugfix/tsan issue query Fix few tsan issues Aug 8, 2022
js/client/modules/@arangodb/testutils/test-utils.js Outdated Show resolved Hide resolved
arangod/Aql/Query.cpp Outdated Show resolved Hide resolved
arangod/Aql/ClusterQuery.cpp Outdated Show resolved Hide resolved
@MBkkt MBkkt requested a review from mpoeter August 8, 2022 12:15
@MBkkt MBkkt requested a review from mpoeter August 8, 2022 13:42
Comment on lines 75 to 78
try {
_traversers.clear();
} catch (...) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this should be kept in the ClusterQuery dtor as before (then we can also keep the traversers private).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

arangod/Aql/Query.cpp Show resolved Hide resolved
tsan_arangodb_suppressions.txt Show resolved Hide resolved
@MBkkt
Copy link
Contributor Author

MBkkt commented Aug 9, 2022

@jsteemann jsteemann merged commit af11ab0 into devel Aug 10, 2022
@jsteemann jsteemann deleted the bugfix/tsan-issue-query branch August 10, 2022 09:59
MBkkt added a commit that referenced this pull request Aug 11, 2022
(cherry picked from commit af11ab0)
KVS85 added a commit that referenced this pull request Aug 11, 2022
* Fix few tsan issues (#16710)

(cherry picked from commit af11ab0)

* Fix data race on query vptr

* Update arangod/Aql/Query.h

Co-authored-by: Jan <jsteemann@users.noreply.github.com>

* Apply suggestions from code review

* Destroy query before destroy traverses

Co-authored-by: Jan <jsteemann@users.noreply.github.com>
Co-authored-by: Vadim Kondratyev <vadim@arangodb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants