Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dirty page checking and snapshot diffs #96

Merged
merged 16 commits into from
Jun 1, 2021
Merged

Dirty page checking and snapshot diffs #96

merged 16 commits into from
Jun 1, 2021

Conversation

Shillaker
Copy link
Collaborator

@Shillaker Shillaker commented May 21, 2021

The aim of this PR is:

  • Return diffs from the workers back to the master after executing a batch of threads. This means tracking dirty pages on worker hosts when executing a batch of threads, then returning the resulting memory diff back to the master when finished.
  • Push diffs from the master host to workers when executing subsequent batches of threads (i.e. push the whole snapshot to the host when executing the first batch, then only push diffs after that). This means tracking dirty pages on the master between executing batches of threads.

Changes:

  • Add pushing diffs between hosts. This is either pushing diffs on their own (master -> worker), or pushing diffs along with a thread result (worker -> master). See notes below on this.
  • Add dirty page checking based on soft-dirty PTEs.
  • Add snapshot() method to executor subclasses so that Faabric can request a pointer to their memory (which it then uses to work out dirty pages).
  • Convert both snapshot push operations (full and diffs) to await a response synchronous, as the master host needs to know the snapshot data has been pushed and set up before sending function messages.
  • Add test fixtures where possible to reduce verbosity of tests.
  • Made logger an instance variable where possible to avoid calling getLogger all over the place.

Because the diffs from child threads have to be applied before other threads can continue, we have to make sure that they are applied before the thread result is returned (as the thread result is the thing that other threads wait on). Therefore we have to piggyback the diffs on the existing thread result message.

I've ported the thread result message to use flatbuffers which is where all the other snapshot stuff lives. For now I've put it in with the existing SnapshotClient and SnapshotServer in an attempt to minimise the number of changes in this PR, but eventually we'll need to commit to either protobuf or flatbuffers. Flatbuffers seem potentially more efficient, but the awkwardness of passing them around and accessing the data within is a major downside.

@Shillaker Shillaker self-assigned this May 21, 2021
@Shillaker Shillaker changed the title Pushing snapshot diffs for updating thread shared memory Dirty page checking May 24, 2021
@Shillaker Shillaker changed the title Dirty page checking Dirty page checking and snapshot diffs May 24, 2021
@@ -9,6 +9,8 @@ namespace faabric::scheduler {
// -----------------------------------
// Mocking
// -----------------------------------
std::mutex mockMutex;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was seeing race conditions on the mock methods when running in threaded tests. This mocking stuff is getting a little excessive to be held in the main source for the clients. A nicer way would be to use the polymorphism of messages to have a single buffer where we keep all the messages a client has sent, but this would require some nasty casting in tests that want to access those messages.

@Shillaker Shillaker marked this pull request as ready for review May 26, 2021 09:22
@Shillaker Shillaker marked this pull request as draft May 26, 2021 15:24
@Shillaker Shillaker marked this pull request as ready for review May 27, 2021 11:00
SnapshotClient& Scheduler::getSnapshotClient(const std::string& otherHost)
{
// Note, our keys here have to include the tid as the clients can only be
// used within the same thread
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csegarragonz I don't know how this isn't failing for the function call clients, as the functionCallClients map is shared by all threads and keyed on the other hostname, so I would imagine the same client is getting used by different threads. However, it seems there aren't any assertions failing, so I guess that isn't true...?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not failing as we do the assertions, but I'm indeed surprised that it is not. Is your approach thread-safe either? I see that different threads would only ever access different keys, but I am wondering if this won't leave the unerlying data structure in an undefined state.

I guess I assumed the scheduler was single-threaded. In MPI we use TLS for the clients, but it's painful to clean up (as it must be explicitely cleaned, from the same thread which opened it). Happy to discuss this offline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler is not single-threaded, it's a singleton object that gets called by lots of threads, so thread safety is still very important. The confusion may have arisen because the scheduler used to manage its own thread pool but now doesn't actually spawn any threads (the executors do that), this does not mean it's only ever run in a single thread though.

You're right on the map accesses, there should be a lock around this as the map itself isn't thread safe.

#include <faabric/util/testing.h>

namespace tests {
class BaseTestFixture
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This BaseTestFixture is here to capture the stuff I found we were repeating in all the existing features. We have to be careful not to make this a dumping ground, and only included the tidy-up that's actually going to get repeated elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally doubt that this is useful/desirable. The main point of fixtures was to remove the dumping ground cleanFaabric() had turned into. However, I feel we have just moved the burden of it to BaseTestFixture.

For instance, why would the transport tests redis.flushAll()? I think there's value in knowing what you are modifying in your tests, and resetting just that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did think that when creating this, the thing is with Redis and the scheduler for example, the snippets of clean-up would have to be sprinkled around so many places that it makes sense to avoid repetition.

The solution I think should be a fixture per feature and have multiple inheritance in the tests, i.e. we have a RedisFixture, a StateFixture, a MockClientsFixture etc. I'll see if that works here...

Copy link
Collaborator

@csegarragonz csegarragonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, overall LGTM, and the dirty page tracking looks great and neat. I know you started this before we agreed on the new PR+commit structure, but for the future, 15 hundred lines added with four commits is very hard to review in a per-commit fashion (which btw you can do using n and p in the browser).

Wrt the PR size/topics I feel that all the fixture changes had really nothing to do with dirty page checking, so this could definately be in a different PR.

include/faabric/scheduler/SnapshotClient.h Outdated Show resolved Hide resolved
snapshotDiffs.size(),
funcStr,
h);
SnapshotClient& c = getSnapshotClient(h);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite stylistic and a matter of preference, but do you think we could one-line the getting + push? This is how its done with getFunctionCallClient so we'd be consistent throughout.

Copy link
Collaborator Author

@Shillaker Shillaker May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a big deal to me, although they are both functions to get clients they're not actually related in any other way, so I'm not sure consistency matters. I guess we could come up with a rule that if we call a "get something" on one line, then it's only used to call a single method on the next line, that the "get something" should be inlined with the method call on the next line? There's probably not much consistency on that throughout the codebase.

} else {
// Send the header first
sendHeader(faabric::scheduler::SnapshotCalls::PushSnapshot);

faabric::util::SystemConfig& conf = faabric::util::getSystemConfig();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mark as const as requesting a reference seems counter-intuitive (as we don't modify anything).

Copy link
Collaborator Author

@Shillaker Shillaker May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good point on the const, but remember that even without the const, requesting a reference avoids a copy so I wouldn't say it's counter-intuitive.

mb.Finish(requestOffset);
uint8_t* msg = mb.GetBufferPointer();
int size = mb.GetSize();
send(msg, size);

// Await a response as this call must be synchronous
awaitResponse(SNAPSHOT_PORT + REPLY_PORT_OFFSET);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the MessageEndpointServer we automatically include the REPLY_PORT_OFFSET programatically. I am wondering if we could do so as well in awaitResponse, as its quite an implementation detail that we may hide from the interface.

I guess the original motivation not to do so was to be able to awaitResponse on an arbitrary port, but we use it nowhere (i.e. without the OFFSET).

Feel free to ignore and I will do this in a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think doing it automatically inside the method makes sense. The only downside is that it adds a bit of magic that may not be obvious to the caller, but I think it's worth it. Let's do in another PR though to try and keep a lid on the size of this one.


// Send the data
mb.Finish(requestOffset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now used sufficiently many times to be a candidate for being macro-ed. I suggest something like:

#define SEND_FLATBUFFER(header, mb, offset, waitResponse) \
    sendHeader(header); \
    mb.Finish(offset); \
    uint8_t* msg = mb.GetBufferPointer(); \
    int size = mb.GetSize(); \
    if (waitResponse) { \
        awaitResponse(SNAPSHOT_PORT + REPLY_PORT_OFFSET); \
    }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good spot.

const SnapshotDiffPushRequest* r =
flatbuffers::GetMutableRoot<SnapshotDiffPushRequest>(msg.udata());

faabric::util::getLogger()->info("Receiving {} diffs to snapshot {}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for info here

char value[] = "4";
size_t nWritten = fwrite(value, sizeof(char), 1, fd);
if (nWritten != 1) {
throw std::runtime_error("Failed to write to clear_refs");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this happen in threads? If so could we logger->error(...) + throw. I'm afraid these exceptions will fly under the radar.

Copy link
Collaborator Author

@Shillaker Shillaker May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the exceptions in here will generally only happen if there's a permissions error on the file itself (or if the kernel doesn't support soft dirty PTEs which is unlikely as they were added in 3.x), so it will only happen if we've misconfigured an environment. In that case, this exception will be triggered from both the main thread and child threads.

I've added a lot more logging now which should make things a little clearer.

std::vector<bool> dirtyFlags = faabric::util::getDirtyPages(data, nPages);

// Convert to snapshot diffs
// TODO - reduce number of diffs by merging adjacent dirty pages
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we tackle the TODO in this PR? I feel it should not be very complex to implement. I was thinking of something like:

auto next1 = std::find(dirtyFlags.begin(), dirtyFlags.end(), 1);
assert(next != dirtyFlags.end());
auto next0 = [&](){ return std::find(next1, dirtyFlags.end(), 0); };
while (next0() != dirtyFlags.end()) {
    // May have to +/- 1
    auto size = next0() - next1;
    uint32_t offset = next1 * faabric::util::HOST_PAGE_SIZE;
    diffs.emplace_back(offset, data + offset,  size * faabric::util::HOST_PAGE_SIZE;
    next1 = std::find(next0, dirtyFlags.end(), 1);
}
// handle the last chunk (i.e. when the vector ends with ones)

there's provably a nicer way, and the code above will be full of bugs, but just an idea.

Copy link
Collaborator Author

@Shillaker Shillaker May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be premature optimisation which is why I left it as a TODO. I'm not sure how often this will be beneficial and am reluctant to add more complexity. I'll revisit once I've done some benchmarking.

#include <faabric/util/testing.h>

namespace tests {
class BaseTestFixture
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally doubt that this is useful/desirable. The main point of fixtures was to remove the dumping ground cleanFaabric() had turned into. However, I feel we have just moved the burden of it to BaseTestFixture.

For instance, why would the transport tests redis.flushAll()? I think there's value in knowing what you are modifying in your tests, and resetting just that.

SnapshotClient& Scheduler::getSnapshotClient(const std::string& otherHost)
{
// Note, our keys here have to include the tid as the clients can only be
// used within the same thread
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not failing as we do the assertions, but I'm indeed surprised that it is not. Is your approach thread-safe either? I see that different threads would only ever access different keys, but I am wondering if this won't leave the unerlying data structure in an undefined state.

I guess I assumed the scheduler was single-threaded. In MPI we use TLS for the clients, but it's painful to clean up (as it must be explicitely cleaned, from the same thread which opened it). Happy to discuss this offline.

@Shillaker
Copy link
Collaborator Author

Left some comments, overall LGTM, and the dirty page tracking looks great and neat. I know you started this before we agreed on the new PR+commit structure, but for the future, 15 hundred lines added with four commits is very hard to review in a per-commit fashion (which btw you can do using n and p in the browser).

Yes let's not be too strict on this one, I think we can aim for the n/p reviewing in an ideal world but not worry too much if we don't achieve it every time. In this case it was a choice between 50 commits with things like "fixing tests" repeated over and over and one big one, neither of which is useful. I could potentially have rebased into something slightly more meaningful, but I think you have to start the work in that mindset for it to really work.

Wrt the PR size/topics I feel that all the fixture changes had really nothing to do with dirty page checking, so this could definately be in a different PR.

Yes, again in an ideal world, but let's not let great be the enemy of the good. "nothing to do" is a little strong, it avoided repeating code in the new tests I was writing.

@Shillaker
Copy link
Collaborator Author

@csegarragonz the latest commit is to address the same issue we saw before in Faasm, where there's a race condition when we're getting an object from a map, intialising it if it isn't there already. Without the shared locking it's possible for another thread to come in while the object is being initialised and see (a) partially initialised object, or (b) the map will be in an inconsistent state.

I've added the fix around getting loggers, getting function call clients and getting snapshot clients in the scheduler.

@@ -1,7 +1,5 @@
#!/bin/bash

set -e

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle failures in the worker and still be able to print logs we have to switch off set -e

{
faabric::util::SharedLock lock(snapshotClientsMx);
return snapshotClients.at(key);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were the changes to getSnapshotClient and getFunctionCallClient that address the issues of cross-thread clients and race conditions.

{
faabric::util::SharedLock lock(loggerMx);
return loggers[name];
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same race condition issue mentioned previously. Given the proposed move to using spdlog macros we don't need to worry too much about the performance hit this will have (which will probably be negligible anyway unless called in a very tight loop)

faabric::util::SnapshotData snapshot() override;

uint8_t* snapshotMemory = nullptr;
size_t snapshotSize = 0;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I hadn't set defaults on the snapshotMemory and snapshotSize, which worked on my machine as the size was getting set to zero, but in Github Actions it was getting set to a non-zero value which was breaking everything. The take-away is that we must always set default values on class members. I looked for a clang-tidy rule to enforce this but couldn't find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants