Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break the precondition of handle_committed #50

Open
yuezato opened this issue Dec 17, 2018 · 2 comments
Open

Break the precondition of handle_committed #50

yuezato opened this issue Dec 17, 2018 · 2 comments
Labels
bug Something isn't working frugalos_mds

Comments

@yuezato
Copy link
Member

yuezato commented Dec 17, 2018

Description

We can break the following precondition of the handle_committed method:

track_assert_eq!(self.next_commit, commit, ErrorKind::InvalidInput);

Reproduce

Use these files: https://gist.github.com/yuezato/9c0af68320935b342d0b152811f58cfc

Why is the precondition broken

In this while-loop:
https://github.com/frugalos/frugalos/blob/master/frugalos_mds/src/node/node.rs#L729-L738
here we assume the two raft events [ Event::SnapshotLoaded, Event::Committed ] come in this order.

First, we deal the Event::SnapshotLoaded

E::SnapshotLoaded { new_head, snapshot } => {
info!(
self.logger,
"New snapshot is loaded: new_head={:?}, bytes={}",
new_head,
snapshot.len()
);
let logger = self.logger.clone();
let future = fibers_tasque::DefaultCpuTaskQueue.async_call(move || {
let machine = track!(codec::decode_machine(&snapshot))?;
let versions = machine.to_versions();
info!(logger, "Snapshot decoded: {} bytes", snapshot.len());
Ok((new_head, machine, versions))
});
self.decoding_snapshot = Some(future);
}

without updating self.next_commit.

Immediately after receiving Event::Committed, we reach this line:

track_assert_eq!(self.next_commit, commit, ErrorKind::InvalidInput);

Finally, the precondition is broken.

How Solve This

Once we encounter a SnapshotLoaded event,
we should wait to deal committed events that follows the loaded event among decoding the snapshot.

Indeed, in this part (Especially line 704)

match track!(self.decoding_snapshot.poll().map_err(Error::from))? {
Async::NotReady => return Ok(Async::NotReady),
Async::Ready(None) => {}
Async::Ready(Some(result)) => {
let (new_head, machine, versions) = track!(result)?;
info!(self.logger, "Snapshot decoded: new_head={:?}", new_head);
let delay = env::var("FRUGALOS_SNAPSHOT_REPAIR_DELAY")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(10);
self.events.reserve_exact(machine.len());
self.events
.extend(versions.into_iter().map(|version| Event::Putted {
version,
put_content_timeout: Seconds(delay),
}));
self.next_commit = new_head.index;
self.machine = machine;
self.metrics.objects.set(self.machine.len() as f64);
self.decoding_snapshot = None;
}
}

we can correctly update self.next_commit and this maybe solve the present issue.

@yuezato yuezato added bug Something isn't working frugalos_mds labels Dec 17, 2018
@yuezato
Copy link
Member Author

yuezato commented Dec 28, 2018

Related Problem

If a frugalos_mds::node is down due to the above problem,
then the SegmentNode that has the frugalos_mds::node is also down: https://github.com/frugalos/frugalos/blob/master/frugalos_segment/src/service.rs#L192-L196

When delete a object on non-meta buckets, SegmentNode is needed:

Therefore, if a frugalos_mds::node is down, then we can never delete objects in the frugslos_mds::node. This leads to make unreachable objects on the device that handles the frugalos_mds::node.

Note: We can put/get an object to/from the frugslos_mds::node.

@yuezato
Copy link
Member Author

yuezato commented Jan 8, 2019

Although we deal with this problem by #51,
due to the impact of this issue, we leave this issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working frugalos_mds
Projects
None yet
Development

No branches or pull requests

1 participant