Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-5811: Add Kibosh integration for Trogdor and Ducktape #4195

Closed
wants to merge 1 commit into from

Conversation

cmccabe
Copy link
Contributor

@cmccabe cmccabe commented Nov 8, 2017

For ducktape: add Kibosh to the testing Dockerfile.
Create files_unreadable_fault_spec.py.

For trogdor: create FilesUnreadableFaultSpec.java.
Add a unit test of using the Kibosh service.

@asfgit
Copy link

asfgit commented Nov 8, 2017

SUCCESS
8057 tests run, 5 skipped, 0 failed.
--none--

1 similar comment
@asfgit
Copy link

asfgit commented Nov 8, 2017

SUCCESS
8057 tests run, 5 skipped, 0 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 8, 2017

FAILURE
7949 tests run, 5 skipped, 1 failed.
--none--

Copy link
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new system tests failed for me. I will check the results to see if I have missed some step. But just wanted to check if they worked for you. Apart from that I have just left some minor comments.

:param nodes: The nodes to put the Kibosh FS on. Kibosh allocates no
nodes of its own.
:param target: The target directory, which Kibosh exports a view of.
:param mirror: The mirror direcotry, where Kibosh injects faults.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: directory

util.wait_until(lambda: self.kibosh_process_running(node), 20, backoff_sec=.1,
err_msg="Timed out waiting for kibosh to stop on %s" % node.account.hostname)
except TimeoutError:
# If the prcess won't terminate, use kill -9 to shut it down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type in comment: process

node = self.nodes[0]

def check(self, node):
self.logger.info("WATERMELON")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the log entry as marker that something checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was supposed to be removed.

return fault_json == expected_json

self.kibosh.set_faults(node, [spec])
#time.sleep(30 * 60 * 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the sleep temporary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me remove that

@@ -96,3 +96,32 @@ def verify_nodes_partitioned():
raise RuntimeError("Node 2 must be reachable from node 1.")
if not node_is_reachable(self.agent_nodes[2], self.agent_nodes[1]):
raise RuntimeError("Node 1 must be reachable from node 2.")

@cluster(num_nodes=4)
def test_files_unreadable_fault(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to be testing network partition rathter than disk error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

import java.util.Objects;
import java.util.TreeMap;

public enum Kibosh {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Maybe it should be called KiboshSingleton if it really needs to be an enum. Personally, it feels like an odd use of enum to me. Another opinion would help - @ijuma ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to a regular class


synchronized void removeFault(KiboshFaultSpec toRemove) throws IOException {
KiboshControlFile file = KiboshControlFile.read(controlPath);
List<KiboshFaultSpec> newFaults = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really new faults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can just call it "faults." That's probably clearer

@asfgit
Copy link

asfgit commented Nov 14, 2017

FAILURE
4020 tests run, 1 skipped, 1 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 14, 2017

FAILURE
8068 tests run, 5 skipped, 2 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 14, 2017

FAILURE
7960 tests run, 5 skipped, 1 failed.
--none--


def pids(self, node):
return [pid for pid in node.account.ssh_capture("test -e '%s' && test -e /proc/$(cat '%s')" %
(self.pidfile_path, self.pidfile_path), allow_fail=True)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the support for multiple pids only for clean up? Since the code checks for a fixed control file, I was thinking we expect only one process at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that we expect only one process here. The reason it's plural is that there's a pattern in the ducktape code of having a pids function for Service subclasses which returns an array of ints, or an empty array if there are no processes. For example, zookeeper.py, kafka.py, mirror_maker.py, streams.py, etc. all follow this pattern.

@asfgit
Copy link

asfgit commented Nov 15, 2017

FAILURE
8068 tests run, 5 skipped, 2 failed.
--none--

1 similar comment
@asfgit
Copy link

asfgit commented Nov 15, 2017

FAILURE
8068 tests run, 5 skipped, 2 failed.
--none--

def message(self):
return {
"class": "org.apache.kafka.trogdor.fault.FilesUnreadableFaultSpec",
"startMs": self.start_ms,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

durationMs not specified?

@asfgit
Copy link

asfgit commented Nov 15, 2017

FAILURE
7960 tests run, 5 skipped, 1 failed.
--none--

For ducktape: add Kibosh to the testing Dockerfile.
Create files_unreadable_fault_spec.py.

For trogdor: create FilesUnreadableFaultSpec.java.
Add a unit test of using the Kibosh service.
Copy link
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asfgit asfgit closed this in d9cbc6b Nov 16, 2017
@asfgit
Copy link

asfgit commented Nov 16, 2017

SUCCESS
8072 tests run, 5 skipped, 0 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 16, 2017

FAILURE
7964 tests run, 5 skipped, 1 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 16, 2017

SUCCESS
8072 tests run, 5 skipped, 0 failed.
--none--

@cmccabe cmccabe deleted the KAFKA-5811 branch May 20, 2019 19:07
}

@JsonProperty
public int errorCode() {
Copy link

@dkuritsyn dkuritsyn Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can see here https://github.com/confluentinc/kibosh/blob/b4288fa98d637f3765dc2abe848d18fce5afa897/fault.c#L41
the parameter name should be 'code', not 'errorCode'. Do you have this fault actually working?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants