New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-5811: Add Kibosh integration for Trogdor and Ducktape #4195
Conversation
SUCCESS |
1 similar comment
SUCCESS |
FAILURE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new system tests failed for me. I will check the results to see if I have missed some step. But just wanted to check if they worked for you. Apart from that I have just left some minor comments.
:param nodes: The nodes to put the Kibosh FS on. Kibosh allocates no | ||
nodes of its own. | ||
:param target: The target directory, which Kibosh exports a view of. | ||
:param mirror: The mirror direcotry, where Kibosh injects faults. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: directory
util.wait_until(lambda: self.kibosh_process_running(node), 20, backoff_sec=.1, | ||
err_msg="Timed out waiting for kibosh to stop on %s" % node.account.hostname) | ||
except TimeoutError: | ||
# If the prcess won't terminate, use kill -9 to shut it down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type in comment: process
node = self.nodes[0] | ||
|
||
def check(self, node): | ||
self.logger.info("WATERMELON") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the log entry as marker that something checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this was supposed to be removed.
return fault_json == expected_json | ||
|
||
self.kibosh.set_faults(node, [spec]) | ||
#time.sleep(30 * 60 * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the sleep temporary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let me remove that
@@ -96,3 +96,32 @@ def verify_nodes_partitioned(): | |||
raise RuntimeError("Node 2 must be reachable from node 1.") | |||
if not node_is_reachable(self.agent_nodes[2], self.agent_nodes[1]): | |||
raise RuntimeError("Node 1 must be reachable from node 2.") | |||
|
|||
@cluster(num_nodes=4) | |||
def test_files_unreadable_fault(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems to be testing network partition rathter than disk error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
import java.util.Objects; | ||
import java.util.TreeMap; | ||
|
||
public enum Kibosh { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a singleton implemented as a Java enum. http://keaplogik.blogspot.com/2013/12/the-java-enum-singleton-pattern.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Maybe it should be called KiboshSingleton
if it really needs to be an enum. Personally, it feels like an odd use of enum
to me. Another opinion would help - @ijuma ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to a regular class
|
||
synchronized void removeFault(KiboshFaultSpec toRemove) throws IOException { | ||
KiboshControlFile file = KiboshControlFile.read(controlPath); | ||
List<KiboshFaultSpec> newFaults = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really new faults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can just call it "faults." That's probably clearer
FAILURE |
FAILURE |
FAILURE |
|
||
def pids(self, node): | ||
return [pid for pid in node.account.ssh_capture("test -e '%s' && test -e /proc/$(cat '%s')" % | ||
(self.pidfile_path, self.pidfile_path), allow_fail=True)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the support for multiple pids only for clean up? Since the code checks for a fixed control file, I was thinking we expect only one process at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that we expect only one process here. The reason it's plural is that there's a pattern in the ducktape code of having a pids
function for Service subclasses which returns an array of ints, or an empty array if there are no processes. For example, zookeeper.py, kafka.py, mirror_maker.py, streams.py, etc. all follow this pattern.
FAILURE |
1 similar comment
FAILURE |
def message(self): | ||
return { | ||
"class": "org.apache.kafka.trogdor.fault.FilesUnreadableFaultSpec", | ||
"startMs": self.start_ms, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
durationMs
not specified?
FAILURE |
For ducktape: add Kibosh to the testing Dockerfile. Create files_unreadable_fault_spec.py. For trogdor: create FilesUnreadableFaultSpec.java. Add a unit test of using the Kibosh service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
SUCCESS |
FAILURE |
SUCCESS |
} | ||
|
||
@JsonProperty | ||
public int errorCode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I can see here https://github.com/confluentinc/kibosh/blob/b4288fa98d637f3765dc2abe848d18fce5afa897/fault.c#L41
the parameter name should be 'code', not 'errorCode'. Do you have this fault actually working?
For ducktape: add Kibosh to the testing Dockerfile.
Create files_unreadable_fault_spec.py.
For trogdor: create FilesUnreadableFaultSpec.java.
Add a unit test of using the Kibosh service.