Add an rt:admin/3 function that accepts a list of options as the third parameter. Currently the only valid option is return_exit_code. The rtdev, rtssh, and rt_cs_dev harnesses have been updated to support this option. If return_to_exit is specified the return from a ?HARNESS:admin call is a pair with the exit code as the first member and the command output as the second member. Finally the basic_command_line test has been changed to use return_for_exit to verify the changes.
Add testing of the handoff heartbeat change from the following pull request: basho/riak_core#560. Add an intercept module for the riak_core_handoff_sender module to introduce artificial delay on item visitation during a handoff fold. This delay along with the changes to the verify_handoff test induces test failure when run without the heartbeat change. The handoff_receive_timeout is exceeded, handoff stalls, and the test eventually fails due to timeout. The test succeeds when run with the heartbeat change.
… the list_keys attempt for the coverage testing. Conflicts: tests/overload.erl
robust to failures.
Change the ACL test case in the replication_ssl and replication2_ssl tests to use certificates generated within the tests instead of relying on certificates created outside the test that are prone to expire and cause spurious test failure. Also change the replication_ssl and replication2_ssl tests to avoid a cycle of standing up the test clusters and then immediately restarting them before any tests cases execute. This should make the test execution slightly faster for both test modules. This commit also changes the tests to be a bit more robust in checking for cluster state when restarting nodes and removes an unnecessary five second sleep call in the replication_ssl test.
Change the overload test to exercise the strongly consistent code paths in addition to the eventually consistent paths during overload conditions.
Avoid a race condition in the replication test module when checking for site IP addresses in the replication status output. The test waits for a connection on the leader, but it only queries the replication status to check for the expected site IP addresses a single time. Change the test to wait and re-check the status output to give greater assurance that if the expected site IP addresses are not present it is due to legitimate failure and not a race condition in checking the replication status. This change affects the replication and replication_upgrade tests as well as any other tests that call the replication:replication function.
Re-initiate fullsync after 100 failed checks for completion. The number of retries of the 'start fullsync and then check for completion' cycle is configurable using repl_util:start_and_wait_until_fullsync_complete/4 and defaults to 20 retries. This change is to avoid spurious test failures due to a rare condition where the rpc call to start fullsync fails to actually initiate the fullsync. A very similar changed for the version of the start_and_wait_until_fullsync_complete in the replication module introduced in 0a36f99 has had good effect at avoiding this condition for v2 replication tests.
Part of the condition checking done in the replication_object_reformat test is to validate the results of a fullsync using repl_util:validate_completed_fullsync/6. The way in which the the function is called from the test expects fullsync to complete with 0 error_exit or retry_exit conditions occurring. This requires that sink cluster be in a steady state with all partitions available. The test failed to wait for such conditions to occur and instead relied on performing a node downgrade asynchronously and waiting for up to 60 seconds for a completion message before continuing with the test. The test was continually failing after a node was downgraded to `previous` due to partitions being reported as `down` on that node. To resolve the issue the node downgrade process is now done in the primary test process instead of in a separate spawned process. After the version downgrade is complete, the test now waits for the riak_repl and the riak_kv services, calls rt:wait_until_nodes_ready/1, calls rt:wait_until_no_pending_changes/1, and finally waits for the riak_repl2_fs_node_reserver named process to be registered on the downgraded node. This process is responsible for handling partition reservation requests and is key to determining the the new node is able to handle a fullsync without partition errors.
Update the calls to rt:systest_read in repl_util and repl_aae_fullsync_util to treat identical siblings resulting from the use of DVV as a single value. These changes are specifically to address failures seen in the repl_aae_fullsync_custom_n and replication_object_reformat tests, but should be generally useful for replication tests using the utility modules that and that have allow_mult set to true.
Add optional parameter to rt:systest_read that provides the ability to compare siblings and treat those with identical values and metadata excluding the dot as a single value for the purposes of testing. This is to facilitate testing with DVV enabled since there are several known cases where use of DVV leads to siblings created internally by Riak. One example of such a case is a write during handoff that is forwarded by the vnode, but also executed locally. The existing behavior of rt:systest_read is to fail when any siblings are encountered with a badmatch. This behavior is maintained as the default so existing tests maintain their current semantics without explicit change.
Fix problem with cacertdir specification in replication_ssl test. The code used load cert files in v2 replication expects the path specific by the cacertdir key to only be a directory. With v3 replication the code used is flexible enough to allow a directory or a file. Also correct a typo in the certfile path for the SSLConfig1 configuration.
* Do not attempt to cancel fullsync if the initial attempt to start and wait for completion fails. It has not been observed that the problem is fullsync starting and not completing in time, but rather the issue is that the initial call to start fullsync does not take effect. Therefore the cancellation is unnecessary. * Replace the call to repl_util:wait_for_connection/2 in the node upgrade process with a call to replication:wait_until_connection/1. This function is geared towards v2 replication and should speed up test execution.