New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vstart_runner for CephFS #513

Merged
merged 35 commits into from Oct 7, 2015

Conversation

Projects
None yet
3 participants
@jcsp
Copy link
Contributor

jcsp commented Jul 28, 2015

No description provided.

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Jul 28, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/493/
Test FAILed.

@jcsp

This comment has been minimized.

Copy link
Contributor Author

jcsp commented Jul 28, 2015

As introduced on mailing list: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/26151

Rather than split up the client handling into a part that operates as root, I've just added enough infrastructure to skip over the tests that require client trimming to work, and use new mds_root_ino_uid settings to avoid the need for root when operating within filesystem mounts.

@jcsp jcsp force-pushed the wip-vstart-runner branch from e20f16f to 59776f0 Jul 28, 2015

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Jul 28, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/494/
Test PASSed.

@jcsp jcsp force-pushed the wip-vstart-runner branch from 59776f0 to a4c8db4 Aug 6, 2015

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Aug 6, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/509/
Test PASSed.

for data_pool in fs_data['data_pools']:
self.mon_manager.raw_cluster_cmd('osd', 'pool', 'delete',
data_pool, data_pool,
'--yes-i-really-really-mean-it')

def delete(self):

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 24, 2015

Member

I don't think anybody else invokes delete(); we should probably just delete it so it doesn't get outdated and then pick up an unaware user.

This comment has been minimized.

@jcsp

jcsp Aug 24, 2015

Author Contributor

agreed -- updated.

@jcsp jcsp force-pushed the wip-vstart-runner branch from a4c8db4 to b48bc83 Aug 24, 2015

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Aug 24, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/559/
Test PASSed.

@gregsfortytwo

This comment has been minimized.

Copy link
Member

gregsfortytwo commented Aug 24, 2015

I'm sure I'm just missing something but I'm not sure how we're actually supposed to run this. Invoke vstart and then invoke vstart_runner from inside the ceph src directory? :)

@jcsp

This comment has been minimized.

Copy link
Contributor Author

jcsp commented Aug 24, 2015

Yeah, pretty much: the magic part is setting PYTHONPATH correctly, here's an example command line:

PYTHONPATH=~/git/teuthology/:~/git/ceph-qa-suite/ python ~/git/ceph-qa-suite/tasks/cephfs/vstart_runner.py tasks.cephfs.test_data_scan.TestDataScan.test_stashed_layout
@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Aug 24, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/561/
Test PASSed.

@jcsp

This comment has been minimized.

Copy link
Contributor Author

jcsp commented Aug 24, 2015

I've just added some handy warning messages for when someone tries to run this without the right paths.

if stdin and isinstance(stdin, basestring):
# Hack: writing to stdin is not deadlock-safe, but it "always" works
# as long as the input buffer is "small"
subproc.stdin.write(stdin)

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 24, 2015

Member

Hmm, it looks like this just throws away the stdin param if it's not a string? I'm not sure we even expect it to be a string here, but maybe I've missed something, and surely we shouldn't silently ignore it.

This comment has been minimized.

@jcsp

jcsp Aug 24, 2015

Author Contributor

In teuthology.orchestra there is a whole load of pipe handling that just doesn't exist here, where callers can pass in run.PIPE. We can probably just assert that this is a string (it always is for the cases we care about in the cephfs tests)



# FIXME: twiddling vstart daemons is likely to be unreliable, we should probably just let vstart
# run RADOS and run the MDS daemons directly from the test runner

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 24, 2015

Member

Can you discuss this a bit more? In particular I notice that this is definitely manipulating daemons but doesn't update the pid files vstart uses — how does a tested cluster get shut down properly?

This comment has been minimized.

@jcsp

jcsp Aug 24, 2015

Author Contributor

Does vstart actually do anything with pid files? stop.sh is just a killall.

The spirit of this comment is that where we want services running for the duration of the test, it would be simpler to just have them as child processes of vstart_runner, so that we can do things like sending signals and then blocking on their actual termination, rather than polling to see things die etc.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 24, 2015

Member

Hmm, it's pretty dense but I thought stop.sh did a little more than that. In particular though you can also use init-ceph stop with a vstart cluster. I've switched back and forth over time; init-ceph used to be much better if you were sharing a box with somebody.

This comment has been minimized.

@jcsp

jcsp Aug 24, 2015

Author Contributor

oh, didn't know about init-ceph with a vstart cluster. I think i'd defer doing nicer init handling until deciding either to go that way or to just start the processes directly from python-land.

if waited > timeout:
raise MaxWhileTries("Timed out waiting for daemon {0}.{1}".format(self.daemon_type, self.daemon_id))
time.sleep(1)
waited += 1

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 24, 2015

Member

Should probably use a start time and actually look at the clock instead of relying on an increment...it's unlikely to be a serious problem since this is just a timeout but they get inaccurate surprisingly quickly on many systems.

This comment has been minimized.

@jcsp

jcsp Aug 24, 2015

Author Contributor

This is true, but there are a tonne of other places in the tests that we use this pattern, so I'm inclined to ignore until it becomes an issue. One day a very bored person will convert all the timeouts to use the one helper function :-)

lines = ps_txt.split("\n")[1:]

for line in lines:
if line.find("--name client.{0} ".format(self.client_id)) != -1:

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 24, 2015

Member

Does the default client command match this string somehow, or is that a personal invocation convention?

This comment has been minimized.

@jcsp

jcsp Aug 24, 2015

Author Contributor

The client is always started by LocalFuseMount, so it's invoked consistently with --name

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Aug 24, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/562/
Test PASSed.

@jcsp jcsp force-pushed the wip-vstart-runner branch from e1160e9 to 96d54fd Aug 24, 2015

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Aug 24, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/563/
Test PASSed.

@jcsp jcsp force-pushed the wip-vstart-runner branch from 96d54fd to ad762f1 Aug 24, 2015

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Aug 24, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/564/
Test PASSed.


def list_connections():
self.client_remote.run(
args=["sudo", "mount", "-t", "fusectl", "/sys/fs/fuse/connections", "/sys/fs/fuse/connections"],

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 24, 2015

Member

Doesn't this sudo get stripped out?

This comment has been minimized.

@jcsp

jcsp Aug 24, 2015

Author Contributor

Indeed, harmlessly so. removed.

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Aug 24, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/566/
Test PASSed.

raise NotImplementedError()

def get_pgs_per_fs_pool(self):
# FIXME: assuming there are 3 OSDs

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 25, 2015

Member

Who uses 3 OSDs on a vstart cluster for the fs? 1, baby! :)
(Of course, this will still work for that, so...meh.)

self._write_conf()

def clear_firewall(self):
# FIXME: unimplemented

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 25, 2015

Member

This should probably assert if it gets activated.

This comment has been minimized.

@jcsp

jcsp Aug 25, 2015

Author Contributor

This is called during CephFSTestCase.tearDown, so pass is the desired behaviour. If anyone tries to run a test that actually sets any firewall rules they'll fail out when trying to run iptables.

# Monkeypatch get_package_version to avoid having to work out what kind of distro we're on
def _get_package_version(remote, pkg_name):
# Used in cephfs tests to find fuse version. Your development workstation *does* have >=2.9, right?
return "2.9"

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 25, 2015

Member

...eek. This needs to at least be something we can set via a config, please!

This comment has been minimized.

@jcsp

jcsp Aug 25, 2015

Author Contributor

I'd sooner wait until someone actually needs to run on a super-old fuse (might never happen). This is meant to just be a hack to get the filesystem tests running

This comment has been minimized.

@gregsfortytwo

gregsfortytwo via email Aug 25, 2015

Member

This comment has been minimized.

@jcsp

jcsp Aug 25, 2015

Author Contributor

Sounds like you're the right person to test and fix it! I'm all on linux these days.

if not is_named:
victims.append((case, method))

log.info("Disabling {0} tests because they're marked as long running".format(len(victims)))

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 25, 2015

Member

This output string is missing a case.

This comment has been minimized.

@jcsp

jcsp Aug 25, 2015

Author Contributor

changed.


drop_test = False

if hasattr(fn, 'is_long_running') and getattr(fn, 'is_long_running') is True:

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Aug 25, 2015

Member

Can we have a switch to do the long-running tests on vstart as well, without explicitly naming them?

This comment has been minimized.

@jcsp

jcsp Aug 25, 2015

Author Contributor

Maybe in another patch? Because this is aimed exclusively at developers I would expect people to write whatever changes they find they need.

John Spray and others added some commits Jul 22, 2015

John Spray
tasks/cephfs: work around fuse weirdness
I am seeing a strange thing where it seems like sometimes
a ls of /sys/fs/fuse/connections is returning empty when
connections do exist.  It is pretty easy to make this
a non-issue by waiting for "more conns than we started with"
instead of "list of conns is different", so do that.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: fix race in TestStrays
We weren't waiting for export dir to complete (the asok
just starts the process).  This wasn't noticeable when running
remotely due to latency between the test runner and the MDS,
but it shows up when running against a local vstart cluster.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: refine TestClientLimits.test_client_oldest_tid
* Instead of creating files in background, create
  them in foreground (simpler).
* Instead of creating max_request*2 files, just create
  max_requests plus a litle bit.
* Set max_requests to 1000 instead of 5000 to run a bit
  faster.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: make memstore dependency declarative
...instead of checking for it procedurally during
TestClusterFull.setUp

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: split up TestClientRecovery
...into the part that requires a network-isolated
client and the part that doesn't.

This happens to also be the part that won't work with
vstart vs. the part that will.  teuthology yaml will
still pick up and run both parts.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: cluster_down before fs rm
In teuthology this isn't needed because we join the
mds child processes after killing them.  In vstart
we're killing them asynchronously, so be a bit more
careful to ensure they can't re-insert themselves
to the mdsmap between our calling fail and our calling
fs rm.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: add @long_running decorator
A means for test cases to mark particular methods
as long running, so that the vstart runner can skip
them when running for developers.

This is not a scientific thing, anything that takes
more than about 2 minutes due to lots of iteration
or sleeps.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: mark some tests as @long_running
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: make FuseMount.teardown safer
(don't assume fuse_daemon exists)

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray John Spray
tasks/cephfs: add vstart runner script
This is to allow running CephFSTestCase tests
against a vstart cluster, for much faster turnaround
during development than running teuthology against
built ceph packages.

Not everything will be runnable this way, but for
certain things like filesystem repair scenarios we
have everything we need within a vstart environment.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: add needs_trimming decorator
For tests to advertise that they need the client
to be able to trim its cache (i.e. currently that
means requiring run as root)

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: mark some tests as @needs_trimming
So that we can drop these tests when not running
client as root.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: updates for cmake environ
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: add instructions to vstart_runner
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: stop if needed binaries are absent
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: warn if vstart_runner can't import mods
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: raise error on non-string stdins
Shouldn't be any from the fs tests that get run

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: remove a redundant sudo
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: add --interactive for vstart runner
Just like interactive-on-error in teuthology.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: fix FuseMount._asok_path
Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: move journal migration test
...into a CephFSTestCase.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: extend vstart_runner's ctx&run
Sufficiently to enable using workunits.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: move mds_scrub_checks
...into a CephFSTestCase.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: fix test_journal_migration
It was trying to get the output file from
a different remote than the one used to
run the journal tool.

Signed-off-by: John Spray <john.spray@redhat.com>
John Spray
tasks/cephfs: fix FuseMount bin path in vstart
FuseMount only uses the prefix for finding the 'ceph'
executable, which is in ./ for either cmake or
authtools, not ./src for cmake like other binaries.

Signed-off-by: John Spray <john.spray@redhat.com>

@jcsp jcsp force-pushed the wip-vstart-runner branch from 156c10c to 62247f2 Oct 2, 2015

@ceph-jenkins

This comment has been minimized.

Copy link

ceph-jenkins commented Oct 2, 2015

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/710/
Test PASSed.

@gregsfortytwo gregsfortytwo merged commit 62247f2 into master Oct 7, 2015

1 check passed

default Merged build finished.
Details
@gregsfortytwo

This comment has been minimized.

Copy link
Member

gregsfortytwo commented Oct 7, 2015

I actually merged this to the infernalis branch and then merged that to master. Hurray!

@gregsfortytwo gregsfortytwo deleted the wip-vstart-runner branch Oct 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment