Skip to content
This repository has been archived by the owner on Jul 16, 2020. It is now read-only.

Weekly Meeting 2016 10 13

Mark Ryan edited this page Oct 14, 2016 · 2 revisions

Agenda

##Minutes

#ciao-project: Weekly Meeting 2016 10 13

Meeting started by markus__ at 16:00:44 UTC. The full logs are available at ciao-project/2016/ciao-project.2016-10-13-16.00.log.html .

Meeting summary

Meeting ended at 17:11:27 UTC.

Action Items

  • tcpepper enter a new bug to cover the fact that some unit tests need to be run as root.
  • mrcastel check to see whether 657 still occurs
  • tcppeper, markus, rbradford triage storage bugs
  • markus add not triaged storage related bugs to next weeks minutes.

Action Items, by person

  • mrcastel
    • mrcastel check to see whether 657 still occurs
  • rbradford
    • tcppeper, markus, rbradford triage storage bugs
  • tcpepper
    • tcpepper enter a new bug to cover the fact that some unit tests need to be run as root.
  • UNASSIGNED
    • markus add not triaged storage related bugs to next weeks minutes.

People Present (lines said)

  • markus__ (177)
  • kristenc (58)
  • tcpepper (57)
  • mrcastel (36)
  • rbradford (4)
  • ciaomtgbot (3)
  • mrkz (2)
  • jvillalo (1)
  • _erick0zcr (1)

Generated by MeetBot_ 0.1.4

.. _MeetBot: http://wiki.debian.org/MeetBot

###Full IRC Log

16:00:44 <markus__> #startmeeting Weekly Meeting 2016 10 13
16:00:44 <ciaomtgbot> Meeting started Thu Oct 13 16:00:44 2016 UTC.  The chair is markus__. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:44 <ciaomtgbot> Useful Commands: #action #agreed #help #info #idea #link #topic.
16:00:44 <ciaomtgbot> The meeting name has been set to 'weekly_meeting_2016_10_13'
16:00:56 <markus__> #topic Roll Call
16:01:06 <mrkz> o/
16:01:09 <markus__> o/
16:01:09 <tcpepper> o/
16:01:23 <_erick0zcr> o/
16:02:01 <rbradford> o/
16:02:38 <jvillalo> o
16:03:05 <mrcastel> o/
16:03:13 <markus__> #topic Opens
16:03:20 <markus__> Does anyone have any opens?
16:04:10 * tcpepper no
16:04:14 <markus__> me neither
16:04:21 * mrkz neither
16:04:44 <markus__> Okay, let's move to the bugs
16:04:59 <markus__> #topic Bug Triage
16:05:13 <markus__> #link https://github.com/01org/ciao/issues?utf8=0.000000E+002    16:05:36 <markus__> #info there are 15 new bugs
16:05:40 <markus__> Let's start at the bottom
16:05:53 <markus__> #link https://github.com/01org/ciao/issues/641
16:06:08 <tcpepper> p high, for ease of dev / test / debug?
16:06:16 <tcpepper> should be easy
16:06:18 <markus__> I think I asked kristen to add this
16:06:28 <markus__> Right now you need to delete volumes one by one
16:06:40 <markus__> It's not blocking anything
16:06:50 <tcpepper> markus__: do you think we should tag the set of storage ones that are new as ... "storage" so we can focus in the next weeks on finishing it off to basic readiness?
16:06:55 <markus__> But I agreee it should be easy
16:07:03 <markus__> Yep, good idea
16:07:14 <tcpepper> we've got a "storage drivers" label, not something less specific
16:07:23 <markus__> Shall I use this or create a new one.
16:07:25 <markus__> ?
16:07:36 <kristenc> markus__, do you think adding a "delete all" option for a tenant is safe though? seems like a scary thing to do outside of debugging.
16:07:49 <markus__> We have one for instances
16:07:56 <markus__> Is that any less scary?
16:08:00 <kristenc> yeah - but that's not your persistent data.
16:08:15 <markus__> That's true.
16:08:18 <tcpepper> maybe an add'l issue to track removal or "debug" flagging of those options so they're not normally built?
16:08:34 <markus__> Does the openstack cli have something like this?
16:08:48 <kristenc> I'd be ok with adding a flag that says something like --iknowwhatimdoing :)
16:09:00 <markus__> -f
16:09:29 <markus__> Anyway, it's low priority.  How about P3 and I'll update the bug to mention we need to check we really want to do this
16:10:36 <markus__> P3?
16:10:53 <tcpepper> sure
16:11:21 <markus__> Okay, next one
16:11:25 <markus__> #link https://github.com/01org/ciao/issues/642
16:11:44 <markus__> I think this is more important
16:12:25 <markus__> Although the context isn't clear.
16:12:44 <markus__> Is this to improve the error the user sees from ciao-cli?
16:13:10 <tcpepper> must be or...we don't have much else that would lead to something actionable and truly "helpful"
16:13:15 <mrcastel> kristenc, can volume creation can fail due to a reason other than lack of capacity.. and do we have a way to free free capacity in Ceph?
16:13:57 <kristenc> mrcastel, we capture the output of what we use to create the volume, which will be either rbd or qemu-img.
16:14:07 <kristenc> I don't know the ways that can fail.
16:14:08 <markus__> One other thing we have noticed is that booting from volume, if the controller fails to create the volume, it tries to create the instance anway.
16:14:31 <markus__> Okay, got it.
16:14:39 <markus__> We could actually do with that right now.
16:14:40 <tcpepper> we should copy/paste example failures into the github issue so it's more clear how the current output was lacking
16:14:43 <kristenc> mrcastel, one problem we have right now is that controller knows nothing of any volume except those that have been created by it.
16:15:03 <kristenc> so in debug, when we delete the datastore, and restart - we lose knowledge
16:15:16 <kristenc> we need to associate metadata with the volume.
16:15:20 <kristenc> such as tenant
16:15:23 <tcpepper> could we make a cleanup tool for controller, like launcher has?
16:15:32 <kristenc> this is why we use a database.
16:15:34 <tcpepper> it would delete the datastore and its known volumes?
16:15:42 <kristenc> I'd like to move that to be tag: value in ceph itself.
16:15:50 <kristenc> then you could query the cluster.
16:16:01 <markus__> I was thinking it would be nice to have one for the image service as well, a cleanup tool
16:16:15 <kristenc> tcpepper, it might be that we have one already - ceph has a lot of tools.
16:16:20 <tcpepper> that would obviate the desire for "delete all volumes" in the cli
16:16:22 <kristenc> and we already are dependent on them.
16:17:09 <markus__> For this bug, I suggest a P2.  rbradford and I could have used this feature today.
16:17:17 <markus__> Any objections?
16:17:21 <tcpepper> wfm
16:17:51 <markus__> #info 642 has been labelled P2
16:18:07 <markus__> #link https://github.com/01org/ciao/issues/643
16:18:46 <tcpepper> seems P high to me
16:19:08 <markus__> Agreed.
16:19:18 <mrcastel> Agree this is high
16:19:30 <mrcastel> last time we had dangling volumes we ran out of disk space
16:19:30 <kristenc> we need to decide on the proper flow. the volume should continue to exist but switch to "available" state, right?
16:19:40 <tcpepper> probably easy? missing call to conditionally delete volume in the delete instance path
16:19:45 <markus__> kristenc: Yes, that's what we discussed over email
16:19:56 <kristenc> so the bug here isn't that the volume remains.
16:20:01 <kristenc> it's that the state didn't change.
16:20:02 <tcpepper> hmm
16:20:03 <tcpepper> ok
16:20:04 <markus__> It's that it remains in use
16:20:12 <markus__> And also we get an error
16:20:16 <kristenc> yes.
16:20:16 <markus__> on the terminal
16:20:27 <kristenc> let me update the bug - agree it's a high priority.
16:20:48 <tcpepper> so that's because of the specific use case or?  do we have in that use case list an ephemeral disk attached at boot?
16:21:16 <markus__> Well, in this case we're takling about a boot from volume instance.
16:21:36 <tcpepper> it's case #3 at https://github.com/01org/ciao/wiki/Storage ?
16:21:38 <markus__> I may not have tested what happens when you delete an instance which has normal volumes attached
16:22:03 <markus__> Yes, that's what I tested.  But the bug could affect 1 and 2 as well
16:22:27 <markus__> Anyway, P1 for this one I think.
16:23:20 <markus__> Okay, here's the next one
16:23:28 <markus__> Actually, first
16:23:42 <markus__> #info 643 has been labelled P1
16:23:46 <markus__> #link https://github.com/01org/ciao/issues/644
16:23:54 <markus__> This one is irritating.
16:24:07 <kristenc> markus__, one question - should we get the boot-from-volume PR merged or fix 643 with that as base?
16:24:14 <kristenc> I've done a lot of bug fixing in that branch.
16:24:41 <tcpepper> merge -> fix sounds better to me
16:24:41 <markus__> kristenc: 643 may not be related to boot from volume.
16:24:47 <markus__> tcpepper: Agree
16:24:52 <kristenc> (btw - as soon as I get singlevm working I can finish up up that PR )
16:25:03 <markus__> I'd like to get boot from volume merged as soon as we can
16:25:15 <kristenc> ok - I'm working on this today and tomorrow.
16:25:49 <markus__> Okay, so in 644, after detaching a volume the output from
16:25:53 <markus__> ciao-cli volume list
16:25:57 <kristenc> markus__, I've notice this issue with 644 as well.
16:25:57 <markus__> becomes corrupt
16:26:04 <markus__> I think the name is the only field affected.
16:26:09 <kristenc> not sure if its a cli issue or a controller issue.
16:26:18 <kristenc> need to check the json cli receives.
16:26:37 <tcpepper> would make sense if the field was not sent and it unmarshalled to empty
16:26:40 <markus__> Okay.  Sort of looks like go routines might be involved in this one
16:26:57 <markus__> Sometimes I see the wrong name against the wrong volume
16:26:59 <kristenc> I seem to have marked it controller - I probably did this already.
16:27:16 <kristenc> I suspect that it has to do with how we update state for the volume.
16:27:26 <kristenc> the sql update probably blows stuff away.
16:27:26 <markus__> But it's not deterministic
16:27:43 <markus__> I get different results each time I run ciao-cli volume list
16:27:50 <markus__> So P1 or P2?
16:28:07 <kristenc> well - I updated the bug to indicate to start with the json. I guess that needs to be done as step 0.
16:28:46 <kristenc> annoying is P2 IMO.
16:28:50 <markus__> Okay.
16:29:19 <tcpepper> if it's a racy thing there could be a correctness implecation
16:29:44 <tcpepper> but I guess worst case you loose the name and can't delete by name and can by uuid.  as opposed to deleting extra stuff accidentally.
16:29:58 <markus__> That's write
16:30:05 <markus__> If you know the UUID you can still delete the volume.
16:30:21 <markus__> And the UUID is correctly reported.
16:30:33 <markus__> #info 644 has been labelled a P2.
16:30:44 <markus__> #link https://github.com/01org/ciao/issues/645
16:31:34 <markus__> Basically, the issue here is that live attaching of volumes is not supported for containers.
16:32:01 <markus__> But rather than reporting an error the volume being attached gets stuck in attaching for ever.
16:32:28 <markus__> So it's a negative test case that is not failing correctly.
16:32:28 <kristenc> I would make this a P1 too.
16:32:33 <markus__> Okay.
16:33:10 <markus__> #info 645 has been labelled P1
16:33:23 <markus__> #link https://github.com/01org/ciao/issues/648
16:33:33 <markus__> This one has been annoying rbradford and myself
16:33:41 <markus__> You can run the BAT tests in single VM
16:33:47 <markus__> But not the unit tests out of the box
16:33:58 <markus__> This should really work.
16:34:25 * tcpepper doesn't grok why they don't run
16:34:29 <markus__> Basically, I guess there are certs that aren't in the right places.
16:34:36 <tcpepper> hmmm
16:34:43 <markus__> Some of the unit tests have pre-requisites
16:34:46 <tcpepper> that's strange
16:34:51 <mrcastel> markus__, I do not understand... unit tests should run no matter what
16:34:58 <markus__> From travis
16:34:59 <markus__> # We need to create and install SSNTP certs for the SSNTP and controller tests
16:34:59 <markus__> before_script:
16:34:59 <markus__> - sudo mkdir -p /etc/pki/ciao/
16:35:14 <markus__> mrcastel: I agree but they don't
16:35:18 <tcpepper> if that's not happening in single vm, single vm's not implementing a "real" ciao cluster
16:35:36 <tcpepper> markus__: where's the "N/A" come from?
16:35:42 <markus__> But we shouldn't need  a full cluser to run unit tests.
16:36:03 <markus__> tcpepper: It means the the tests did not run, usually because something paniced or quit.
16:36:09 <tcpepper> ok
16:36:29 <markus__> We can fix this as part of our cleanup.
16:36:31 <tcpepper> so is the goal to decompose unit tests to actually be unit tests?
16:36:44 <tcpepper> or get cluster function tests (hiding as unit tests) running in single vm?
16:36:44 <markus__> tcpepper:  That would be my preference.
16:37:19 <markus__> It would also help newcomers to the project
16:37:24 <mrcastel> tcpepper, yes... cluster function tests should be runnable in some cut down version of single VM
16:37:34 <markus__> Who don't know they need to copy certs to certain places to get the tests to run
16:37:39 <mrcastel> markus__, that would mean we need to decompose the single VM sequence further
16:37:55 <mrcastel> for example setup the env but not launch all the agents
16:38:00 <markus__> Well, basically, I think you should be able to do
16:38:11 <markus__> go get github.com/01org/ciao
16:38:14 <markus__> go test -v ./...
16:38:19 <markus__> And everything should pass
16:38:24 <tcpepper> that makes sense
16:38:32 <mrcastel> ahhh I see
16:38:34 <markus__> But that doesn't work today
16:38:49 <markus__> Which might confuse people trying ciao for the first time.
16:38:51 <mrcastel> yes.. we can all fix our own components
16:38:56 <mrcastel> we should I mean
16:38:57 <tcpepper> definitely confusing
16:39:04 <markus__> And to be honest, it prevents me from running unit tests locally.
16:39:12 <markus__> I just use travis.
16:39:17 <tcpepper> wdouglas: iirc was talking about this
16:39:31 <markus__> and rbradford as well.
16:39:39 <markus__> So we are confusing our new project members.
16:39:41 <tcpepper> moving the synthetic (or real) cluster function testing to a different framework than the basic "go test" which is unit focused
16:40:17 <markus__> tcpepper: Yes, and maybe mocking out the clusre functionality in our existing tests, if possible.
16:40:31 <rbradford> fwiw, i spent 2 hours trying to get all the unit tests to run in singlevm and I couldn't
16:40:50 <rbradford> the cert setup is similar but different between the two
16:40:53 <kristenc> i use travis all the time now and don't bother testing locally for the most part.
16:41:10 <markus__> That's what I do to.  But the fact that
16:41:15 <rbradford> i wanted to be able to run the unit tests to look at the races
16:41:30 <tcpepper> i do a mix, but have set up my dev machine to match travis' config.  I break semi-regularly and look for commits on the travis.yml
16:41:31 <markus__> go get github.com/01org/ciao
16:41:46 <markus__> go test -v ./...
16:41:50 <markus__> doesn't work is bad
16:41:54 <mrcastel> the networking unit tests for example need priming... I wonder if that can be automated
16:42:25 <tcpepper> if we added helpers for this, called from component TestMain(), could travis and single VM call those helpers?
16:42:32 <markus__> In general are we agreed that this needs to be fixed one way or the other?
16:42:33 <tcpepper> then we'd only have one place implementing the desired setup
16:42:38 <tcpepper> yes agreed
16:42:52 <markus__> And in terms of priority?
16:42:54 <markus__> P2?
16:42:56 <tcpepper> we need to architect our test strategy more for unit -> functionality -> system
16:43:17 <tcpepper> I'd P1 at least documenting what one needs to do to get "go test -v ./..." to work
16:43:33 <tcpepper> and insuring travis and single vm implement that documented set of expectations
16:43:38 <markus__> Okay.  I'd also like to see the requirement of running some tests as root removed.
16:43:53 <markus__> If we did this we'd never forget to add packages to the travis test-cases call
16:43:57 <mrcastel> markus__, tcpepper we shoudl say testing should be sudo -E go test -v ./...
16:43:59 <markus__> As we could use a wildcsrd
16:44:10 <mrcastel> otherwise networking tests will not run
16:44:11 <markus__> Or we could do that I guess
16:44:34 <tcpepper> I do like that that the travis has only some things sudo'd
16:44:58 <markus__> I'd prefer to have nothing sudo'ed but I wonder if it's possible
16:44:59 <tcpepper> marking the root required bits, versus lazily doing all as root is good practice
16:45:16 <tcpepper> long term it would be awesome to have nothing in ciao run as root
16:45:23 <markus__> Could we mock the parts of the tests that need root access or is this simply too hard.
16:45:35 <tcpepper> it's system set up again
16:45:43 <markus__> So maybe we could enter another bug for that.
16:45:47 <tcpepper> we can run as a user in a group that has permissions to do the things ciao needs
16:45:48 <markus__> i.e., removing root.
16:45:52 <mrcastel> or...
16:46:00 <mrcastel> we can run the tests in thier namespace
16:46:02 <tcpepper> we've articulated it as a security goal in the past
16:46:07 <mrcastel> own namespace
16:46:08 <tcpepper> not sure if it's recorded properly
16:46:20 <mrcastel> not sure if all networking will run... but I think most of it will
16:47:01 <tcpepper> at least we find out what doesn't and can document what it is and why
16:47:02 <markus__> Okay, shall we make this bug
16:47:07 <tcpepper> yes plz :)
16:47:23 <markus__> sudo -E go test -v ./...
16:47:34 <markus__> and enter another bug to remove root?
16:47:54 * tcpepper prematurely said yes...so "yes please" again :)
16:48:03 <tcpepper> I'll open the root one if you want
16:48:03 <markus__> Okay
16:48:34 <markus__> #action tcpepper enter a new bug to cover the fact that some unit tests need to be run as root.
16:48:38 <markus__> thanks
16:48:41 <mrcastel> markus__, tcpepper I will not be able to handle that anytime soon for networking
16:48:44 <markus__> Okay and we decded P1
16:48:50 <markus__> mrcastel: no problems
16:49:05 <mrcastel> then I do not see why it is a P1
16:49:10 <kristenc> me either.
16:49:28 <kristenc> I was just thinking that it needed to be prioritized below all the other stuff we just signed up for for our next sprint.
16:49:39 <markus__> Okay, so P2?
16:49:55 <kristenc> fine with me.
16:50:08 <markus__> #info 648 has been labelled P2.
16:50:10 <tcpepper> I'd even go P3
16:50:20 <markus__> #info 648 has been labelled P3
16:50:50 <markus__> #link https://github.com/01org/ciao/issues/649
16:51:00 <markus__> Oh I entered this one
16:51:11 <markus__> It's needed to get Single VM working in travis
16:51:43 <markus__> I was thinking of using a fixed port for the fake identity service and passing it to the controller via the --single option
16:52:02 <kristenc> markus__, we use httptest util for the identity service I believe.
16:52:07 <kristenc> not sure if it can take a fixed port?
16:52:09 <mrcastel> markus__, kristen said this port is auto created
16:52:29 <mrcastel> kristenc, we tried this.. but I think that does not support a defined port right?
16:52:49 <kristenc> I don't think so.
16:52:54 <mrcastel> I remember us trying to use fixed port..
16:53:02 <markus__> Ah
16:53:06 <kristenc> you can try not using httptest.
16:53:20 <kristenc> that was for ease and expediency
16:54:05 <markus__> It looks to me like you can specify the port
16:54:08 <mrcastel> markus__, is there a good alternative?
16:54:26 <kristenc> markus__, great!
16:54:45 <markus__> Okay, no I think I'm wrong
16:54:51 <markus__> It says a system chosen port
16:55:09 <kristenc> bummer.
16:55:16 <markus__> Anyway, one way or another it should be fixable.
16:55:48 <kristenc> yes, certainly.
16:55:59 <markus__> What does httptest give us that the normal http server does not?
16:56:11 <markus__> Well, we can discuss later.
16:56:14 <kristenc> yeah.
16:56:32 <kristenc> i chose it because it was easy and this is how people do test servers in go.
16:56:37 <kristenc> doesn't mean it's the only way.
16:56:59 <markus__> What is the priority of the Single VM in travis bug
16:57:01 <markus__> ?
16:57:58 <kristenc> wasn't delivering singlevm in travis a pretty high priority?
16:58:03 <mrcastel> Ideally... P1. I think the 5s wait is less of an issue vs the non VT-x CNCI... we can add some sort of polling in the script...
16:58:12 <markus__> Yes,I can't find the bug
16:58:39 <kristenc> lets assume p1.
16:58:41 <mrcastel> Let me spend a day on the non VT-x CNCI next week...
16:58:45 <markus__> Okay
16:58:46 <mrcastel> and keep it a P1
16:59:21 <markus__> #info 649 has been labelled a P1
16:59:38 <markus__> So the hour is up.
16:59:44 <markus__> Should we carry on?
17:00:01 <markus__> Or leave the remaining bugs until next week
17:00:02 <kristenc> meetingbot will be here for 15 more minutes.
17:00:04 <mrcastel> One bug I want to discuss
17:00:08 <tcpepper> I'd like to see us get through more of the storage ones
17:00:20 <markus__> There are 9 bugs we didn't discuss
17:00:21 <mrcastel> markus__, is this a real issue https://github.com/01org/ciao/issues/657
17:00:31 <tcpepper> but markus__, rbradford and I can go through storage later / separate
17:00:31 * kristenc double checks when meetingbot will leave.
17:00:54 <markus__> mrcastel: I'm not sure.  I don't think it's a seggy fault
17:01:08 <markus__> I've seen it but it doesn't stop ceph from working
17:01:16 <kristenc> tcpepper, I need to go through the storage stuff with you to make sure all progress was captured.
17:01:41 <mrcastel> yes... but sometimes I see 100s of these when I start single VM
17:01:43 * kristenc doesn't trust self to have updated issues properly
17:01:58 <markus__> mrcastel: Oh, okay, I haven't seen that
17:02:27 <markus__> It could be a race condition between when the container starts and when we execute the first ceph command
17:02:53 <markus__> I noticed today I needed to give the container 10 seconds to start up before I could mount a cephfs file system in single vm
17:03:35 <kristenc> meetingbot will leave at quarter after, so make sure to end meeting before then to get the logs.
17:04:00 <markus__> mrcastel: We need to investigate.  Does it work for you eventually even though you get loads of warnings
17:04:03 <markus__> ?
17:04:20 <mrcastel> I think when I get the stream it locks up
17:04:29 <mrcastel> if I get 1-5 it works
17:04:32 <markus__> mrcastel: So that's bad.
17:04:40 <markus__> Maybe we should make it a P2 then?
17:04:44 <mrcastel> But this was when I had dangling volumes
17:04:54 <markus__> Ah, that might have been the problem
17:04:58 <mrcastel> so maybe diskspace caused ceph create to fail triggering the chain
17:05:06 <markus__> Now you mention it I may have seen something similar.
17:05:50 <markus__> mrcastel: can you check to see if it's still a problem for you?
17:06:00 <mrcastel> Sure.. P3 then
17:06:04 <markus__> Yep
17:06:08 <mrcastel> I have not seen it in a while
17:06:43 <markus__> #action mrcastel check to see whether 657 still occurs
17:06:48 <markus__> #info label 657 as P3
17:07:15 <markus__> So shall we call it a day
17:07:22 <markus__> We only have 8 minutes left
17:07:48 <markus__> I will go through the storage bugs and label them
17:07:53 <kristenc> how will the other bugs get triaged?
17:08:03 <kristenc> we don't want to drop them.
17:08:13 <markus__> We'll do the storage ones separately
17:08:32 <kristenc> oh - are all the others storage?
17:08:44 <kristenc> ok.
17:09:02 <markus__> Most but not all.
17:09:37 <markus__> I'll record the non storage related bugs in the minutes
17:10:05 <markus__> #action tcppeper, markus, rbradford triage storage bugs
17:10:41 <markus__> #action markus add not triaged storage related bugs to next weeks minutes.
17:11:13 <markus__> Okay we better finish up.
17:11:17 <markus__> Don't want to miss the boot
Clone this wiki locally