-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to Issue #57 has unintended consequences #66
Comments
Actually, Miles just convinced me a better solution is to just provide the individual dataSets out as an optional 3 (crossValidate) or 4th (kfolds) output - then you can do: [dsAgg,ascts,dsFolds] = class.kfolds(ds,10); where dsFolds(1).userData, dsFolds(2).userData, etc. have the right things. |
Then what should the userData for dsAgg look like? |
I think its the userData from the original ds? (the input) |
So my workaround has been to index the very big userData @peterTorrione referenced originally. It gets really big because I am doing 10-fold xval as well as one-vs-all classification. So I have a double for look that goes something like this:
I like Miles' suggestion to output the individual data sets from kfolds, and then I can use each data set's observationInfo() instead. We would still need to figure out how to fill observationInfo() in a one-vs-all scheme. Ultimately I think it would be easiest if my snippet would look more like:
|
Oh holy cow. prtClassBinaryToMary does the same thing! Wow. That's... interesting. I think I understand the use case. I don't think observationInfo works here, b/c it has to be nObservations x 1. And, I think we always thought a "pure" action shouldn't muck with userData b/c you (the user) might put stuff in userData, and you want it to come out the same way you put it in. I propose one of the following -
[dsAgg,acts,dsOutPerFold] = ... for iClass = 1:nClasses for iFold = 1:nFold data_i_need = dsOutPerFold(iFold).userData(iClass).fieldName end end Per here - #57 I am pretty sure we want to NOT use observationInfo here b/c observationInfo is "special" and gets handled properly, e.g., in ds.retainObservations(...), and has to be nObservations x 1. To avoid smooshing other people's userData, I would actually do something like function self = runAction(self,ds) .... ds.userData.prtClassFld = struct(all my info I want to store) end If I was doing that in prtClassFld. In general this is probably safe function self = runAction(self,ds) .... ds.userData.(class(self)) = struct(all my info I want to store) end
Thoughts? THen you would do basically for iClass = 1:nClasses for iFold = 1:nFold data_i_need = dsOutPerFold(iFold).actionInfo.prtClassBinaryToMary(iClass).classifierName.fieldName end end But I'm not 100% sure that's right either, and I need to go pick up my family... |
I'm on board with either of these. Option 1 satisfies my needs (as does my hack mentioned above), but if Option 2 allows for userData to be used as originally intended, that's fine too. |
Checkout this revision: 85b6087
then run:
You will see ds has userData is a [10 x 1] struct. This bit me in the ass when I was running some cross-validation experiments, because:
Will make a very large userData structure. Because in the first instance we make a ds with a 1x100 userData, and then we make a 1 x (100100) userData, then (1 x (100100*100)) etc. That's... no good.
Note that, as far as I can tell, this does NOT happen in the parent commit: f441f1c
The problem is that userData was never intended to be used to hold information from different cross-validation folds - userData was always intended to be a single, global set of information for the dataSet, and mucking with it inside kfolds breaks that assumption.
The bug above only really matters if you run cross-validaton on the same data set a bunch of times - it is admittedly not the worst problem in the world, but it can be weird. If userData was actually BIG this problem can be significant - I don't necessarily want 1000 copies of userData for my 1000 folds.
I don't know what the right answer is. Usually I'd suggest putting whatever this information that you want to store in userData on every fold in the ACTION (and use the second output of kfolds or crossValidate) but that's clearly sub-optimal and confusing.
We could ADD a new field to dataSets - kfoldsUserData, which is an nFolds x 1 array of userData? Then in cross-Validate, we can change prtDataSetBase.crossValidateCombineFoldResults to do this:
Thoughts?
The text was updated successfully, but these errors were encountered: