Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to Issue #57 has unintended consequences #66

Open
peterTorrione opened this issue Mar 13, 2018 · 6 comments
Open

Fix to Issue #57 has unintended consequences #66

peterTorrione opened this issue Mar 13, 2018 · 6 comments

Comments

@peterTorrione
Copy link
Collaborator

peterTorrione commented Mar 13, 2018

Checkout this revision: 85b6087

then run:

ds = prtDataGenUnimodal;
class = prtClassFld;
ds = class.kfolds(ds,10);
disp(ds)

You will see ds has userData is a [10 x 1] struct. This bit me in the ass when I was running some cross-validation experiments, because:

ds = class.kfolds(ds,100);
ds = class.kfolds(ds,100);
ds = class.kfolds(ds,100);

Will make a very large userData structure. Because in the first instance we make a ds with a 1x100 userData, and then we make a 1 x (100100) userData, then (1 x (100100*100)) etc. That's... no good.

Note that, as far as I can tell, this does NOT happen in the parent commit: f441f1c

The problem is that userData was never intended to be used to hold information from different cross-validation folds - userData was always intended to be a single, global set of information for the dataSet, and mucking with it inside kfolds breaks that assumption.

The bug above only really matters if you run cross-validaton on the same data set a bunch of times - it is admittedly not the worst problem in the world, but it can be weird. If userData was actually BIG this problem can be significant - I don't necessarily want 1000 copies of userData for my 1000 folds.

I don't know what the right answer is. Usually I'd suggest putting whatever this information that you want to store in userData on every fold in the ACTION (and use the second output of kfolds or crossValidate) but that's clearly sub-optimal and confusing.

We could ADD a new field to dataSets - kfoldsUserData, which is an nFolds x 1 array of userData? Then in cross-Validate, we can change prtDataSetBase.crossValidateCombineFoldResults to do this:

    for i = 1:length(dsTestCell)
         dsOut.kfoldsUserData(i) = dsTestCell(i).userData;
    end

Thoughts?

@peterTorrione
Copy link
Collaborator Author

Actually, Miles just convinced me a better solution is to just provide the individual dataSets out as an optional 3 (crossValidate) or 4th (kfolds) output - then you can do:

   [dsAgg,ascts,dsFolds] = class.kfolds(ds,10);

where dsFolds(1).userData, dsFolds(2).userData, etc. have the right things.

@patrickkwang
Copy link
Collaborator

Then what should the userData for dsAgg look like?

@peterTorrione
Copy link
Collaborator Author

I think its the userData from the original ds? (the input)

@cratto
Copy link

cratto commented Mar 13, 2018

So my workaround has been to index the very big userData @peterTorrione referenced originally. It gets really big because I am doing 10-fold xval as well as one-vs-all classification. So I have a double for look that goes something like this:

for iClass = 1:nClasses
     for iFold = 1:nFold
        userDataIdx = iClass*(iFold-1)+iClass;
        data_i_need = dsOut.userData(userDataIdx).fieldName
      end
end

I like Miles' suggestion to output the individual data sets from kfolds, and then I can use each data set's observationInfo() instead. We would still need to figure out how to fill observationInfo() in a one-vs-all scheme.

Ultimately I think it would be easiest if my snippet would look more like:

for iClass = 1:nClasses
     for iFold = 1:nFold
        data_i_need = dsOutPerFold(iFold).observationInfo(iClass).fieldName
      end
end

@peterTorrione
Copy link
Collaborator Author

Oh holy cow. prtClassBinaryToMary does the same thing! Wow. That's... interesting.

I think I understand the use case. I don't think observationInfo works here, b/c it has to be nObservations x 1. And, I think we always thought a "pure" action shouldn't muck with userData b/c you (the user) might put stuff in userData, and you want it to come out the same way you put it in. I propose one of the following -

  1. The idea above, which basically results in:
[dsAgg,acts,dsOutPerFold] = ...
for iClass = 1:nClasses
     for iFold = 1:nFold
        data_i_need = dsOutPerFold(iFold).userData(iClass).fieldName
      end
end

Per here - #57 I am pretty sure we want to NOT use observationInfo here b/c observationInfo is "special" and gets handled properly, e.g., in ds.retainObservations(...), and has to be nObservations x 1.

To avoid smooshing other people's userData, I would actually do something like

   function self = runAction(self,ds)
        ....
        ds.userData.prtClassFld = struct(all my info I want to store)
   end

If I was doing that in prtClassFld. In general this is probably safe

   function self = runAction(self,ds)
        ....
        ds.userData.(class(self)) = struct(all my info I want to store)
   end
  1. We can also make a new field of prtDataSet* - call it "actionData". actionData is intended to be mucked with by actions and do all of the above with actionData rather than userData. We have to be careful that whenever we make multiple dataSets (e.g., in cross-validate, in prtClassBinaryToMary) we always concatenate actionData arrays if they are not empty I guess?

Thoughts?

THen you would do basically

for iClass = 1:nClasses
     for iFold = 1:nFold
         data_i_need = dsOutPerFold(iFold).actionInfo.prtClassBinaryToMary(iClass).classifierName.fieldName
      end
end

But I'm not 100% sure that's right either, and I need to go pick up my family...

@cratto
Copy link

cratto commented Mar 14, 2018

I'm on board with either of these. Option 1 satisfies my needs (as does my hack mentioned above), but if Option 2 allows for userData to be used as originally intended, that's fine too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants