Fix to Issue #57 has unintended consequences #66

peterTorrione · 2018-03-13T15:51:52Z

Checkout this revision: 85b6087

then run:

ds = prtDataGenUnimodal;
class = prtClassFld;
ds = class.kfolds(ds,10);
disp(ds)

You will see ds has userData is a [10 x 1] struct. This bit me in the ass when I was running some cross-validation experiments, because:

ds = class.kfolds(ds,100);
ds = class.kfolds(ds,100);
ds = class.kfolds(ds,100);

Will make a very large userData structure. Because in the first instance we make a ds with a 1x100 userData, and then we make a 1 x (100100) userData, then (1 x (100100*100)) etc. That's... no good.

Note that, as far as I can tell, this does NOT happen in the parent commit: f441f1c

The problem is that userData was never intended to be used to hold information from different cross-validation folds - userData was always intended to be a single, global set of information for the dataSet, and mucking with it inside kfolds breaks that assumption.

The bug above only really matters if you run cross-validaton on the same data set a bunch of times - it is admittedly not the worst problem in the world, but it can be weird. If userData was actually BIG this problem can be significant - I don't necessarily want 1000 copies of userData for my 1000 folds.

I don't know what the right answer is. Usually I'd suggest putting whatever this information that you want to store in userData on every fold in the ACTION (and use the second output of kfolds or crossValidate) but that's clearly sub-optimal and confusing.

We could ADD a new field to dataSets - kfoldsUserData, which is an nFolds x 1 array of userData? Then in cross-Validate, we can change prtDataSetBase.crossValidateCombineFoldResults to do this:

    for i = 1:length(dsTestCell)
         dsOut.kfoldsUserData(i) = dsTestCell(i).userData;
    end

Thoughts?

The text was updated successfully, but these errors were encountered:

peterTorrione · 2018-03-13T16:43:22Z

Actually, Miles just convinced me a better solution is to just provide the individual dataSets out as an optional 3 (crossValidate) or 4th (kfolds) output - then you can do:

   [dsAgg,ascts,dsFolds] = class.kfolds(ds,10);

where dsFolds(1).userData, dsFolds(2).userData, etc. have the right things.

patrickkwang · 2018-03-13T18:51:10Z

Then what should the userData for dsAgg look like?

peterTorrione · 2018-03-13T18:59:39Z

I think its the userData from the original ds? (the input)

cratto · 2018-03-13T19:18:39Z

So my workaround has been to index the very big userData @peterTorrione referenced originally. It gets really big because I am doing 10-fold xval as well as one-vs-all classification. So I have a double for look that goes something like this:

for iClass = 1:nClasses
     for iFold = 1:nFold
        userDataIdx = iClass*(iFold-1)+iClass;
        data_i_need = dsOut.userData(userDataIdx).fieldName
      end
end

I like Miles' suggestion to output the individual data sets from kfolds, and then I can use each data set's observationInfo() instead. We would still need to figure out how to fill observationInfo() in a one-vs-all scheme.

Ultimately I think it would be easiest if my snippet would look more like:

for iClass = 1:nClasses
     for iFold = 1:nFold
        data_i_need = dsOutPerFold(iFold).observationInfo(iClass).fieldName
      end
end

peterTorrione · 2018-03-13T20:13:46Z

Oh holy cow. prtClassBinaryToMary does the same thing! Wow. That's... interesting.

I think I understand the use case. I don't think observationInfo works here, b/c it has to be nObservations x 1. And, I think we always thought a "pure" action shouldn't muck with userData b/c you (the user) might put stuff in userData, and you want it to come out the same way you put it in. I propose one of the following -

The idea above, which basically results in:

[dsAgg,acts,dsOutPerFold] = ...
for iClass = 1:nClasses
     for iFold = 1:nFold
        data_i_need = dsOutPerFold(iFold).userData(iClass).fieldName
      end
end

Per here - #57 I am pretty sure we want to NOT use observationInfo here b/c observationInfo is "special" and gets handled properly, e.g., in ds.retainObservations(...), and has to be nObservations x 1.

To avoid smooshing other people's userData, I would actually do something like

   function self = runAction(self,ds)
        ....
        ds.userData.prtClassFld = struct(all my info I want to store)
   end

If I was doing that in prtClassFld. In general this is probably safe

   function self = runAction(self,ds)
        ....
        ds.userData.(class(self)) = struct(all my info I want to store)
   end

We can also make a new field of prtDataSet* - call it "actionData". actionData is intended to be mucked with by actions and do all of the above with actionData rather than userData. We have to be careful that whenever we make multiple dataSets (e.g., in cross-validate, in prtClassBinaryToMary) we always concatenate actionData arrays if they are not empty I guess?

Thoughts?

THen you would do basically

for iClass = 1:nClasses
     for iFold = 1:nFold
         data_i_need = dsOutPerFold(iFold).actionInfo.prtClassBinaryToMary(iClass).classifierName.fieldName
      end
end

But I'm not 100% sure that's right either, and I need to go pick up my family...

cratto · 2018-03-14T20:13:34Z

I'm on board with either of these. Option 1 satisfies my needs (as does my hack mentioned above), but if Option 2 allows for userData to be used as originally intended, that's fine too.

patrickkwang mentioned this issue Mar 13, 2018

Pass userData to output data set in cross-validation or binaryToMaryOneVsAll #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to Issue #57 has unintended consequences #66

Fix to Issue #57 has unintended consequences #66

peterTorrione commented Mar 13, 2018 •

edited

Loading

peterTorrione commented Mar 13, 2018

patrickkwang commented Mar 13, 2018

peterTorrione commented Mar 13, 2018

cratto commented Mar 13, 2018 •

edited

Loading

peterTorrione commented Mar 13, 2018

cratto commented Mar 14, 2018

Fix to Issue #57 has unintended consequences #66

Fix to Issue #57 has unintended consequences #66

Comments

peterTorrione commented Mar 13, 2018 • edited Loading

peterTorrione commented Mar 13, 2018

patrickkwang commented Mar 13, 2018

peterTorrione commented Mar 13, 2018

cratto commented Mar 13, 2018 • edited Loading

peterTorrione commented Mar 13, 2018

cratto commented Mar 14, 2018

peterTorrione commented Mar 13, 2018 •

edited

Loading

cratto commented Mar 13, 2018 •

edited

Loading