Add pre given feautes to CMIM. #3

alecuba16 · 2016-04-07T14:23:50Z

Hello!

As I told you on mloss, I would like to modify the CMIM to accept a set of predefined features by a input , so as a first approach I will work on this code from CMIM.c:

I think that the first loop on lines 80-94 is mandatory because you have to calculate first the max CMIM of whole the features (including pre-selected), is right?

Imagine that I have already received preSelectedFeatures as an array of the positions of the features presents on feature2D and a auxiliary function isApreSelectedFeature as a simply search and check if exists.

CMIM.c lines 104 ->129

for (i = 1; i < k; i++)
{
score = 0.0;
iMinus = i-1;
for (j = 0; j < noOfFeatures; j++)
{
while ((classMI[j] > score) && (lastUsedFeature[j] < i))
{
----------- my code ----------------------------------
if( isApreSelectedFeature(j,preSelectedFeatures)){
outputFeatures[i] = j;
}else{
---------- finish my code ----------------------------
/double calculateConditionalMutualInformation(double *firstVector, double *targetVector, double *conditionVector, int vectorLength);/
currentFeature = (int) outputFeatures[lastUsedFeature[j]];
conditionalInfo = calculateConditionalMutualInformation(feature2D[j],classColumn,feature2D[currentFeature],noOfSamples);
if (classMI[j] > conditionalInfo)
{
classMI[j] = conditionalInfo;
}/reset classMI/
}
/moved due to C indexing from 0 rather than 1/
lastUsedFeature[j] += 1;
}/while partial score greater than score & not reached last feature/
if (classMI[j] > score)
{
score = classMI[j];
outputFeatures[i] = j;
}/if partial score still greater than score/
}/for number of features/
}/*for the number of feature

What do you think?

Craigacp · 2016-04-07T16:55:01Z

You need to update the score as well, as otherwise the CMIM algorithm doesn't realise those features have been selected so doesn't use their mi values to help select new features (and you might not pop out of the while loop properly). I'd add a loop that added the features before the for(i=1) loop and updated the score function.

alecuba16 · 2016-04-11T12:04:56Z

Thank you for your fast reply, I continued today working on it.

True, I forgot to update the score.

So, changing this is enougth to make it possible?

I don't have to change to modify the mitoolbox entropy function and so on?

So it finally can looks like:

// Added pre-selected score updater and output features adder.
for( my pre-selected features ){
  if( classMI[j] > score ) {
    score = classMI[j];
  }
  outputFeatures[i] = j;
}
//Your original loop line 105
for (i=1 ; i <k;..)
   {
   for(j=0; ...)
      {
     **  if( j is not in my pre-selected features)**
           {
            while()
                  {
                  }
           }
       }
     }

thanks!

alecuba16 · 2016-04-11T13:05:11Z

Sorry for bother you again,

They told me if I can mod in the same way the condMi.

As it looks almost the same structure that CMIM but I think that is easier to mod:

Mod lines 94 to 109

And replace by:

j=0;
for (i = 0; i < MyPreselectedFeatures;i++)
  {    
    /*calculate mutual info
    **double calculateMutualInformation(double *firstVector, double *secondVector, int vectorLength);
    */
    classMI[i] = calculateMutualInformation(feature2D[i], classColumn, noOfSamples);

    if (classMI[i] > maxMI)
    {
      maxMI = classMI[i];
      maxMICounter = i;
    }/*if bigger than current maximum*/
   selectedFeatures[i] = 1;
   outputFeatures[j] = i;
   j++;
  }/*for noOfFeatures - filling classMI*/

Craigacp · 2016-04-11T13:28:26Z

Not quite. You're updating the scores with the class MI which is I(X;Y), but in CMIM you need to update it with min_{s \in preselected features} I(X;Y|X_s), and for CondMI you need to call mergeArrays repeatedly to merge all the preselected features into the condition vector, then start the loop at L119 with i=0 (skipping the loop at L94).

As I said on MLOSS unless you have a very large amount of data then CondMI will perform poorly with any reasonable number of selected features.

You won't have to modify any MIToolbox functions, these operations all happen at the feature selection algorithmic level, it doesn't use different information theoretic operations.

alecuba16 · 2016-04-11T13:48:22Z

Thank for your fast reply,

About CondMI, true, I forgot to add all the preselected in the condition vector, I will fix and let you know.

About CMIM , So as far as I understood I have to find the minimal Mutual information of my preselected, that is almost copy the code from 85 to 94 but with my preselected variables and look for the minimal MI, save the position and use it as the maxMiCounter, isn't it?

Craigacp · 2016-04-11T14:52:58Z

For CMIM you need to make the classMI array have the minimal value so classMI[j] = min_{s \in preselected_features} I(X_j;Y|X_s), and set the lastUsedFeature[j] = s. That ensures the invariant is setup properly for the loop at line 104 to work properly. You should also start that loop with i = |preselected_features| + 1.

alecuba16 · 2016-04-11T15:26:15Z

I think that I catch it.

For the loop I will have to just skip because preselected_features might come interleaved, I mean, maybe the selected features are 2, 4,7 from a total feature list of 1 to 9.

So I will have to skip inside the loop if I find that is already preselected.

I have already told him about the CondMi performance, but they measured different algorithms of your toolbox with 36 to 50 features and the best performing in terms of accuracy of his model is CondMi. But with low number of features the exhaustive method that they have implemented give better results (obvious) but is pretty slow when are selecting more than 10 features with the dataset that they are working.

Since they are using a different type of models and dynamically requieres different features, they asked me to modify your toolbox.

thanks for your help.

Craigacp · 2016-04-11T18:02:28Z

You don't need to skip a feature in the loop for CMIM (or CondMI) as I(X;Y|X) = 0 (for CMIM) and I(X;Y|{S,X}) = 0 (for CondMI). This means they won't be selected multiple times, though it will do slightly more computation.

If it's working for them and they are happy with the performance then by all means use CondMI.

alecuba16 · 2016-04-12T09:43:05Z

Thank you for your support, you almost gave me the code modified 👍

Can I fork your code on my github with my modifications?

Craigacp · 2016-04-12T12:32:00Z

Sure.

alecuba16 · 2016-06-29T12:25:02Z

Hello again!

I'm planning make a port of FEAST to R , as a .C import, there is is some work already done by someone?

Greetings!

Craigacp · 2016-06-29T13:41:19Z

I'm not aware of anyone writing an R wrapper. A couple of people from Drexel University wrote a Python wrapper, you might find that a helpful starting point. Be careful as the FEAST C API expects Fortran style matrices rather than C style ones (due to it's history as a Matlab library).

mbq · 2016-08-02T21:51:25Z

FYI: https://github.com/mbq/praznik

alecuba16 · 2016-08-03T07:53:23Z

I have implemented a general C interface of your library (callable from any other platform) and packaged into a R package and it is working well, I will upload to github later.

I think that I might have found a "bug" or limitation about your original matrix normalization into arrayoperations at the MItoolbox, that let me unable to use your library with my real-data datasets.

When the range that gets the values of the features are pretty big it reserves lots of memory because the vector normalize reserves the multiplication of the numer of states,

My computer has about 12GB free Ram and I cannot run any feature selection without get out of memory even with your original matlab library.

I was waiting to have enough time to provide a full detailed debug report of memory space but meanwhile I send you some details and a subset of the dataset that I'm using (real data) that generates that problem, as an example.

Message on matlab your original library :
The targets -> alarms.csv.zip
The features -> features.csv.zip

Finally, I would ask how much will change the feature selection result if I discretize and normalize previously the features with Zscore.

Craigacp · 2016-08-03T14:03:12Z

That's expected behaviour, you should remap the feature values so they are dense and increasing from 0. It won't change the output of FEAST as the information calculation doesn't actually care about the feature values, it only uses equality not the ordering. I can't get around the memory usage without writing a hashmap and using that to store the probability distributions, or doing linear probing for every single value in the data (which would be incredibly slow).

What else did you have to add to make it useful in C (apart from flipping the arrays so they are columnwise rather than rowwise)?

If you normalize the features in a different way then you'll get a slightly different answer. How different is very difficult to describe as I don't know what your current discretisation is, and figuring out the transform between the two is hard.

Craigacp · 2016-08-03T14:13:01Z

@mbq That R library looks good. I'm not sure exactly what the compilation path is for a C extension in R, is there a reason you have to break apart the header files like that? Would adding more flags to the header files help? Also, we should move this discussion to a different issue.

mbq · 2016-08-03T15:29:31Z

Not really; praznik started as a fast&dirty project to compare FEAST with my own fs methods, so I aimed to avoid touching your code (not to break anything), statically bundle in a single object (to avoid installing libraries on the system level) and to reduce compilation into a single make call (for convenience, in principle R can even handle an entire autotools build, provided that it will make a single so at the end).

For CRAN (official R library repository) compatibility, though, you cannot have any links to stuff like printf and exit, removing them would require appropriate flags. Also some more detailed documentation and some examples would be adviced.

And thumbs up for moving the discussion about an R port elsewhere.

alecuba16 · 2016-08-03T19:50:03Z

@Craigacp Well, the values are continuous ,so if I map to a range from 0 to N I will be bound by the number of samples if each sample is different, Once I return from holidays I will try to map the values and think a way to reduce the number of possible states.

If you want ,you can check my code here:

https://github.com/alecuba16/FEASTR

I have reworked your C-R interface to be as general as posible for others uses and added some additional functions to manage the row by column vectorization or vice versa.

Craigacp · 2016-08-03T22:07:08Z

@alecuba16 FEAST doesn't support continuous values. You need to discretise them first, which will account for the extreme memory usage on your test data. In our experiments we found 10 bins of equal width to work pretty well for most things, but there are plenty of more complicated strategies which might work better. Unfortunately the C & MATLAB interfaces expect doubles for historical reasons rather than the ints they should accept.

Craigacp mentioned this issue Aug 3, 2016

FEAST Portability #4

Closed

Craigacp self-assigned this Dec 6, 2016

alecuba16 mentioned this issue Apr 10, 2017

project status? alecuba16/FEASTR#1

Open

Craigacp closed this as completed May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre given feautes to CMIM. #3

Add pre given feautes to CMIM. #3

alecuba16 commented Apr 7, 2016

Craigacp commented Apr 7, 2016

alecuba16 commented Apr 11, 2016

alecuba16 commented Apr 11, 2016

Craigacp commented Apr 11, 2016

alecuba16 commented Apr 11, 2016

Craigacp commented Apr 11, 2016

alecuba16 commented Apr 11, 2016

Craigacp commented Apr 11, 2016

alecuba16 commented Apr 12, 2016

Craigacp commented Apr 12, 2016

alecuba16 commented Jun 29, 2016

Craigacp commented Jun 29, 2016

mbq commented Aug 2, 2016

alecuba16 commented Aug 3, 2016

Craigacp commented Aug 3, 2016 •

edited

Loading

Craigacp commented Aug 3, 2016

mbq commented Aug 3, 2016

alecuba16 commented Aug 3, 2016 •

edited

Loading

Craigacp commented Aug 3, 2016

Add pre given feautes to CMIM. #3

Add pre given feautes to CMIM. #3

Comments

alecuba16 commented Apr 7, 2016

Craigacp commented Apr 7, 2016

alecuba16 commented Apr 11, 2016

alecuba16 commented Apr 11, 2016

Craigacp commented Apr 11, 2016

alecuba16 commented Apr 11, 2016

Craigacp commented Apr 11, 2016

alecuba16 commented Apr 11, 2016

Craigacp commented Apr 11, 2016

alecuba16 commented Apr 12, 2016

Craigacp commented Apr 12, 2016

alecuba16 commented Jun 29, 2016

Craigacp commented Jun 29, 2016

mbq commented Aug 2, 2016

alecuba16 commented Aug 3, 2016

Craigacp commented Aug 3, 2016 • edited Loading

Craigacp commented Aug 3, 2016

mbq commented Aug 3, 2016

alecuba16 commented Aug 3, 2016 • edited Loading

Craigacp commented Aug 3, 2016

Craigacp commented Aug 3, 2016 •

edited

Loading

alecuba16 commented Aug 3, 2016 •

edited

Loading