Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pre given feautes to CMIM. #3

Closed
alecuba16 opened this issue Apr 7, 2016 · 19 comments
Closed

Add pre given feautes to CMIM. #3

alecuba16 opened this issue Apr 7, 2016 · 19 comments
Assignees

Comments

@alecuba16
Copy link

Hello!

As I told you on mloss, I would like to modify the CMIM to accept a set of predefined features by a input , so as a first approach I will work on this code from CMIM.c:

I think that the first loop on lines 80-94 is mandatory because you have to calculate first the max CMIM of whole the features (including pre-selected), is right?

Imagine that I have already received preSelectedFeatures as an array of the positions of the features presents on feature2D and a auxiliary function isApreSelectedFeature as a simply search and check if exists.

CMIM.c lines 104 ->129

for (i = 1; i < k; i++)
{
score = 0.0;
iMinus = i-1;
for (j = 0; j < noOfFeatures; j++)
{
while ((classMI[j] > score) && (lastUsedFeature[j] < i))
{
----------- my code ----------------------------------
if( isApreSelectedFeature(j,preSelectedFeatures)){
outputFeatures[i] = j;
}else{
---------- finish my code ----------------------------
/double calculateConditionalMutualInformation(double *firstVector, double *targetVector, double *conditionVector, int vectorLength);/
currentFeature = (int) outputFeatures[lastUsedFeature[j]];
conditionalInfo = calculateConditionalMutualInformation(feature2D[j],classColumn,feature2D[currentFeature],noOfSamples);
if (classMI[j] > conditionalInfo)
{
classMI[j] = conditionalInfo;
}/reset classMI/
}
/moved due to C indexing from 0 rather than 1/
lastUsedFeature[j] += 1;
}/while partial score greater than score & not reached last feature/
if (classMI[j] > score)
{
score = classMI[j];
outputFeatures[i] = j;
}/if partial score still greater than score/
}/for number of features/
}/*for the number of feature

What do you think?

@Craigacp
Copy link
Owner

Craigacp commented Apr 7, 2016

You need to update the score as well, as otherwise the CMIM algorithm doesn't realise those features have been selected so doesn't use their mi values to help select new features (and you might not pop out of the while loop properly). I'd add a loop that added the features before the for(i=1) loop and updated the score function.

@alecuba16
Copy link
Author

Thank you for your fast reply, I continued today working on it.

True, I forgot to update the score.

So, changing this is enougth to make it possible?

I don't have to change to modify the mitoolbox entropy function and so on?

So it finally can looks like:

// Added pre-selected score updater and output features adder.
for( my pre-selected features ){
  if( classMI[j] > score ) {
    score = classMI[j];
  }
  outputFeatures[i] = j;
}
//Your original loop line 105
for (i=1 ; i <k;..)
   {
   for(j=0; ...)
      {
     **  if( j is not in my pre-selected features)**
           {
            while()
                  {
                  }
           }
       }
     }

thanks!

@alecuba16
Copy link
Author

Sorry for bother you again,

They told me if I can mod in the same way the condMi.

As it looks almost the same structure that CMIM but I think that is easier to mod:

Mod lines 94 to 109

And replace by:

j=0;
for (i = 0; i < MyPreselectedFeatures;i++)
  {    
    /*calculate mutual info
    **double calculateMutualInformation(double *firstVector, double *secondVector, int vectorLength);
    */
    classMI[i] = calculateMutualInformation(feature2D[i], classColumn, noOfSamples);

    if (classMI[i] > maxMI)
    {
      maxMI = classMI[i];
      maxMICounter = i;
    }/*if bigger than current maximum*/
   selectedFeatures[i] = 1;
   outputFeatures[j] = i;
   j++;
  }/*for noOfFeatures - filling classMI*/

@Craigacp
Copy link
Owner

Not quite. You're updating the scores with the class MI which is I(X;Y), but in CMIM you need to update it with min_{s \in preselected features} I(X;Y|X_s), and for CondMI you need to call mergeArrays repeatedly to merge all the preselected features into the condition vector, then start the loop at L119 with i=0 (skipping the loop at L94).

As I said on MLOSS unless you have a very large amount of data then CondMI will perform poorly with any reasonable number of selected features.

You won't have to modify any MIToolbox functions, these operations all happen at the feature selection algorithmic level, it doesn't use different information theoretic operations.

@alecuba16
Copy link
Author

Thank for your fast reply,

About CondMI, true, I forgot to add all the preselected in the condition vector, I will fix and let you know.

About CMIM , So as far as I understood I have to find the minimal Mutual information of my preselected, that is almost copy the code from 85 to 94 but with my preselected variables and look for the minimal MI, save the position and use it as the maxMiCounter, isn't it?

@Craigacp
Copy link
Owner

For CMIM you need to make the classMI array have the minimal value so classMI[j] = min_{s \in preselected_features} I(X_j;Y|X_s), and set the lastUsedFeature[j] = s. That ensures the invariant is setup properly for the loop at line 104 to work properly. You should also start that loop with i = |preselected_features| + 1.

@alecuba16
Copy link
Author

I think that I catch it.

For the loop I will have to just skip because preselected_features might come interleaved, I mean, maybe the selected features are 2, 4,7 from a total feature list of 1 to 9.

So I will have to skip inside the loop if I find that is already preselected.

I have already told him about the CondMi performance, but they measured different algorithms of your toolbox with 36 to 50 features and the best performing in terms of accuracy of his model is CondMi. But with low number of features the exhaustive method that they have implemented give better results (obvious) but is pretty slow when are selecting more than 10 features with the dataset that they are working.

Since they are using a different type of models and dynamically requieres different features, they asked me to modify your toolbox.

thanks for your help.

@Craigacp
Copy link
Owner

You don't need to skip a feature in the loop for CMIM (or CondMI) as I(X;Y|X) = 0 (for CMIM) and I(X;Y|{S,X}) = 0 (for CondMI). This means they won't be selected multiple times, though it will do slightly more computation.

If it's working for them and they are happy with the performance then by all means use CondMI.

@alecuba16
Copy link
Author

Thank you for your support, you almost gave me the code modified 👍

Can I fork your code on my github with my modifications?

@Craigacp
Copy link
Owner

Sure.

@alecuba16
Copy link
Author

Hello again!

I'm planning make a port of FEAST to R , as a .C import, there is is some work already done by someone?

Greetings!

@Craigacp
Copy link
Owner

I'm not aware of anyone writing an R wrapper. A couple of people from Drexel University wrote a Python wrapper, you might find that a helpful starting point. Be careful as the FEAST C API expects Fortran style matrices rather than C style ones (due to it's history as a Matlab library).

@mbq
Copy link

mbq commented Aug 2, 2016

FYI: https://github.com/mbq/praznik

@alecuba16
Copy link
Author

I have implemented a general C interface of your library (callable from any other platform) and packaged into a R package and it is working well, I will upload to github later.

I think that I might have found a "bug" or limitation about your original matrix normalization into arrayoperations at the MItoolbox, that let me unable to use your library with my real-data datasets.

When the range that gets the values of the features are pretty big it reserves lots of memory because the vector normalize reserves the multiplication of the numer of states,

My computer has about 12GB free Ram and I cannot run any feature selection without get out of memory even with your original matlab library.

I was waiting to have enough time to provide a full detailed debug report of memory space but meanwhile I send you some details and a subset of the dataset that I'm using (real data) that generates that problem, as an example.

Message on matlab your original library : outofmemory
The targets -> alarms.csv.zip
The features -> features.csv.zip

Finally, I would ask how much will change the feature selection result if I discretize and normalize previously the features with Zscore.

@Craigacp
Copy link
Owner

Craigacp commented Aug 3, 2016

That's expected behaviour, you should remap the feature values so they are dense and increasing from 0. It won't change the output of FEAST as the information calculation doesn't actually care about the feature values, it only uses equality not the ordering. I can't get around the memory usage without writing a hashmap and using that to store the probability distributions, or doing linear probing for every single value in the data (which would be incredibly slow).

What else did you have to add to make it useful in C (apart from flipping the arrays so they are columnwise rather than rowwise)?

If you normalize the features in a different way then you'll get a slightly different answer. How different is very difficult to describe as I don't know what your current discretisation is, and figuring out the transform between the two is hard.

@Craigacp
Copy link
Owner

Craigacp commented Aug 3, 2016

@mbq That R library looks good. I'm not sure exactly what the compilation path is for a C extension in R, is there a reason you have to break apart the header files like that? Would adding more flags to the header files help? Also, we should move this discussion to a different issue.

@mbq
Copy link

mbq commented Aug 3, 2016

Not really; praznik started as a fast&dirty project to compare FEAST with my own fs methods, so I aimed to avoid touching your code (not to break anything), statically bundle in a single object (to avoid installing libraries on the system level) and to reduce compilation into a single make call (for convenience, in principle R can even handle an entire autotools build, provided that it will make a single so at the end).

For CRAN (official R library repository) compatibility, though, you cannot have any links to stuff like printf and exit, removing them would require appropriate flags. Also some more detailed documentation and some examples would be adviced.

And thumbs up for moving the discussion about an R port elsewhere.

@alecuba16
Copy link
Author

alecuba16 commented Aug 3, 2016

@Craigacp Well, the values are continuous ,so if I map to a range from 0 to N I will be bound by the number of samples if each sample is different, Once I return from holidays I will try to map the values and think a way to reduce the number of possible states.

If you want ,you can check my code here:

https://github.com/alecuba16/FEASTR

I have reworked your C-R interface to be as general as posible for others uses and added some additional functions to manage the row by column vectorization or vice versa.

@Craigacp
Copy link
Owner

Craigacp commented Aug 3, 2016

@alecuba16 FEAST doesn't support continuous values. You need to discretise them first, which will account for the extreme memory usage on your test data. In our experiments we found 10 bins of equal width to work pretty well for most things, but there are plenty of more complicated strategies which might work better. Unfortunately the C & MATLAB interfaces expect doubles for historical reasons rather than the ints they should accept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants