Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCEP dataset incoherent with 'official' version? #52

Open
ArnoVel opened this issue Oct 9, 2019 · 4 comments
Open

TCEP dataset incoherent with 'official' version? #52

ArnoVel opened this issue Oct 9, 2019 · 4 comments

Comments

@ArnoVel
Copy link

ArnoVel commented Oct 9, 2019

Hi,
After I opened the issue about the labels being all set to 1, I went to check the tcep reference website to identify some pairs that got permuted and so on.

I stumbled across something strange: in the original dataset, some of the variables are multivariate. This can be seen, for example, in pair 54 or pair 71.

However, in the current version of cdt, when checking these two pairs, one finds 1D variables.

> data.iloc[53]
A    [43.51, 41.33, 36.78, -8.82, 34.61, 40.11, 12....
B    [42.0, 75.0, 69.0, 42.0, 76.0, 72.0, 77.0, 81....
Name: pair54, dtype: object
> data.iloc[53]['A']
array([ 43.51,  41.33,  36.78,  -8.82,  34.61,  40.11,  12.52, -35.18,
        48.12,  40.24,  25.4 ,  26.19,  23.71,  13.09,  53.97,  50.83,
        17.25,   6.48,  27.44, -16.3 ,  43.86, -24.66, -15.78,   4.94,
        42.71,  12.35,  -3.38,  11.54,   3.86,  45.42,   4.36,  12.11,
        49.42, -33.45,  31.14, -11.7 ,  -4.25,  -4.33,   9.92,   5.33,
        45.8 ,  23.  ,  35.17,  50.08,  55.68,  11.58,  18.48,  -0.23,
        30.06,  13.7 ,   3.75,  15.33,  59.43,   9.  ,   6.92, -18.14,
        60.17,  48.86,   4.93, -17.54,   0.39,  13.44,  41.7 ,  52.52,
         5.54,  37.97,  12.05,  16.  ,  13.47,  14.62,   9.54,  11.86,
         6.8 ,  18.54,  14.08,  22.3 ,  47.5 ,  64.14,  28.63,  -6.19,
        35.71,  33.32,  53.34,  31.79,  41.9 ,  18.  ,  35.68,  31.94,
        51.18,  -1.28,  39.02,  37.51,  29.37,  42.87,  17.97,  56.95,
        33.89, -29.3 ,   6.31,  32.88,  54.69,  49.61,  22.18,  42.  ,
       -18.92, -13.99,   3.15,   4.17,  12.65,  35.9 ,  14.6 ,  18.07,
       -20.16,  19.42,  47.91,  42.46,  33.99, -25.97,  19.74, -22.57,
        27.71,  52.37,  12.1 , -22.28, -41.29,  12.15,  13.52,   9.06,
        59.91,  23.61,  33.68,  31.88,   8.99,  -9.47, -25.3 , -12.09,
        14.58,  52.22,  38.71,  18.45,  25.29,  47.01, -20.87,  44.45,
        55.76,  27.15,  14.  , -13.83,   0.34,  24.67,  14.7 ,  44.8 ,
         8.47,   1.29,  48.21,  46.05,  -9.43,   2.04, -25.75,  40.42,
         6.92,  13.2 ,  15.63,   5.82, -26.32,  59.33,  46.95,  33.52,
        38.57,  -6.17,  13.76,  -8.57,   6.12, -21.14,  10.66,  36.81,
        39.94,  37.95,   0.31,  50.44,  24.48,  51.5 ,  38.89,  18.34,
       -34.89,  41.32, -17.74,  10.5 ,  21.03,  15.36, -15.41, -17.82])

Same can be seen about pair 71. Is this a mistake, or just a shuffling of the data?
I made sure I set shuffle=False before testing for the two pairs.
If the basic (non-shuffled) dataset is already shuffled, or has been pre-processed in some way to reduce dimensionality, can we have some explanation of how the two datasets relate to each other?

Any amount of information would help,
Thanks

@diviyank diviyank added the Investigation Investigation of a possible bug label Oct 14, 2019
@diviyank
Copy link
Collaborator

Hi,
This is concerning, I will look at how I got this version of the TCEP and come back to you.

Best,
Diviyan

@ArnoVel
Copy link
Author

ArnoVel commented Oct 28, 2019

Hi,
Would it be possible to have some kind of update?
Thanks!

@ArnoVel
Copy link
Author

ArnoVel commented Nov 12, 2019

Hi,
I checked the official website in details, It highly likely that the current CDT TCEP version is
simply the current TCEP (with 108 pairs) with the multivariate one removed.
This is likely as it leaves 99 pairs, which is the current length of the CDT TCEP.
However I did not take the time to check if the two match.
Regards,
A.V

@diviyank
Copy link
Collaborator

Hi,
Sorry for the delay, I was quite busy lately.
Thanks for looking into it; it seems to be indeed the case. I just checked all the pairs and they match.
I will add another dataset containing all the pairs (including the multivariate ones) because most of the algorithms do not support multivariate variables.
Best regards,
Diviyan

@diviyank diviyank added enhancement and removed Investigation Investigation of a possible bug labels Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants