Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DETECT explaratory procedure nclusters argument #15

Closed
czopluoglu opened this issue Aug 30, 2020 · 7 comments
Closed

DETECT explaratory procedure nclusters argument #15

czopluoglu opened this issue Aug 30, 2020 · 7 comments

Comments

@czopluoglu
Copy link

czopluoglu commented Aug 30, 2020

Hi again,

Sorry to bother you with so many questions. As I indicated before I am trying to make sure this package does what original DETECT program does.

There is an argument nclusters for the expl.detect() function. It seems that the user is expected to specify the number of clusters before running the analysis.

This is a little bit confusing. Based on my understanding, the purpose of running the exploratory DETECT procedure is to explore and decide the number of clusters. So, it is unexpected to ask the user to specify that before the analysis.

In the original DETECT program, you only specify the maximum possible cluster allowed (say 10). Then, when the exploratory DETECT runs, it searches the whole space for nclusters=2, nclusters=3, nclusters = 4, ..., nclusters= 10. In the end, it returns the solution with the maximum DETECT value. So, the user can estimate how many potential clusters with the items assigned to clusters for the optimized solution. It is like deciding the number of factors.

The way sirt runs the exploratory DETECT seems different. Please let me know if I am missing anything, or I am interpreting inaccurately.

Thank you.

@alexanderrobitzsch
Copy link
Owner

alexanderrobitzsch commented Aug 30, 2020

nclusters is the maximum number of clusters in sirt::expl.detect. I think that this function follows the original DETECT approach. I suppose that the description in the manual was confusing because the argument nclusters was not explicitly mentioned as the maximum number of clusters.

@czopluoglu
Copy link
Author

czopluoglu commented Aug 30, 2020

I don't think it works as intended.

It always reports the optimal cluster size as the user specified nclusters argument.

I similarly run the exploratory procedure for the TIMSS data in the package. I specified the max ncluster as 10. Here is the output from the original software.

timss_exploratory.txt

-------------------------------------------------------
                  DETECT SUMMARY OUTPUT
-------------------------------------------------------
 
                    Data File Name: B:\UM Teaching\EPS707\Spr19\Tutorials\DIMTEST-PolyDIMTEST\timssdata.dat                                                 

              Number of Items used:       25

           Number of Items dropped:        0

               Number of Examinees:      345

                 Minimum Number of 
                Examinees per Cell:        2

         Number of Vectors Mutated:        5

      Maximum Number of Dimensions:       10

                Randomization Seed:    99991

   Minimum percentage of examinees
         used after deleting cells
     having less than  2 examinees:    99.13

   Average percentage of examinees
         used after deleting cells
     having less than  2 examinees:    99.25
 
-------------------------------------------------------

 NUMBER OF DIMENSIONS THAT MAXIMIZE DETECT:  4
 
  Exploratory DETECT Statistics:

              Maximum DETECT value:   0.4365
                   IDN index value:   0.6933
                           Ratio r:   0.5840

 PARTITION WITH MAXIMUM DETECT VALUE:

    1    1    1    2    1    1    1    1    1    3
    2    2    3    4    3    4    2    2    2    2
    2    2    4    4    4

 CLUSTER MEMBERSHIPS:

   -----------CLUSTER  1-------------

    1    2    3    5    6    7    8    9

   -----------CLUSTER  2-------------

    4   11   12   17   18   19   20   21   22

   -----------CLUSTER  3-------------

   10   13   15

   -----------CLUSTER  4-------------

   14   16   23   24   25
 
   ----------------------------------
 
  Covariance Sign Pattern Matrix:
   d+++++++++--+----++-+----
   +d++-+++------++---++-+--
   ++d++++++-++-----+-++----
   +++d++++----+---+----+--+
   +-++d++-----++-+++-------
   +++++d++--+----------+-++
   ++++++d+-+-+----+---+-+++
   ++++-++d+---++---+-+-++--
   +-+----+d+-++++-+-++---++
   +-----+-+d++++++----+--+-
   --+--+---+d+++++--+-+++--
   --+---+-+++d++-+--++++++-
   +--++--+++++d-+-+-++-+-++
   ----+--+++++-d++++---++--
   -+------+++-++d++++-+--+-
   -+--+----+++-++d+--------
   ---++-+-+---++++d--++---+
   +-+-+--+-----++--d-+-+-+-
   +-------+-+++-+---d+--+--
   -++----++--++---+++d---+-
   +++---+--+++--+-+---d+++-
   ---+-+-+--++++---+--+d+++
   -+----++--++-+----+-++d-+
   -----++-++-++-+--+-+++-d+
   ---+-++-+---+---+----+++d
 
 No cross validation for this DETECT run

As you can see, it finds that the optimal number of cluster is 4 and reports the identified item clusters with a DETECT value of 0.4365.

When I run the following code in R, it gives me this.

data(data.timss)
dat <- data.timss$data
dat <- dat[, substring( colnames(dat),1,1)=="M" ]
iteminfo <- data.timss$item

expl.detect(data      = dat, 
            score     = rowSums(dat), 
            nclusters = 10,
            N.est     = 345,
            seed      = NULL,
            use_sum_score=TRUE)

The output is below. I just can't make sense of it. It reports the optimal cluster size as 10 (if you change nclusters to 5, then it reports the optimal cluster size as 5). Also, the numbers it reports for 4 cluster solution doesn't look same.

Pairwise Estimation of Conditional Covariances
...........................................................
Nonparametric ICC estimation 
 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 
55% 60% 65% 70% 75% 80% 85% 90% 95% 
...........................................................
Nonparametric Estimation of conditional covariances 
 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 
55% 60% 65% 70% 75% 80% 85% 90% 95% 


DETECT (unweighted)

Optimal Cluster Size is  10  (Maximum of DETECT Index)

  N.Cluster N.items N.est N.val        size.cluster DETECT.est ASSI.est
1         2      25   345     0               10-15      0.287    0.020
2         3      25   345     0             10-10-5      1.210    0.353
3         4      25   345     0            6-4-10-5      1.648    0.513
4         5      25   345     0           6-4-7-5-3      2.109    0.653
5         6      25   345     0         6-4-4-5-3-3      2.360    0.733
6         7      25   345     0       6-1-4-3-5-3-3      2.425    0.753
7         8      25   345     0     6-1-4-3-5-3-2-1      2.473    0.767
8         9      25   345     0   4-1-4-2-3-5-3-2-1      2.592    0.820
9        10      25   345     0 4-1-3-2-3-5-1-3-2-1      2.645    0.840
  RATIO.est MADCOV100.est MCOV100.est
1     0.100          2.87      -2.862
2     0.422          2.87      -2.862
3     0.574          2.87      -2.862
4     0.735          2.87      -2.862
5     0.822          2.87      -2.862
6     0.845          2.87      -2.862
7     0.861          2.87      -2.862
8     0.903          2.87      -2.862
9     0.921          2.87      -2.862
$detect.unweighted
     DETECT.est  ASSI.est  RATIO.est MADCOV100.est MCOV100.est
Cl2   0.2865966 0.0200000 0.09984716      2.870353   -2.862047
Cl3   1.2102761 0.3533333 0.42164716      2.870353   -2.862047
Cl4   1.6475366 0.5133333 0.57398402      2.870353   -2.862047
Cl5   2.1085820 0.6533333 0.73460727      2.870353   -2.862047
Cl6   2.3602305 0.7333333 0.82227892      2.870353   -2.862047
Cl7   2.4250036 0.7533333 0.84484518      2.870353   -2.862047
Cl8   2.4725198 0.7666667 0.86139930      2.870353   -2.862047
Cl9   2.5919150 0.8200000 0.90299530      2.870353   -2.862047
Cl10  2.6449019 0.8400000 0.92145538      2.870353   -2.862047

$detect.weighted
     DETECT.est  ASSI.est  RATIO.est MADCOV100.est MCOV100.est
Cl2   0.2865966 0.0200000 0.09984716      2.870353   -2.862047
Cl3   1.2102761 0.3533333 0.42164716      2.870353   -2.862047
Cl4   1.6475366 0.5133333 0.57398402      2.870353   -2.862047
Cl5   2.1085820 0.6533333 0.73460727      2.870353   -2.862047
Cl6   2.3602305 0.7333333 0.82227892      2.870353   -2.862047
Cl7   2.4250036 0.7533333 0.84484518      2.870353   -2.862047
Cl8   2.4725198 0.7666667 0.86139930      2.870353   -2.862047
Cl9   2.5919150 0.8200000 0.90299530      2.870353   -2.862047
Cl10  2.6449019 0.8400000 0.92145538      2.870353   -2.862047

$clusterfit

Call:
stats::hclust(d = d, method = "ward.D")

Cluster method   : ward.D 
Number of objects: 25 


$itemcluster
       item cluster2 cluster3 cluster4 cluster5 cluster6 cluster7 cluster8
1   M031286        1        1        1        1        1        1        1
2   M031106        1        1        2        2        2        2        2
3   M031282        1        1        1        1        1        1        1
4   M031227        2        2        3        3        3        3        3
5   M031335        1        1        1        1        1        1        1
6   M031068        1        1        1        1        1        1        1
7   M031299        1        1        1        1        1        1        1
8   M031301        1        1        1        1        1        1        1
9   M031271        1        1        2        2        2        4        4
10  M031134        1        1        2        2        2        4        4
11  M031045        2        2        3        3        3        3        3
12  M041014        2        3        4        4        4        5        5
13  M041039        2        2        3        3        3        3        3
14  M041278        2        2        3        5        5        6        6
15  M041006        1        1        2        2        2        4        4
16  M041250        2        2        3        3        6        7        7
17  M041094        2        2        3        3        6        7        7
18  M041330        2        2        3        3        3        3        3
19 M041300A        2        3        4        4        4        5        5
20 M041300B        2        3        4        4        4        5        5
21 M041300C        2        3        4        4        4        5        5
22 M041300D        2        3        4        4        4        5        5
23  M041173        2        2        3        3        6        7        8
24  M041274        2        2        3        5        5        6        6
25  M041203        2        2        3        5        5        6        6
   cluster9 cluster10
1         1         1
2         2         2
3         1         1
4         3         3
5         4         4
6         1         1
7         4         4
8         1         1
9         5         5
10        5         5
11        3         3
12        6         6
13        3         7
14        7         8
15        5         5
16        8         9
17        8         9
18        3         3
19        6         6
20        6         6
21        6         6
22        6         6
23        9        10
24        7         8
25        7         8

@czopluoglu
Copy link
Author

czopluoglu commented Aug 30, 2020

Is smooth argument valid for exploratory analysis? When I tried smooth=FALSE, it says there is not such an argument.

It may be screwing all these numbers.

@alexanderrobitzsch
Copy link
Owner

alexanderrobitzsch commented Sep 2, 2020

I will include the smooth argument in expl.detect in a recent update (will take some days).

I probably agree that expl.detect does not result in the same estimates as the original DETECT procedure. In the official publications about DETECT it is stated that they use a kind of a genetic algorithm for clustering. I used the ward.D method of the stats::hclust function. Maybe I should allow that one can choose different values of method in applying hclust.

Moreover, there is work that showed that the cross-validated DETECT index should almost always be prefered to the non-validated exploratory DETECT index. Hence, I was not too eager to replicate findings of the DETECT software.

@czopluoglu
Copy link
Author

Thank you. Well, this is not really about cross-validated vs. non-validated. Even though you run the above example with cross-validated procedure, the DETECT numbers from expl.detect() is not consistent with the original software. I am just trying to understand why, and if there is any way to fix it. If not, it is also OK. We just need to be aware of that fact as researchers.

I really respect the work you do by developing the package and integrating the DETECT procedure into this package. I opened these issues and am asking these questions because my concern is that some people may do research on DETECT using the functions included in the sirt package and their findings will significantly deviate from findings already published in the literature.

Also a side note: No, cross-validated DETECT index should not always be preferred to the non-validated DETECT index in an exploratory analysis. There is a trade-off between bias and variance (this is also based on some published work).

Thank you again for your responsiveness.

@alexanderrobitzsch
Copy link
Owner

I will certainly add a note in the sirt manual that results will likely differ from the original software. I suppose that the choice of the clustering algorithm could be crucial, but I will experiment a bit to gain further insights.

@alexanderrobitzsch
Copy link
Owner

alexanderrobitzsch commented Sep 5, 2020

I checked the results of expl.detect. It seems that the recently introduced change to use sum scores use_sum_score in conf.detect and ccov.np had side effects on expl.detect that produced implausible values. With the recent dev version sirt 3.10-70, results are much more plausible. Moreover, I included arguments hclust_method (for choice of clustering method) and estsample. With estsample, one can explicitly specify the cases using for the estimation sample. Probably, using only one random estimation sample for the cross-validated DETECT index introduces too much simulation uncertainty. It could be that one should compute the cross-validated index based on repeated random sampling of the estimation sample and taking the average DETECT index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants