Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time



This script simulates NLP scientific challenges with a variable number of participants, score ranges and epochs. I developed it to show how confident a participant can be of a final official rank. Of course, it is just a statistical simulation with some assumptions which can be too strict.

I generate N participants, with their relative final F1 score. I then order them to obtain a rank. I iterate this process EPOCH times and count how many times the expected rank is equal to the simulated rank. The process is affected by the numbers of istances in the test set. You can easily see from the pictures that when you have small test sets the reliability of the rank is low. As expected, the official ranking becomes reliable as soon as we increase the number of test istances.

This was to show that the final ranking of a scientific challenge (that has a certain number of participants) has a statistical significance only if the official benchmark test set is larger than a certaing threshold.

##How to

###read the graph

The picture above shows you the results for a challenge with 21 participant, expected top score 74.8% F1, standard deviation of 0.0005. Each simulation has been generated 5000 times. The Top N curve shows, depending on the size of the test set, how many times the real top N participants corresponded to the simulated top N participants. In the graph above you can see that, for 2000 samples, only 50% of the time the real full ranking corresponded to the simulated ones. For the same test set size, 85% of the time the real top 8 participants corresponded to the top 8 simulated ones in that very order.

###simulate a challenge To visualize other simulated rankings with different parameter, use:

$ python <minimum test set size> 
                        <maximum test set size>
                        <step test set size>
                        <number of participants>
                        <expected top score>
                        <standard deviation>
                        <% of CPU to use>

For example:

$ python 0 6000 25 5000 21 0.748 0.0005 0.5

will compute the graph above using half of the CPU cores you have on your machine.


  • temp.mem

The file temp.mem contains all the pre-computed simulated rankings generate by

Those are the parameters:

  • number of participants = [1,30]
  • trials = 10000
  • minimum test set size = 0 samples
  • step for test set size = 25 samples
  • maximum test set size = 6000 samples
  • expected maximum score per challenge = from 30% to 100%, step=5%
  • standard deviations = [0.0001, 0.0002, 0.0005, 0.001, 0.005]


$ python 0 5000 25 10000 30 0.95 0.0001


$ python 0 5000 25 10000 10 0.95 0.0001


$ python 0 5000 25 10000 10 0.95 0.0005



(GPL v2)

Copyright (c) 2012 Michele Filannino,

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see



Scientific Text Mining challenges simulator






No releases published


No packages published