How to stop screening? #557
Replies: 20 comments 43 replies
-
It is recommended to decide on a stopping rule before starting the screening process. Your stopping rule can either be time-based, data-driven or a mix of these two. Time-based strategy: If you choose a time-based strategy, you decide to stop after an x amount of time. This strategy can be useful when you have a limited amount of time to screen. Data-driven strategy: When using a data-driven strategy, you e.g. decide to stop after an x amount of consecutive irrelevant papers (this number can be found in the statistics panel). Whether you choose 50, 100, 250, 500, etc. is dependent on the size of the dataset and the goal of the user. You can ask yourself: how important is it to find all the relevant papers? Mixed strategy: Another option is to stop after an x amount of time, unless you exceed the predetermined threshold of consecutive irrelevant papers before that time. |
Beta Was this translation helpful? Give feedback.
-
@MartijnUX, where can I find the statistics panel showing the 42 irrelevant records you were referring to? |
Beta Was this translation helpful? Give feedback.
-
For a scoping review I was wondering if a rule like the following would be advisable or not: Screening saturation (the point we stop screening) is defined in this scoping review when ASReview gives 1% of the total found papers with (a minimum of 25) as consecutive non-relevant papers. When this point is reached we choose to end the screening. (For example if 1500 articles are found, when 25 consecutive articles are not relevant anymore the screening ends; when 3000 articles are found, when 30 consecutive articles are not relevant the screening ends). |
Beta Was this translation helpful? Give feedback.
-
Same as Jan. I don't see why a scoping review should be different from any review.
That said, if we could generate a credible estimate of recall, then we could stop whenever that estimate reaches the recall that is sufficient for the particular project, AND we would be able to generate some measure of confidence in that interval.
Work has been published on estimators of recall, and at least one team has implemented theirs for the purpose of providing quantitative stopping information (Howard et al. 2020 https://doi.org/10.1016/j.envint.2020.105623). They think the estimate they generate is valid and conservative, but I would love to have the opinion of the AS-Review developers about that.
Note that Howard et al.'s estimator of recall is based on the statistical properties of the gap between successive positive records in an ordered queue, which is also the basis for the heuristic proposed by Jan.
|
Beta Was this translation helpful? Give feedback.
-
In their implementation, we repeatedly notice the "clumping" you mention, where the positive records are not smoothly spread out throughout the queue. We often have "streaks" of positive records followed by "droughts".
The recall estimate gets recomputed every time the ranking gets recomputed (every 30 records or by user request). It seems as if the recall estimation just gets adjusted as the work passes through plateaus.
I personally notice the similarity between screening data and time-to-event data, such as survival data or reliability data. I have used time-to-event methods to compare the efficiency of different ranking models. Parametric time-to-event methods can generate an estimate of the endpoint of the process (100% events). I wonder if this has been described.
|
Beta Was this translation helpful? Give feedback.
-
Thank you for the offer!!
I am on the US east coast, so wouldn't that be 5:00 am?
That is way too early for me, but the offer is extremely tempting.
Add me to the list and I'll see if I can manage to wake up.
|
Beta Was this translation helpful? Give feedback.
-
Those are three great sources. Thank you.
Your suggestion of using a preliminary random sample to estimate the population frequency of positives is intuitively very attractive.
However, our work deals routinely with populations of 100000-300000 references, among which there may only be 1000-3000 positives. The size of the random sample that would give us a credible estimate of total recall that is also reasonably accurate may be prohibitive. Or it may be acceptable, especially if the margin of error we require is not too small.
|
Beta Was this translation helpful? Give feedback.
-
You can also check the stopping rules used by other researchers. We have a list of systematic reviews where ASReview was used as a screening tool and the authors have reported their stopping rule. |
Beta Was this translation helpful? Give feedback.
-
See also the discussion in #1115 about calculating the 'knee' criterion based on the output of the recall plot. |
Beta Was this translation helpful? Give feedback.
-
Hereby I would like to share my stopping rules for two rounds of screening in my review protocol. The stopping rule of the first screening phase is a three-fold rule. (1) screening in this phase will be stopped when at least 25% of the records have been screened. Van de Schoot et al. (2021) showed that 95% of the eligible studies will be found after screening between only 8% to 33% of the total number of records; (2) screening will be stopped only when all key papers have been marked as relevant; (3) Based on the results of the screening of 25% of the records, a ‘knee method’ stopping criterion algorithm (Cormack & Grossman, 2016) will be determined. The knee method is a a geometric stopping procedure, based on the shape of the gain curve (i.e. recall versus effort). The recall plot generated by the software, plotting the number of identified relevant records against the number of viewed records, will be visually inspected after screening batches of 5% of the total number of records, to see whether a plateau (or ‘knee’) is reached. When a plateau is visually identified, we will mathematically evaluate to see if the slope at that earlier evaluation points is more than a predefined slope cutoff ratio (e.g. 6) as high. When this point is reached we choose to end the active learning screening phase. Screening phase 2: Deep learning |
Beta Was this translation helpful? Give feedback.
-
Hello everyone, I hope you're doing well. I have recently conducted a simulation study on utilizing extracted features to develop a stopping criterion for screening. I would be extremely grateful for your expert opinions and any feedback you might have. Thank you in advance for taking the time to review my work. |
Beta Was this translation helpful? Give feedback.
-
see also the pre-print "The SAFE Procedure: A Practical Stopping Heuristic for Active Learning-Based Screening in Systematic Reviews and Meta-Analyses": https://psyarxiv.com/c93gq |
Beta Was this translation helpful? Give feedback.
-
Hi, What do you think about combining a stopping rule and the human error rate (10.76% [95% CI: 7.43% to 14.09%]) from doi: 10.1371/journal.pone.0227742.
All the above numbers are imaginary. What do you think? Emanuel |
Beta Was this translation helpful? Give feedback.
-
Stopping decision (approximate numbers to make it easier to follow) : Hello, for a set of data counting around 5000 records, I am using ASReview in the screening for inclusion and would have questions about the stopping decision:
In the actual screening with ASReview, I screened 33% (around 1600) and labeled 15% of relevant records, I haven´t yet reached 50 irrelevant in a row. Thank you |
Beta Was this translation helpful? Give feedback.
-
Hello everyone! |
Beta Was this translation helpful? Give feedback.
-
I am very temped to use ASReview for a systematic review in transplant infectious disease. I am struggling a little with the 'stopping rule'! I have not completed deduplication yet, but there will be approx 15k titles and abstracts. I know of approx 5 relevant papers already. From the above discussion/references, it seems there are two main approaches to deciding a stopping rule. First, you can screen a random sample of the total then apply the proportion screening positive to the total sample size to estimate the total number of papers you are expecting to include. You can then stop screening when you have found e.g. 95% of this estimate. However, the uncertainty on the estimate presumably means you could end up targeting more included papers than truly exist? The alternative approach is to stop when you achieve a run of x irrelevant papers. This makes more sense to me. Presumably you could stop when the upper limit of the binomial confidence interval on (1/number screened since last positive) x remaining papers to screen is <5% of the number that have screened positive? That would be conservative, as the algorithm places papers most likely to be relevant at the top of the pile. Are there other conservative stopping rules based on data/calculations that I might apply. I think I am happy with finding 95% of relevant papers, as I can probably mop up the rest by emailing corresponding authors and going through the reference lists of included papers. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Phase 3 of safe procedure: |
Beta Was this translation helpful? Give feedback.
-
Hi, I have a question about the SAFE procedure. PS: I have read the discussions above, but can not follow everything:( hopefully I am not posting a question that has already been answered |
Beta Was this translation helpful? Give feedback.
-
Thanks Josien. I did the dubble check... 22 out of 74. So that at least confirms my first approach. I just have to think about my design I am afraid. Some background info: I am deliberately using a broad approach with a very extensive search query and expecting mostly qualitative studies in return. So it does make sense that quite a big chunk is relevant at least for the first round of screening. I am expecting quite a big chunk to be dismissed during full-text screening. So I either have to deal with having to screen 4000+ sources, or have to decide to apply a different stopping rule. |
Beta Was this translation helpful? Give feedback.
-
Hi all, and thank you to the developers of ASReview - it's great to have open source software for ML-assisted screening. We recently published a paper in Systematic Reviews discussing stopping rules. In this paper, we argue that good stopping rules are absolutely vital to using machine learning responsibly, but that good stopping rules are hardly ever used, and the guidance is very much short of where it needs to be. We recommend that: 1. We need to agree minimum criteria for stopping rules
In short, they should be based on good statistics, with clear, sensible, and transparent assumptions (just like the rest of the systematic review process!) 2. We evaluate rules robustly across a range of datasets and prioritisation algorithmsWe want to demonstrate that rules are reliable, not just in general, but across a broad range of contexts. Reliable in this case, means that a rule targeting 95% recall with 95% confidence should result in recall less than 95% less than 5% of the time. Among reliable rules, we can select the most efficient. 3. The guidelines for systematic review need to be updated.Cochrane and Campbell guidance on how to conduct a review is short of information on how to use ML-prioritised screening, yet many people are already using it (thanks to great software like this!). The guidance needs to help people (authors, reviews, and the general public) to understand the difference between stopping after 50 consecutive irrelevant records, and using a criterion that is based on solid statistics. 4. Platforms (like ASReview) should provide better guidance on stopping rules, and make them easy to use within their platformsAs noted, many people are already using ML-prioritised screening, but anecdotally, the majority of people are doing things like stopping after X consecutive irrelevant documents, with no justification for X. Platforms need to do a much better job at helping people to use ML reponsibly. ML is not magic, it is just a way of making predictions, and we need statistical methods to understand the risks of relying on those predictions. Platforms should therefore either give better guidance on how to use well-justified stopping criteria, or incorporate them into their software. 5. Papers should not make claims about work savings unless they incorporate a well-justified stopping methodThere are a lot of papers that say our system/our algorithm achieves work savings of X%, without addressing the fact that these savings could only have been made if one had a perfect stopping method that predicted exactly when the right moment to stop was, without requiring any extra screening (we would need omniscience for this!). We have enough of these papers now, let's focus on getting stopping criteria right, as this is necessary for actual real-world work-savings Looking forward to hearing thoughts on this. I would also like to share some resources on a stopping method I developed a few years ago, which I think has a solid statistical foundation, and would be my personal recommendation for anyone looking to solve this problem: https://mcallaghan.github.io/buscar-app/ |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This first post is continuously updated based on the discussions in this thread
In the active learning cycle, the model incrementally improves its predictions on the remaining unlabeled records, but hopefully, all relevant records are identified as early in the process as possible. The reviewer decides to stop at some point during the process to conserve resources or when all records have been labeled. In the latter case, no time was saved and therefore the main question is to decide when to stop: i.e. to determine the point at which the cost of labeling more papers by the reviewer is greater than the cost of the errors made by the current model (e.g., Cohen, 2011). Finding 100% of the relevant papers appears to be almost impossible, even for human annotators(Wang, Nayfeh, Tetzlaff, O’Blenis, & Murad, 2020). Therefore, we typically aim to find 95% of the inclusions. However, in the situation of an unlabeled dataset, you don’t know how many relevant papers there are left to be found. So researchers might either stop too early and potentially miss many relevant papers, or stop too late, causing unnecessary further reading(Z. Yu, N. Kraft, & T. Menzies, 2018a).
There are potential stopping rules which have to be implemented, estimating the number of potentially relevant papers or finding an inflection point(Cormack & Grossman, 2015, 2016; Kastner, Straus, McKibbon, & Goldsmith, 2009; Stelfox, Foster, Niven, Kirkpatrick, & Goldsmith, 2013; Ros, Bjarnason, & Runeson, 2017; Wallace et al., 2010, 2012; Webster & Kemp, 2013; Yu & Menzies, 2019).
Another option is to use heuristics (Bloodgood & Vijay-Shanker, 2014; Olsson & Tomanek, 2009; Vlachos, 2008), for example:
Time-based strategy: If you choose a time-based strategy, you decide to stop after an x amount of time. This strategy can be useful when you have a limited amount of time to screen.
Data-driven strategy: When using a data-driven strategy, you e.g. decide to stop after an x amount of consecutive irrelevant papers (this number can be found in the statistics panel). Whether you choose 50, 100, 250, 500, etc. is dependent on the size of the dataset and the goal of the user. You can ask yourself: how important is it to find all the relevant papers?
Mixed strategy: Another option is to stop after an x amount of time unless you exceed the predetermined threshold of consecutive irrelevant papers before that time.
Below we discuss more options in detail. Join the discussion!!
Some useful references:
Beta Was this translation helpful? Give feedback.
All reactions