Report of Reviewer 1 #1

jchiquet · 2022-03-30T08:21:29Z

Associate Editor: Julien Chiquet
Reviewer 1 : Christophe Botella (chose to lift his anonymity)

Reviewer 1: Reviewing history

Paper submitted August 25, 2021
Reviewer invited September 28, 2021
Review 1 received October 28, 2021
Paper revised January 4, 2022
Reviewer invited January 4, 2022
Review 2 received February 4, 2022
Paper conditionally accepted February 21, 2022

First Round

The study is clearly and nicely written with an interesting perspective. The authors made a clear effort to illustrate sophisticated posterior-analysis tasks based on deep learning image classification. I perceived the main points of originality to (i) the integrated and reproducible R pipeline that should drive a wide and fast appropriation of such pipeline, (ii) the “end-to-end automation”: Except coding, there is no human labor from data collection to scientific results (not advertised too much in my opinion, if this is really new), (iii) The aim is not general monitoring, but a the question of species statistical associations, for which such methodology has rarely been used. As developed hereafter, I think the authors might push the discussion of their methodology further, and it doesn’t seem much more expensive to include some other classes to avoid biases and obtain richer results.

authors' answer: Thank you for your time, and your constructive suggestions. See below our answers and details on how we addressed your comments.

General comments

I understand sample size may be limitating, but the comparison of classified dataset with ground truth dataset would be more interesting if the classified dataset contained only machine classified images (not used in training) and no human classified images. For instance, you could split the training/test data from jura based on sites. Thus, you could comparison the posterior occupancy estimates (classified vs ground truth) on jura test+Ain sites only.

authors' answer: This is a neat idea, but as the referee figured out,sample size was definitely one of the reasons why we did not consider this option. Another reason was that we were specifically interested in testing the prediction that generalizing classification algorithms trained in a site (Jurain our case study)might have moderate to poor performances when used in another environment(Ain in our case study).

An important question is the sensitivity of the co-occurrence signal to confusions of the image classification algorithm. This is not directly addressed here, nor even discussed. First, we could expect that high confusions between prey species would artificially increase the estimated level of co-occurrence, and bias co-occurrence with Lynx. For instance, a part of the surplus in estimated Pr(lynx present| roe deer present and chamois absent) and deficit in Pr(lynx present| roe deer present and chamois absent) in classified dataset compared to ground truth (Figure 5) could be due to chamois being often classified as roe-deer (Figure 2). Second, observed co-occurrence patterns between focal species may be driven by co-occurrence patterns with other classes hidden among the focal species. As for example, roe deer has 0.67 precision only and include in reality many foxes. A potential competition Lynx<->fox for prey could decrease the estimated co-occurrence roe deer / lynx. I invite you to discuss this problem, which might importantly bias your ecological interpretations precisely because of a bias in automatic identification.

authors' answer: These are excellent points indeed, and we added the following paragraph in the Discussion section to account for this comment: “When it comes to the case study, our results should be discussed with regard to the sensitivity of co-occurrence estimates to errors in automatic species classification. In particular, we expected that confusions
between the two prey species might artificially increase the estimated probability of co-occurrence with lynx. This was illustrated by $\Pr(\mbox{lynx present} | \mbox{roe deer present and chamois absent})$ (resp. $\Pr(\mbox{lynx present} | \mbox{roe deer absent and chamois present})$) being estimated higher (resp. lower) with the classified than the ground truth dataset (Figure 5). This pattern could be explained by chamois being often classified as (and confused with) roe deer (Figure 2).”.As of the second point, we agree in theory, but there is no competition between lynx and fox for big prey(despite fox preying on fawn in rare occasions).Last, we emphasize that this comment nicely relates to another comment by the other referee who encouraged us to discuss ways to explicitly account for confusions in statistical inference.

I wonder why authors didn’t include other classes in their occupancy model. I would find very interesting to see estimate of associations between Lynx and humans (regrouped with dogs and hunters), as we might expect a negative effect of frequent human presence on Lynx due to disturbance or even due to competition for prey resource in the case of hunting. Also, what the about the association between foxes and Lynx, which might compete, and between foxes and the prey? Further, mentionned earlier, it would be better to integrate these species rather than let them bias your focal estimates. The precisions and recalls in Table 3 don’t prevent the inclusion of these classes compared to Chamois or Roe deer. So what is the reason? Estimation variance reason? Computational limit? A statement about it would be nice.

authors' answer: This is indeed a fair question. First, in our experience multi-species occupancy models are very much data-hungry, and this is only by using regularization methods (Clipp et al. 2021) that we can avoid occupancy probabilities to be estimated at the boundary of the parameter space or with large uncertainty. Second, and this is true for any joint species distribution models, models quickly become every complex with many parameters to estimate when the number of species increases and co-occurrence is allowed between all species. Here ecological expertise should be used to consider only meaningful species interactions and apply parsimony when parameterizing models.We added the following paragraph in the Method section to emphasize these limitations: “First, these models are data-hungry and regularization methods [@clipp2021] are needed to avoid occupancy probabilities to be estimated at the boundary of the parameter space or with large uncertainty. Second, and this is true for any joint species distribution models, these models quickly become very complex with many parameters to be estimated when the number of species increases and co-occurrence is allowed between all species. Here, ecological expertise should be used to consider only meaningful species interactions and apply parsimony when parameterizing models.”.

With that in mind, we did not include humans as a species in our model because all camera traps were set up in areas with high human activity, so that human occupancy probability is basically one everywhere. We would need sites with moderate and no human activity to be able to assess the effects of human activity on lynx presence.We did include fox and cat in our occupancy analyses (see R Markdown script) but did not consider co-occurrence between these species and our focal species (lynx, chamois and roe deer). Fox do not prey on roe deer, which is why we did not include co-occurrence between these two species as a parameter to be estimated. This being said, there might be competition between lynx and fox for prey like small rodents and birds. To test for this hypothesis, we fitted another occupancy model in which we considered co-occurrence between fox and lynx. This model was better supported by the data than the model without co-occurrence (AIC was1544 vs 1557)which opens room for further investigation.We thank the referee for pushing us to investigate this effect, and we added the following sentence in the Discussion section:“Our results are only preliminary and we see several perspectives to our work. First, we focused our analysis on lynx and its main prey, while other species should be included to get a better understanding of the community structure. For example, both lynx and fox prey on small rodents and birds and a model including co-occurrence between these two predators showed better support by the data (AIC was 1544 when co-occurrence was included vs. 1557 when it was not).”.

Point by point:

The work may be positioned to comparable studies with SDM, earlier work used deep-learning-based image classification to produce occurrence data exploitable for species monitoring and test sensitivity to image classification errors, e.g. Botella et al., 2018, .

authors' answer: We agree, and we added this sentence “In that spirit, we praise previous work on plants which used deep learning to produce occurrence data and tested the sensitivity of species distribution models to image classification errors [@botella2018].”

“as most, if not all, algorithms are written in the Python language”: Not all, many library have R implementations (e.g. mxnet) or R wrappers (e.g. kerasR, R keras). Mxnet-R has been used for species distribution.

authors' answer: True. We deleted “if not all” and added a link to both MXNet for R and the R interface to Keras.

“wild board” (picturing it was funny though).

authors' answer: Well spotted, we corrected the typo.

Figure 2: Add % across columns and line.

authors' answer: Done.

“an ecologist would probably wonder whether ecological inference about the interactions between lynx and its prey is biased by these average performances, a question we address in the next section.” -> What you are looking at further is co-occurrence, not interaction, so stick with “co-occurrence” or “association” consistently across the manuscript.

authors' answer: Agreed, we now use co-occurrence throughout the manuscript.

Table 2: Is it training or test/validation metrics here? You suggested earlier in the text that test/validation metrics were computed on 20% of the Jura dataset (this is what we want to see here), but in the caption you write “Model training performance”. This is not very interesting to look at training metrics as they don’t inform on the generalising ability of your algorithm. Test/validation metrics on the Jura dataset would be more relevant and inform on the transferability gap compared to Ain dataset.

authors' answer: This table is about test/validation metrics(precision/recall as the titles of the columns suggest). We realize that table caption is confusing, and we now use “Model performance metrics”to make it clear we provide test/validation metrics here.

“we may also infer potential interactions by calculating conditional probabilities such as for example the probability of a site being occupied by species 2 conditional of species 1” Again, I wouldn’t speak about interaction here, as this probability may also capture a same response to environmental variability, or a same response to another species, etc. Association would be more suited.

authors' answer: Agreed, we now use co-occurrence throughout the manuscript.

Reproducibility:

I tried to reproduce article using the RmarkDown script but without full success. Most of it worked fine, but a chunk (starting l.644 of dl-occupancy-lynx-paper.Rmd) sends an error and so half of the code couldn’t be tested. The reason is the function “optimizePenalty” is not defined, even though I have installed all dependencies. I guess it is an author’s defined function that they forget to include in the script? Please, include it and test to reproduce the article starting from an empty R environment.

authors' answer: Apologies for that. The latest version of the Rpackage unmarkedwe used for occupancy analyses is not on CRAN yet. We added a line of code in the R Markdown script so that the dev version is now installedon the user’s computervia devtools::install_github("rbchan/unmarked")

Second Round

The authors have responded in depth to my previous questions. I have reviewed and appreciated the entire manuscript, which the authors have significantly improved, making it suitable for publication in the journal in my opinion with minor modifications.

I also agree with the jugaad spirit of your message and hope it can both reduce the perceived technical barrier towards these methods and also make their use more efficient/sober.

Please find below my answer and suggestion regarding a discussion initiated during last round and point by point minor modifications to improve the clarity and comfort of reading of the manuscript.

Following up on our discussion:

I understand sample size may be limitating, but the comparison of classified dataset with ground truth dataset would be more interesting if the classified dataset contained only machine classified images (not used in training) and no human classified images. For instance, you could split the training/test data from jura based on sites. Thus, you could comparison the posterior occupancy estimates (classified vs ground truth) on jura test+Ain sites only.

authors'quote This is a neat idea, but as the referee figured out, sample size was definitely one of the reasons why we did not consider this option. Another reason was that we were specifically interested in testing the prediction that generalizing classification algorithms trained in a site (Jura in our case study) might have moderate to poor performances when used in another environment (Ain in our case study).

Thanks for this clarification. I do understand now the need for such transferability assessment and this use-case is indeed a good illustratation for future context of applications where automatic classification will be used to label many camera trap images in different areas. However, I also realize that this motivation for the study is not clear in the introduction (-> add a sentence in the last introduction paragraph), and that it must be explicitely used to justify the choice of the classified dataset (-> add a sentence in first paragraph of 5).

Point by point minor remarks:

In Caption of Table 2: “Images from the Jura study site were used for training” -> This is already clear from the text above, so i’m not sure it’s useful here and might be even misleading because what you actually display here are TEST metrics. I would suggest instead to be explicit on which subset of data you compute these metrics on (to avoid confusion with the transferrability metrics), namely to change this sentence with something like “[…] computed on test sites of Jura”.
I find Figure 2 very important and it is really helpful that you added row and column percentages. However, percentages partially hide some central numbers, making the Figure look messy. You may simply reduce the size of these numbers, they don’t need to be that big.
In caption of Figure 2, You should specify “transferrability confusion matrix” or something like this to clarify for the reader that these numbers were computed on Ain sites.
In caption of Figure 2. “An example of row percentage is as follows: of all pictures in which we have a wild boar, we predict 94% of them to be badgers” -> Change “badgers” to “wild boars”.

jchiquet added the review Report written before acceptance, now public label Mar 30, 2022

jchiquet changed the title ~~Report of Reviewer #1~~ Report of Reviewer 1 Mar 30, 2022

jchiquet self-assigned this Mar 30, 2022

jchiquet pinned this issue Apr 5, 2022

computorg locked as resolved and limited conversation to collaborators Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report of Reviewer 1 #1

Report of Reviewer 1 #1

jchiquet commented Mar 30, 2022 •

edited

Report of Reviewer 1 #1

Report of Reviewer 1 #1

Comments

jchiquet commented Mar 30, 2022 • edited

Reviewer 1: Reviewing history

First Round

General comments

Point by point:

Reproducibility:

Second Round

Point by point minor remarks:

jchiquet commented Mar 30, 2022 •

edited