Identifying images in the biology literature that are problematic for people with a color-vision deficiency

To help maximize the impact of scientific journal articles, authors must ensure that article figures are accessible to people with color-vision deficiencies (CVDs), which affect up to 8% of males and 0.5% of females. We evaluated images published in biology- and medicine-oriented research articles between 2012 and 2022. Most included at least one color contrast that could be problematic for people with deuteranopia (‘deuteranopes’), the most common form of CVD. However, spatial distances and within-image labels frequently mitigated potential problems. Initially, we reviewed 4964 images from eLife, comparing each against a simulated version that approximated how it might appear to deuteranopes. We identified 636 (12.8%) images that we determined would be difficult for deuteranopes to interpret. Our findings suggest that the frequency of this problem has decreased over time and that articles from cell-oriented disciplines were most often problematic. We used machine learning to automate the identification of problematic images. For a hold-out test set from eLife (n=879), a convolutional neural network classified the images with an area under the precision-recall curve of 0.75. The same network classified images from PubMed Central (n=1191) with an area under the precision-recall curve of 0.39. We created a Web application (https://bioapps.byu.edu/colorblind_image_tester); users can upload images, view simulated versions, and obtain predictions. Our findings shed new light on the frequency and nature of scientific images that may be problematic for deuteranopes and motivate additional efforts to increase accessibility.


Introduction
Most humans have trichromatic vision: they perceive blue, green, and red colors using three types of retinal photoreceptor cells that are sensitive to short, medium, or long wavelengths of light, respectively.Color-vision deficiency (CVD) affects between 2% and 8% of males (depending on ancestry) and approximately 0.5% of females (Delpero et al., 2005).Congenital CVD is commonly caused by mutations in the genes (or nearby promoter regions) that code for red or green cone photopigments; these genes are proximal to each other on the X chromosome (Nathans et al., 1986).
CVD is divided into categories, the most common being deutan CVD, affecting approximately 6% of males of European descent, and protan CVD, affecting 2% of males of European descent (Delpero et al., 2005).Both categories are commonly known as red-green colorblindness.Within each category, CVD is subclassified according to whether individuals are dichromats-able to see two primary colors-or anomalous trichromats-able to see three primary colors but differently from normal trichromats.Anomalous trichromats differ in the degree of severity with which they can distinguish color patterns.Individuals with deuteranopia ('deuteranopes') or protanopia do not have corresponding green or red cones, respectively (Simunovic, 2010).Individuals with deuteranomaly do not have properly functioning green cones, and those with protanomaly do not have properly functioning red cones.People with any of these conditions often see green and red as brown or beige colors.Thus, when images contain shades of green and red-or when either is paired with brown-parts of the image may be indistinguishable.Furthermore, it can be problematic when some pinks or oranges are paired with greens.These issues can lead to incorrect interpretations of figures in scientific journal articles for individuals with CVD.
Efforts have been made to ensure that scientific figures are accessible to people with CVD.For example, researchers have developed algorithms that attempt to recolor images so that people with CVD can more easily interpret them (Flatla, 2011;Lin et al., 2019;Tsekouras et al., 2021).However, these tools are not in wide use, and more work is needed to verify their efficacy in practice.In the meantime, as researchers prepare scientific figures, they can take measures to improve accessibility for people with CVD.For example, they can avoid rainbow color maps that show colors in a gradient; they can use color schemes or color intensities that are CVD friendly (Crameri et al., 2020); additionally, they can provide labels that complement information implied by color differences.However, for the millions of images that have already been published in scientific articles, little is known about the frequency with which these images are CVD friendly.The presence or absence of particular color pairings-and distances between them-can be quantified computationally to estimate this frequency.However, a subjective evaluation of individual images is necessary to identify whether color pairings and distances are likely to affect scientific interpretation.
In this paper, we focus on deuteranopia and its subtypes.To estimate the extent to which the biological and medical literature contains images that may be problematic to deuteranopes, we manually reviewed a 'training set' of 4964 images and two 'test sets' of 879 and 1191 images, respectively.These images were published in articles between the years 2012 and 2022.After identifying images that we deemed most likely to be problematic or not, we used machine-learning algorithms to identify patterns that could discriminate between these two categories of images and thus might be useful for automating the identification of problematic images.If successful, such an algorithm could be used to alert authors, presenters, and publishers that scientific images could be modified to improve visual accessibility and thus make biological and medical fields more inclusive.

Results
We downloaded images from research articles published in the eLife journal.Not counting duplicate versions of the same image, we obtained 66,253 images.Of these images, 1744 (2.6%) were grayscale (no color).Of these images, 56,816 (85.6%) included at least one color pair for which the amount of contrast might be problematic to people with moderate-to-severe deuteranopia ('deuteranopes').To characterize potentially problematic aspects of each color-based image, we calculated five metrics based on color contrasts and distances; we also compared the color profiles against what deuteranopes might see.The mean, pixel-wise color distance between the original and simulated image exhibited a bimodal distribution, according to Hartigans' Dip Test for Unimodality (p<0.001;Hartigan and Hartigan, 1985).Specifically, 4708 images (7.3%) had a difference smaller than 0.01, while the median difference for the remaining images was 0.05 (Figure 1).Most other metrics showed similar patterns, although bimodality was less apparent through visual inspection (Figure 2; Figure 3; Figure 4).The exception was the proportion of pixels in the original image that used a color from one of the high-ratio color pairs, which was unimodal (p=1; Figure 5).
We determined that many images with the highest (or lowest, as would be the case for the 'Mean Euclidean distance between pixels for high-ration color pairs') scores for these metrics would be problematic for deuteranopes.However, we noted that certain color pairs were more problematic than others, and the use of effective labels and/or spacing between colors often mitigated potential problems.Thus, to better estimate the extent to which images are problematic for deuteranopes, we manually reviewed a sample of 4964 images and judged whether it would be likely for deuteranopes to recognize the scientific message behind each image.Supplementary file 2 contains a record of these evaluations, along with comments that indicate either problematic aspects of the images or factors that mitigated potential problems.We concluded that 636 (12.8%) of the images were 'Definitely problematic', whereas 3865 of the images (77.9%) were 'Definitely okay'.The remaining images were grayscale (n=179), or we were unable to reach a confident conclusion (n=284).For the images that were 'Definitely okay', we visually detected shades of green and red or orange in 2348 (60.8%) images; however, in nearly all (99.3%) of these cases, we deemed that the contrasts between the shades were sufficient that a deuteranope could interpret the images.Furthermore, distance between the colors and/or labels within the images mitigated potential problems in 54.2% and 48.4% of cases, respectively.We evaluated longitudinal trends and differences among biology subdisciplines.In some cases for the eLife articles, multiple images came from the same journal article.Therefore, to avoid pseudoreplication, we categorized each article as either 'Definitely okay' or 'Definitely problematic'.If an article included at least one 'Definitely problematic' image, we categorized the entire article under this category.The percentage of 'Definitely problematic' articles declined steadily between 2012 and 2021, with a modest increase in 2022 (Figure 6).(Fewer articles were available for 2022 than for prior years.)Using a generalized linear model with a binomial family to perform logistic regression, we found this decline to be statistically significant (p<0.001).A χ² goodness-of-fit test revealed that the number of 'Definitely problematic' articles differed significantly by subdiscipline (p<0.001).The subdisciplines with the highest percentages of problematic articles were Cell Biology, Developmental Biology, and Stem Cells and Regenerative Medicine (Figure 7).The subdisciplines with the lowest percentages of problematic articles were Evolutionary Biology, Genetics and Genomics, and Computational and Systems Biology.
Despite the benefits of manual review, this process is infeasible on a large scale.Therefore, we evaluated techniques for automating image classification.As an initial test, we used the five imagequantification metrics.We also combined these into a single, ranked-based score for each image.In all cases, the metrics differed significantly between the 'Definitely okay' and 'Definitely problematic' images (Figure 8  relatively high performance.A value of 0.5 indicates that predictions are no better than random guessing.The best-performing metric was number of color pairs that exhibited a high color-distance ratio between the original and simulated images (AUROC: 0.75; AUPRC: 0.34).All the other metricsexcept mean, pixel-wise color distance between the original and simulated image-performed better than random guessing (Supplementary file 1A).As an alternative to the combined rank score, we used classification algorithms to make predictions with the five metrics as inputs.In cross-validation on selected articles from PubMed Central, as images from these articles represent life-science journals more broadly.After manual review, we estimate that 12.8% of the figures in eLife would be challenging to interpret for scientists with moderate-to-severe deuteranopia.The percentage of 'Definitely problematic' figures in PubMed Central articles was considerably lower (5.2%).One reason is that a much higher percentage (38.5%) of the images from PubMed Central were grayscale compared to those from eLife (4.4%).The findings for both sources indicate that color accessibility is a problem for thousands of journal articles per year.Significant work has been done to address and improve accessibility for individuals with CVD (Zhu and Mao, 2021).This work can be categorized into four types of studies: simulation methods, recolorization methods, estimating the frequency of accessible images, and educational.Simulation methods have been developed to better understand how images appear to individuals with CVD.Brettel et al. first simulated CVDs using the long, medium, and short (LMS) colorspace (Brettel et al., 1997).For dichromacy, the colors in the LMS space are projected onto an axis that corresponds to the nonfunctional cone cell.Viénot et al. expanded on this work by applying a 3x3 transformation matrix  to simulate images in the same LMS space (Vinot et al., 1999).Machado et al. created matrices to simulate CVDs based on the shift theory of cone cell sensitivity (Machado et al., 2009;Stockman and Sharpe, 2000).These algorithms allow individuals without CVD to qualitatively test how their images might appear to people with CVD.The simulation algorithms and matrices are freely available and accessible via websites and software packages (Coblis, 2021;Color blind, 2020;DaltonLens, 2023;Wilke, 2023).CVD simulations have facilitated the creation of colorblind-friendly palettes (Olson and Brewer, 1997), and they have led to algorithms that recolor images to become more accessible to people with CVD.Recolorization methods focus on enhancing color contrasts and preserving image naturalness (Zhu and Mao, 2021).Many algorithms have been developed to compensate for dichromacy (Jefferson and Harvey, 2007;Huang et al., 2007;Ruminski et al., 2010;Rasche et al., 2005;Machado and Oliveira, 2010;Ching and Sabudin, 2010;Ribeiro and Gomes, 2020;Li et al., 2020;Wang et al., 2021;Nakauchi and Onouchi, 2008;Zhu et al., 2019b;Ma et al., 2009).These algorithms apply a variety of techniques including hue rotation, customized difference addition, node mapping, and generative adversarial networks (Zhu and Mao, 2021;Li et al., 2020).Many of these  methods have been tested for efficacy, both qualitatively and quantitatively (Zhu and Mao, 2021).
Recolorization algorithms have been applied to PC displays, websites, and smart glasses (Tanuwidjaja, 2014).Despite the prevalence of these algorithms, current techniques have not been systematically compared and may sacrifice image naturalness to increase contrast.Additionally, recoloring may not improve the accessibility of some scientific figures because papers often reference colors in figure descriptions; recoloring the image could interfere with matching colors between the text and images.
An increase in available resources for making figures accessible to individuals with CVD has prompted some researchers to investigate whether these resources have been impactful in decreasing the frequency of publishing scientific figures with problematic colors.Frane examined the prevalence of images in psychology journals that could be confusing to people with CVD (Frane, 2015).A group of panelists with CVD qualitatively evaluated 246 images and found that 13.8% of color figures caused difficulty for at least one panelist; this percentage is similar to our findings.They also found that in instructions to authors, journals rarely mentioned the importance of designing figures for CVD accessibility.Angerbauer et al. recruited crowdworkers to analyze a sample of 1,710 published images and to identify issues with the use of color (Angerbauer et al., 2022).On average, 60% of the sampled images were given a rating of 'accessible' across CVD types.From 2000 to 2019, they observed a slight increase in CVD accessibility for published figures.
Educational resources are available to researchers looking to make their figures suitable for people with CVD.For example, Jambor et al. provide guidelines and examples to help researchers avoid common problems (Jambor et al., 2021).JetFighter scans preprints from bioRxiv and searches for rainbow-based color schemes (Saladi and Maggiolo, 2019).When these are identified, JetFighter notifies the authors about page(s) that might need to be adjusted.However, as we have shown, the presence of particular color combinations does not necessarily indicate that an image is problematic to people with deuteranopia.Frequently, a more nuanced evaluation is necessary.
The seaborn Python package includes a 'colorblind' palette (Waskom, 2021).The colorBlindness package for R provides simulation tools and CVD-friendly palettes (Ou, 2021).The scatterHatch package facilitates creation of CVD-friendly scatter plots for single-cell data (Guha et al., 2022).When designing figures, researchers may find it useful to first design them so that key elements are distinguishable in grayscale.Then, color can be added-if necessary-to enhance the image.Color should not be used for the sole purpose of making an image aesthetically pleasing.Using minimal color avoids problems that arise from color pairing issues.Rainbow color maps, in particular, should be avoided.If a researcher finds it necessary to include problematic color pairings in figures, they can vary the saturation and intensity of the colors so they are more distinguishable to people with CVD.Many of the problematic figures that we identified in this study originated from fluorescence microscopy experiments, where red and green dyes were used.Choosing alternative color dyes could reduce this problem and improve the interpretability of microscopy images for people in all fields.
Our analysis has limitations.Firstly, it relied on deuteranopia simulations rather than the experiences of deuteranopes.However, by using simulations, the reviewers were capable of seeing two versions of each image: the original and a simulated version.We believe this is important in assessing the extent to which deuteranopia could confound image interpretations.Conceivably, this could be done with deuteranopes after recoloration, but it is difficult to know whether deuteranopes would see the recolored images in the same way that non-deuteranopes see the original images.Secondly, because we used a single, relatively high severity threshold, our simulations do not represent the full spectrum of experiences that scientists with deuteranopia have.Thus, our findings and tools should be relevant to some (but not all) people with deuteranopia.Furthermore, recent evidence suggests that commonly used mathematical representations of color differences are unlikely to reflect human perceptions perfectly (Bujack et al., 2022).As methods evolve for more accurately simulating color perception, we will be more capable of estimating the extent to which scientific figures are problematic for deuteranopes.Thirdly, our evaluations focused on deuteranopia, the most common form of CVD.It will be important to address other forms of CVD, such as protanopia, in future work.Fourthly, we identified some images as 'Probably problematic' or 'Probably okay'.Using our review process, we were unable to draw firm conclusions about these images.To avoid adding noise to the classification analyses-we excluded these images and provided notes reflecting our reasoning.Future work may help to clarify these labels.Finally, our CNN model performed well at differentiating between 'Definitely okay' and 'Definitely problematic' images in the eLife hold-out test set; however, the model's predictive performance dropped considerably when applied to the PubMed Central hold-out test set.Many of the eLife images are from cell-related research, and we labeled many of these as problematic.
Many other image types were also identified as unfriendly, including heat maps, line charts, maps, three-dimensional structural representations of proteins, photographs, network diagrams, etc.Our model may have developed a bias toward patterns specific to image types that are over-represented in eLife, affecting its performance for other journals.The PubMed Central Open Access Subset contains articles for thousands of journals, spanning diverse subdisciplines of biology and medicine.It seems likely that this diversity is a factor behind the drop in performance.Future efforts to review larger collections of PubMed Central articles could help to overcome this limitation.By summarizing color patterns in more than 66,000 images and manually reviewing 8000 images, we have created an open data resource that other researchers can use to develop their own methods.Using all of these images, we trained a machine-learning model that predicts whether images are friendly to deuteranopes.It is available as a Web application (https://bioapps.byu.edu/colorblind_image_tester).Scientists and others can use it to obtain insights into whether individual images are accessible to scientists with deuteranopia.However, this tool should be used as a starting point only.Human judgment remains essential.

Image acquisition
We evaluated images in articles from eLife, an open-access journal that publishes research in 'all areas of the life sciences and medicine'.Article content from this journal is released under a Creative Commons Attribution license.On June 1, 2022, we downloaded all available images from an Amazon Web Services storage bucket provided by journal staff.We also cloned a GitHub repository that eLife provides (https://github.com/elifesciences/elife-article-xml).This repository contains text and metadata from all articles published in the journal since its inception.For each article, we parsed the article identifier, digital object identifier, article type, article subject, and publication date.We excluded any article that was not published with the 'Research article' type.These articles were published between the years 2012 and 2022.
On March 21, 2024, we downloaded a list of articles from the PMC Open Access Subset, 2003.We filtered the articles to those published between 2012 and 2022 that used a CC BY license (https:// creativecommons.org)and were categorized as research articles.This filtering resulted in 2,730,256 article candidates.

Image summarization metrics
For each available image, we identified whether the image was either grayscale or contained colors.For each color image, we calculated a series of metrics to summarize the colors, contrasts, and distances between potentially problematic colors.These metrics have similarities to those used to assess recoloring algorithms, including global luminance error (Kuhn et al., 2008), local contrast error our categorical reevaluation each image."Unclear"=We continue to conclude that our manual label was correct, and it is unclear what confused the model."Understandable"=We continue to conclude that our manual label was correct, and we think we understand what confused the model."Agree"=We acknowledge that the manual label was incorrect, and the model helped us identify that.
• Supplementary file 6. Results of manual curation for 2,000 images used from the PubMed Central hold-out test set.

Figure 1 .Figure 2 .
Figure 1.Mean, pixel-wise color distance between each original and simulated image from eLife.The histogram depicts the frequency distribution of this metric for 64,509 non-grayscale images.

Figure 3 .Figure 4 .
Figure 3. Number of color pairs per image that exhibited a high color-distance ratio between the original and simulated images from eLife.The histogram depicts the frequency distribution of this metric for 64,509 nongrayscale images.
Figure5.Proportion of pixels in each original image that used a color from one of the high-ratio color pairs from eLife.The histogram depicts the frequency distribution of this metric for 64,509 non-grayscale images.

Figure 15 .Figure 16 .
Figure15.Receiver operating characteristic curve for the Logistic Regression predictions for the images in the eLife hold-out test set.This curve illustrates tradeoffs between sensitivity and specificity.The area under the curve is 0.82.The dashed, gray line indicates the performance expected by random chance.

Figure 17 .
Figure 17.Receiver operating characteristic curve for the Convolutional Neural Network predictions for the images in the eLife hold-out test set.This curve illustrates tradeoffs between sensitivity and specificity.The area under the curve is 0.89.The dashed, gray line indicates the performance expected by random chance.

Figure 18 .
Figure18.Precision-recall curve for the Convolutional Neural Network predictions for the images in the eLife hold-out test set.This curve illustrates tradeoffs between precision and recall.The area under the curve is 0.75.The dashed, gray line indicates the frequency of the minority class ('Definitely problematic' images).

Figure 19 .Figure 20 .
Figure 19.Convolutional Neural Network predictions for images in the eLife hold-out test set.Each point represents the prediction for an individual image.Relatively high confidence scores indicate that the model had more confidence that a given image was 'Definitely problematic' for a person with deuteranopia.

Figure 21 .
Figure 21.Precision-recall curve for the Convolutional Neural Network predictions for the images in the PubMed Central hold-out test set.This curve illustrates tradeoffs between precision and recall.The area under the curve is 0.39.The dashed, gray line indicates the frequency of the minority class ('Definitely problematic' images).

Figure
Figure Convolutional Neural Network predictions for images in the PubMed Central hold-out test set.Each point represents the prediction for an individual image.Relatively high confidence scores indicate that the model had more confidence that a given image was 'Definitely problematic' for a person with deuteranopia.

Figure 23 .Figure 24 .
Figure 23.Logistic Regression predictions for the images in the PubMed Central hold-out test set.Each point represents the prediction for an individual image.Relatively high confidence scores indicate that the model had more confidence that a given image was 'Definitely problematic' for a person with deuteranopia.

Figure 25 .
Figure 25.Precision-recall curve for the Logistic Regression predictions for the images in the PubMed Central hold-out test set.This curve illustrates tradeoffs between precision and recall.The area under the curve is 0.16.The dashed, gray line indicates the frequency of the minority class ('Definitely problematic' images).
Proportion of pixels in each original image that used a color from one of the high-ratio color pairs from eLife.The histogram depicts the frequency distribution of this metric for 64,509 non-grayscale images.
Figure 6.Longitudinal trends for the eLife articles.For the training set, we summarized our findings per article.This graph shows article counts for the 'Definitely okay' and 'Definitely problematic' categories for each year evaluated.