# DETECTING PUBLISHER BIAS IN ACADEMIC TEXTBOOKS USING BAYESIAN ENSEMBLE METHODS AND LARGE LANGUAGE MODELS

**Derek Lankeaux**

**Applied Statistics Program**

**Rochester Institute of Technology**

**dlankeaux12@gmail.com**

---

## 📚 Project Resources

**Repository:** [TextbookBiasDetection on GitHub](https://github.com/dl1413/TextbookBiasDetection)

**Full Paper:** [Detecting Publisher Bias in Academic Textbooks Using Bayesian Ensemble Methods and Large Language ModelsF.pdf](./Detecting%20Publisher%20Bias%20in%20Academic%20Textbooks%20Using%20Bayesian%20Ensemble%20Methods%20and%20Large%20Language%20ModelsF.pdf)

**Implementation Notebook:** [textbook_bias_detection.ipynb](./textbook_bias_detection.ipynb)

**Contact:** [dlankeaux12@gmail.com](mailto:dlankeaux12@gmail.com)

---

## ABSTRACT

The proliferation of commercial influence in academic publishing raises concerns about potential bias in educational materials. This study develops a novel framework for detecting and quantifying publisher bias in academic textbooks using an ensemble of large language models (LLMs) combined with Bayesian factor analysis. We analyzed 4,500 passages from 150 textbooks across six disciplines, rated by three LLMs (GPT-4, Claude-3, Llama-3) on five bias dimensions. Exploratory factor analysis revealed four latent bias constructs: Political Framing (32.4% variance), Commercial Influence (21.7% variance), Perspective Diversity (18.3% variance), and Epistemic Certainty (14.2% variance). Bayesian hierarchical models quantified significant publisher type effects, with for-profit publishers exhibiting substantially higher commercial influence (mean difference = 1.24, 95% CI: 1.05-1.43, p < 0.001) and lower perspective diversity (mean difference = -1.03, 95% CI: -1.24 to -0.82, p < 0.001) compared to university press and open-source alternatives. The LLM ensemble demonstrated excellent inter-rater reliability (Krippendorff's α = 0.84), validating this approach for scalable bias detection. This framework provides educators, publishers, and policymakers with empirical tools to identify and mitigate bias in educational materials, promoting more balanced and diverse academic content.

Keywords: textbook bias, publisher influence, large language models, Bayesian factor analysis, educational equity, content analysis, machine learning

## 1. INTRODUCTION

### 1.1 Background and Motivation

Academic textbooks serve as foundational knowledge sources in higher education, shaping how millions of students understand complex subjects ranging from biology to economics. The textbook publishing industry, increasingly dominated by for-profit corporations, generates over $15 billion annually in the United States alone. This consolidation raises critical questions about potential bias in educational content, particularly regarding commercial influence, ideological framing, and perspective diversity.

Traditional methods of detecting bias in textbooks rely heavily on manual expert review—a labor-intensive process that cannot scale to analyze the thousands of textbooks published annually. Recent advances in large language models (LLMs) offer unprecedented capabilities for automated content analysis, yet their application to systematic bias detection in educational materials remains largely unexplored.

This research addresses this gap by developing a comprehensive framework that leverages multiple LLMs in an ensemble approach, combined with rigorous statistical methods including exploratory factor analysis (EFA) and Bayesian hierarchical modeling. By analyzing patterns across publisher types—for-profit corporations, university presses, and open-source initiatives—we provide empirical evidence of systematic differences in how educational content is presented.

### 1.2 Research Objectives

This study pursues four primary objectives:

- Develop a scalable, automated framework for detecting bias in academic textbooks using LLM ensemble methods

- Identify and characterize latent dimensions of bias through exploratory factor analysis

- Quantify publisher type effects on bias dimensions with full uncertainty quantification using Bayesian hierarchical models

- Validate the LLM ensemble approach against established psychometric standards for inter-rater reliability

### 1.3 Significance and Contributions

This research makes several important contributions to both methodological and substantive domains:

Methodologically, we introduce the first comprehensive application of LLM ensembles to educational content analysis, demonstrating that automated approaches can achieve reliability comparable to expert human raters. The integration of multiple LLMs provides robustness against individual model biases while enabling analysis at scales impossible with manual review.

Substantively, we provide the first large-scale empirical evidence of systematic differences in textbook content across publisher types. These findings have direct implications for educational equity, as students' exposure to diverse perspectives and balanced content varies significantly based on institutional purchasing decisions.

Practically, this framework offers educators, librarians, and institutional decision-makers actionable tools to evaluate textbook options, while providing publishers with objective metrics to audit their own content development processes.

## 2. LITERATURE REVIEW

### 2.1 Bias in Educational Materials

The presence of bias in educational materials has been a persistent concern in education research. Sleeter and Grant (1991) conducted seminal work examining how textbooks represented race, ethnicity, gender, and disability, finding systematic underrepresentation and stereotyping of marginalized groups. More recent studies have extended this analysis to ideological bias, with Apple (2004) demonstrating how textbook selection processes reflect broader political and economic power structures.

Specifically regarding commercial influence, Krimsky (2003) documented pharmaceutical industry influence in medical textbooks, while Mirowski (2011) examined how corporate funding shapes economics textbook content. These studies relied on qualitative analysis of relatively small samples, limiting generalizability and scalability.

### 2.2 Computational Text Analysis Methods

The field of computational text analysis has advanced rapidly with the development of sophisticated natural language processing techniques. The emergence of large language models represents a paradigm shift. Models like GPT-4, Claude, and Llama demonstrate remarkable capabilities in complex text understanding tasks. However, individual LLMs exhibit their own biases and limitations. Ensemble approaches, where multiple models contribute to final assessments, show promise for improving robustness and reducing individual model biases.

### 2.3 Factor Analysis and Latent Variable Modeling

Factor analysis provides powerful tools for identifying underlying dimensions in complex, multivariate data. Modern approaches distinguish between exploratory factor analysis (EFA), which discovers latent structure, and confirmatory factor analysis (CFA), which tests predetermined structures. This study employs EFA to discover latent bias dimensions from LLM ratings, followed by Bayesian hierarchical modeling to quantify group differences with full uncertainty quantification.

## 3. METHODOLOGY

### 3.1 Dataset Construction

We constructed a stratified sample of 150 academic textbooks published between 2018-2023, distributed across three publisher categories: for-profit publishers (n=75: Pearson, Cengage, McGraw-Hill, Elsevier, Wiley), university presses (n=50: Oxford, Cambridge, Princeton, MIT, Chicago), and open-source initiatives (n=25: OpenStax, BCcampus, Saylor). Textbooks were sampled across six disciplines: Biology, Chemistry, Computer Science, Economics, Psychology, and History.

From each textbook, we extracted 30 passages using stratified sampling: 10 conceptual explanations, 10 introductory sections, and 10 controversial topics. This yielded 4,500 total passages for analysis. Passages ranged from 200-500 words to provide sufficient context while maintaining tractability for LLM analysis.

### 3.2 LLM Ensemble Rating System

We employed three state-of-the-art LLMs with distinct architectures: GPT-4 (transformer-based with RLHF training), Claude-3 (Constitutional AI approach), and Llama-3 (open-source model). Each LLM rated passages on five dimensions: Perspective Balance (1-7 scale), Source Authority (1-7 scale), Commercial Framing (1-7 scale), Certainty Language (1-7 scale), and Ideological Framing (-3 to +3 scale). This yielded 15 ratings per passage across 4,500 passages, totaling 67,500 individual ratings.

### 3.3 Statistical Analysis

We assessed reliability using Krippendorff's Alpha, Intraclass Correlation Coefficient (ICC), and Pearson correlations. We applied EFA with varimax rotation to discover latent bias dimensions, determining optimal factors using Kaiser criterion, scree plot analysis, and parallel analysis. For each identified factor, we fitted Bayesian hierarchical models using PyMC3 with 1,000 tuning iterations and 1,000 post-warmup samples.

## 4. RESULTS

### 4.1 Dataset Characteristics

The final dataset comprised 4,500 passages from 150 textbooks with complete ratings from all three LLMs.

### 4.2 Inter-Rater Reliability

The LLM ensemble demonstrated strong agreement across all rating dimensions. Table 1 presents reliability metrics.

**Table 1: Inter-Rater Reliability Metrics**

| Dimension | Krippendorff's α | ICC(2,1) | Avg. Corr. | Interpretation |
| --- | --- | --- | --- | --- |
| Commercial Framing | 0.91 | 0.89 | 0.87 | Excellent |
| Certainty Language | 0.85 | 0.83 | 0.81 | Excellent |
| Perspective Balance | 0.82 | 0.80 | 0.78 | Excellent |
| Overall | 0.84 | 0.82 | 0.79 | Excellent |

Overall Krippendorff's α of 0.84 indicates excellent reliability, comparable to trained human raters in content analysis studies.

### 4.3 Exploratory Factor Analysis

Bartlett's test of sphericity (χ² = 48,234.5, df = 105, p < 0.001) confirmed that correlations among rating dimensions were sufficient for factor analysis. The overall KMO measure of 0.89 indicated excellent sampling adequacy. Parallel analysis and scree plot examination converged on a four-factor solution explaining 86.6% of total variance.

The four factors were interpreted as: Factor 1 - Political Framing (32.4% variance), capturing left-right ideological positioning; Factor 2 - Commercial Influence (21.7% variance), reflecting emphasis on commercial applications; Factor 3 - Perspective Diversity (18.3% variance), representing inclusion of multiple viewpoints; and Factor 4 - Epistemic Certainty (14.2% variance), capturing presentation of knowledge with certainty versus uncertainty.

### 4.4 Publisher Type Comparisons

Kruskal-Wallis tests revealed significant differences across publisher types for all four factors. Table 2 presents mean factor scores by publisher type.

**Table 2: Mean Factor Scores by Publisher Type**

| Factor | For-Profit | University | Open-Source | p-value |
| --- | --- | --- | --- | --- |
| Political Framing | +0.12 | -0.08 | -0.15 | <0.001 |
| Commercial Influence | +0.73 | -0.51 | -0.62 | <0.001 |
| Perspective Diversity | -0.62 | +0.41 | +0.58 | <0.001 |
| Epistemic Certainty | +0.31 | -0.22 | +0.05 | <0.001 |

Bayesian hierarchical models provided full posterior distributions. For-profit publishers showed substantially higher commercial framing (mean = +0.73, 95% HDI: 0.65-0.81) compared to university presses (mean = -0.51, 95% HDI: -0.61 to -0.41), with the difference (mean = 1.24, 95% HDI: 1.05-1.43) being large and highly significant. Similarly, for-profit publishers exhibited markedly lower perspective diversity (mean = -0.62, 95% HDI: -0.70 to -0.54) compared to university presses (mean = +0.41, 95% HDI: 0.30-0.52), with the difference (mean = -1.03, 95% HDI: -1.24 to -0.82) indicating substantially more diverse viewpoints in university press textbooks.

## 5. DISCUSSION

### 5.1 Principal Findings

This study demonstrates that automated bias detection using LLM ensembles can achieve excellent reliability while revealing systematic differences in textbook content across publisher types. The LLM ensemble approach achieved inter-rater reliability comparable to trained human coders. Factor analysis revealed four interpretable bias dimensions explaining 87% of variance. For-profit publishers exhibited substantially higher commercial influence and lower perspective diversity compared to university presses and open-source alternatives, with large effect sizes.

### 5.2 Interpretation of Publisher Type Effects

The substantial difference in commercial framing between for-profit publishers and alternatives reflects market pressures where emphasizing practical applications may enhance marketability. However, excessive commercial framing risks conflating academic knowledge with market applications and underrepresenting non-commercial perspectives. The finding that for-profit textbooks present fewer diverse perspectives raises concerns about educational equity, as students using these textbooks may receive more homogeneous viewpoints on contested topics.

### 5.3 Methodological Contributions

Using multiple LLMs mitigates individual model biases while providing robust ratings. The strong inter-rater reliability validates this approach. Combining EFA to discover latent structure with Bayesian hierarchical models for group comparisons provides both exploratory insight and rigorous statistical inference. This framework can analyze thousands of passages in hours rather than weeks required for manual review.

### 5.4 Practical Implications

Findings provide actionable guidance for textbook selection. Institutions prioritizing diverse perspectives may prefer university press or open-source textbooks. Educators can use this framework to evaluate specific textbooks under consideration. Publishers can use results to audit content and benchmark against diversity standards. Evidence supports policy considerations including funding for open educational resources and transparency requirements.

### 5.5 Limitations

While demonstrating high reliability, LLMs have known limitations including training data biases and lack of true understanding. Our stratified sample provides diversity but has limitations in exhaustiveness and generalizability. Causal claims require caution as publisher type may correlate with other confounding factors. Our rating dimensions do not exhaustively cover all forms of bias, such as demographic representation or visual content.

## 6. CONCLUSION

This study demonstrates that large language model ensembles, combined with rigorous statistical methods, provide scalable, reliable tools for detecting bias in educational materials. Analysis of 4,500 passages from 150 textbooks revealed four latent bias dimensions and substantial differences across publisher types, with for-profit publishers exhibiting significantly higher commercial influence and lower perspective diversity compared to university presses and open-source alternatives.

These findings have direct implications for educational equity. Students' exposure to diverse perspectives and balanced content varies systematically based on which textbooks their institutions adopt. The framework developed here offers educators, publishers, and policymakers empirical tools to identify and address bias in educational materials, moving toward more transparent, diverse, and equitable educational content that serves all learners.

As artificial intelligence capabilities continue advancing, the integration of automated content analysis with educational research promises to transform how we understand and improve educational materials. This study represents an important step in that direction, demonstrating both the potential and the challenges of bringing machine learning to bear on fundamental questions about educational equity and quality.

## REFERENCES

- Apple, M. W. (2004). Ideology and curriculum (3rd ed.). New York: RoutledgeFalmer.

- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610-623.

- Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). New York: Guilford Press.

- Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267-297.

- Krimsky, S. (2003). Science in the private interest: Has the lure of profits corrupted biomedical research? Lanham, MD: Rowman & Littlefield.

- Mirowski, P. (2011). Science-mart: Privatizing American science. Cambridge, MA: Harvard University Press.

- Roberts, M. E., Stewart, B. M., Tingley, D., et al. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58(4), 1064-1082.

- Sleeter, C. E., & Grant, C. A. (1991). Race, class, gender, and disability in current textbooks. In M. W. Apple & L. K. Christian-Smith (Eds.), The politics of the textbook (pp. 78-110). New York: Routledge.

- Spearman, C. (1904). 'General intelligence,' objectively determined and measured. The American Journal of Psychology, 15(2), 201-292.