Analysis code to complement Lape, et al. "After the Infection: A Survey of Pathogens and Non-communicable Human Disease" (2023). See below for citation information.
This code is made available along with explanatory flowcharts to enable replication of the results reported in the associated manuscript. UK Biobank data and TriNetX data must be obtained from the respective organizations.
There are many well-established relationships between pathogens and human disease, but far fewer when focusing on non-communicable diseases (NCDs). We leverage data from The UK Biobank and TriNetX to perform a systematic survey across 20 pathogens and 426 diseases, focused primarily on NCDs. To this end, we assess association between disease status and infection history proxies. We identify 206 pathogen-disease pairs that replicate in both cohorts. We replicate many established relationships, including Helicobacter pylori with several gastroenterological diseases, and connections between Epstein-Barr virus with multiple sclerosis and lupus. Overall, our approach identified evidence of association for 15 of the pathogens and 96 distinct diseases, including a currently controversial link between human cytomegalovirus (CMV) and ulcerative colitis (UC). We validate this connection through two orthogonal analyses, revealing increased CMV gene expression in UC patients and enrichment for UC genetic risk signal near human genes that have altered expression upon CMV infection. Collectively, these results form a foundation for future investigations into mechanistic roles played by pathogens in disease.
All patient identifiers are generic and don't correspond to actual identifiers from either UK BioBank (UKB) or TriNetX (TNX). They are present just to make it easier to follow what input and output files will look like.
- R v4.2.2
- Python v3.7.8
- MASS v7.3-58.1
- performance v0.10.2
- logistf v1.24.1
- dplyr v1.1.0
- data.table v1.14.8
- openxlsx v4.2.5.2
- readxl v1.4.2
- stringr v1.5.0
- glue v1.6.2
- DT v0.27
- numpy v1.22.3
- pandas v1.4.2
- scipy v1.8.0
- sklearn v1.0.2
- statsmodels v0.13.2
- matplotlib v3.7.1
- seaborn v0.11.2
- tabulate v0.8.9
- tqdm v4.64.0
- xlrd v2.0.1
-
GNU Parallel v20220122
Tange, O. (2022, January 22). GNU Parallel 20220122 ('20 years'). Zenodo. https://doi.org/10.5281/zenodo.5893336
Color | Shape |
---|---|
Code from this repository may be cited as:
Michael Lape, et al. (2023). WeirauchLab/pathogen_ncd: Preprint release
(preprint). Zenodo. https://doi.org/10.5281/zenodo.8423556
The associated manuscript is pending publication. In the meantime, you may cite the preprint on medRxiv:
After the Infection: A Survey of Pathogens and Non-communicable Human Disease
Michael Lape, et al. medRxiv 2023.09.14.23295428;
doi: https://doi.org/10.1101/2023.09.14.23295428
Please contact Dr. Matthew Weirauch via email with any questions or concerns.
Name | Institution | Remarks |
---|---|---|
Mike Lape | University of Cincinnati | primary author |
Source code is ©2023 Cincinnati Children's Hospital Medical Center and Mike Lape.
Released under the terms of the GNU General Public License, Version 3. See
LICENSE.txt