Reproducible single-cell RNA-seq pipeline analyzing sex differences in human ascending thoracic aortic aneurysm (TAA), built on the public GSE155468 dataset from Li et al. (2020), Circulation.
The pipeline runs end-to-end from raw 10x Genomics matrices to publication-style multipanel figures, with a per-cell-type pseudobulk DESeq2 design that explicitly models the sex × disease interaction — i.e., it asks not just "what changes in TAA" but "does the disease look different in men vs women at the gene level."
End-to-end data pipeline implemented as a sequence of numbered, idempotent R scripts with versions pinned by renv for cross-machine reproducibility. Every downstream artifact rebuilds from the raw data alone.
| # | Stage | What happens | Key methods / tools |
|---|---|---|---|
| 1 | Ingest | Eleven per-sample sparse expression matrices (~5K–10K cells × 20K+ genes each) loaded from raw GEO archives via a custom streaming reader for the dataset's nonstandard header format. | data.table::fread, Matrix::sparse |
| 2 | QC / filter | Per-cell mitochondrial fraction, library size, and gene-count thresholds applied per sample; low-quality cells and doublet candidates dropped. | Seurat v5 |
| 3 | Normalization & batch correction | Log-normalize → variable-feature selection → PCA → Harmony batch correction across donors. Preserves biological structure while removing sample-level confounding. | Seurat, Harmony |
| 4 | Unsupervised clustering & embedding | Shared-nearest-neighbor graph + Louvain at resolution 0.5 yielding 17 clusters; UMAP for 2-D visualization. | Seurat |
| 5 | Supervised cell-type annotation | Module scoring against a curated literature panel of canonical aortic markers; per-cluster top-score label assigned across eight lineages. | AddModuleScore, custom marker DB |
| 6 | Pseudobulk aggregation | Counts collapsed from cell-level → (sample × cell type) so the unit of statistical replication is the patient (n = 11), not the cell — avoiding the inflated-significance trap of per-cell DE tests and respecting the hierarchical structure of the experiment. | Matrix::rowSums, base R |
| 7 | GLM fit | Negative-binomial GLM per cell type with design ~ sex + disease + sex:disease. Five biologically distinct contrasts extracted via linear combinations of model coefficients — including the sex × disease interaction, the statistically defensible way to test for sex-differential disease response. Multiple-testing controlled via Benjamini–Hochberg FDR. |
DESeq2 |
| 8 | Pathway enrichment | Pre-ranked GSEA on signed Wald statistics tested against MSigDB Hallmark / KEGG / GO:BP. Complementary over-representation analysis on FDR-significant gene lists. | fgsea, clusterProfiler, msigdbr |
| 9 | Reporting | Multipanel figures (volcano, interaction heatmap, sex-stratified scatter, GSEA bubble, sex-chromosome QC) at Nature-style 183 mm scale with embedded paper-style legends; colored Excel deliverables for non-technical reviewers. | ggplot2, patchwork, openxlsx |
- Pipeline-as-DAG: each script ingests files written by the previous and writes its own outputs to disk. Any stage is independently re-runnable without re-running the upstream — useful for iterating on figures or enrichment without touching the 30-minute integration step.
- Statistical unit = patient, not cell: pseudobulk before testing. Treating each of ~50K cells as an independent observation inflates p-values by orders of magnitude; honest inference uses the n=11 patients.
- Interaction-term-first design: rather than running disease DE separately within each sex and comparing eye-ball, the model fits a single GLM with the
sex:diseaseterm, giving a proper hypothesis test for sex-dimorphic disease response. - Reproducibility:
renvlockfile pins every Bioconductor / CRAN dependency, so the analysis runs identically on a fresh machine; raw data is re-fetchable from GEO viaR/01_download.R.
GSE155468 — 10x Genomics scRNA-seq of human ascending aorta.
| Group | Sex | n |
|---|---|---|
| TAA | Male | 4 |
| TAA | Female | 4 |
| Control | Male | 1 |
| Control | Female | 2 |
Ages 56–78. Eleven samples total.
1. TAA induces an interferon-driven, immune-like transcriptional state in vascular smooth-muscle cells themselves. The top TAA-vs-Control DEGs in contractile SMCs are not canonical SMC genes — they are immune / IFN markers: CD69 (log₂FC = +5.0, padj = 3.0×10⁻⁶), IFITM1, IFNG (+6.4, padj = 7×10⁻⁶), CCL5, CD52. Modulated SMCs reinforce this with CTHRC1 (the canonical aortic-fibrosis marker, padj = 3.5×10⁻⁶) plus chemokines CCL4 / CXCR4 / CCL5 / CCL2 that recruit immune cells. SMCs in TAA aren't only losing their contractile identity; they are acquiring an active immune / cytotoxic gene-expression program.
2. Female TAA macrophages mount a vastly stronger interferon / oxidative response than male TAA macrophages. Hallmark GSEA on Male-TAA-vs-Female-TAA in macrophages — all enriched in females:
| Pathway | NES | padj |
|---|---|---|
| INTERFERON_GAMMA_RESPONSE | −2.82 | 3.6×10⁻²⁴ |
| OXIDATIVE_PHOSPHORYLATION | −2.67 | 4.8×10⁻²⁰ |
| ALLOGRAFT_REJECTION | −2.45 | 1.5×10⁻¹⁴ |
| MYC_TARGETS_V1 | −2.43 | 1.5×10⁻¹⁴ |
| MTORC1_SIGNALING | −2.34 | 2.8×10⁻¹² |
Effect sizes large enough to survive aggressive multiple-testing correction at n = 4 per sex.
3. The same female-skewed IFN / OXPHOS pattern repeats across SMCs, NK cells, T cells, and plasma cells.
- SMC contractile: IFN-γ NES = −2.83 (padj = 2×10⁻²³); IFN-α NES = −2.93 (padj = 2×10⁻¹⁸).
- SMC modulated: IFN-γ NES = −2.94 (padj = 2×10⁻²⁵); IL6-JAK-STAT3 NES = −2.44; EMT NES = −2.16.
- Female-enriched OXPHOS in NK, T-cell, and plasma compartments as well.
Consistent with the broader literature on sex differences in autoimmunity (women generally mount stronger humoral / IFN responses), this is, to our knowledge, the first systematic observation of that pattern in TAA across multiple aortic lineages.
4. Endothelium is the contrarian cell type — it flips the direction of sex-dimorphism. In endothelial cells, males in TAA show the higher pro-inflammatory signal:
| Pathway | NES | padj |
|---|---|---|
| TNFA_SIGNALING_VIA_NFKB | +2.23 | 9.7×10⁻¹⁰ |
| P53_PATHWAY | +1.87 | 4.4×10⁻⁵ |
| ANDROGEN_RESPONSE | +1.94 | 2.4×10⁻⁴ |
Sex-dimorphism in TAA is therefore not unidirectional — different lineages respond oppositely. The androgen-response hit is also a useful sex-annotation sanity check.
5. Plasma-cell antibody biology is sex-dimorphic in TAA. Significant interaction-term genes in plasma cells include several immunoglobulin chains: IGHG2 (log₂FC = +11.1, padj = 0.006), HSPA6 (−12.1, padj = 0.006), PTGDS (−15.8, padj = 0.011), NFKB1 (+4.2, padj = 0.013). Macrophages also pick up IGLC3 (−21.7) and IGHG1 (−9.6) near the FDR boundary. Suggests sex-differential humoral immunity in disease, but magnitudes are extreme and the cohort is small — could be biology or could be one outlier patient driving it.
6. NK cells: MT1X is the single best-powered interaction gene in the dataset. MT1X in NK cells: log₂FC = −3.22, padj = 3.8×10⁻⁶. Metallothioneins are involved in metal binding and oxidative-stress response; together with the broader OXPHOS skew, this hints at sex-differential handling of oxidative damage in TAA leukocytes.
- Sample sizes are small. Per-sex DESeq2 contrasts have very limited power — particularly Male_TAA_vs_Male_Control (n = 4 vs 1), which yields only 21 significant genes across all 8 cell types combined. Findings from that specific contrast should be treated as exploratory.
- The sex × disease interaction term inherits the same n = 1 male-Control limitation. Cell types where the interaction did not reach FDR (e.g., SMC modulated, where the visually divergent GRIA2 / CDKN2A / KRT14 signal has padj ≥ 0.44) are reported as suggestive, not confirmed.
- All claims are based on a single cohort. External replication in an independent TAA scRNA-seq dataset (e.g., Pedroza et al. 2022/2023) would substantially strengthen them.
| Script | Purpose |
|---|---|
| R/00_setup.R | Initialize renv, install Bioconductor + CRAN packages |
| R/01_download.R | Fetch GSE155468 raw 10x matrices via GEOquery |
| R/02_qc_integrate.R | Seurat v5 QC, Harmony integration, clustering, marker-score annotation |
| R/03_celltype_abundance.R | Per-sample cell-type composition shifts by disease × sex |
| R/04_pseudobulk_de.R | Pseudobulk → DESeq2 per cell type; five contrasts including sex × disease interaction |
| R/05_enrichment.R | clusterProfiler ORA + fgsea pre-ranked GSEA against MSigDB Hallmark / KEGG / GO:BP |
| R/06_figures.R | Per-contrast volcanoes, MA plots, UMAPs, PCA, marker sanity panels |
Per cell type, DESeq2 fits
~ sex + disease + sex:disease
with Female and Control as reference levels. Five contrasts are extracted:
- TAA vs Control — main disease effect, averaged across sex (
disease + 0.5 × sex:disease) - Male TAA vs Male Control — disease effect in males (low power, n = 4 vs 1)
- Female TAA vs Female Control — disease effect in females (n = 4 vs 2)
- Male TAA vs Female TAA — sex effect within disease (n = 4 vs 4)
- Sex × disease interaction — genes whose disease response differs by sex (
sex:diseaseterm)
Pseudobulk aggregation per (sample × cell type) — not per-cell tests — so that the experimental unit of replication is the patient, not the cell.
TAA-scRNAseq/
├── R/ numbered pipeline scripts (run in order)
├── results/
│ ├── figures/
│ │ ├── figure2_sex_disease_SMC_contractile.pdf 2-page: 5-panel figure + legend
│ │ ├── figure2_sex_disease_SMC_modulated.pdf 2-page: 5-panel figure + legend
│ │ └── figure2_sex_disease_*_legend.md standalone markdown legends
│ └── tables/
│ ├── degs_SMC_contractile.xlsx color-formatted DEG workbook
│ └── degs_SMC_modulated.xlsx color-formatted DEG workbook
├── data/ gitignored — raw / processed (re-download via R/01)
├── renv/ package lockfile (committed); library is gitignored
└── README.md
The bulk per-contrast DE CSVs and enrichment CSVs (~60 files) are gitignored — they regenerate locally from R/04_pseudobulk_de.R and R/05_enrichment.R. The two color-formatted Excel summary workbooks are committed, so a reviewer can browse top up/down-regulated genes per cell type without installing R.
R is required (≥ 4.4) — install first from https://cran.r-project.org. Then from the project root:
source("R/00_setup.R") # one-time: ~20 min on first install
source("R/01_download.R") # ~5 min, ~3 GB (raw 10x matrices)
source("R/02_qc_integrate.R") # ~20–40 min
source("R/03_celltype_abundance.R")
source("R/04_pseudobulk_de.R")
source("R/05_enrichment.R")
source("R/06_figures.R")All scripts assume the working directory is the project root.
If you use this analysis or build on it, please cite the original dataset:
Li Y, Ren P, Dawson A, Vasquez HG, Ageedi W, Zhang C, Luo W, Chen R, Li Y, Kim S, Lu HS, Cassis LA, Coselli JS, Daugherty A, Shen YH, LeMaire SA. Single-Cell Transcriptome Analysis Reveals Dynamic Cell Populations and Differential Gene Expression Patterns in Control and Aneurysmal Human Aortic Tissue. Circulation 2020. PMID: 33017217.