Skip to content

amarhon786/TAA-scRNAseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

TAA-scRNAseq — Sex-Stratified Differential Expression in Thoracic Aortic Aneurysm

Reproducible single-cell RNA-seq pipeline analyzing sex differences in human ascending thoracic aortic aneurysm (TAA), built on the public GSE155468 dataset from Li et al. (2020), Circulation.

The pipeline runs end-to-end from raw 10x Genomics matrices to publication-style multipanel figures, with a per-cell-type pseudobulk DESeq2 design that explicitly models the sex × disease interaction — i.e., it asks not just "what changes in TAA" but "does the disease look different in men vs women at the gene level."


Workflow

End-to-end data pipeline implemented as a sequence of numbered, idempotent R scripts with versions pinned by renv for cross-machine reproducibility. Every downstream artifact rebuilds from the raw data alone.

# Stage What happens Key methods / tools
1 Ingest Eleven per-sample sparse expression matrices (~5K–10K cells × 20K+ genes each) loaded from raw GEO archives via a custom streaming reader for the dataset's nonstandard header format. data.table::fread, Matrix::sparse
2 QC / filter Per-cell mitochondrial fraction, library size, and gene-count thresholds applied per sample; low-quality cells and doublet candidates dropped. Seurat v5
3 Normalization & batch correction Log-normalize → variable-feature selection → PCA → Harmony batch correction across donors. Preserves biological structure while removing sample-level confounding. Seurat, Harmony
4 Unsupervised clustering & embedding Shared-nearest-neighbor graph + Louvain at resolution 0.5 yielding 17 clusters; UMAP for 2-D visualization. Seurat
5 Supervised cell-type annotation Module scoring against a curated literature panel of canonical aortic markers; per-cluster top-score label assigned across eight lineages. AddModuleScore, custom marker DB
6 Pseudobulk aggregation Counts collapsed from cell-level → (sample × cell type) so the unit of statistical replication is the patient (n = 11), not the cell — avoiding the inflated-significance trap of per-cell DE tests and respecting the hierarchical structure of the experiment. Matrix::rowSums, base R
7 GLM fit Negative-binomial GLM per cell type with design ~ sex + disease + sex:disease. Five biologically distinct contrasts extracted via linear combinations of model coefficients — including the sex × disease interaction, the statistically defensible way to test for sex-differential disease response. Multiple-testing controlled via Benjamini–Hochberg FDR. DESeq2
8 Pathway enrichment Pre-ranked GSEA on signed Wald statistics tested against MSigDB Hallmark / KEGG / GO:BP. Complementary over-representation analysis on FDR-significant gene lists. fgsea, clusterProfiler, msigdbr
9 Reporting Multipanel figures (volcano, interaction heatmap, sex-stratified scatter, GSEA bubble, sex-chromosome QC) at Nature-style 183 mm scale with embedded paper-style legends; colored Excel deliverables for non-technical reviewers. ggplot2, patchwork, openxlsx

Engineering choices worth flagging

  • Pipeline-as-DAG: each script ingests files written by the previous and writes its own outputs to disk. Any stage is independently re-runnable without re-running the upstream — useful for iterating on figures or enrichment without touching the 30-minute integration step.
  • Statistical unit = patient, not cell: pseudobulk before testing. Treating each of ~50K cells as an independent observation inflates p-values by orders of magnitude; honest inference uses the n=11 patients.
  • Interaction-term-first design: rather than running disease DE separately within each sex and comparing eye-ball, the model fits a single GLM with the sex:disease term, giving a proper hypothesis test for sex-dimorphic disease response.
  • Reproducibility: renv lockfile pins every Bioconductor / CRAN dependency, so the analysis runs identically on a fresh machine; raw data is re-fetchable from GEO via R/01_download.R.

Dataset

GSE155468 — 10x Genomics scRNA-seq of human ascending aorta.

Group Sex n
TAA Male 4
TAA Female 4
Control Male 1
Control Female 2

Ages 56–78. Eleven samples total.


Key findings

Robust findings (strong stats, replicated across cell types)

1. TAA induces an interferon-driven, immune-like transcriptional state in vascular smooth-muscle cells themselves. The top TAA-vs-Control DEGs in contractile SMCs are not canonical SMC genes — they are immune / IFN markers: CD69 (log₂FC = +5.0, padj = 3.0×10⁻⁶), IFITM1, IFNG (+6.4, padj = 7×10⁻⁶), CCL5, CD52. Modulated SMCs reinforce this with CTHRC1 (the canonical aortic-fibrosis marker, padj = 3.5×10⁻⁶) plus chemokines CCL4 / CXCR4 / CCL5 / CCL2 that recruit immune cells. SMCs in TAA aren't only losing their contractile identity; they are acquiring an active immune / cytotoxic gene-expression program.

2. Female TAA macrophages mount a vastly stronger interferon / oxidative response than male TAA macrophages. Hallmark GSEA on Male-TAA-vs-Female-TAA in macrophages — all enriched in females:

Pathway NES padj
INTERFERON_GAMMA_RESPONSE −2.82 3.6×10⁻²⁴
OXIDATIVE_PHOSPHORYLATION −2.67 4.8×10⁻²⁰
ALLOGRAFT_REJECTION −2.45 1.5×10⁻¹⁴
MYC_TARGETS_V1 −2.43 1.5×10⁻¹⁴
MTORC1_SIGNALING −2.34 2.8×10⁻¹²

Effect sizes large enough to survive aggressive multiple-testing correction at n = 4 per sex.

3. The same female-skewed IFN / OXPHOS pattern repeats across SMCs, NK cells, T cells, and plasma cells.

  • SMC contractile: IFN-γ NES = −2.83 (padj = 2×10⁻²³); IFN-α NES = −2.93 (padj = 2×10⁻¹⁸).
  • SMC modulated: IFN-γ NES = −2.94 (padj = 2×10⁻²⁵); IL6-JAK-STAT3 NES = −2.44; EMT NES = −2.16.
  • Female-enriched OXPHOS in NK, T-cell, and plasma compartments as well.

Consistent with the broader literature on sex differences in autoimmunity (women generally mount stronger humoral / IFN responses), this is, to our knowledge, the first systematic observation of that pattern in TAA across multiple aortic lineages.

4. Endothelium is the contrarian cell type — it flips the direction of sex-dimorphism. In endothelial cells, males in TAA show the higher pro-inflammatory signal:

Pathway NES padj
TNFA_SIGNALING_VIA_NFKB +2.23 9.7×10⁻¹⁰
P53_PATHWAY +1.87 4.4×10⁻⁵
ANDROGEN_RESPONSE +1.94 2.4×10⁻⁴

Sex-dimorphism in TAA is therefore not unidirectional — different lineages respond oppositely. The androgen-response hit is also a useful sex-annotation sanity check.

Suggestive findings (interesting effect sizes, smaller cohorts, mention with hedging)

5. Plasma-cell antibody biology is sex-dimorphic in TAA. Significant interaction-term genes in plasma cells include several immunoglobulin chains: IGHG2 (log₂FC = +11.1, padj = 0.006), HSPA6 (−12.1, padj = 0.006), PTGDS (−15.8, padj = 0.011), NFKB1 (+4.2, padj = 0.013). Macrophages also pick up IGLC3 (−21.7) and IGHG1 (−9.6) near the FDR boundary. Suggests sex-differential humoral immunity in disease, but magnitudes are extreme and the cohort is small — could be biology or could be one outlier patient driving it.

6. NK cells: MT1X is the single best-powered interaction gene in the dataset. MT1X in NK cells: log₂FC = −3.22, padj = 3.8×10⁻⁶. Metallothioneins are involved in metal binding and oxidative-stress response; together with the broader OXPHOS skew, this hints at sex-differential handling of oxidative damage in TAA leukocytes.

Caveats

  • Sample sizes are small. Per-sex DESeq2 contrasts have very limited power — particularly Male_TAA_vs_Male_Control (n = 4 vs 1), which yields only 21 significant genes across all 8 cell types combined. Findings from that specific contrast should be treated as exploratory.
  • The sex × disease interaction term inherits the same n = 1 male-Control limitation. Cell types where the interaction did not reach FDR (e.g., SMC modulated, where the visually divergent GRIA2 / CDKN2A / KRT14 signal has padj ≥ 0.44) are reported as suggestive, not confirmed.
  • All claims are based on a single cohort. External replication in an independent TAA scRNA-seq dataset (e.g., Pedroza et al. 2022/2023) would substantially strengthen them.

Pipeline

Script Purpose
R/00_setup.R Initialize renv, install Bioconductor + CRAN packages
R/01_download.R Fetch GSE155468 raw 10x matrices via GEOquery
R/02_qc_integrate.R Seurat v5 QC, Harmony integration, clustering, marker-score annotation
R/03_celltype_abundance.R Per-sample cell-type composition shifts by disease × sex
R/04_pseudobulk_de.R Pseudobulk → DESeq2 per cell type; five contrasts including sex × disease interaction
R/05_enrichment.R clusterProfiler ORA + fgsea pre-ranked GSEA against MSigDB Hallmark / KEGG / GO:BP
R/06_figures.R Per-contrast volcanoes, MA plots, UMAPs, PCA, marker sanity panels

Statistical model

Per cell type, DESeq2 fits

~ sex + disease + sex:disease

with Female and Control as reference levels. Five contrasts are extracted:

  1. TAA vs Control — main disease effect, averaged across sex (disease + 0.5 × sex:disease)
  2. Male TAA vs Male Control — disease effect in males (low power, n = 4 vs 1)
  3. Female TAA vs Female Control — disease effect in females (n = 4 vs 2)
  4. Male TAA vs Female TAA — sex effect within disease (n = 4 vs 4)
  5. Sex × disease interaction — genes whose disease response differs by sex (sex:disease term)

Pseudobulk aggregation per (sample × cell type) — not per-cell tests — so that the experimental unit of replication is the patient, not the cell.


Repo layout

TAA-scRNAseq/
├── R/                                              numbered pipeline scripts (run in order)
├── results/
│   ├── figures/
│   │   ├── figure2_sex_disease_SMC_contractile.pdf       2-page: 5-panel figure + legend
│   │   ├── figure2_sex_disease_SMC_modulated.pdf         2-page: 5-panel figure + legend
│   │   └── figure2_sex_disease_*_legend.md               standalone markdown legends
│   └── tables/
│       ├── degs_SMC_contractile.xlsx                     color-formatted DEG workbook
│       └── degs_SMC_modulated.xlsx                       color-formatted DEG workbook
├── data/                                          gitignored — raw / processed (re-download via R/01)
├── renv/                                          package lockfile (committed); library is gitignored
└── README.md

The bulk per-contrast DE CSVs and enrichment CSVs (~60 files) are gitignored — they regenerate locally from R/04_pseudobulk_de.R and R/05_enrichment.R. The two color-formatted Excel summary workbooks are committed, so a reviewer can browse top up/down-regulated genes per cell type without installing R.


Running it locally

R is required (≥ 4.4) — install first from https://cran.r-project.org. Then from the project root:

source("R/00_setup.R")           # one-time: ~20 min on first install
source("R/01_download.R")        # ~5 min, ~3 GB (raw 10x matrices)
source("R/02_qc_integrate.R")    # ~20–40 min
source("R/03_celltype_abundance.R")
source("R/04_pseudobulk_de.R")
source("R/05_enrichment.R")
source("R/06_figures.R")

All scripts assume the working directory is the project root.


Citation

If you use this analysis or build on it, please cite the original dataset:

Li Y, Ren P, Dawson A, Vasquez HG, Ageedi W, Zhang C, Luo W, Chen R, Li Y, Kim S, Lu HS, Cassis LA, Coselli JS, Daugherty A, Shen YH, LeMaire SA. Single-Cell Transcriptome Analysis Reveals Dynamic Cell Populations and Differential Gene Expression Patterns in Control and Aneurysmal Human Aortic Tissue. Circulation 2020. PMID: 33017217.

About

Sex-stratified scRNA-seq differential expression pipeline for human thoracic aortic aneurysm (GSE155468). Pseudobulk DESeq2 with sex x disease interaction, fgsea pathway enrichment, Nature-style multipanel figures.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages