In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown, HTML
import gseapy as gp
from gseapy import Msigdb
from gseapy import GSEA
from gseapy import dotplot
import warnings
warnings.filterwarnings('ignore')

In [None]:
adata = sc.read('../Data/dataset_annotated.h5ad')
violin_plot = ['classification','Basal', 'LumA', 'LumB', 'Her2', 'Normal']
cluster='LumA-G2'

In [None]:
msig = Msigdb()
gmt = msig.get_gmt(category='h.all', dbver="2025.1.Hs")

def plot_bars(tmp, ax):
    col1_props = tmp['pam50 subtype'].value_counts(normalize=True)
    col2_props = tmp['nhg'].value_counts(normalize=True)
    proportions = pd.concat([col1_props, col2_props], axis=1, keys=['pam50 subtype', 'nhg']).fillna(0)
    proportions.T.plot(
        kind='bar',
        stacked=True,
        colormap='tab20',
        edgecolor='black',
        ax=ax
    )

    ax.legend()
    ax.tick_params(rotation=0)
    ax.set_ylabel('Proportion')
    ax.set_title('PAM50 & NHG proportions')
    return ax, proportions


def plot_violin(tmp,ax):
    ax=sns.violinplot(tmp,ax=ax)
    ax.tick_params(rotation=45)
    ax.set_title('PAM50 scoring')
    return ax

def pathways(cluster,ax):
    expr = sc.get.rank_genes_groups_df(adata, group=cluster)[['names','scores']]
    expr.columns = ['gene_name', 'score'] 
    pre_res = gp.prerank(
        rnk=expr,  # DataFrame or path to .rnk file
        gene_sets=gmt,  # Or 'KEGG_2021_Human', 'Reactome_2022', etc.
        permutation_num=1000,  # recommended ≥1000
        seed=42,
        processes=4  # parallelization
    )

    ax = dotplot(pre_res.res2d,
             column="FDR q-val",
             cmap=plt.cm.viridis,
             size=4, # adjust dot size
             cutoff=0.25, show_ring=False,ax=ax)
    return ax, pre_res.res2d


In [None]:
Markdown(f"""
# Cluster {cluster}""")

In [None]:
fig,axes= plt.subplots(1,2, figsize=(12,5))
sc.pl.umap(adata, color='LumA',ax=axes[0], show=False)
axes[0].set_title('Luminal A - score')
sc.pl.umap(adata, color='classification',ax=axes[1], show=False)
axes[1].set_title('Clusters')
plt.show()


Figure 1: UMAP Visualization of Breast Cancer Samples Colored by Luminal A Score and Assigned Clusters.

This figure displays the dimensionality reduction of breast cancer RNA-Seq samples using Uniform Manifold Approximation and Projection (UMAP). Each point in the plots represents an individual sample.

- Left panel: The UMAP space is colored based on a calculated Luminal A score for each sample. This score was derived from the expression levels of 10 genes from the PAM50 panel known to have high centroid values for the Luminal A subtype. The color bar indicates the range of the score, where samples with higher Luminal A scores are depicted in yellow/green, and samples with lower scores are in purple/blue.
- Right panel: The same UMAP space is shown, with samples colored according to their assigned cluster. These clusters were identified based on the analysis and represent different breast cancer subtypes combined with Nottingham Histologic Grade (G2 or G3): Basal-G3, Her2-G2, Her2-G3, LumA-G2, LumB-G3, and Normal-G2, as indicated by the legend.

Interpretation: The UMAP projection successfully separates the samples into visually distinct regions corresponding to the different assigned clusters/subtypes. Notably, the spatial distribution of samples with high Luminal A scores (left panel) strongly aligns with the region occupied by the LumA-G2 cluster (right panel). This indicates that the calculated Luminal A score effectively captures the molecular characteristics defining the Luminal A subtype, differentiating it from other subtypes like Basal and Her2 which show significantly lower Luminal A scores in their respective regions of the UMAP space.

In [None]:
vln = sc.get.obs_df(adata, keys=violin_plot)
vln=vln[vln['classification']==cluster]
prop=sc.get.obs_df(adata, keys=['classification','pam50 subtype','nhg'])
prop=prop[prop['classification']==cluster]


fig,axes=plt.subplots(1,2, figsize=(12,5))
axes[0]=plot_violin(vln,axes[0])
axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### PAM50 scoring - {cluster}

{vln.describe().to_html()}

### PAM50 & NHG proportions {cluster}

{prop.to_html()}
""")

Figure 2: PAM50 Subtype Score Distributions and Overall Proportions of PAM50 Subtypes and NHG Grades in the Full Breast Cancer Cohort.

This figure provides an overview of the molecular and histological characteristics of the entire breast cancer sample cohort (n=1176) analyzed.

- Left panel: Violin plots illustrating the distribution of calculated PAM50 centroid scores for each of the five intrinsic molecular subtypes (Basal-like, Luminal A, Luminal B, Her2-enriched, and Normal-like). Each violin shape represents the density of samples across the range of scores for that subtype, indicating where most samples fall within the scoring spectrum. Embedded box plots summarize the median and interquartile range of scores for each subtype. The Y-axis represents the PAM50 score, reflecting the similarity of a sample's expression profile to the average profile (centroid) of samples representative of that specific subtype.

- Right panel: Stacked bar plots showing the overall proportion of samples classified by PAM50 intrinsic subtype (left bar) and by Nottingham Histological Grade (NHG; right bar), based on clinical metadata. The legend specifies the color assigned to each subtype and grade category. This plot visualizes the prevalence of different molecular subtypes and histological grades within the entire cohort.

Interpretation: The violin plots demonstrate that the PAM50 scoring system effectively differentiates samples based on their subtype-specific gene expression patterns, with distinct score distributions observed for each group. The stacked bar plots reveal the overall composition of the  LumA-G2, highlighting the relative frequency of each PAM50 subtype (with Luminal A being the most common) and NHG grade (with G2 being the most prevalent grade). This figure provides essential context regarding the molecular and histological heterogeneity present in the  LumA-G2 cluster, which underlies the distinct clusters like LumA-G2 identified in the dimensionality reduction analysis (Figure 1).

In [None]:
fig,axes=plt.subplots(1,1, figsize=(12,5))
axes,res=pathways(cluster,axes)
axes.set_title('GSEA')
# axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### GSEA - {cluster}

{res[res['FDR q-val']<0.05].iloc[:,:-1].to_html()}

""")




Figure 3: Gene Set Enrichment Analysis (GSEA) Highlighting Biological Pathways Characterizing the LumA-G2 Cluster.

This bubble plot displays the results of Gene Set Enrichment Analysis (GSEA) comparing gene expression profiles of samples within the identified LumA-G2 cluster against all other samples in the cohort. The plot shows the top significantly enriched Hallmark gene sets from the Molecular Signatures Database (MSigDB).

- X-axis (NES): Normalized Enrichment Score. A positive NES indicates that the gene set is significantly enriched and predominantly upregulated in the LumA-G2 cluster compared to the other samples. A negative NES indicates that the gene set is significantly enriched and predominantly upregulated in the other samples, or equivalently, downregulated in the LumA-G2 cluster.

- Y-axis: The names of the significantly enriched Hallmark gene sets.

- Bubble Size: Represents the percentage of genes within the gene set that contribute to the core enrichment (the "leading edge"). Larger bubbles indicate that a greater proportion of the gene set's members drive the observed enrichment.

- Bubble Color: Represents the statistical significance of the enrichment, specifically the log10(1/FDR). Higher values (more yellow/green) correspond to lower False Discovery Rate (FDR) q-values, indicating higher confidence in the enrichment result.

Interpretation: The GSEA reveals key biological processes that differentiate the LumA-G2 cluster. The most positively enriched pathway (highest positive NES) is HALLMARK_ESTROGEN_RESPONSE_EARLY, consistent with the known hormone receptor-positive nature of Luminal A breast cancers. Conversely, several pathways are significantly negatively enriched (downregulated) in the LumA-G2 cluster relative to other subtypes. These include pathways related to immune response (HALLMARK_INTERFERON_GAMMA_RESPONSE, HALLMARK_ALLOGRAFT_REJECTION, HALLMARK_INFLAMMATORY_RESPONSE, HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_COMPLEMENT), cell cycle and proliferation (HALLMARK_E2F_TARGETS, HALLMARK_G2M_CHECKPOINT), and signaling (HALLMARK_IL6_JAK_STAT3_SIGNALING). This pattern of high estrogen signaling coupled with lower immune/inflammatory and proliferative pathway activity supports the characterization of LumA-G2 as a less proliferative and perhaps less immune-infiltrated subtype compared to other breast cancer classes present in the dataset. All shown enrichments are highly statistically significant, indicated by the predominantly high log10(1/FDR) values (green/yellow colors) and very low FDR q-values (all 0.0 for the top entries based on your provided data).