In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown, HTML
import gseapy as gp
from gseapy import Msigdb
from gseapy import GSEA
from gseapy import dotplot
import warnings
warnings.filterwarnings('ignore')

In [None]:
adata = sc.read('../Data/dataset_annotated.h5ad')
violin_plot = ['classification','Basal', 'LumA', 'LumB', 'Her2', 'Normal']
cluster='LumB-G3'
score='LumB'

In [None]:
msig = Msigdb()
gmt = msig.get_gmt(category='h.all', dbver="2025.1.Hs")

def plot_bars(tmp, ax):
    col1_props = tmp['pam50 subtype'].value_counts(normalize=True)
    col2_props = tmp['nhg'].value_counts(normalize=True)
    proportions = pd.concat([col1_props, col2_props], axis=1, keys=['pam50 subtype', 'nhg']).fillna(0)
    proportions.T.plot(
        kind='bar',
        stacked=True,
        colormap='tab20',
        edgecolor='black',
        ax=ax
    )

    ax.legend()
    ax.tick_params(rotation=0)
    ax.set_ylabel('Proportion')
    ax.set_title('PAM50 & NHG proportions')
    return ax, proportions


def plot_violin(tmp,ax):
    ax=sns.violinplot(tmp,ax=ax)
    ax.tick_params(rotation=45)
    ax.set_title('PAM50 scoring')
    return ax

def pathways(cluster,ax):
    expr = sc.get.rank_genes_groups_df(adata, group=cluster)[['names','scores']]
    expr.columns = ['gene_name', 'score'] 
    pre_res = gp.prerank(
        rnk=expr,  # DataFrame or path to .rnk file
        gene_sets=gmt,  # Or 'KEGG_2021_Human', 'Reactome_2022', etc.
        permutation_num=1000,  # recommended ≥1000
        seed=42,
        processes=4  # parallelization
    )

    ax = dotplot(pre_res.res2d,
             column="FDR q-val",
             cmap=plt.cm.viridis,
             size=4, # adjust dot size
             cutoff=0.25, show_ring=False,ax=ax)
    return ax, pre_res.res2d


In [None]:
Markdown(f"""
# Cluster {cluster}""")

In [None]:
fig,axes= plt.subplots(1,2, figsize=(12,5))
sc.pl.umap(adata, color=score,ax=axes[0], show=False)
axes[0].set_title(f"{score} - score")
sc.pl.umap(adata, color='classification',ax=axes[1], show=False)
axes[1].set_title('Clusters')
plt.show()


Figure 1: UMAP Visualization of Breast Cancer Samples Colored by Luminal B Score and Assigned Clusters.

This figure displays the dimensionality reduction of the breast cancer RNA-Seq samples (from the full cohort) using Uniform Manifold Approximation and Projection (UMAP). Each point in the plots represents an individual sample.

- Left panel: The UMAP space is colored based on a calculated Luminal B score for each sample. This score was derived from the expression levels of specific genes from the PAM50 panel known to have high centroid values for the Luminal B subtype. The color bar indicates the range of the score, where samples with higher Luminal B scores are depicted in yellow/green, and samples with lower scores are in purple/blue.

- Right panel: The same UMAP space is shown, with samples colored according to their assigned cluster. These clusters were previously identified based on the analysis and represent different breast cancer subtypes combined with Nottingham Histologic Grade (G2 or G3): Basal-G3, Her2-G2, Her2-G3, LumA-G2, LumB-G3, and Normal-G2, as indicated by the legend.

Interpretation: The UMAP projection effectively separates the samples into distinct regions corresponding to the different molecular subtypes. The spatial distribution of samples exhibiting high Luminal B scores (left panel) shows a strong overlap with the region primarily occupied by samples assigned to the LumB-G3 cluster (right panel). This visual correspondence confirms that the calculated Luminal B score effectively identifies samples with a strong Luminal B-enriched gene expression signature and validates that the clustering method successfully grouped these samples together based on this molecular characteristic, particularly those with a Grade 3 histology. The Luminal B cluster appears adjacent to the Luminal A cluster in the UMAP space, consistent with their shared hormone receptor-positive nature, but distinct based on their LumB/LumA scoring profiles.

In [None]:
vln = sc.get.obs_df(adata, keys=violin_plot)
vln=vln[vln['classification']==cluster]
prop=sc.get.obs_df(adata, keys=['classification','pam50 subtype','nhg'])
prop=prop[prop['classification']==cluster]


fig,axes=plt.subplots(1,2, figsize=(12,5))
axes[0]=plot_violin(vln,axes[0])
axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### PAM50 scoring - {cluster}

{vln.describe().to_html()}

### PAM50 & NHG proportions {cluster}

{prop.to_html()}
""")

Figure 2: Molecular and Histological Characterization of the LumB-G3 Cluster.

This figure provides a detailed characterization of the samples assigned to the LumB-G3 cluster (n=520) based on PAM50 subtype scoring and clinical Nottingham Histological Grade (NHG).

- Left panel: Violin plots displaying the distribution of PAM50 centroid scores for the samples within the LumB-G3 cluster. Each violin shows the density distribution of how strongly these samples score against the average gene expression profile (centroid) of each of the five intrinsic PAM50 subtypes (Basal-like, Luminal A, Luminal B, Her2-enriched, and Normal-like). Embedded box plots indicate the median and interquartile range. The Y-axis represents the PAM50 score, reflecting the molecular similarity of the LumB-G3 samples to each respective subtype centroid.

- Right panel: Stacked bar plots illustrating the proportions of samples within the LumB-G3 cluster according to their PAM50 molecular subtype assignment (left bar) and their Nottingham Histological Grade (NHG; right bar), based on clinical metadata. The legend specifies the color assigned to each subtype and grade category. This plot visualizes the molecular and histological composition of this specific cluster.

Interpretation: The violin plots on the left indicate that samples within the LumB-G3 cluster score highest against the Luminal B centroid, but also show substantial scores for the Luminal A centroid and relatively low scores for Basal and Her2 centroids. This molecular profile is consistent with their classification as Luminal tumors, particularly Luminal B which typically exhibits higher proliferation than Luminal A. The stacked bar plots on the right confirm the cluster's composition: the majority (66.7%) are assigned the Luminal B PAM50 subtype, with a significant proportion (26.5%) also classified as Luminal A. Clinically, this cluster is predominantly composed of samples with Histological Grade 3 (54.8%), with a large number also being Grade 2 (39.2%), justifying the "G3" designation as the most frequent grade. This figure demonstrates that the LumB-G3 cluster is primarily defined by a Luminal B molecular signature and high histological grade, but contains molecular heterogeneity including a substantial proportion of Luminal A samples.

In [None]:
fig,axes=plt.subplots(1,1, figsize=(12,5))
axes,res=pathways(cluster,axes)
axes.set_title('GSEA')
# axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### GSEA - {cluster}

{res[res['FDR q-val']<0.05].iloc[:,:-1].to_html()}

""")




Figure 3: Gene Set Enrichment Analysis (GSEA) Reveals Proliferative and Metabolic Pathway Enrichment and Reduced Immune/Mesenchymal Activity in the LumB-G3 Cluster.

This bubble plot presents the results of Gene Set Enrichment Analysis (GSEA) comparing the gene expression profiles of samples within the LumB-G3 cluster against all other samples in the cohort. The plot displays the top significantly enriched Hallmark gene sets from the Molecular Signatures Database (MSigDB).

- X-axis (NES): Normalized Enrichment Score. A positive NES signifies significant enrichment and predominantly upregulated gene expression of the pathway in the LumB-G3 cluster compared to the other samples. A negative NES indicates significant enrichment and downregulation in the LumB-G3 cluster relative to the comparison group.

- Y-axis: The names of the significantly enriched Hallmark gene sets.

- Bubble Size: Represents the percentage of genes within the gene set that are part of the leading edge, meaning they contribute most significantly to the enrichment score. Larger bubbles denote that a higher proportion of the gene set members are driving the enrichment.

- Bubble Color: Indicates the statistical significance of the enrichment result, specifically represented by log10(1/FDR). Higher values (more yellow/green colors) correspond to lower False Discovery Rate (FDR) q-values, denoting higher statistical confidence in the enrichment result. Lower values (purple/blue) indicate less significance.

Interpretation: The GSEA reveals a mixed set of biological pathways enriched in the LumB-G3 cluster. Consistent with the more aggressive nature of Luminal B subtypes compared to Luminal A and their high histological grade (G3), pathways associated with cell cycle progression and proliferation, such as HALLMARK_E2F_TARGETS, HALLMARK_G2M_CHECKPOINT, and HALLMARK_MYC_TARGETS, are significantly positively enriched (upregulated). Metabolic pathways like HALLMARK_OXIDATIVE_PHOSPHORYLATION and HALLMARK_UNFOLDED_PROTEIN_RESPONSE, as well as HALLMARK_DNA_REPAIR, are also positively enriched. As expected for a Luminal subtype, HALLMARK_ESTROGEN_RESPONSE (Early and Late) are positively enriched.

In contrast, several pathways commonly associated with more aggressive or treatment-resistant subtypes (like Basal-like) are significantly negatively enriched (downregulated) in the LumB-G3 cluster. These include processes related to epithelial-mesenchymal transition (HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION), immune and inflammatory responses (HALLMARK_ALLOGRAFT_REJECTION, HALLMARK_INFLAMMATORY_RESPONSE, HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_IL2_STAT5_SIGNALING, HALLMARK_IL6_JAK_STAT3_SIGNALING, HALLMARK_COMPLEMENT), KRAS signaling (HALLMARK_KRAS_SIGNALING_UP), cell adhesion (HALLMARK_APICAL_JUNCTION), and various developmental/signaling pathways (HALLMARK_TGF_BETA_SIGNALING, HALLMARK_HEDGEHOG_SIGNALING, HALLMARK_WNT_BETA_CATENIN_SIGNALING, HALLMARK_NOTCH_SIGNALING). This suggests that while proliferative activity is high in LumB-G3, features related to immune infiltration, mesenchymal characteristics, and specific oncogenic signaling cascades are relatively less prominent compared to other subtypes in the cohort. The majority of the displayed enrichments are highly statistically significant, indicated by the predominantly green/yellow bubble colors reflecting very low FDR q-values.