In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown, HTML
import gseapy as gp
from gseapy import Msigdb
from gseapy import GSEA
from gseapy import dotplot
import warnings
warnings.filterwarnings('ignore')

In [None]:
adata = sc.read('../Data/dataset_annotated.h5ad')
violin_plot = ['classification','Basal', 'LumA', 'LumB', 'Her2', 'Normal']
cluster='Normal-G2'
score='Normal'

In [None]:
msig = Msigdb()
gmt = msig.get_gmt(category='h.all', dbver="2025.1.Hs")

def plot_bars(tmp, ax):
    col1_props = tmp['pam50 subtype'].value_counts(normalize=True)
    col2_props = tmp['nhg'].value_counts(normalize=True)
    proportions = pd.concat([col1_props, col2_props], axis=1, keys=['pam50 subtype', 'nhg']).fillna(0)
    proportions.T.plot(
        kind='bar',
        stacked=True,
        colormap='tab20',
        edgecolor='black',
        ax=ax
    )

    ax.legend()
    ax.tick_params(rotation=0)
    ax.set_ylabel('Proportion')
    ax.set_title('PAM50 & NHG proportions')
    return ax, proportions


def plot_violin(tmp,ax):
    ax=sns.violinplot(tmp,ax=ax)
    ax.tick_params(rotation=45)
    ax.set_title('PAM50 scoring')
    return ax

def pathways(cluster,ax):
    expr = sc.get.rank_genes_groups_df(adata, group=cluster)[['names','scores']]
    expr.columns = ['gene_name', 'score'] 
    pre_res = gp.prerank(
        rnk=expr,  # DataFrame or path to .rnk file
        gene_sets=gmt,  # Or 'KEGG_2021_Human', 'Reactome_2022', etc.
        permutation_num=1000,  # recommended ≥1000
        seed=42,
        processes=4  # parallelization
    )

    ax = dotplot(pre_res.res2d,
             column="FDR q-val",
             cmap=plt.cm.viridis,
             size=4, # adjust dot size
             cutoff=0.25, show_ring=False,ax=ax)
    return ax, pre_res.res2d


In [None]:
Markdown(f"""
# Cluster {cluster}""")

In [None]:
fig,axes= plt.subplots(1,2, figsize=(12,5))
sc.pl.umap(adata, color=score,ax=axes[0], show=False)
axes[0].set_title(f"{score} - score")
sc.pl.umap(adata, color='classification',ax=axes[1], show=False)
axes[1].set_title('Clusters')
plt.show()


Figure 1: UMAP Visualization of Breast Cancer Samples Colored by Normal Score and Assigned Clusters.

This figure displays the dimensionality reduction of the breast cancer RNA-Seq samples (from the full cohort) using Uniform Manifold Approximation and Projection (UMAP). Each point in the plots represents an individual sample.

- Left panel: The UMAP space is colored based on a calculated "Normal - score" for each sample. This score was derived from the expression levels of specific genes from the PAM50 panel known to have high centroid values for the Normal-like subtype. The color bar indicates the range of the score, where samples with higher Normal-like scores are depicted in yellow/green, and samples with lower scores are in purple/blue.

- Right panel: The same UMAP space is shown, with samples colored according to their assigned cluster. These clusters were previously identified based on the analysis and represent different breast cancer subtypes combined with Nottingham Histologic Grade (G2 or G3): Basal-G3, Her2-G2, Her2-G3, LumA-G2, LumB-G3, and Normal-G2, as indicated by the legend.

Interpretation: The UMAP projection effectively separates the samples into distinct regions corresponding to the different molecular subtypes. The spatial distribution of samples exhibiting high "Normal - scores" (left panel) strongly overlaps with the region primarily occupied by samples assigned to the Normal-G2 cluster (right panel), located distinctly from the main tumor subtype clusters. This visual correspondence confirms that the calculated Normal score effectively identifies samples with a gene expression signature characteristic of the Normal-like subtype and validates that the clustering method successfully grouped these samples together based on this molecular characteristic.

In [None]:
vln = sc.get.obs_df(adata, keys=violin_plot)
vln=vln[vln['classification']==cluster]
prop=sc.get.obs_df(adata, keys=['classification','pam50 subtype','nhg'])
prop=prop[prop['classification']==cluster]


fig,axes=plt.subplots(1,2, figsize=(12,5))
axes[0]=plot_violin(vln,axes[0])
axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### PAM50 scoring - {cluster}

{vln.describe().to_html()}

### PAM50 & NHG proportions {cluster}

{prop.to_html()}
""")

Figure 2: Molecular and Histological Characterization of the Normal-G2 Cluster.

This figure provides a detailed characterization of the samples assigned to the Normal-G2 cluster (n=736) based on PAM50 subtype scoring and clinical Nottingham Histological Grade (NHG).

- Left panel: Violin plots displaying the distribution of PAM50 centroid scores for the samples within the Normal-G2 cluster. Each violin shows the density distribution of how strongly these samples score against the average gene expression profile (centroid) of each of the five intrinsic PAM50 subtypes (Basal-like, Luminal A, Luminal B, Her2-enriched, and Normal-like). Embedded box plots indicate the median and interquartile range. The Y-axis represents the PAM50 score, reflecting the molecular similarity of the Normal-G2 samples to each respective subtype centroid.

- Right panel: Stacked bar plots illustrating the proportions of samples within the Normal-G2 cluster according to their PAM50 molecular subtype assignment (left bar) and their Nottingham Histological Grade (NHG; right bar), based on clinical metadata. The legend specifies the color assigned to each subtype and grade category. This plot visualizes the composition of this specific cluster.

Interpretation: The violin plots on the left show that samples within the Normal-G2 cluster score highest against the Normal-like centroid, and also show high scores against the Luminal A centroid. Scores for other centroids (Basal, Luminal B, Her2) are significantly lower. The stacked bar plots on the right reveal the composition of this cluster based on standard PAM50 calls and clinical grade: the majority of samples (72.3%) are classified as Luminal A by PAM50, with a substantial minority (20.9%) classified as Normal-like. Histologically, the cluster is predominantly Grade 2 (59.2%), with a significant proportion being Grade 1 (29.1%), justifying the "G2" part of the cluster name. This figure indicates that the Normal-G2 cluster, while containing a large proportion of Luminal A tumors with lower histological grades (G1/G2), is characterized by gene expression patterns that also show considerable similarity to the Normal-like subtype, leading to their grouping in the UMAP space.

In [None]:
fig,axes=plt.subplots(1,1, figsize=(12,5))
axes,res=pathways(cluster,axes)
axes.set_title('GSEA')
# axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### GSEA - {cluster}

{res[res['FDR q-val']<0.05].iloc[:,:-1].to_html()}

""")




Figure 3: Gene Set Enrichment Analysis (GSEA) Highlighting Suppressed Proliferation and Metabolic Pathways in the Normal-G2 Cluster.

This bubble plot presents the results of Gene Set Enrichment Analysis (GSEA) comparing the gene expression profiles of samples within the Normal-G2 cluster against all other samples in the cohort. The plot displays the top significantly enriched Hallmark gene sets from the Molecular Signatures Database (MSigDB), ranked by Normalized Enrichment Score (NES).

- X-axis (NES): Normalized Enrichment Score. A positive NES signifies significant enrichment and generally increased gene expression of the pathway in the Normal-G2 cluster compared to the other samples. A negative NES indicates significant enrichment and decreased expression in the Normal-G2 cluster relative to the comparison group (or upregulation in other samples).

- Y-axis: The names of the significantly enriched Hallmark gene sets.

- Bubble Size: Represents the percentage of genes within the gene set that are part of the leading edge, meaning they contribute most significantly to the enrichment score. Larger bubbles denote that a higher proportion of the gene set members are driving the enrichment.

- Bubble Color: Indicates the statistical significance of the enrichment result, specifically represented by log10(1/FDR). Higher values (more yellow/green colors) correspond to lower False Discovery Rate (FDR) q-values, denoting higher statistical confidence in the enrichment result.

Interpretation: The GSEA reveals that pathways related to cell cycle progression, proliferation, and key metabolic processes are significantly negatively enriched (downregulated) in the Normal-G2 cluster compared to other breast cancer subtypes in the cohort. The most prominent negatively enriched pathways include those associated with cell cycle regulation (HALLMARK_E2F_TARGETS, HALLMARK_G2M_CHECKPOINT), cell growth and signaling (HALLMARK_MYC_TARGETS_V1/V2, HALLMARK_MTORC1_SIGNALING), and energy metabolism (HALLMARK_OXIDATIVE_PHOSPHORYLATION, HALLMARK_GLYCOLYSIS), as well as HALLMARK_DNA_REPAIR and HALLMARK_UNFOLDED_PROTEIN_RESPONSE. This pattern of pathway suppression aligns with the predominantly lower histological grade (G1/G2) and less aggressive nature of the samples constituting this cluster, which includes a large proportion of Luminal A tumors and samples with a Normal-like PAM50 signature. The only positively enriched pathway among the top ranked is HALLMARK_UV_RESPONSE_DN, which is anti-correlation with UV response. All displayed enrichments are highly statistically significant, indicated by the predominantly green/yellow bubble colors reflecting very low FDR q-values.