In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown, HTML
import gseapy as gp
from gseapy import Msigdb
from gseapy import GSEA
from gseapy import dotplot
import warnings
warnings.filterwarnings('ignore')

In [None]:
adata = sc.read('../Data/dataset_annotated.h5ad')
violin_plot = ['classification','Basal', 'LumA', 'LumB', 'Her2', 'Normal']
cluster='Basal-G3'
score='Basal'

In [None]:
msig = Msigdb()
gmt = msig.get_gmt(category='h.all', dbver="2025.1.Hs")

def plot_bars(tmp, ax):
    col1_props = tmp['pam50 subtype'].value_counts(normalize=True)
    col2_props = tmp['nhg'].value_counts(normalize=True)
    proportions = pd.concat([col1_props, col2_props], axis=1, keys=['pam50 subtype', 'nhg']).fillna(0)
    proportions.T.plot(
        kind='bar',
        stacked=True,
        colormap='tab20',
        edgecolor='black',
        ax=ax
    )

    ax.legend()
    ax.tick_params(rotation=0)
    ax.set_ylabel('Proportion')
    ax.set_title('PAM50 & NHG proportions')
    return ax, proportions


def plot_violin(tmp,ax):
    ax=sns.violinplot(tmp,ax=ax)
    ax.tick_params(rotation=45)
    ax.set_title('PAM50 scoring')
    return ax

def pathways(cluster,ax):
    expr = sc.get.rank_genes_groups_df(adata, group=cluster)[['names','scores']]
    expr.columns = ['gene_name', 'score'] 
    pre_res = gp.prerank(
        rnk=expr,  # DataFrame or path to .rnk file
        gene_sets=gmt,  # Or 'KEGG_2021_Human', 'Reactome_2022', etc.
        permutation_num=1000,  # recommended ≥1000
        seed=42,
        processes=4  # parallelization
    )

    ax = dotplot(pre_res.res2d,
             column="FDR q-val",
             cmap=plt.cm.viridis,
             size=4, # adjust dot size
             cutoff=0.25, show_ring=False,ax=ax)
    return ax, pre_res.res2d


In [None]:
Markdown(f"""
# Cluster {cluster}""")

In [None]:
fig,axes= plt.subplots(1,2, figsize=(12,5))
sc.pl.umap(adata, color=score,ax=axes[0], show=False)
axes[0].set_title(f"{score} - score")
sc.pl.umap(adata, color='classification',ax=axes[1], show=False)
axes[1].set_title('Clusters')
plt.show()


Figure 1: UMAP Visualization of Breast Cancer Samples Colored by Basal Score and Assigned Clusters.

This figure displays the dimensionality reduction of the breast cancer RNA-Seq samples (same cohort as previously analyzed) using Uniform Manifold Approximation and Projection (UMAP). Each point in the plots represents an individual sample.

- Left panel: The UMAP space is colored based on a calculated Basal score for each sample. This score was derived from the expression levels of specific genes from the PAM50 panel known to have high centroid values for the Basal subtype. The color bar indicates the range of the score, where samples with higher Basal scores are depicted in yellow/green, and samples with lower scores are in purple/blue.

- Right panel: The same UMAP space is shown, with samples colored according to their assigned cluster. These clusters were previously identified based on the analysis and represent different breast cancer subtypes combined with Nottingham Histologic Grade (G2 or G3): Basal-G3, Her2-G2, Her2-G3, LumA-G2, LumB-G3, and Normal-G2, as indicated by the legend.

Interpretation: The UMAP projection effectively separates the samples into distinct regions. The spatial distribution of samples with high Basal scores (left panel) strongly coincides with the region primarily occupied by samples assigned to the Basal-G3 cluster (right panel). This visual alignment demonstrates that the calculated Basal score accurately reflects the molecular characteristics distinguishing the Basal subtype and confirms that the clustering approach successfully identified this specific group, particularly those with a Grade 3 histology.

In [None]:
vln = sc.get.obs_df(adata, keys=violin_plot)
vln=vln[vln['classification']==cluster]
prop=sc.get.obs_df(adata, keys=['classification','pam50 subtype','nhg'])
prop=prop[prop['classification']==cluster]


fig,axes=plt.subplots(1,2, figsize=(12,5))
axes[0]=plot_violin(vln,axes[0])
axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### PAM50 scoring - {cluster}

{vln.describe().to_html()}

### PAM50 & NHG proportions {cluster}

{prop.to_html()}
""")

Figure 2: Molecular and Histological Characterization of the Basal-G3 Cluster.

This figure provides a detailed characterization of the samples assigned to the Basal-G3 cluster (n=302) based on PAM50 subtype scoring and clinical Nottingham Histological Grade (NHG).

- Left panel: Violin plots displaying the distribution of PAM50 centroid scores for the samples within the Basal-G3 cluster. Each violin shows how strongly these samples score against the average gene expression profile (centroid) of each of the five intrinsic PAM50 subtypes (Basal-like, Luminal A, Luminal B, Her2-enriched, and Normal-like). The width of the violin indicates the density distribution of scores, while the embedded box plot shows the median and interquartile range. The Y-axis represents the PAM50 score, reflecting the molecular similarity of the Basal-G3 samples to each respective subtype centroid.

- Right panel: Stacked bar plots illustrating the proportions of samples within the Basal-G3 cluster according to their PAM50 molecular subtype assignment (left bar) and their Nottingham Histological Grade (NHG; right bar), based on clinical metadata. The height of each colored segment represents the fraction of samples within this cluster belonging to that specific category.

Interpretation: The violin plots on the left confirm the strong molecular signature of the Basal-G3 cluster. Samples within this cluster show overwhelmingly high scores for the Basal centroid and correspondingly low scores for the Luminal A and Luminal B centroids. This molecular profile is consistent with their assignment to the Basal subtype. The stacked bar plots on the right further validate the cluster definition: the vast majority (97.0%) of samples within this cluster are classified as Basal by PAM50, and a high proportion (91.4%) are assigned Histological Grade 3. This figure confirms that the Basal-G3 cluster is molecularly defined by a strong Basal signature and clinically characterized by high-grade histology.

In [None]:
fig,axes=plt.subplots(1,1, figsize=(12,5))
axes,res=pathways(cluster,axes)
axes.set_title('GSEA')
# axes[1],prop=plot_bars(prop,axes[1])
plt.show()

In [None]:
Markdown(f"""

### GSEA - {cluster}

{res[res['FDR q-val']<0.05].iloc[:,:-1].to_html()}

""")




Figure 3: Gene Set Enrichment Analysis (GSEA) Reveals Enriched Biological Pathways in the Basal-G3 Cluster.

This bubble plot displays the results of Gene Set Enrichment Analysis (GSEA) comparing gene expression profiles of samples within the Basal-G3 cluster against all other samples in the cohort. The plot highlights the top significantly enriched Hallmark gene sets from the Molecular Signatures Database (MSigDB).

- X-axis (NES): Normalized Enrichment Score. A positive NES indicates significant enrichment and predominantly upregulated gene expression of the pathway in the Basal-G3 cluster compared to other samples. A negative NES indicates significant enrichment and downregulation in the Basal-G3 cluster relative to other samples (or upregulation in other samples).

- Y-axis: The names of the significantly enriched Hallmark gene sets.

- Bubble Size: Represents the percentage of genes within the gene set that are part of the leading edge, i.e., the genes that contribute most to the enrichment score. Larger bubbles signify that a larger proportion of the gene set members are driving the observed enrichment.

- Bubble Color: Indicates the statistical significance of the enrichment result, specifically represented by log10(1/FDR). Higher values (more yellow/green colors) correspond to smaller False Discovery Rate (FDR) q-values, denoting higher statistical confidence in the enrichment.

Interpretation: The GSEA identifies several key biological pathways significantly enriched in the Basal-G3 cluster. Consistent with the highly proliferative nature and high histological grade (G3) characteristic of many Basal-like breast cancers, pathways related to cell cycle progression and proliferation such as HALLMARK_E2F_TARGETS, HALLMARK_G2M_CHECKPOINT, and HALLMARK_MITOTIC_SPINDLE are among the most highly positively enriched. Pathways associated with immune and inflammatory responses, including HALLMARK_INTERFERON_GAMMA_RESPONSE, HALLMARK_ALLOGRAFT_REJECTION, HALLMARK_INTERFERON_ALPHA_RESPONSE, HALLMARK_INFLAMMATORY_RESPONSE, and HALLMARK_IL6_JAK_STAT3_SIGNALING, also show significant positive enrichment, suggesting increased immune activity in this cluster. Additionally, HALLMARK_MYC_TARGETS (V1 and V2) and HALLMARK_MTORC1_SIGNALING are positively enriched, reflecting activated oncogenic signaling. In contrast, HALLMARK_ESTROGEN_RESPONSE_EARLY is significantly negatively enriched, consistent with the hormone receptor-negative status typical of Basal-like breast cancers. All shown gene sets are highly statistically significant with very low FDR q-values.