# IRD Xenium Spatial Neighborhood Analysis

This notebook performs downstream analysis of Xenium spatial transcriptomics data from the Multiple Myeloma IRD study. The study focuses on understanding the microenvironment changes in newly diagnosed multiple myeloma (NDMM) and after autologous stem cell transplant (ASCT) therapy. The analysis includes cell type composition, gene expression patterns, immune microenvironment characterization, and spatial neighborhood identification.

Data preprocessing and cell type annotation were performed outside of this notebook.


In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.backends.backend_pdf import PdfPages
from scipy.stats import mannwhitneyu
from statsmodels.stats.multitest import multipletests
from itertools import combinations

import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42

In [None]:
# Import custom helper functions
import sys
sys.path.append('/diskmnt/Users2/chouw/Projects/SenNet_bone/src/spatial/utils')
import spatial_utils
import plot_utils

# If helper functions are updated, uncomment and run this line
#import importlib
#importlib.reload(plot_utils)

## Section 1: Data Loading and Cell Type Composition

Load the merged Xenium spatial transcriptomics dataset and analyze cell type composition across different collection timepoints (NBM, NDMM, PT).

### Load merged Xenium data and revised cell type annotations

The merged dataset contains cells from multiple samples across three collection timepoints:
- **NBM**: Normal bone marrow
- **NDMM**: Newly diagnosed multiple myeloma
- **PT**: Post-treatment

In [None]:
ird_xenium_merge = sc.read_h5ad("/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/merged.h5ad")
ird_xenium_merge.obs.head()

In [None]:
# The most updated cell type annotation is in the 'ct' column
ird_xenium_merge.obs['ct'].unique().tolist()

### Cell type composition analysis across timepoints

Calculate the percentage of cell types per sample and compare across collection timepoints to identify disease-associated changes in cell type abundance.


In [None]:
ird_nbm_mm = ird_xenium_merge[ird_xenium_merge.obs['Collection'].isin(['NBM', 'NDMM'])].copy()

In [None]:
# Calculate percentage of each cell type in each sample
ird_cell_info = ird_nbm_mm.obs.copy()
ct_counts = ird_cell_info.groupby(['Sample', 'ct']).size().reset_index(name = 'ct_count')   ## Number of cells per cell type per sample
sample_counts = ird_cell_info.groupby('Sample').size().reset_index(name='total_count')        ## Total number of cells per sample
ct_counts = ct_counts.merge(sample_counts, on='Sample', how='left')
ct_counts['pct'] = ct_counts['ct_count'] / ct_counts['total_count'] * 100
ct_counts = ct_counts.merge(ird_cell_info[['Sample', 'Collection']].drop_duplicates(), on='Sample', how='left')
ct_counts.head()


In [None]:
all_ct = ird_xenium_merge.obs['ct'].unique().tolist()
all_ct = [ct for ct in all_ct if ct not in ['Low Confidence']]
high_abundance_ct = ['Erythroid', 'GMP', 'Late Myeloid', 'Neutrophil', 'PC']
low_abundance_ct = [ct for ct in all_ct if ct not in high_abundance_ct]
high_ct_counts = ct_counts[ct_counts['ct'].isin(high_abundance_ct)]
low_ct_counts = ct_counts[ct_counts['ct'].isin(low_abundance_ct)]

In [None]:
plot_utils.plot_multigroup_boxplot_with_significance(high_ct_counts, 'ct', 'pct', 'Collection', show_swarm = False,
                                                    palette = 'Set2', figsize=(6, 4), xlabel='Cell type', ylabel='Percentage of cells', title=None, save_path=None) 
plt.xticks(rotation=45, ha='right')

In [None]:
plot_utils.plot_multigroup_boxplot_with_significance(low_ct_counts, 'ct', 'pct', 'Collection', show_swarm = False,
                                                    palette = 'Set2', figsize=(12, 4), xlabel='Cell type', ylabel='Percentage of cells', title=None, save_path=None) 
plt.xticks(rotation=45, ha='right')

### Observations
- Immune Compartment
    - Plasma cells take over the bone marrow in NDMM and revert to normal ranges after treatment. 
    - In compensation, erythroid, GMP, late myeloid, neutrophils proportions decreased in NDMM. In normal bone marrow, these cells account for 60-70% of the bone marrow.
    - B cells and their progenitors decreased in proportion in NDMM, but were restored after stem cell transplant in PT.
    - Macrophages increased in proportions in NDMM and were further increased in PT. Monocytes, on the other hand, decreased in proportions from normal bone marrow.
    - CD4 T cells drastically decreased in proportions in NDMM and is further depleted in PT. CD8 T cells and NK cells show a higher proportion in NDMM and slightly decreased in PT.
- Stromal Compartment
    - Overall increased in proportions in NDMM and PT, including MSCs, endothelial eclls, and adipocytes

These observations give rise to a few biological hypotheses, which will be tested in the following sections

1. Plasma cells out-compete B cells and their progenitors for survival signals, leading to their proliferation and inhibits the normal B-to-PC transformation. This is restored post stem cell transplant.
    - These signals may include APRIL/BAFF signaling from myeloid cells and CXCL signaling from stromal cells (MSCs).
2. The post-transplant immune microenvironment are still dysfunctional, such as the elevated macrophage proportions and decreased CD4 T cells.
    - These macrophages may still exhibit immunosuppressive capabilities that induces T cell exhaustion.

## Section 2: APRIL/BAFF signaling between myeloid and B cells/plasma cells

APRIL pathway: Myeloid cells expressing TNFSF13 (APRIL) and plasma cells expressing TNFRSF17 (BCMA), TNFRSF13B (TACI)

BAFF pathway: Myeloid cells expressing TNFSF13B (BAFF) and plasma cells expressing TNFRSF13B (TACI), TNFRSF13C (BAFF-R)

**Note**: Many of these genes are only present in the v6 custom panel.

### Expression of APRIL/BAFF receptor/ligands across cell types and timepoints

In [None]:
ird_v6 = ird_xenium_merge[ird_xenium_merge.obs['Panel'] == 'BYGXJ6_hMulti', :].copy()
ird_v6.obs['ct_timepoint'] = ird_v6.obs.apply(lambda x: f"{x['ct']}_{x['Collection']}", axis = 1)
sc.pl.dotplot(ird_v6, var_names = ['TNFSF13', 'TNFSF13B', 'TNFRSF17', 'TNFRSF13B', 'TNFRSF13C'], groupby = 'ct_timepoint', standard_scale = 'var', dot_max = 0.5, cmap = 'Reds', swap_axes = True)

Observations:
- APRIL expression is highest on monocytes, but also present on GMP/Late myeloid/neutrophil/pDCs. APRIL expression decrease in NDMM and increase in PT.
- BCMA/TACI expression on plasma cells increase in NDMM and decrease in PT. They are also seen to be expressed in all other cell types in NDMM, most likely due to transcript leakage.
- BAFF expression is also highest on monocytes, but the expression is not as high on GMP/Late myeloid/neutrophils. Interestingly there is some expression on HSPCs. BAFF seems to increase during PT.
- BAFF-R is expressed in Mature B cells and at lower levels on early B cells.

Hypothesis:
**Plasma cells in NDMM hijacks the survival APRIL/BAFF signaling for early/mature B cells**

Are plasma cells closer to APRIL+/BAFF+ cells in NDMM?

In [None]:
# Plot mean expression of APRIL/BAFF receptor/ligands in key cell types across timepoints

### Distance between (APRIL+) myeloid cells and (BCMA+) plasma cells

In [None]:
rn_obj = sc.read_h5ad("/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/radial_neighborhoods/Output/merged_RN.h5ad")
ird_v6 = rn_obj[rn_obj.obs['Panel'] == 'BYGXJ6_hMulti'].copy()

In [None]:
## Filter out cells with unassigned radial neighborhood -> these are cells in low density regions
ird_v6 = ird_v6[ird_v6.obs['rn'] != 'Unassigned']

In [None]:
april_df = ird_v6[:, 'TNFSF13'].to_df()
baff_df = ird_v6[:, 'TNFSF13B'].to_df()
ird_v6.obs.loc[:, 'APRILpos'] = april_df['TNFSF13'] > 0
ird_v6.obs.loc[:, 'APRIL_exp'] = april_df['TNFSF13'].values
ird_v6.obs.loc[:, 'BAFFpos'] = baff_df['TNFSF13B'] > 0

april_ligand_df = ird_v6[:, ['TNFRSF17', 'TNFRSF13B']].to_df()
baff_ligand_df = ird_v6[:, ['TNFRSF13B', 'TNFRSF13C']].to_df()
ird_v6.obs.loc[:, 'APRIL_receptor_pos'] = (april_ligand_df['TNFRSF17'] > 0) | (april_ligand_df['TNFRSF13B'] > 0)
ird_v6.obs.loc[:, 'BAFF_receptor_pos'] = (baff_ligand_df['TNFRSF13B'] > 0) | (baff_ligand_df['TNFRSF13C'] > 0)


In [None]:
ird_cells_info = ird_v6.obs.copy()

In [None]:
#ird_april_myeloid = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'].isin(['Granulo.', 'Mc/Mp', 'cDC', 'pDC'])) & (ird_xenium_merge.obs['APRILpos'] == True)]
#ird_april_pc = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'PC') & (ird_xenium_merge.obs['APRIL_receptor_pos'] == True)]
#ird_april_b = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'B') & (ird_xenium_merge.obs['APRIL_receptor_pos'] == True)]

#ird_baff_myeloid = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'].isin(['Granulo.', 'Mc/Mp', 'cDC', 'pDC'])) & (ird_xenium_merge.obs['BAFFpos'] == True)]
#ird_baff_pc = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'PC') & (ird_xenium_merge.obs['BAFF_receptor_pos'] == True)]
#ird_baff_b = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'B') & (ird_xenium_merge.obs['BAFF_receptor_pos'] == True)]

#### Nearest distance between all myeloid cells to all PC/B cells

Since the detection of genes by Xenium probes can be sparse, we first consider all myeloid cells to be the source of APRIL expression and plasma cell/B cell as the source of APRIL receptor (BCMA/TACI) expression.

In [None]:
myeloid_pos = ird_cells_info[ird_cells_info['ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Monocyte', 'Macrophage','cDC', 'pDC'])][['Sample', 'x_centroid', 'y_centroid']]
plasma_pos = ird_cells_info[ird_cells_info['ct'] == 'PC'][['Sample', 'x_centroid', 'y_centroid']]
b_pos = ird_cells_info[ird_cells_info['ct'] == 'Mature B'][['Sample', 'x_centroid', 'y_centroid']]
earlyb_pos = ird_cells_info[ird_cells_info['ct'] == 'Early B'][['Sample', 'x_centroid', 'y_centroid']]
mye_pc_dist = spatial_utils.nearest_dist_between_two_celltypes(myeloid_pos, plasma_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
mye_b_dist = spatial_utils.nearest_dist_between_two_celltypes(myeloid_pos, b_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
mye_earlyb_dist = spatial_utils.nearest_dist_between_two_celltypes(myeloid_pos, earlyb_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')

In [None]:
mye_pc_dist['ref_ct'] = 'Plasma cell'
mye_b_dist['ref_ct'] = 'Mature B cell'
mye_earlyb_dist['ref_ct'] = 'Early B cell'
mye_dist = pd.concat([mye_pc_dist, mye_b_dist, mye_earlyb_dist], axis = 0)

In [None]:
from scipy.stats import gaussian_kde
def get_most_likely_dist(sample_ref_df, dist_col, dist_eval):
    kde = gaussian_kde(sample_ref_df[dist_col])
    kde_pdf = kde.pdf(dist_eval)
    most_likely_dist = dist_eval[np.argmax(kde_pdf)]
    return most_likely_dist

In [None]:
res = []
for sample in mye_dist['Sample'].unique():
    sample_df = mye_dist[mye_pc_dist['Sample'] == sample]
    for ref_ct in sample_df['ref_ct'].unique():
        sample_ref_df = sample_df[sample_df['ref_ct'] == ref_ct]
        if len(sample_ref_df) < 10:
            continue

        res.append({
            'Sample': sample,
            'Reference Cell Type': ref_ct,
            'Most Likely Distance': get_most_likely_dist(sample_ref_df, 'nearest_dist_to_df2', dist_eval),
            'Average Distance': sample_ref_df['nearest_dist_to_df2'].mean(),
            'Median Distance': sample_ref_df['nearest_dist_to_df2'].median()
        })

most_likely_dists_df = pd.DataFrame(res)

In [None]:
# Add metadata information
metadata = ird_cells_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
most_likely_dists_df['Collection'] = most_likely_dists_df['Sample'].map(metadata['Collection'])
most_likely_dists_df.head()

In [None]:
timecols = {"NBM": "#0C7515", "NDMM": "#E619B9", "PT": "#CF99C3"} 
_ = plot_utils.plot_multigroup_boxplot_with_significance(most_likely_dists_df, 'Reference Cell Type', 'Most Likely Distance', 'Collection', show_swarm = True,
                                                    palette = timecols, figsize=(8, 6), xlabel='Cell type', ylabel='Distance from myeloid cells', title=None, save_path=None) 
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
#plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_myeloid_nearestDist_to_PC_B_cells_boxplot_v11062025.pdf', dpi = 300, transparent = True)

#### Randomly shuffle cell types to test for significant spatial association

In [None]:
# Randomly shuffle cell type labels within each sample to create a null distribution
num_shuffles = 10
dist_eval = np.arange(0, 501, 1)
res = []
for sample in ird_cells_info.loc[:, 'Sample'].unique().tolist():
    sample_df = ird_cells_info[ird_cells_info['Sample'] == sample]
    
    # Observed data
    myeloid_pos = sample_df[sample_df['ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Monocyte', 'Macrophage','cDC', 'pDC'])][['Sample', 'x_centroid', 'y_centroid']]
    plasma_pos = sample_df[sample_df['ct'] == 'Early B'][['Sample', 'x_centroid', 'y_centroid']]
    mye_pc_dist = spatial_utils.nearest_dist_between_two_celltypes(myeloid_pos, plasma_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
    observed_dist = get_most_likely_dist(mye_pc_dist, 'nearest_dist_to_df2', dist_eval)
    
    # Shuffled data
    shuffled_dists = []
    for i in range(num_shuffles):
        sample_df.loc[:, 'shuffled_ct'] = sample_df['ct'].sample(frac=1, random_state = i).values
        myeloid_pos_shuffled = sample_df[sample_df['shuffled_ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Monocyte', 'Macrophage','cDC', 'pDC'])][['Sample', 'x_centroid', 'y_centroid']]
        plasma_pos_shuffled = sample_df[sample_df['shuffled_ct'] == 'Early B'][['Sample', 'x_centroid', 'y_centroid']]
        dist_shuffled = spatial_utils.nearest_dist_between_two_celltypes(myeloid_pos_shuffled, plasma_pos_shuffled, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
        shuffled_dists.append(get_most_likely_dist(dist_shuffled, 'nearest_dist_to_df2', dist_eval))
    mean_shuffled_dist = np.mean(shuffled_dists)
    
    sample_res = {
        'Sample': sample,
        'Observed Distance': observed_dist,
        'Shuffled Distances': mean_shuffled_dist
    }
    res.append(sample_res)

shuffled_dists_df = pd.DataFrame(res)
shuffled_dists_df.head()

In [None]:
shuffled_dists_df_melt = shuffled_dists_df.melt(id_vars = ['Sample'], value_vars = ['Observed Distance', 'Shuffled Distances'], var_name = 'Distance Type', value_name = 'Distance')
sns.boxplot(shuffled_dists_df_melt, x = 'Distance Type', y = 'Distance', hue = 'Distance Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
#plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_myeloid_nearestDist_to_PC_B_cells_boxplot_v11062025.pdf', dpi = 300, transparent = True)
#plt.close()


In [None]:
metadata = ird_cells_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
shuffled_dists_df_melt['Collection'] = shuffled_dists_df_melt['Sample'].map(metadata['Collection'])
sns.boxplot(shuffled_dists_df_melt, x = 'Collection', y = 'Distance', hue = 'Distance Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

In [None]:
test_sample = 'S13-47674A1U1'
sample_df = ird_cells_info[ird_cells_info['Sample'] == test_sample]
myeloid_pos = sample_df[sample_df['ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Monocyte', 'Macrophage','cDC', 'pDC'])][['Sample', 'x_centroid', 'y_centroid']]
plasma_pos = sample_df[sample_df['ct'] == 'Early B'][['Sample', 'x_centroid', 'y_centroid']]
mye_pc_dist = spatial_utils.nearest_dist_between_two_celltypes(myeloid_pos, plasma_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
mye_pc_dist.head()


In [None]:
sample_df.loc[:, 'shuffled_ct'] = sample_df['ct'].sample(frac=1, random_state = i).values
myeloid_pos_shuffled = sample_df[sample_df['shuffled_ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Monocyte', 'Macrophage','cDC', 'pDC'])][['Sample', 'x_centroid', 'y_centroid']]
plasma_pos_shuffled = sample_df[sample_df['shuffled_ct'] == 'Early B'][['Sample', 'x_centroid', 'y_centroid']]
dist_shuffled = spatial_utils.nearest_dist_between_two_celltypes(myeloid_pos_shuffled, plasma_pos_shuffled, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')

In [None]:
get_most_likely_dist(mye_pc_dist, 'nearest_dist_to_df2', dist_eval)

In [None]:
get_most_likely_dist(dist_shuffled, 'nearest_dist_to_df2', dist_eval)

In [None]:
sns.histplot(mye_pc_dist, x = 'nearest_dist_to_df2', binwidth = 1, stat = 'density', common_norm = False, alpha = .5)
sns.histplot(dist_shuffled, x = 'nearest_dist_to_df2', binwidth = 1, stat = 'density', common_norm = False, alpha = .5)
plt.xlim(0, 100)

In [None]:
shuffled_dists_df['Collection'] = shuffled_dists_df['Sample'].map(metadata['Collection'])
shuffled_dists_df[shuffled_dists_df['Outlier'] == True]

#### Nearest distance between APRIL+ myeloid and APRIL_ligand+ PC/B cells

In [None]:
ird_april_myeloid = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'].isin(['Granulo.', 'Mc/Mp', 'cDC', 'pDC'])) & (ird_xenium_merge.obs['APRILpos'] == True)]
ird_april_pc = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'PC') & (ird_xenium_merge.obs['APRIL_receptor_pos'] == True)]
ird_april_b = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'B') & (ird_xenium_merge.obs['APRIL_receptor_pos'] == True)]

ird_baff_myeloid = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'].isin(['Granulo.', 'Mc/Mp', 'cDC', 'pDC'])) & (ird_xenium_merge.obs['BAFFpos'] == True)]
ird_baff_pc = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'PC') & (ird_xenium_merge.obs['BAFF_receptor_pos'] == True)]
ird_baff_b = ird_xenium_merge.obs[(ird_xenium_merge.obs['annot'] == 'B') & (ird_xenium_merge.obs['BAFF_receptor_pos'] == True)]

myeloid_april_pos = ird_april_myeloid[['x_centroid', 'y_centroid', 'Sample']]
plasma_april_pos = ird_april_pc[['x_centroid', 'y_centroid', 'Sample']]
bcell_april_pos = ird_april_b[['x_centroid', 'y_centroid', 'Sample']]


In [None]:

april_pc_dist = nearest_dist_between_two_celltypes(myeloid_april_pos, plasma_april_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
april_b_dist = nearest_dist_between_two_celltypes(myeloid_april_pos, bcell_april_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')

In [None]:
april_pc_dist['ref_ct'] = 'Plasma cell'
april_b_dist['ref_ct'] = 'B cell'
april_dists = pd.concat([april_pc_dist, april_b_dist], axis = 0)

In [None]:
april_dists.head()

In [None]:
metadata = ird_cell_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
april_dists['Collection'] = april_dists['Sample'].map(metadata['Collection'])
april_dists['Collection'] = april_dists['Collection'].astype('str')
april_dists['Condition'] = april_dists['Collection'] + '-' + april_dists['ref_ct']
april_dists.head()

In [None]:
sns.histplot(april_dists[april_dists['Collection']=='NDMM'], x = 'nearest_dist_to_df2', hue = 'Condition', binwidth = 5, stat = 'density', common_norm = False)
plt.xlim(0, 200)

In [None]:
sns.histplot(april_dists[april_dists['Collection']=='NBM'], x = 'nearest_dist_to_df2', hue = 'Condition', binwidth = 5, stat = 'density', common_norm = False)
plt.xlim(0, 200)

In [None]:
sns.histplot(april_dists[april_dists['Collection']=='PT'], x = 'nearest_dist_to_df2', hue = 'Condition', binwidth = 5, stat = 'density', common_norm = False)
plt.xlim(0, 200)

In [None]:
##### Calculate most likely distance using KDE, on a per sample basis
from scipy.stats import gaussian_kde

res = []
for sample in april_dists['Sample'].unique():
    sample_df = april_dists[april_dists['Sample'] == sample]
    for ref_ct in sample_df['ref_ct'].unique():
        sample_ref_df = sample_df[sample_df['ref_ct'] == ref_ct]
        if len(sample_ref_df) < 10:
            continue
        # Estimate distance using KDE
        kde = gaussian_kde(sample_ref_df['nearest_dist_to_df2'])
        dist_eval = np.arange(0, 501, 1)
        kde_pdf = kde.pdf(dist_eval)
        most_likely_dist = dist_eval[np.argmax(kde_pdf)]
        res.append({
            'Sample': sample,
            'Reference Cell Type': ref_ct,
            'Most Likely Distance': most_likely_dist,
            'Average Distance': sample_ref_df['nearest_dist_to_df2'].mean(),
            'Median Distance': sample_ref_df['nearest_dist_to_df2'].median()
        })

most_likely_dists_df = pd.DataFrame(res)
most_likely_dists_df['Collection'] = most_likely_dists_df['Sample'].map(metadata['Collection'])
most_likely_dists_df.head()

Plot by comparing distance between cell types within the same timepoint

In [None]:
pdf_name = "/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_APRILpairs_boxplot_v10242025.pdf"
with PdfPages(pdf_name) as pdf:
    fig1 = sns.boxplot(most_likely_dists_df, x = 'Collection', y = 'Most Likely Distance', hue = 'Reference Cell Type')
    sns.swarmplot(most_likely_dists_df, x = 'Collection', y = 'Most Likely Distance', hue = 'Reference Cell Type', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from APRIL+ myeloid cells to BCMA/TACI+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig2 = sns.boxplot(most_likely_dists_df, x = 'Collection', y = 'Average Distance', hue = 'Reference Cell Type')
    sns.swarmplot(most_likely_dists_df, x = 'Collection', y = 'Average Distance', hue = 'Reference Cell Type', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from APRIL+ myeloid cells to BCMA/TACI+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig3 = sns.boxplot(most_likely_dists_df, x = 'Collection', y = 'Median Distance', hue = 'Reference Cell Type')
    sns.swarmplot(most_likely_dists_df, x = 'Collection', y = 'Median Distance', hue = 'Reference Cell Type', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from APRIL+ myeloid cells to BCMA/TACI+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

In [None]:
# Stats test between B cells and plasma cell distances in each condition
stats_res = []
dist_metrics = ['Most Likely Distance', 'Average Distance', 'Median Distance']
for condition in most_likely_dists_df['Collection'].unique():
    condition_df = most_likely_dists_df[most_likely_dists_df['Collection'] == condition]

    for ct1, ct2 in combinations(condition_df['Reference Cell Type'].unique(), 2):
        for metric in dist_metrics:
            ct1_dists = condition_df[condition_df['Reference Cell Type'] == ct1][metric].values.tolist()
            ct2_dists = condition_df[condition_df['Reference Cell Type'] == ct2][metric].values.tolist()
            if len(ct1_dists) > 0 and len(ct2_dists) > 0:
                u_stat, p_val = mannwhitneyu(ct1_dists, ct2_dists, alternative='two-sided')
                stats_res.append({
                    'Condition': condition,
                    'Cell Type 1': ct1,
                    'Cell Type 2': ct2,
                    'Distance Metric': metric,
                    'U_statistic': u_stat,
                    'p_value': p_val
                })
stats_res_df = pd.DataFrame(stats_res)
stats_res_df['p_adj'] = multipletests(stats_res_df['p_value'], method='fdr_bh')[1]
stats_res_df.to_csv("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_APRILpairs_stats_v10242025.csv")

Plot by comparing distance between timepoint within the same cell type

In [None]:
pdf_name = "/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_APRILpairs_boxplot_byTimepoint_v10242025.pdf"
with PdfPages(pdf_name) as pdf:
    fig = sns.boxplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Most Likely Distance', hue = 'Collection')
    sns.swarmplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Most Likely Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from APRIL+ myeloid cells to BCMA/TACI+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig = sns.boxplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Average Distance', hue = 'Collection')
    sns.swarmplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Average Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from APRIL+ myeloid cells to BCMA/TACI+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig = sns.boxplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Median Distance', hue = 'Collection')
    sns.swarmplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Median Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from APRIL+ myeloid cells to BCMA/TACI+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

In [None]:
# Stats test between timepoints in each reference cell type
from scipy.stats import mannwhitneyu
from itertools import combinations
from statsmodels.stats.multitest import multipletests

stats_res = []
dist_metrics = ['Most Likely Distance', 'Average Distance', 'Median Distance']
for ct in most_likely_dists_df['Reference Cell Type'].unique():
    conditions = most_likely_dists_df['Collection'].unique()
    ct_df = most_likely_dists_df[most_likely_dists_df['Reference Cell Type'] == ct]

    for cond1, cond2 in combinations(conditions, 2):
        for metric in dist_metrics:
            cond1_dists = ct_df[ct_df['Collection'] == cond1][metric].values.tolist()
            cond2_dists = ct_df[ct_df['Collection'] == cond2][metric].values.tolist()
            if len(cond1_dists) > 0 and len(cond2_dists) > 0:
                u_stat, p_val = mannwhitneyu(cond1_dists, cond2_dists, alternative='two-sided')
                stats_res.append({'Condition 1': cond1, 'Condition 2': cond2, 'Reference Cell Type': ct, 'Distance Metric': metric, 'U statistic': u_stat, 'p-value': p_val})
stats_df = pd.DataFrame(stats_res)
stats_df['p_adj'] = multipletests(stats_df['p-value'], method='fdr_bh')[1]
stats_df.to_csv("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_APRILpairs_byTimepoints_stats_v10242025.csv")
stats_df

#### Nearest distance between BAFF+ myeloid vs BAFF_ligand+ PC/B

In [None]:
myeloid_baff_pos = ird_baff_myeloid[['x_centroid', 'y_centroid', 'Sample']]
plasma_baff_pos = ird_baff_pc[['x_centroid', 'y_centroid', 'Sample']]
bcell_baff_pos = ird_baff_b[['x_centroid', 'y_centroid', 'Sample']]
baff_pc_dist = nearest_dist_between_two_celltypes(myeloid_baff_pos, plasma_baff_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
baff_b_dist = nearest_dist_between_two_celltypes(myeloid_baff_pos, bcell_baff_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')

In [None]:
baff_pc_dist['ref_ct'] = 'Plasma cell'
baff_b_dist['ref_ct'] = 'B cell'
baff_dists = pd.concat([baff_pc_dist, baff_b_dist], axis = 0)

In [None]:
metadata = ird_cell_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
baff_dists['Collection'] = baff_dists['Sample'].map(metadata['Collection'])
baff_dists['Collection'] = baff_dists['Collection'].astype('str')
baff_dists['Condition'] = baff_dists['Collection'] + '-' + baff_dists['ref_ct']
baff_dists.head()

In [None]:
sns.histplot(baff_dists[baff_dists['Collection']=='NDMM'], x = 'nearest_dist_to_df2', hue = 'Condition', binwidth = 5, stat = 'density', common_norm = False)
plt.xlim(0, 200)

In [None]:
sns.histplot(baff_dists[baff_dists['Collection']=='NBM'], x = 'nearest_dist_to_df2', hue = 'Condition', binwidth = 5, stat = 'density', common_norm = False)
plt.xlim(0, 200)

In [None]:
sns.histplot(baff_dists[baff_dists['Collection']=='PT'], x = 'nearest_dist_to_df2', hue = 'Condition', binwidth = 5, stat = 'density', common_norm = False)
plt.xlim(0, 200)

In [None]:
##### Calculate most likely distance using KDE, on a per sample basis
from scipy.stats import gaussian_kde

res = []
for sample in baff_dists['Sample'].unique():
    sample_df = baff_dists[baff_dists['Sample'] == sample]
    for ref_ct in sample_df['ref_ct'].unique():
        sample_ref_df = sample_df[sample_df['ref_ct'] == ref_ct]
        if len(sample_ref_df) < 10:
            continue
        kde = gaussian_kde(sample_ref_df['nearest_dist_to_df2'])
        dist_eval = np.arange(0, 501, 1)
        kde_pdf = kde.pdf(dist_eval)
        most_likely_dist = dist_eval[np.argmax(kde_pdf)]
        res.append({
            'Sample': sample,
            'Reference Cell Type': ref_ct,
            'Most Likely Distance': most_likely_dist,
            'Average Distance': sample_ref_df['nearest_dist_to_df2'].mean(),
            'Median Distance': sample_ref_df['nearest_dist_to_df2'].median()
        })

most_likely_dists_df = pd.DataFrame(res)
most_likely_dists_df['Collection'] = most_likely_dists_df['Sample'].map(metadata['Collection'])
most_likely_dists_df.head()

In [None]:
pdf_name = "/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_BAFFpairs_boxplot_byCelltype_v10242025.pdf"
with PdfPages(pdf_name) as pdf:
    fig1 = sns.boxplot(most_likely_dists_df, x = 'Collection', y = 'Most Likely Distance', hue = 'Reference Cell Type')
    sns.swarmplot(most_likely_dists_df, x = 'Collection', y = 'Most Likely Distance', hue = 'Reference Cell Type', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from BAFF+ myeloid cells to TACI/BAFF-R+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig2 = sns.boxplot(most_likely_dists_df, x = 'Collection', y = 'Average Distance', hue = 'Reference Cell Type')
    sns.swarmplot(most_likely_dists_df, x = 'Collection', y = 'Average Distance', hue = 'Reference Cell Type', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from BAFF+ myeloid cells to TACI/BAFF-R+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.ylim(0, 2000)
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig3 = sns.boxplot(most_likely_dists_df, x = 'Collection', y = 'Median Distance', hue = 'Reference Cell Type')
    sns.swarmplot(most_likely_dists_df, x = 'Collection', y = 'Median Distance', hue = 'Reference Cell Type', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from BAFF+ myeloid cells to TACI/BAFF-R+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.ylim(0, 2000)
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

In [None]:
stats_res = []
dist_metrics = ['Most Likely Distance', 'Average Distance', 'Median Distance']
for condition in most_likely_dists_df['Collection'].unique():
    condition_df = most_likely_dists_df[most_likely_dists_df['Collection'] == condition]

    for ct1, ct2 in combinations(condition_df['Reference Cell Type'].unique(), 2):
        for metric in dist_metrics:
            ct1_dists = condition_df[condition_df['Reference Cell Type'] == ct1][metric].values.tolist()
            ct2_dists = condition_df[condition_df['Reference Cell Type'] == ct2][metric].values.tolist()
            if len(ct1_dists) > 0 and len(ct2_dists) > 0:
                u_stat, p_val = mannwhitneyu(ct1_dists, ct2_dists, alternative='two-sided')
                stats_res.append({
                    'Condition': condition,
                    'Cell Type 1': ct1,
                    'Cell Type 2': ct2,
                    'Distance Metric': metric,
                    'U_statistic': u_stat,
                    'p_value': p_val
                })
stats_res_df = pd.DataFrame(stats_res)
stats_res_df['p_adj'] = multipletests(stats_res_df['p_value'], method='fdr_bh')[1]
stats_res_df.to_csv("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_BAFFpairs_byCelltype_stats_v10242025.csv")

In [None]:
pdf_name = "/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_BAFFpairs_boxplot_byTimepoint_v10242025.pdf"
with PdfPages(pdf_name) as pdf:
    fig = sns.boxplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Most Likely Distance', hue = 'Collection')
    sns.swarmplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Most Likely Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from BAFF+ myeloid cells to TACI/BAFF-R+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig = sns.boxplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Average Distance', hue = 'Collection')
    sns.swarmplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Average Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from BAFF+ myeloid cells to TACI/BAFF-R+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.ylim(0, 2000)
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

    fig = sns.boxplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Median Distance', hue = 'Collection')
    sns.swarmplot(most_likely_dists_df, x = 'Reference Cell Type', y = 'Median Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .5)
    plt.title('Distance from BAFF+ myeloid cells to TACI/BAFF-R+ B/Plasma cells')
    plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
    plt.ylim(0, 2000)
    plt.tight_layout()
    pdf.savefig(dpi = 300, transparent = True)
    plt.close()

In [None]:
# Stats test between timepoints in each reference cell type
from scipy.stats import mannwhitneyu
from itertools import combinations
from statsmodels.stats.multitest import multipletests

stats_res = []
dist_metrics = ['Most Likely Distance', 'Average Distance', 'Median Distance']
for ct in most_likely_dists_df['Reference Cell Type'].unique():
    conditions = most_likely_dists_df['Collection'].unique()
    ct_df = most_likely_dists_df[most_likely_dists_df['Reference Cell Type'] == ct]

    for cond1, cond2 in combinations(conditions, 2):
        for metric in dist_metrics:
            cond1_dists = ct_df[ct_df['Collection'] == cond1][metric].values.tolist()
            cond2_dists = ct_df[ct_df['Collection'] == cond2][metric].values.tolist()
            if len(cond1_dists) > 0 and len(cond2_dists) > 0:
                u_stat, p_val = mannwhitneyu(cond1_dists, cond2_dists, alternative='two-sided')
                stats_res.append({'Condition 1': cond1, 'Condition 2': cond2, 'Reference Cell Type': ct, 'Distance Metric': metric, 'U statistic': u_stat, 'p-value': p_val})
stats_df = pd.DataFrame(stats_res)
stats_df['p_adj'] = multipletests(stats_df['p-value'], method='fdr_bh')[1]
stats_df.to_csv("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_BAFFpairs_byTimepoint_stats_v10242025.csv")
stats_df

In [None]:
# Count number of BAFF+ myeloid cells within 200um of BAFF_ligand+ plasma cells, per sample
proximal_threshold = 200
proximal_counts = res1[res1['nearest_dist_to_df2'] <= proximal_threshold].groupby(['Sample']).size()
proximal_counts = proximal_counts[proximal_counts > 0].to_frame('Counts')
# Normalize by total number of myeloid cells per sample
total_myeloid_counts = myeloid_info['Sample'].value_counts()
proximal_counts = proximal_counts.join(total_myeloid_counts.rename('total_myeloid'))
proximal_counts['frac_proximal'] = proximal_counts['Counts'] / proximal_counts['total_myeloid']
proximal_counts = proximal_counts.join(metadata, how='left')
sns.boxplot(proximal_counts, x = 'Collection', y = 'frac_proximal', showfliers = False)

### APRIL receptor expression vs distance in PC/B cells vs APRIL+ cells

#### APRIL expression in myeloid cells with distance from plasma cells

In [None]:
april_pos = ird_v6[(ird_v6.obs['APRILpos'] == True) & (ird_v6.obs['ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Monocyte', 'Macrophage','cDC', 'pDC'])), :]
#april_neg_pos = ird_v6[ird_v6.obs['APRILpos'] == False, :]
ird_pc = ird_v6[ird_v6.obs['ct'] == 'PC', :]
ird_b = ird_v6[(ird_v6.obs['ct'] == 'Mature B'), :]
april_dist_to_PC = spatial_utils.nearest_dist_between_two_celltypes(april_pos.obs.copy(), ird_pc.obs.copy(), sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
april_dist_to_B = spatial_utils.nearest_dist_between_two_celltypes(april_pos.obs.copy(), ird_b.obs.copy(), sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
#pc_dist_to_aprilneg = nearest_dist_between_two_celltypes(ird_pc.obs.copy(), april_neg_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')
#b_dist_to_aprilneg = nearest_dist_between_two_celltypes(ird_b.obs.copy(), april_neg_pos, sample_col='Sample', x_col='x_centroid', y_col='y_centroid')

In [None]:
bins = [0, 20, 50, np.inf]
bin_labels = ['0-20', '20-50', '>50']
april_pos.obs['dist_to_nearest_PC'] = april_dist_to_PC['nearest_dist_to_df2'].values
april_pos.obs['dist_to_nearest_B'] = april_dist_to_B['nearest_dist_to_df2'].values
#ird_pc.obs['dist_to_nearest_APRILneg'] = pc_dist_to_aprilneg['nearest_dist_to_df2'].values
#ird_b.obs['dist_to_nearest_APRILneg'] = b_dist_to_aprilneg['nearest_dist_to_df2'].values

april_pos.obs['dist_bin_to_nearest_PC'] = pd.cut(april_pos.obs['dist_to_nearest_PC'], 
                                                    bins = bins, 
                                                    labels = bin_labels)
april_pos.obs['dist_bin_to_nearest_B'] = pd.cut(april_pos.obs['dist_to_nearest_B'], 
                                                    bins=bins, 
                                                    labels = bin_labels)
#ird_pc.obs['dist_bin_to_nearest_APRILneg'] = pd.cut(ird_pc.obs['dist_to_nearest_APRILneg'], 
#                                                    bins=[0, 50, 100, 150, 200, np.inf], 
#                                                    labels = ['0-50', '50-100', '100-150', '150-200', '>200'])
#ird_b.obs['dist_bin_to_nearest_APRILneg'] = pd.cut(ird_b.obs['dist_to_nearest_APRILneg'], 
#                                                    bins=[0, 50, 100, 150, 200, np.inf], 
#                                                    labels = ['0-50', '50-100', '100-150', '150-200', '>200'])

In [None]:
# For each sample, calculate the mean expression of APRIL in each distance bin
april_pos_pc_avg = april_pos.obs.groupby(['Sample', 'dist_bin_to_nearest_PC'])['APRIL_exp'].mean().reset_index()
april_pos_b_avg = april_pos.obs.groupby(['Sample', 'dist_bin_to_nearest_B'])['APRIL_exp'].mean().reset_index()
metadata = ird_cells_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
april_pos_pc_avg['Collection'] = april_pos_pc_avg['Sample'].map(metadata['Collection'])
april_pos_b_avg['Collection'] = april_pos_b_avg['Sample'].map(metadata['Collection'])


In [None]:
sns.boxplot(april_pos_pc_avg, x = 'Collection', y = 'APRIL_exp', hue = 'dist_bin_to_nearest_PC')
sns.swarmplot(april_pos_pc_avg, x = 'Collection', y = 'APRIL_exp', hue = 'dist_bin_to_nearest_PC', dodge = True, color = 'k', alpha = .5)

In [None]:
sns.boxplot(april_pos_b_avg, x = 'Collection', y = 'APRIL_exp', hue = 'dist_bin_to_nearest_B')
sns.swarmplot(april_pos_b_avg, x = 'Collection', y = 'APRIL_exp', hue = 'dist_bin_to_nearest_B', dodge = True, color = 'k', alpha = .5)

#### BCMA expression in plasma cells with distance from APRIL+ cells

In [None]:
# Filter to only include cells that fall within assigned distance bins
# Group cells by both Collection and distance bin, plot expression boxplot
ird_pc_df = ird_pc.obs.copy()#[ird_pc.obs['dist_bin_to_nearest_APRILpos'].cat.codes != -1].copy()
ird_b_df = ird_b.obs.copy()#[ird_b.obs['dist_bin_to_nearest_APRILpos'].cat.codes != -1].copy()
ird_pc_aprilR_expr = ird_pc[:, ['TNFRSF17', 'TNFRSF13B']].to_df()
ird_b_aprilR_expr = ird_b[:, ['TNFRSF17', 'TNFRSF13B']].to_df()
ird_pc_df['BCMA_expr'] = ird_pc_aprilR_expr['TNFRSF17'].values
ird_pc_df['TACI_expr'] = ird_pc_aprilR_expr['TNFRSF13B'].values
ird_b_df['BCMA_expr'] = ird_b_aprilR_expr['TNFRSF17'].values
ird_b_df['TACI_expr'] = ird_b_aprilR_expr['TNFRSF13B'].values
#sns.violinplot(ird_pc_df, x = 'Collection', y = 'BCMA_expr', hue = 'dist_bin_to_nearest_APRILpos', inner = 'quartile')


In [None]:
ird_pc_df.head()

In [None]:
# Calculate number of cells in each distance bin per sample
metadata = ird_cells_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
pc_bin_counts = ird_pc_df.groupby(['Sample', 'dist_bin_to_nearest_APRILpos']).size().unstack(fill_value=0).reset_index()
pc_bin_counts['Collection'] = pc_bin_counts['Sample'].map(metadata['Collection'])
pc_bin_counts = pc_bin_counts.melt(id_vars=['Sample', 'Collection'], var_name='Distance Bin', value_name='Cell Count')
pc_bin_counts

In [None]:
sns.boxplot(pc_bin_counts, x = 'Collection', y = 'Cell Count', hue = 'Distance Bin')  # %? cell  ### Only look at low PC% in NDM??
plt.ylim(0, 800)
plt.title('Number of plasma cells in each distance bin to nearest APRIL+ cell')

In [None]:
# Calculate average BCMA & TACI expression within each distance bin, for each sample
ird_pc_df = ird_pc_df[ird_pc_df['dist_bin_to_nearest_APRILpos'].cat.codes != -1].copy()
pc_bcma_avg = ird_pc_df.groupby(['Sample', 'dist_bin_to_nearest_APRILpos'])['BCMA_expr'].mean().reset_index()
pc_taci_avg = ird_pc_df.groupby(['Sample', 'dist_bin_to_nearest_APRILpos'])['TACI_expr'].mean().reset_index()

metadata = ird_cells_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
pc_bcma_avg['Collection'] = pc_bcma_avg['Sample'].map(metadata['Collection'])
pc_taci_avg['Collection'] = pc_taci_avg['Sample'].map(metadata['Collection'])
pc_bcma_avg

In [None]:
plt.figure(figsize = (8, 5))
sns.boxplot(pc_bcma_avg[pc_bcma_avg['dist_bin_to_nearest_APRILpos']!='>200'], x = 'Collection', y = 'BCMA_expr', hue = 'dist_bin_to_nearest_APRILpos', showfliers = False)
sns.swarmplot(pc_bcma_avg[pc_bcma_avg['dist_bin_to_nearest_APRILpos']!='>200'], x = 'Collection', y = 'BCMA_expr', hue = 'dist_bin_to_nearest_APRILpos', dodge = True, color = 'k', alpha = .5)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Average BCMA expression in Plasma cells by distance to nearest APRIL+ cell')

In [None]:
plt.figure(figsize = (8, 5))
sns.boxplot(pc_taci_avg[pc_taci_avg['dist_bin_to_nearest_APRILpos']!='>200'], x = 'Collection', y = 'TACI_expr', hue = 'dist_bin_to_nearest_APRILpos', showfliers = False)
sns.swarmplot(pc_taci_avg[pc_taci_avg['dist_bin_to_nearest_APRILpos']!='>200'], x = 'Collection', y = 'TACI_expr', hue = 'dist_bin_to_nearest_APRILpos', dodge = True, color = 'k', alpha = .5)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Average TACI expression in Plasma cells by distance to nearest APRIL+ cell')

In [None]:
b_bin_counts = ird_b_df.groupby(['Sample', 'dist_bin_to_nearest_APRILpos']).size().unstack(fill_value=0).reset_index()
b_bin_counts['Collection'] = b_bin_counts['Sample'].map(metadata['Collection'])
b_bin_counts = b_bin_counts.melt(id_vars=['Sample', 'Collection'], var_name='Distance Bin', value_name='Cell Count')

In [None]:
sns.boxplot(b_bin_counts, x = 'Collection', y = 'Cell Count', hue = 'Distance Bin')
plt.ylim(0, 1200)
plt.title('Number of B cells in each distance bin to nearest APRIL+ cell')

In [None]:
ird_b_df = ird_b_df[ird_b_df['dist_bin_to_nearest_APRILpos'].cat.codes != -1].copy()
b_bcma_avg = ird_b_df.groupby(['Sample', 'dist_bin_to_nearest_APRILpos'])['BCMA_expr'].mean().reset_index()
b_taci_avg = ird_b_df.groupby(['Sample', 'dist_bin_to_nearest_APRILpos'])['TACI_expr'].mean().reset_index()

metadata = ird_cells_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
b_bcma_avg['Collection'] = b_bcma_avg['Sample'].map(metadata['Collection'])
b_taci_avg['Collection'] = b_taci_avg['Sample'].map(metadata['Collection'])


In [None]:
plt.figure(figsize = (8, 5))
sns.boxplot(b_bcma_avg, x = 'Collection', y = 'BCMA_expr', hue = 'dist_bin_to_nearest_APRILpos', showfliers = False)
sns.swarmplot(b_bcma_avg, x = 'Collection', y = 'BCMA_expr', hue = 'dist_bin_to_nearest_APRILpos', dodge = True, color = 'k', alpha = .5)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Average BCMA expression in B cells by distance to nearest APRIL+ cell')

In [None]:
plt.figure(figsize = (8, 5))
sns.boxplot(b_taci_avg, x = 'Collection', y = 'TACI_expr', hue = 'dist_bin_to_nearest_APRILpos', showfliers = False)
sns.swarmplot(b_taci_avg, x = 'Collection', y = 'TACI_expr', hue = 'dist_bin_to_nearest_APRILpos', dodge = True, color = 'k', alpha = .5)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Average TACI expression in B cells by distance to nearest APRIL+ cell')

### Exploring a ligand-receptor score based on ligand expression, receptor expression, and distance

In [None]:
from scipy.spatial import KDTree
def compute_pairwise_LR_scores(adata, ligand_gene, receptor_gene,
                               distance_threshold=200, dist_lambda = 50,
                               expr_layer = None, celltype_col='annot', x_col='x_centroid', y_col='y_centroid'):
    """
    DOCS (This is per sample)
    """
    # 1) Get coordinates and expression
    coords = adata.obs[[x_col, y_col]].copy().values
    if expr_layer is None:
        L_exp = adata[:, ligand_gene].X.toarray().flatten()
        R_exp = adata[:, receptor_gene].X.toarray().flatten()
    else:
        L_exp = adata[:, ligand_gene].layers[expr_layer].toarray().flatten()
        R_exp = adata[:, receptor_gene].layers[expr_layer].toarray().flatten()

    # Find cells expressing ligand and receptor
    ligand_idx = np.where(L_exp > 0)[0]
    receptor_idx = np.where(R_exp > 0)[0]

    # If no ligand or receptor expressing cells, return empty dataframe
    if len(ligand_idx) == 0 or len(receptor_idx) == 0:
        print(f"No ligand or receptor expressing cells for {ligand_gene}-{receptor_gene} in this sample.")
        return pd.DataFrame(), pd.DataFrame()

    # 2) Compute ligand receptor pairs within distance threshold
    # Build KDTree for receptor cells
    tree = KDTree(coords[receptor_idx])
    # For all ligand points, find all points within distance threshold
    results = tree.query_ball_point(coords[ligand_idx], r=distance_threshold)

    # 3) Compute scores for each ligand-receptor pair
    score_results = []
    for li_idx, neighbors in enumerate(results):
        ligand_cell_idx = ligand_idx[li_idx]
        if len(neighbors) == 0:
            continue
        # Vectorized distance calculation
        neighbor_global_idx = receptor_idx[neighbors]
        dists = np.linalg.norm(coords[ligand_cell_idx] - coords[neighbor_global_idx], axis=1)
        Li = L_exp[ligand_cell_idx]
        Rj = R_exp[neighbor_global_idx]
        K = np.exp(-dists / dist_lambda)
        score_ij = Li * Rj * K
        # Store results
        for r_idx, dist, L_exp_val, R_exp_val, score in zip(neighbor_global_idx, dists, 
                                                        [Li]*len(neighbor_global_idx), 
                                                        Rj, score_ij):
            score_results.append({
                'Ligand Cell Index': ligand_cell_idx,
                'Receptor Cell Index': r_idx,
                'Ligand Cell Type': adata.obs[celltype_col].iloc[ligand_cell_idx],
                'Receptor Cell Type': adata.obs[celltype_col].iloc[r_idx],
                'Distance': dist,
                'Ligand Expression': L_exp_val,
                'Receptor Expression': R_exp_val,
                'LR Score': score
            })

    pairwise_scores_df = pd.DataFrame(score_results)
    
    # aggregation per celltype pair
    agg = pairwise_scores_df.groupby(['Ligand Cell Type','Receptor Cell Type']).agg(
        score_sum=('LR Score','sum'),
        score_mean=('LR Score','mean'),
        pair_count=('LR Score','count')
    ).reset_index()

    # normalize by possible pairs (N_i * N_j)
    counts = adata.obs[celltype_col].value_counts().to_dict()
    agg['N_ligandCt'] = agg['Ligand Cell Type'].map(counts)
    agg['N_ReceptorCt'] = agg['Receptor Cell Type'].map(counts)
    agg['norm_score'] = agg['score_sum'] / (agg['N_ligandCt'] * agg['N_ReceptorCt'])

    return pairwise_scores_df, agg



In [None]:
def compute_LR_scores(adata, ligand_gene, receptor_gene,
                      distance_threshold=200, dist_lambda = 50,
                      expr_layer = None, celltype_col='annot', sample_col='Sample',
                      x_col='x_centroid', y_col='y_centroid'):
    """
    DOCS
    """
    all_scores = []
    all_agg = []
    for sample in adata.obs[sample_col].unique():
        print(f"Processing sample: {sample}")
        sample_adata = adata[adata.obs[sample_col] == sample]
        sample_scores, agg = compute_pairwise_LR_scores(sample_adata, ligand_gene, receptor_gene,
                                                   distance_threshold, dist_lambda,
                                                   expr_layer, celltype_col, x_col, y_col)
        sample_scores['Sample'] = sample
        agg['Sample'] = sample
        all_agg.append(agg)
        all_scores.append(sample_scores)

    all_scores_df = pd.concat(all_scores, axis=0)
    all_agg_df = pd.concat(all_agg, axis=0)
    
    return all_scores_df, all_agg_df

#### Compute all ligand-receptor cell pairs and their scores

In [None]:
# Filter out cells with unassigned radial neighborhoods
ird_xenium_merge_filtered = ird_xenium_merge[ird_xenium_merge.obs['radial_neighborhood'] != 'Unassigned', :]

In [None]:
ligand_gene = 'TNFSF13'  # APRIL
receptor_gene = 'TNFRSF17'  # BCMA
april_bcma_scores_allsamples, agg_allsamples = compute_LR_scores(ird_xenium_merge_filtered, ligand_gene, receptor_gene,
                                                                 distance_threshold=200, dist_lambda = 100,
                                                                 expr_layer = None, celltype_col='annot', sample_col = 'Sample', x_col='x_centroid', y_col='y_centroid')

In [None]:
april_taci_scores_allsamples, agg_at_allsamples = compute_LR_scores(ird_xenium_merge, 'TNFSF13', 'TNFRSF13B',
                                                                 distance_threshold=200, dist_lambda = 100,
                                                                 expr_layer = None, celltype_col='annot', sample_col = 'Sample', x_col='x_centroid', y_col='y_centroid')

In [None]:
april_bcma_scores_allsamples.head()

#### LR score per receptor cell - average or sum

In [None]:
# In april_bcma_scores_allsamples, for each unique Receptor Cell Index, average (or sum) the LR Score across all Ligand Cell Indexs
receptor_avg_scores = april_bcma_scores_allsamples.groupby(['Receptor Cell Index', 'Sample']).agg(
    avg_LR_score=('LR Score', 'mean'),
    sum_LR_score=('LR Score', 'sum'),
    Receptor_Cell_Type=('Receptor Cell Type', 'first')
).reset_index()
metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
receptor_avg_scores['Collection'] = receptor_avg_scores['Sample'].map(metadata['Collection'])
receptor_avg_scores.head()

In [None]:
receptor_avg_scores_per_sample = receptor_avg_scores.groupby(['Sample', 'Receptor_Cell_Type']).agg(
    mean_sum_LR_score=('sum_LR_score', 'mean'),
    mean_avg_LR_score=('avg_LR_score', 'mean'),
    Collection = ('Collection', 'first')
)
plt.figure(figsize=(8,4))
sns.boxplot(receptor_avg_scores_per_sample, x = 'Receptor_Cell_Type', y = 'mean_sum_LR_score', hue = 'Collection', showfliers = False)
plt.title('Average of Sum LR Score per BCMA+ cell')
plt.xlabel('Receptor Cell Type')
plt.ylabel('Average Sum LR Score')
plt.xticks(rotation=45)

In [None]:
plt.figure(figsize=(8,4))
sns.boxplot(receptor_avg_scores_per_sample, x = 'Receptor_Cell_Type', y = 'mean_avg_LR_score', hue = 'Collection', showfliers = False)
plt.title('Average of Mean LR Score per BCMA+ cell')
plt.xlabel('Receptor Cell Type')
plt.ylabel('Average Mean LR Score')
plt.xticks(rotation=45)

#### Number of (& density of) ligand+ cells around PCs in a certain distance bin (#cells/area of bin)

In [None]:
# Density of ligand+ cells around each receptor cell type within distance bins per sample
binwidth = 20
bin_limit = 200
bins = np.arange(0, bin_limit+1, binwidth)
april_bcma_scores_allsamples['dist_bin'] = pd.cut(april_bcma_scores_allsamples['Distance'], bins=bins, labels = bins[0:-1])
april_bcma_scores_allsamples['Receptor Cell Index'] = april_bcma_scores_allsamples['Receptor Cell Index'].astype(int)
density_results = april_bcma_scores_allsamples.groupby(['Sample', 'Receptor Cell Index', 'dist_bin']).size().reset_index(name='count')
density_results = density_results[density_results['count'] != 0]
# Normalize by area of annulus
density_results['area'] = np.pi * ((density_results['dist_bin'].astype(float) + binwidth)**2 - (density_results['dist_bin'].astype(float))**2)
density_results['density'] = density_results['count'] / density_results['area']
# Add cell type information
density_results['Sample_Index'] = density_results['Sample'] + '_' + density_results['Receptor Cell Index'].astype(str)
celltype_map = april_bcma_scores_allsamples[['Receptor Cell Index', 'Sample', 'Receptor Cell Type']].drop_duplicates()
celltype_map['Sample_Index'] = celltype_map['Sample'] + '_' + celltype_map['Receptor Cell Index'].astype(int).astype(str)
density_results = density_results.merge(celltype_map[['Sample_Index', 'Receptor Cell Type']], on='Sample_Index', how='left')
density_results.head()

In [None]:
# Average count of ligand+ cells per distance bin to PC cells per sample
avg_count = density_results[density_results['Receptor Cell Type']=='PC'].groupby(['Sample', 'dist_bin']).agg(
    mean_count = ('count', 'mean')
).reset_index()

metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
avg_count['Collection'] = avg_count['Sample'].map(metadata['Collection'])
sns.boxplot(avg_count, x = 'dist_bin', y = 'mean_count', hue = 'Collection')
plt.title('Average count of APRIL+ ligand cells around BCMA+ PC cells per distance bin')
plt.xlabel('Distance bin (um)')
plt.ylabel('Average count of ligand+ cells')

In [None]:
ird_cells_info = ird_xenium_merge.obs.copy()
metadata = ird_cells_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
metadata

In [None]:
ird_cells_info.head()

In [None]:
# Create a column in ird_cells_info called APRILstat_ct that indicates whether the cells are APRIL+, APRIL_receptor+, both, or neither, and concatenate the cell type
apr = ird_cells_info['APRILpos']
rcp = ird_cells_info['APRIL_receptor_pos']

labels = np.select(
    condlist=[apr & rcp, apr & ~rcp, ~apr & rcp],
    choicelist=['both', 'APRIL+', 'APRILrec+'],
    default='neither'
)

ird_cells_info['APRILstat_ct'] = pd.Series(labels, index=ird_cells_info.index) + '_' + ird_cells_info['annot'].astype(str)

# optional: check
ird_cells_info['APRILstat_ct'].value_counts().head()

In [None]:
sn219_april_annot = ird_cells_info[ird_cells_info['Sample'] == 'SN219R1-Ma1Fd2-1U1'][['Original_Barcode', 'APRILstat_ct']]
sn219_april_annot.columns = ['cell_id', 'group']
sn219_april_annot.to_csv('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_SN219_APRIL_annotation_v10082025.csv', index=False)
s13_april_annot = ird_cells_info[ird_cells_info['Sample'] == 'S13-47674A1U1'][['Original_Barcode', 'APRILstat_ct']]
s13_april_annot.columns = ['cell_id', 'group']
s13_april_annot.to_csv('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_S13-47674_APRIL_annotation_v10082025.csv', index=False)

In [None]:
s18_april_annot = ird_cells_info[ird_cells_info['Sample'] == 'S18-36373A1U1'][['Original_Barcode', 'APRILstat_ct']]
s18_april_annot.columns = ['cell_id', 'group']
s18_april_annot.to_csv('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_S18-36373A1U1_APRIL_annotation_v10082025.csv', index=False)

In [None]:
#avg_count[avg_count['Sample']=='S13-47674A1U1']
avg_count[avg_count['Sample']=='S18-36373A1U1']
#avg_count[avg_count['Sample']=='SN219R1-Ma1Fd2-1U1']

In [None]:
# Average count of ligand+ cells per distance bin to B cells per sample
avg_count = density_results[density_results['Receptor Cell Type']=='B'].groupby(['Sample', 'dist_bin']).agg(
    mean_count = ('count', 'mean')
).reset_index()

avg_count['Collection'] = avg_count['Sample'].map(metadata['Collection'])
sns.boxplot(avg_count, x = 'dist_bin', y = 'mean_count', hue = 'Collection')
plt.title('Average count of APRIL+ ligand cells around BCMA+ B cells per distance bin')
plt.xlabel('Distance bin (um)')
plt.ylabel('Average count of ligand+ cells')

In [None]:
### Check % of APRIL+ cells per sample
april_df = ird_xenium_merge[:, 'TNFSF13'].to_df()
baff_df = ird_xenium_merge[:, 'TNFSF13B'].to_df()
ird_xenium_merge.obs.loc[:, 'APRILpos'] = april_df['TNFSF13'] > 0
ird_xenium_merge.obs.loc[:, 'BAFFpos'] = baff_df['TNFSF13B'] > 0

april_ligand_df = ird_xenium_merge[:, ['TNFRSF17', 'TNFRSF13B']].to_df()
baff_ligand_df = ird_xenium_merge[:, ['TNFRSF13B', 'TNFRSF13C']].to_df()
ird_xenium_merge.obs.loc[:, 'APRIL_receptor_pos'] = (april_ligand_df['TNFRSF17'] > 0) | (april_ligand_df['TNFRSF13B'] > 0)
ird_xenium_merge.obs.loc[:, 'BAFF_receptor_pos'] = (baff_ligand_df['TNFRSF13B'] > 0) | (baff_ligand_df['TNFRSF13C'] > 0)

april_pos_frac = ird_xenium_merge.obs.groupby('Sample').agg(
    total_cells = ('APRILpos', 'size'),
    april_pos_cells = ('APRILpos', 'sum'),
    april_receptor_pos_cells = ('APRIL_receptor_pos', 'sum')
).reset_index()
april_pos_frac = april_pos_frac[(april_pos_frac['april_pos_cells'] > 0) & (april_pos_frac['april_receptor_pos_cells'] > 0)]
april_pos_frac['frac_april_pos'] = april_pos_frac['april_pos_cells'] / april_pos_frac['total_cells']
april_pos_frac['frac_april_receptor_pos'] = april_pos_frac['april_receptor_pos_cells'] / april_pos_frac['total_cells']
april_pos_frac['Collection'] = april_pos_frac['Sample'].map(metadata['Collection'])
april_pos_frac.head()

In [None]:
sns.boxplot(april_pos_frac, x = 'Collection', y = 'frac_april_pos')
sns.swarmplot(april_pos_frac, x = 'Collection', y = 'frac_april_pos', color = 'k', alpha = .5)
plt.title('Fraction of APRIL+ cells per sample')
plt.xlabel('Timepoint')
plt.ylabel('Fraction of APRIL+ cells')

In [None]:
# Average density per sample and distance bin for PCs
avg_density = density_results[density_results['Receptor Cell Type'] == 'PC'].groupby(['Sample', 'dist_bin']).agg(
    mean_density = ('density', 'mean')
).reset_index()
# Pivot results for plotting - sample as columns, dist_bin as index
density_mat = avg_density.pivot(index='dist_bin', columns='Sample', values='mean_density')
# Reorder sample columns by their timepoint
metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates()#.set_index('Sample')
sample_order = metadata.sort_values(['Collection', 'Sample'])['Sample'].tolist()  # Collection is ordered categorical already
sample_order = [s for s in sample_order if s in density_mat.columns]  # Keep only samples present in the matrix
density_mat = density_mat.reindex(columns=sample_order)

#### For APRIL+ cells and BCMA+ PCs, the average expression of APRIL and BCMA with distance

In [None]:
expr_summary = april_bcma_scores_allsamples[april_bcma_scores_allsamples['Receptor Cell Type'] == 'PC'].groupby(['Sample', 'dist_bin']).agg(
    mean_L = ('Ligand Expression', 'mean'),
    mean_R = ('Receptor Expression', 'mean')
).reset_index()
lig_mat = expr_summary.pivot(index = 'dist_bin', columns = 'Sample', values = 'mean_L').fillna(0)
rec_mat = expr_summary.pivot(index = 'dist_bin', columns = 'Sample', values = 'mean_R').fillna(0)
# Reorder sample columns by their timepoint
metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates()#.set_index('Sample')
sample_order = metadata.sort_values(['Collection', 'Sample'])['Sample'].tolist()  # Collection is ordered categorical already
sample_order = [s for s in sample_order if s in lig_mat.columns]  # Keep only samples present in the matrix
lig_mat = lig_mat.reindex(columns=sample_order)
sample_order = [s for s in sample_order if s in rec_mat.columns]  # Keep only samples present in the matrix
rec_mat = rec_mat.reindex(columns=sample_order)

In [None]:
metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
expr_summary['Collection'] = expr_summary['Sample'].map(metadata['Collection'])
sns.boxplot(expr_summary, x = 'Collection', y = 'mean_L', hue = 'dist_bin')
plt.legend(title = 'Distance Bins', bbox_to_anchor=(1, 1), loc='upper left')
plt.title('Mean APRIL expression in APRIL+ cells around BCMA+ PC per distance bin')
plt.xlabel('Timepoint')
plt.ylabel('Mean APRIL expression')

In [None]:
sns.boxplot(expr_summary, x = 'Collection', y = 'mean_R', hue = 'dist_bin')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
plt.legend(title = 'Distance Bins', bbox_to_anchor=(1, 1), loc='upper left')
plt.title('Mean BCMA expression in BCMA+ PCs around APRIL+ cells per distance bin')
plt.xlabel('Timepoint')
plt.ylabel('Mean BCMA expression')

#### For APRIL+ cells and BCMA+ B cells, the average expression of APRIL and BCMA with distance

In [None]:
expr_summary_B = april_bcma_scores_allsamples[april_bcma_scores_allsamples['Receptor Cell Type'] == 'B'].groupby(['Sample', 'dist_bin']).agg(
    mean_L = ('Ligand Expression', 'mean'),
    mean_R = ('Receptor Expression', 'mean')
).reset_index()
lig_mat_B = expr_summary_B.pivot(index = 'dist_bin', columns = 'Sample', values = 'mean_L').fillna(0)
rec_mat_B = expr_summary_B.pivot(index = 'dist_bin', columns = 'Sample', values = 'mean_R').fillna(0)
# Reorder sample columns by their timepoint
metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates()#.set_index('Sample')
sample_order = metadata.sort_values(['Collection', 'Sample'])['Sample'].tolist()  # Collection is ordered categorical already
sample_order = [s for s in sample_order if s in lig_mat_B.columns]  # Keep only samples present in the matrix
lig_mat_B = lig_mat_B.reindex(columns=sample_order)
sample_order = [s for s in sample_order if s in rec_mat_B.columns]  # Keep only samples present in the matrix
rec_mat_B = rec_mat_B.reindex(columns=sample_order)

In [None]:
metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
expr_summary_B['Collection'] = expr_summary_B['Sample'].map(metadata['Collection'])
sns.boxplot(expr_summary_B, x = 'Collection', y = 'mean_L', hue = 'dist_bin')
plt.legend(title = 'Distance Bins', bbox_to_anchor=(1, 1), loc='upper left')
plt.title('Mean APRIL expression in APRIL+ cells around BCMA+ B cells per distance bin')
plt.xlabel('Timepoint')
plt.ylabel('Mean APRIL expression')

In [None]:
sns.boxplot(expr_summary_B, x = 'Collection', y = 'mean_R', hue = 'dist_bin')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left')
plt.legend(title = 'Distance Bins', bbox_to_anchor=(1, 1), loc='upper left')
plt.title('Mean BCMA expression in BCMA+ B cells around APRIL+ cells per distance bin')
plt.xlabel('Timepoint')
plt.ylabel('Mean BCMA expression')

#### Visualization

In [None]:
# Create concentric radial heatmap plots
theta_vals = np.linspace(0, 2*np.pi, len(lig_mat.columns), endpoint=False)

fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))

# Parameters
n_dist = lig_mat.shape[0]
n_samples = lig_mat.shape[1]
theta_edges = np.linspace(0, 2*np.pi, n_samples + 1)

# Define separate radial scales for inner (ligand) and outer (receptor)
r_max = 100#bin_limit  # Maximum distance for receptor ring

r_edges_lig = np.linspace(0, r_max, n_dist + 1)
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, lig_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)

# Color code the collections
collections = metadata.loc[lig_mat.columns, 'Collection']
cond_groups = collections.groupby(collections).indices

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_max], color='red', lw=3)
    ax.plot([end, end], [0, r_max], color='red', lw=3)
plt.title('APRIL expression of APRIL+ cells around BCMA+ Plasma cells')

In [None]:
fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))
r_edges_lig = np.linspace(0, r_max, n_dist + 1)
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, rec_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_max], color='red', lw=3)
    ax.plot([end, end], [0, r_max], color='red', lw=3)
plt.title('BCMA expression of BCMA+ Plasma cells around APRIL+ cells')

In [None]:
fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))
# Define separate radial scales for inner (ligand) and outer (receptor)
r_inner_max = 100   # inner ring radius
r_outer_min = 110   # where outer ring starts
r_outer_max = 200   # outer ring limit

# Ligand ring: map distance bins [0, max] → [0, r_inner_max]
r_edges_lig = np.linspace(0, r_inner_max, n_dist + 1)
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, lig_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)
# Receptor ring: map distance bins [0, max] → [r_outer_min, r_outer_max]
r_edges_rec = np.linspace(r_outer_min, r_outer_max, n_dist + 1)
mesh_R = ax.pcolormesh(theta_edges, r_edges_rec, rec_mat.values,
                       cmap='Oranges', shading='auto')#, vmin=0, vmax=1

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_outer_max], color='red', lw=3)
    ax.plot([end, end], [0, r_outer_max], color='red', lw=3)
plt.title('APRIL (inner ring) and BCMA (outer ring) expression around Plasma cells')

In [None]:
theta_vals = np.linspace(0, 2*np.pi, len(lig_mat_B.columns), endpoint=False)

fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))

# Parameters
n_dist = lig_mat_B.shape[0]
n_samples = lig_mat_B.shape[1]
theta_edges = np.linspace(0, 2*np.pi, n_samples + 1)

# Define separate radial scales for inner (ligand) and outer (receptor)
r_inner_max = 100   # inner ring radius
r_outer_min = 110   # where outer ring starts
r_outer_max = 200   # outer ring limit

# Ligand ring: map distance bins [0, max] → [0, r_inner_max]
r_edges_lig = np.linspace(0, r_inner_max, n_dist + 1)
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, lig_mat_B.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)
# Receptor ring: map distance bins [0, max] → [r_outer_min, r_outer_max]
r_edges_rec = np.linspace(r_outer_min, r_outer_max, n_dist + 1)
mesh_R = ax.pcolormesh(theta_edges, r_edges_rec, rec_mat_B.values,
                       cmap='Oranges', shading='auto')#, vmin=0, vmax=1

# Color code the collections
collections = metadata.loc[lig_mat_B.columns, 'Collection']
cond_groups = collections.groupby(collections).indices
for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_outer_max], color='red', lw=3)
    ax.plot([end, end], [0, r_outer_max], color='red', lw=3)
plt.title('APRIL (inner ring) and BCMA (outer ring) expression around B cells')

In [None]:
#Percent ct pairs are interacting

#Are there more APRIL+ cells around PC than B in NDMM?

In [None]:
# Summarize average expression and score per sample, per distance bin
bins = np.arange(0, 200, 20)
april_bcma_scores_allsamples['dist_bin'] = pd.cut(april_bcma_scores_allsamples['Distance'], bins=bins, labels = bins[0:-1])
expr_summary = april_bcma_scores_allsamples.groupby(['Sample', 'dist_bin']).agg(
    mean_L = ('Ligand Expression', 'mean'),
    mean_R = ('Receptor Expression', 'mean'),
    mean_score = ('LR Score', 'mean'),
    sum_score = ('LR Score', 'sum'),
    count_pairs = ('LR Score', 'size')
).reset_index()
expr_summary.head()

In [None]:
lig_mat = expr_summary.pivot(index = 'dist_bin', columns = 'Sample', values = 'mean_L').fillna(0)
rec_mat = expr_summary.pivot(index = 'dist_bin', columns = 'Sample', values = 'mean_R').fillna(0)
score_mat = expr_summary.pivot(index = 'dist_bin', columns = 'Sample', values = 'sum_score').fillna(0)
count_mat = expr_summary.pivot(index = 'dist_bin', columns = 'Sample', values = 'count_pairs').fillna(0)
lig_mat.head()

In [None]:
# Reorder sample columns by their timepoint
metadata = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates()#.set_index('Sample')
sample_order = metadata.sort_values(['Collection', 'Sample'])['Sample'].tolist()  # Collection is ordered categorical already
sample_order = [s for s in sample_order if s in lig_mat.columns]  # Keep only samples present in the matrix
lig_mat = lig_mat.reindex(columns=sample_order)
rec_mat = rec_mat.reindex(columns=sample_order)
score_mat = score_mat.reindex(columns=sample_order)
count_mat = count_mat.reindex(columns=sample_order)

In [None]:
# Create concentric radial heatmap plots
#r_vals = lig_mat.index.astype(float)
theta_vals = np.linspace(0, 2*np.pi, len(lig_mat.columns), endpoint=False)

fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))

# Parameters
n_dist = lig_mat.shape[0]
n_samples = lig_mat.shape[1]
theta_edges = np.linspace(0, 2*np.pi, n_samples + 1)

# Define separate radial scales for inner (ligand) and outer (receptor)
r_inner_max = 100   # inner ring radius
r_outer_min = 110   # where outer ring starts
r_outer_max = 200   # outer ring limit

# Ligand ring: map distance bins [0, max] → [0, r_inner_max]
r_edges_lig = np.linspace(0, r_inner_max, n_dist + 1)
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, lig_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)

# Color code the collections
collections = metadata.set_index('Sample').loc[lig_mat.columns, 'Collection']
cond_groups = collections.groupby(collections).indices

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_inner_max], color='red', lw=3)
    ax.plot([end, end], [0, r_inner_max], color='red', lw=3)
plt.title('APRIL')

In [None]:
fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, rec_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)

# Color code the collections
collections = metadata.set_index('Sample').loc[lig_mat.columns, 'Collection']
cond_groups = collections.groupby(collections).indices

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_inner_max], color='red', lw=3)
    ax.plot([end, end], [0, r_inner_max], color='red', lw=3)
plt.title('BCMA')

In [None]:
fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, score_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)

# Color code the collections
collections = metadata.set_index('Sample').loc[lig_mat.columns, 'Collection']
cond_groups = collections.groupby(collections).indices

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_inner_max], color='red', lw=3)
    ax.plot([end, end], [0, r_inner_max], color='red', lw=3)
plt.title('LR score')

In [None]:
fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, count_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)

# Color code the collections
collections = metadata.set_index('Sample').loc[lig_mat.columns, 'Collection']
cond_groups = collections.groupby(collections).indices

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_inner_max], color='red', lw=3)
    ax.plot([end, end], [0, r_inner_max], color='red', lw=3)
plt.title('LR pair count')

In [None]:
# Create concentric radial heatmap plots
#r_vals = lig_mat.index.astype(float)
theta_vals = np.linspace(0, 2*np.pi, len(lig_mat.columns), endpoint=False)

fig, ax = plt.subplots(subplot_kw=dict(projection='polar'), figsize=(7,7))

# Parameters
n_dist = lig_mat.shape[0]
n_samples = lig_mat.shape[1]
theta_edges = np.linspace(0, 2*np.pi, n_samples + 1)

# Define separate radial scales for inner (ligand) and outer (receptor)
r_inner_max = 100   # inner ring radius
r_outer_min = 110   # where outer ring starts
r_outer_max = 200   # outer ring limit

# Ligand ring: map distance bins [0, max] → [0, r_inner_max]
r_edges_lig = np.linspace(0, r_inner_max, n_dist + 1)
mesh_L = ax.pcolormesh(theta_edges, r_edges_lig, lig_mat.values,
                       cmap='Blues', shading='auto')#, vmin=0, vmax=1)

r_edges_rec = np.linspace(r_outer_min, r_outer_max, n_dist + 1)
mesh_R = ax.pcolormesh(theta_edges, r_edges_rec, rec_mat.values,
                       cmap='Oranges', shading='auto')#, vmin=0, vmax=1

# Color code the collections
collections = metadata.set_index('Sample').loc[lig_mat.columns, 'Collection']
cond_groups = collections.groupby(collections).indices

for cond, idx in cond_groups.items():
    start = theta_edges[min(idx)]
    end = theta_edges[max(idx)+1]
    ax.plot([start, start], [0, r_outer_max], color='red', lw=3)
    ax.plot([end, end], [0, r_outer_max], color='red', lw=3)

## Section 3: Myeloid Immunosuppression Analysis

Myeloid cells have been shown to be a source of a immunosuppression tumor microenvironment, such as myeloid-derived suppressive cells (MDSCs) and tumor-associated macrophages (TAMs). Due to the limited gene panel and the sparsity of transcript detection, we may not be able to identify these populations using unsupervised clustering of their Xenium transcript profiles. Alternatively, this section analyzes the expression of immune checkpoint molecules and immunosuppressive factors in myeloid populations (especially macrophages) across disease stages to identify changes in immunosuppressive capacity. 

### Genes associated with myeloid immunosuppression

Myeloid cells are assessed for immunosuppressive phenotype based on expression of the following markers:

**Immune checkpoint and immunosuppressive molecules:**
CD274 (PD-L1)(on all panel), SPP1, MARCO, IL1R2, IL1RL1, TGFB1 (on v5, v6 panel)

**Transcriptional regulators:**
STAT3, STAT1, NFKB1, NFKBIA

Cells expressing ≥1 of these markers are considered to have an immunosuppressive MDSC-like phenotype. We also test for spatial proximity to exhausted T cells, which are hypothesized to be induced by immunosuppressive myeloid cells. These T cells are identified by the following markers:

**T cell exhaustion markers**:
CTLA4, LAG3, HAVCR2 (TIM-3), TIGIT



In [None]:
ird_xenium_merge.obs['ct_timepoint'] = ird_xenium_merge.obs.apply(lambda x: f"{x['ct']}_{x['Collection']}", axis = 1)

PDL1 (CD274) is expressed at higher levels in macrophages and megakaryocytes. PDL1+ macrophages could represent a possible mMDSC phenotype. Megakaryocytes maybe somewhat novel players expressing PDL1.

In [None]:
sc.pl.dotplot(ird_xenium_merge, var_names = 'CD274', groupby = 'ct_timepoint', swap_axes = True, dot_max = 0.05)

In [None]:
ird_myeloid = ird_xenium_merge[ird_xenium_merge.obs['ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Ba/Eo/Ma/', 'Monocyte', 'Macrophage', 'pDC', 'cDC']), :].copy()

In [None]:
timecols = {"NBM": "#0C7515", "NDMM": "#E619B9", "PT": "#CF99C3"} 

SPP1 expression is higher in macrophages in NDMM

In [None]:
ird_macro = ird_myeloid[ird_myeloid.obs['ct'].isin(['Macrophage']), :].copy()
sc.pl.dotplot(ird_macro, var_names = 'SPP1', groupby = 'Collection', swap_axes = True)

In [None]:
ird_macro.obs['SPP1_exp'] = ird_macro[:, 'SPP1'].X.toarray()
ird_macro.obs['SPP1_pos'] = ird_macro[:, 'SPP1'].X.toarray() > 0
macro_spp1_exp = ird_macro.obs.groupby('Sample')['SPP1_exp'].mean().reset_index()
macro_spp1_exp = macro_spp1_exp.merge(ird_macro.obs.groupby('Sample')['SPP1_pos'].mean().reset_index(), on='Sample', how='left')
macro_spp1_exp['Collection'] = macro_spp1_exp['Sample'].map(ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')['Collection'])
plot_utils.plot_comparison_with_significance(macro_spp1_exp, 'Collection', 'SPP1_pos', order=['NBM', 'NDMM', 'PT'], 
                                             palette=timecols, xlabel = 'Timepoint', ylabel = 'Fraction of Macrophages with SPP1 Expression')
                                             #save_path = '/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_SPP1fraction.pdf')

PDL1 (CD274) expression in macrophages is higher in NDMM

In [None]:
sc.pl.dotplot(ird_macro, var_names = 'CD274', groupby = 'Collection', swap_axes = True)

In [None]:
ird_macro.obs['PDL1_exp'] = ird_macro[:, 'CD274'].X.toarray()
ird_macro.obs['PDL1_pos'] = ird_macro[:, 'CD274'].X.toarray() > 0
macro_pdl1_exp = ird_macro.obs.groupby('Sample')['PDL1_exp'].mean().reset_index()
macro_pdl1_exp = macro_pdl1_exp.merge(ird_macro.obs.groupby('Sample')['PDL1_pos'].mean().reset_index(), on='Sample', how='left')
macro_pdl1_exp['Collection'] = macro_pdl1_exp['Sample'].map(ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')['Collection'])
plot_utils.plot_comparison_with_significance(macro_pdl1_exp, 'Collection', 'PDL1_pos', order=['NBM', 'NDMM', 'PT'], 
                                             palette=timecols, xlabel = 'Timepoint', ylabel = 'Fraction of Macrophages with PDL1 Expression')
                                             #save_path = '/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_PDL1fraction.pdf')

In [None]:
sc.pl.dotplot(ird_macro, var_names = 'MARCO', groupby = 'Collection', swap_axes = True)

In [None]:
ird_macro.obs['MARCO_exp'] = ird_macro[:, 'MARCO'].X.toarray()
ird_macro.obs['MARCO_pos'] = ird_macro[:, 'MARCO'].X.toarray() > 0
macro_marco_exp = ird_macro.obs.groupby('Sample')['MARCO_exp'].mean().reset_index()
macro_marco_exp = macro_marco_exp.merge(ird_macro.obs.groupby('Sample')['MARCO_pos'].mean().reset_index(), on='Sample', how='left')
macro_marco_exp['Collection'] = macro_marco_exp['Sample'].map(ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')['Collection'])
plot_utils.plot_comparison_with_significance(macro_marco_exp, 'Collection', 'MARCO_exp', order=['NBM', 'NDMM', 'PT'], palette=timecols)

### Immunosuppression gene set scores

Both SPP1+ macrophages and PDL1+ macrophages have been reported to exhibit immunosuppressive activities.

Due to their sparse expression and the lack of relevant MDSC/SPP1+ macrophage markers, we use a geneset score approach to identify macrophages that are likely to exhibit immunosuppressive capacities.

We include both SPP1 and CD274 (PDL1) as part of this gene set. While MARCO expression is lower in macrophages in NDMM, it's reported to be a marker of SPP1+ macrophages. As a result, we still included MARCO for the geneset score calculation.

In [None]:
sc.tl.score_genes(ird_macro, gene_list = ['SPP1', 'CD274', 'MARCO'], score_name = 'ImmSupp_score')

In [None]:
sns.histplot(ird_macro.obs['ImmSupp_score'])

Validate the expression of other markers that are reported to be associated with SPP1+ macrophages and MDSC activities.

In [None]:
ird_macro.obs['ImmSupp_score_high'] = ['high' if x > 0 else 'low' for x in ird_macro.obs['ImmSupp_score']]
dotplot = sc.pl.dotplot(ird_macro, var_names = ['SPP1', 'CD274', 'MARCO'], groupby = 'ImmSupp_score_high', return_fig = True)
#dotplot.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_ImmSuppScore_SPP1CD274MARCO.pdf', dpi = 300)


In [None]:
dotplot = sc.pl.dotplot(ird_macro, var_names = ['TGFB1', 'MMP9', 'STAT1'], groupby = 'ImmSupp_score_high', return_fig = True)
dotplot.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_ImmSuppScore_TGFB1MMP9STAT1.pdf', dpi = 300)

In [None]:
dotplot = sc.pl.dotplot(ird_macro, var_names = ['STAT3', 'NFKB1', 'NFKBIA'], groupby = 'ImmSupp_score_high', return_fig = True)
dotplot.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_ImmSuppScore_STAT3NFKB1NFKBIA.pdf', dpi = 300)

In [None]:
dotplot = sc.pl.dotplot(ird_macro, var_names = ['TREM2', 'STAT5A', 'IFI44L', 'ISG15'], groupby = 'ImmSupp_score_high', return_fig = True)
dotplot.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_ImmSuppScore_TREM2STAT5AIFI44LISG15.pdf', dpi = 300)

Validate if macrophages with high immunosuppression scores are enriched in NDMM

In [None]:
dotplot = sc.pl.dotplot(ird_macro, var_names = 'ImmSupp_score', groupby = 'Collection', return_fig = True)
dotplot.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_ImmSuppScore_timepoint.pdf', dpi = 300)

In [None]:
# Percent macrophage with high SPP1 score per sample, grouped by Collection
ird_macro.obs['ImmSupp_score_pos'] = ird_macro.obs['ImmSupp_score'] > 0.
ird_macro_immSupp_summary = ird_macro.obs.groupby(['Sample'])['ImmSupp_score_pos'].mean().reset_index()
ird_macro_immSupp_summary = ird_macro_immSupp_summary.merge(ird_macro.obs[['Sample', 'Collection']].drop_duplicates(), on='Sample', how='left')
plot_utils.plot_comparison_with_significance(ird_macro_immSupp_summary, 'Collection', 'ImmSupp_score_pos', order=['NBM', 'NDMM', 'PT'], palette=timecols, 
                                       xlabel='Timepoint', ylabel='Fraction of Macrophages with High Immunosuppression Score', title=None)
                                       #save_path='/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_macrophage_ImmSuppScoreFrac.pdf', figsize=(6, 5))



Validate if macrophages with high immunosuppression scores have higher expression of other SPP1/MDSC markers in NDMM.

In [None]:
ird_macro_immSupp = ird_macro[ird_macro.obs['ImmSupp_score_high'] == 'high', :].copy()

In [None]:
sc.pl.dotplot(ird_macro_immSupp, var_names = ['TGFB1', 'MMP9', 'STAT1'], groupby = 'Collection', swap_axes = True)

In [None]:
sc.pl.dotplot(ird_macro_immSupp, var_names = ['STAT3', 'NFKB1', 'NFKBIA'], groupby = 'Collection', swap_axes = True)

In [None]:
sc.pl.dotplot(ird_macro_immSupp, var_names = ['TREM2', 'STAT5A', 'IFI44L', 'ISG15'], groupby = 'Collection', swap_axes = True)

### Spatial localization of immunosuppressive macrophages

Are the macrophages enriched in certain radial neighborhoods? Radial neighborhoods were defined by Julia, which clusters neighbor composition within 50 microns of each cell.

In [None]:
rn_obj = sc.read_h5ad('/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/radial_neighborhoods/Output/merged_RN.h5ad')
rn_obj.obs.head()


In [None]:
# Count number of cells in each radial neighborhood
rn_counts = rn_obj.obs['rn'].value_counts().reset_index()
rn_counts

In [None]:
# Add radial neighborhood information to the macrophage data
ird_macro.obs['rn'] = rn_obj.obs.loc[ird_macro.obs.index, 'rn'].values

Calculate the fraction of macrophages that are immunosuppressive in each radial neighborhood. They are more enriched in:
- nbhd 8: CD8 T/NK cell neighborhood
- nbhd 9: stromal neighborhood
- nbhd 12 (few cells per sample): osteoblast neighborhood

In [None]:
# Count the number of macrophages in each radial neighborhood, grouped by ImmSupp_score_pos
macro_rn_counts = (
    ird_macro.obs
    .groupby(['rn', 'ImmSupp_score_pos'])
    .size()
    .reset_index(name='macrophage_count')
)
macro_rn_counts = macro_rn_counts.merge(rn_counts, on='rn', how='left')
macro_rn_counts['macrophage_frac'] = macro_rn_counts['macrophage_count'] / macro_rn_counts['count']
display(macro_rn_counts)



In [None]:
sns.barplot(macro_rn_counts, x = 'macrophage_frac', y = 'rn', hue = 'ImmSupp_score_pos')

Calculate the fraction of macrophages that are immunnosuppressive per radial neighborhood on a sample level, and compare across collection timepoints.

In [None]:
macro_per_sample_rn = ird_macro.obs.groupby(['Sample', 'rn']).agg(
    macrophage_count = ('rn', 'count'),
    immSupp_count = ('ImmSupp_score_pos', 'sum')
).reset_index()
macro_per_sample_rn['macrophage_frac'] = macro_per_sample_rn['macrophage_count'] / macro_per_sample_rn['macrophage_count'].sum()
macro_per_sample_rn['immSupp_frac'] = macro_per_sample_rn['immSupp_count'] / macro_per_sample_rn['macrophage_count']
macro_per_sample_rn['immSupp_frac'] = macro_per_sample_rn['immSupp_frac'].fillna(0)
macro_per_sample_rn['Collection'] = macro_per_sample_rn['Sample'].map(ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')['Collection'])
macro_per_sample_rn.head()

In [None]:
_ = plot_utils.plot_multigroup_boxplot_with_significance(macro_per_sample_rn, 'rn', 'immSupp_frac', 'Collection', figsize = (15, 6), show_outliers = False,
                                                     xlabel = 'Radial neighborhood', ylabel = 'Fraction of macrophages that are immunosuppressive',
                                                     palette = timecols, save_path = '/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_frac_ImmSupp_macrophage_rn_timepoint.pdf')


### Visualize spatial location of immunosuppressive macrophages

Scatterplot of cell centroids of macrophages, grouped by their immunosuppressive gene expression, and plasma cells (tumor).

In [None]:
test_sample = 'S13-35096A1U1'
ird_macro_test_meta = ird_macro.obs.loc[ird_macro.obs['Sample'] == test_sample, :].copy()
ird_pc_test_meta = ird_xenium_merge.obs.loc[(ird_xenium_merge.obs['Sample'] == test_sample) & (ird_xenium_merge.obs['ct'] == 'PC'), :].copy()
ird_test_meta = pd.concat([ird_macro_test_meta, ird_pc_test_meta], axis = 0)
ird_test_meta.head()



In [None]:
# Defind subtypes: PCs remain PCs, macrophages are split into ImmSupp and Non-ImmSupp
ird_test_meta['ct_subtype'] = ['PC' if x == 'PC' else 'ImmSupp' if x == 'Macrophage' and y == 'high' else 'Non-ImmSupp' for x, y in zip(ird_test_meta['ct'], ird_test_meta['ImmSupp_score_high'])]
ird_test_meta.head()

In [None]:
plt.figure(figsize = (15, 15))
sns.scatterplot(ird_test_meta, x = 'x_centroid', y = 'y_centroid', hue = 'ct_subtype', palette = 'muted', s = 2, linewidth = 0)
# Maintain aspect ratio
plt.gca().set_aspect('equal', 'box')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', markerscale = 4)

### Archived

#### Gene expression dotplot of immunosuppression genes only in the myeloid populations

In [None]:
ird_v6 = ird_xenium_merge[ird_xenium_merge.obs['Panel'] == 'BYGXJ6_hMulti'].copy()

In [None]:
ird_v6.obs['ct_timepoint'] = ird_v6.obs.apply(lambda x: f"{x['ct']}_{x['Collection']}", axis = 1)

In [None]:
ird_myeloid = ird_v6[ird_v6.obs['ct'].isin(['GMP', 'Late Myeloid', 'Neutrophil', 'Ba/Eo/Ma/', 'Monocyte', 'Macrophage', 'pDC', 'cDC']), :]

In [None]:
sc.pl.dotplot(ird_myeloid, var_names = ['CD274', 'IL1RL1'], groupby = 'ct_timepoint', swap_axes = True)

In [None]:
sc.pl.dotplot(ird_myeloid, var_names = ['IL1R2'], groupby = 'ct_timepoint', swap_axes = True)

In [None]:
sc.pl.dotplot(ird_myeloid, var_names = ['TGFB1', 'STAT3', 'STAT5A', 'S100A12'], groupby = 'ct_timepoint', swap_axes = True)

#### PDL1 expression and immunosuppressive score (old version) in all myeloid cells

In [None]:
# Identify immune suppressor genes in Mc/Mp, Granulo.
genes = ['CD274', 'TGFB1', 'CTLA4', 'HAVCR2', 'LAG3']#, 'STAT3']

myeloid_supp_df = ird_myeloid[:, genes].X.toarray()
myeloid_supp_df_binary = myeloid_supp_df > 0
ird_myeloid.obs['Immune_supp'] = np.any(myeloid_supp_df_binary, axis=1)
ird_myeloid.obs['PDL1_pos'] = ird_myeloid[:, 'CD274'].X.toarray() > 0

# Percent myeloid per sample expressing any of the one immuno suppresor genes
myeloid_supp_pct = ird_myeloid.obs.groupby('Sample')['Immune_supp'].mean().reset_index()
myeloid_supp_pct = myeloid_supp_pct.merge(ird_myeloid.obs.groupby('Sample')['PDL1_pos'].mean().reset_index(), on='Sample', how='left')
myeloid_supp_pct = myeloid_supp_pct.merge(ird_myeloid.obs.groupby('Sample')['Immune_supp'].mean().reset_index(), on='Sample', how='left')
myeloid_supp_pct = myeloid_supp_pct.merge(ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates(), on='Sample', how='left')
myeloid_supp_pct.head()

In [None]:
fig, stats = plot_utils.plot_comparison_with_significance(
        myeloid_supp_pct, 
        x_col='Collection', 
        y_col='PDL1_pos',
        order=['NBM', 'NDMM', 'PT'],
        ylabel='Fraction of PDL1+ myeloid cells'
    )

Look at paired data

In [None]:
# Identify paired samples in NDMM and PT, they will have the same UPN (already in meta)
ird_meta = ird_xenium_merge.obs.copy()
ird_meta[['UPN', 'Collection']].drop_duplicates().value_counts('UPN').reset_index()

Look at paired samples in NDMM and PT, they will have the same UPN (already in meta)

In [None]:
paired_UPNs = ['WU030', 'WU025', 'WU107', 'WU068', 'WU066', 'WU050', 'WU007', 'WU043']
upn_sample_meta = ird_meta[['Sample', 'UPN', 'Collection']].drop_duplicates().set_index('Sample')
myeloid_supp_pct['UPN'] = myeloid_supp_pct['Sample'].map(upn_sample_meta['UPN'])
myeloid_supp_paired_pct = myeloid_supp_pct[myeloid_supp_pct['UPN'].isin(paired_UPNs)].copy()
myeloid_supp_paired_pct['Collection'] = pd.Categorical(myeloid_supp_paired_pct['Collection'], categories=['NDMM', 'PT'], ordered=True)
# Some samples are from same UPN, need to average the percentages per collection
myeloid_supp_paired_pct = myeloid_supp_paired_pct.groupby(['UPN', 'Collection']).agg({'PDL1_pos': 'mean'}).reset_index().dropna()
sns.boxplot(myeloid_supp_paired_pct, x='Collection', y='PDL1_pos', palette='Set2')

# Plot line connecting paired samples
for upn in paired_UPNs:
    data_upn = myeloid_supp_paired_pct[myeloid_supp_paired_pct['UPN'] == upn]
    plt.plot(data_upn['Collection'], data_upn['PDL1_pos'], color='gray', linestyle='--', marker='o')


In [None]:
paired_UPNs = ['WU030', 'WU025', 'WU107', 'WU068', 'WU066', 'WU050', 'WU007', 'WU043']
upn_sample_meta = ird_meta[['Sample', 'UPN', 'Collection']].drop_duplicates().set_index('Sample')
myeloid_supp_pct['UPN'] = myeloid_supp_pct['Sample'].map(upn_sample_meta['UPN'])
myeloid_supp_paired_pct = myeloid_supp_pct[myeloid_supp_pct['UPN'].isin(paired_UPNs)].copy()
myeloid_supp_paired_pct['Collection'] = pd.Categorical(myeloid_supp_paired_pct['Collection'], categories=['NDMM', 'PT'], ordered=True)
# Some samples are from same UPN, need to average the percentages per collection
myeloid_supp_paired_pct = myeloid_supp_paired_pct.groupby(['UPN', 'Collection']).agg({'Immune_supp': 'mean'}).reset_index().dropna()
sns.boxplot(myeloid_supp_paired_pct, x='Collection', y='Immune_supp', palette='Set2')

# Plot line connecting paired samples
for upn in paired_UPNs:
    data_upn = myeloid_supp_paired_pct[myeloid_supp_paired_pct['UPN'] == upn]
    plt.plot(data_upn['Collection'], data_upn['Immune_supp'], color='gray', linestyle='--', marker='o')


In [None]:
# Calculate p-value using Wilcoxon signed-rank test
from scipy.stats import wilcoxon
# Extract percentage values for NDMM and PT for each paired UPN, make sure they are in the same order
ndmm_pct_df = myeloid_supp_paired_pct[myeloid_supp_paired_pct['Collection'] == 'NDMM'].sort_values('UPN')
pt_pct_df = myeloid_supp_paired_pct[myeloid_supp_paired_pct['Collection'] == 'PT'].sort_values('UPN')

# Perform Wilcoxon signed-rank test
wilcoxon(ndmm_pct_df['PDL1_pos'].values, pt_pct_df['PDL1_pos'].values).pvalue




#### Distance between immunosuppressive myeloid cells (e.g. PDL1+) and (exhaustive) T cells

Check if CD274+ myeloid cells are around PDCD1+ T cells - answer seems to be "no", maybe due to sparse transcript detection

In [None]:
exh_genes = ['HAVCR2', 'PDCD1', 'LAG3', 'CTLA4', 'TIGIT']
ird_tcells = ird_xenium_merge[ird_xenium_merge.obs['annot'] == 'T', :]
exh_df = ird_tcells[:, exh_genes].X.toarray()
exh_df_binary = exh_df > 0
ird_tcells.obs['Exh_pos'] = np.any(exh_df_binary, axis=1)
ird_tcells.obs['PDCD1_pos'] = ird_tcells[:, 'PDCD1'].X.toarray() > 0

In [None]:
# Check if CD274+ Mc/Mp are around PDCD1+ T cells
PDL1_pos_mcmp = ird_myeloid.obs.loc[ird_myeloid.obs['Immune_supp'] == True, ['Sample', 'x_centroid', 'y_centroid']].copy()
PDCD1_tcell_pos = ird_tcells.obs.loc[ird_tcells.obs['Exh_pos'] == True, ['Sample', 'x_centroid', 'y_centroid']].copy()

res = nearest_dist_between_two_celltypes(PDL1_pos_mcmp, PDCD1_tcell_pos, 'Sample', x_col='x_centroid', y_col='y_centroid')
res.head()

In [None]:
sns.histplot(res, x='nearest_dist_to_df2', binwidth = 10)
plt.xlim(0, 200)

Look at the number of MDSCs within 20 microns of exh. T cells in each sample, normalized by the number of cells within 20 microns of each exh. T cell

In [None]:
from scipy.spatial import KDTree
all_cell_pos = ird_xenium_merge.obs[['Sample', 'x_centroid', 'y_centroid']].copy()
PDL1_pos_mcmp = ird_myeloid.obs.loc[ird_myeloid.obs['PDL1_pos'] == True, ['Sample', 'x_centroid', 'y_centroid']].copy()
PDCD1_tcell_pos = ird_tcells.obs.loc[ird_tcells.obs['PDCD1_pos'] == True, ['Sample', 'x_centroid', 'y_centroid']].copy()
all_cell_pos = all_cell_pos.drop(PDCD1_tcell_pos.index)

res = []
distance_threshold = 20
for sample in PDL1_pos_mcmp['Sample'].unique():
    # Extract positions of MDSCs, Exh. T cells, and all cells in the sample
    mdsc_pos = PDL1_pos_mcmp[PDL1_pos_mcmp['Sample'] == sample]
    tcell_pos = PDCD1_tcell_pos[PDCD1_tcell_pos['Sample'] == sample]
    cell_pos = all_cell_pos[all_cell_pos['Sample'] == sample]

    # Build KDTree for T cells and MDSCs for distance query
    t_tree = KDTree(tcell_pos[['x_centroid', 'y_centroid']])
    mdsc_tree = KDTree(mdsc_pos[['x_centroid', 'y_centroid']])
    cell_tree = KDTree(cell_pos[['x_centroid', 'y_centroid']])
        
    # For Metric 1: MDSCs within 20 microns of T cells / All cells within 20 microns of T cells
    ind_mdsc_near_t = mdsc_tree.query_ball_point(tcell_pos[['x_centroid', 'y_centroid']], r=distance_threshold)
    ind_cell_near_t = cell_tree.query_ball_point(tcell_pos[['x_centroid', 'y_centroid']], r=distance_threshold)
    
    total_mdscs_around_tcells = sum(len(neighbors) for neighbors in ind_mdsc_near_t)
    total_cells_around_tcells = sum(len(neighbors) for neighbors in ind_cell_near_t)
    metric1 = total_mdscs_around_tcells / total_cells_around_tcells if total_cells_around_tcells > 0 else 0
    
    # For Metric 2: T cells with at least 1 MDSC neighbor / Total T cells
    ind_t_near_mdsc = mdsc_tree.query_ball_point(tcell_pos[['x_centroid', 'y_centroid']], r=distance_threshold)
    total_tcells_with_mdsc = sum(1 for neighbors in ind_t_near_mdsc if len(neighbors) > 0)
    total_tcells = len(tcell_pos)
    metric2 = total_tcells_with_mdsc / total_tcells if total_tcells > 0 else 0
    
    res.append({
        'Sample': sample,
        'frac_mdsc_around_tcells': metric1,
        'frac_tcells_with_mdsc': metric2,
        'total_mdscs_around_tcells': total_mdscs_around_tcells,
        'total_cells_around_tcells': total_cells_around_tcells,
        'total_tcells_with_mdsc': total_tcells_with_mdsc,
        'total_tcells': total_tcells
        })
res = pd.DataFrame(res)
res['Collection'] = res['Sample'].map(ird_meta[['Sample', 'Collection']].drop_duplicates().set_index('Sample')['Collection'])
# Normalize by total number of myeloid cells per sample
total_myeloid = ird_myeloid.obs.groupby('Sample').size().reset_index(name='total_myeloid')
res = res.merge(total_myeloid, on='Sample', how='left')



In [None]:
sns.boxplot(res, x='Collection', y='frac_mdsc_around_tcells', palette='Set2')
sns.swarmplot(res, x='Collection', y='frac_mdsc_around_tcells', color='black', size=8, alpha = .5)
plt.xlabel('Timepoint')
plt.ylabel('Fraction of exh. T cells neighbors being MDSCs')

In [None]:
sns.boxplot(res, x='Collection', y='frac_tcells_with_mdsc', palette='Set2')
sns.swarmplot(res, x='Collection', y='frac_tcells_with_mdsc', color='black', size=8, alpha = .5)
plt.xlabel('Timepoint')
plt.ylabel('Fraction of exh. T cells having MDSC neighbors')

## Section 4: Stromal Niche Dysfunction - MSC Analysis

Characterize mesenchymal stromal cells (MSCs) and their functional changes across disease stages. MSCs are critical components of the bone marrow niche that support hematopoiesis and regulate plasma cell survival through factors like CXCL12. We test if CXCL12 expression in MSCs are significantly altered in disease and treatment.

In [None]:
ird_msc = ird_xenium_merge[ird_xenium_merge.obs['ct'] == 'MSC'].copy()

In [None]:
ird_msc_v6 = ird_msc[ird_msc.obs['Panel'] == 'BYGXJ6_hMulti'].copy()

In [None]:
sc.tl.rank_genes_groups(ird_msc_v6, groupby ='Collection', method = 'wilcoxon')

In [None]:
sc.pl.rank_genes_groups(ird_msc_v6, n_genes = 25, show = False)

In [None]:
sc.pl.dotplot(ird_xenium_merge, var_names = ['CD74', 'FTH1'], groupby = 'annot_timepoint', swap_axes = True)

### Plasma cell contamination issue

Initial differential expression analysis reveals MSCs in NDMM samples are heavily confounded by plasma cell gene expression. This contamination likely arises from:
1. **Transcript leakage**: High plasma cell abundance leads transcript leakage to neighboring cells
2. **Segmentation errors**: Cells near plasma cells may incorrectly capture plasma transcripts

**Solution**: Apply stringent filtering to remove MSCs expressing canonical plasma cell markers before analyzing MSC-specific gene expression changes.

In [None]:
# Stringent filtering - remove MSCs with any expression of plasma cell genes SDC1, SLAMF7, MZB1, TNFRSF17
msc_PCexp = ird_msc[:, ['SDC1', 'SLAMF7', 'MZB1', 'TNFRSF17', 'B2M', 'TENT5C', 'CD74']].X.toarray().sum(axis=1)
ird_msc.obs['PC_exp'] = msc_PCexp > 1e-6
ird_msc_PC_filtered = ird_msc[ird_msc.obs['PC_exp'] == False].copy()
len(ird_msc_PC_filtered)



### CXCL12/CXCR4 axis analysis

The **CXCL12-CXCR4 axis** is critical for plasma cell homing and retention in the bone marrow niche:

- **CXCL12** (SDF-1): Chemokine produced by MSCs and other stromal cells that attracts CXCR4+ cells
- **CXCR4**: Chemokine receptor expressed on plasma cells and B cell precursors

**Analysis approach:**
1. Quantify CXCL12 expression in filtered MSCs across disease stages
2. Quantify CXCR4 expression in B-lineage cells (Early B, Mature B, Plasma cells)


**Interpretation notes:**
- Decreased MSC CXCL12 may indicate stromal dysfunction and reduced niche support
- Altered CXCR4 levels in plasma cells may affect their retention and survival in the niche
- Cellchat analysis using Xenium data (done separately) indicate that CXCL12-CXCR4 signaling axis is affected



In [None]:
sc.tl.rank_genes_groups(ird_msc_PC_filtered, groupby ='Collection', method = 'wilcoxon')

In [None]:
sc.pl.rank_genes_groups(ird_msc_PC_filtered, n_genes = 25, show = False)

In [None]:
ird_msc_PC_filtered_v6 = ird_msc_PC_filtered[ird_msc_PC_filtered.obs['Panel'] == 'BYGXJ6_hMulti'].copy()
sc.tl.rank_genes_groups(ird_msc_PC_filtered_v6, groupby ='Collection', method = 'wilcoxon')
sc.pl.rank_genes_groups(ird_msc_PC_filtered_v6, n_genes = 25, show = False)

In [None]:
sc.pl.dotplot(ird_msc_PC_filtered, var_names = ['CXCL12'], groupby = 'Collection')

In [None]:
ird_msc_PC_filtered.obs['CXCL12_exp'] = ird_msc_PC_filtered[:, 'CXCL12'].X.toarray().flatten()
cxcl12_exp = ird_msc_PC_filtered.obs.groupby('Sample')['CXCL12_exp'].mean().reset_index()
cxcl12_exp['Collection'] = cxcl12_exp['Sample'].map(ird_msc_PC_filtered.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')['Collection'])

timecols = {"NBM": "#0C7515", "NDMM": "#E619B9", "PT": "#CF99C3"} 
plot_utils.plot_comparison_with_significance(cxcl12_exp, 'Collection', 'CXCL12_exp', order=['NBM', 'NDMM', 'PT'], palette=timecols, 
                                       xlabel='Timepoint', ylabel='Mean CXCL12 Expression in MSCs', title=None,
                                       save_path='/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_Xenium_merge_CXCL12_expression_boxplot_v111125.pdf', figsize=(6, 5))

In [None]:
# Boxplot of mean CXCR4 expression in PC, Mature B, and Early B cells per sample, grouped by Collection
ird_B_lineage = ird_xenium_merge[ird_xenium_merge.obs['ct'].isin(['PC', 'Mature B', 'Early B'])].copy()
ird_B_lineage.obs['CXCR4_exp'] = ird_B_lineage[:, 'CXCR4'].X.toarray().flatten()
cxcr4_exp = ird_B_lineage.obs.groupby(['Sample', 'ct'])['CXCR4_exp'].mean().reset_index()
cxcr4_exp['Collection'] = cxcr4_exp['Sample'].map(ird_B_lineage.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')['Collection'])

timecols = {"NBM": "#0C7515", "NDMM": "#E619B9", "PT": "#CF99C3"} 
order = cxcr4_exp['ct'].unique().tolist()          # set your desired x order
hue_order = sorted(cxcr4_exp['Collection'].unique())  # set your desired hue order

plot_utils.plot_multigroup_boxplot_with_significance(cxcr4_exp, x_col='ct', y_col='CXCR4_exp', hue_col='Collection', 
                                                     order=order, hue_order=hue_order, show_swarm=True, 
                                                     palette=timecols, figsize=(6, 5), xlabel='Timepoint', ylabel='Mean CXCR4 expression', title=None, save_path=None)
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.ylim(-0.1, 1.1)
plt.tight_layout()
plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_Xenium_merge_CXCR4_expression_boxplot_v111125.pdf', dpi = 300)

In [None]:
sc.pl.dotplot(ird_xenium_merge, var_names = ['CXCL12'], groupby = 'ct_timepoint', swap_axes = True)

In [None]:
sc.pl.dotplot(ird_xenium_merge, var_names = ['CXCR4'], groupby = 'annot_timepoint', swap_axes = True)

In [None]:
sc.pl.dotplot(ird_msc_PC_filtered, var_names = ['DKK1'], groupby = 'Collection', dot_max = .02)

### Distance between MSCs and B/Plasma cells

In [None]:
msc_df = ird_msc.obs[['Sample', 'x_centroid', 'y_centroid']].copy()
b_df = ird_xenium_merge.obs[ird_xenium_merge.obs['ct'] == 'Mature B'][['Sample', 'x_centroid', 'y_centroid']].copy()
earlyB_df = ird_xenium_merge.obs[ird_xenium_merge.obs['ct'] == 'Early B'][['Sample', 'x_centroid', 'y_centroid']].copy()
PC_df = ird_xenium_merge.obs[ird_xenium_merge.obs['ct'] == 'PC'][['Sample', 'x_centroid', 'y_centroid']].copy()
msc_B_dist = spatial_utils.nearest_dist_between_two_celltypes(msc_df, b_df, sample_col = 'Sample', x_col='x_centroid', y_col='y_centroid')
msc_earlyB_dist = spatial_utils.nearest_dist_between_two_celltypes(msc_df, earlyB_df, sample_col = 'Sample', x_col='x_centroid', y_col='y_centroid')
msc_PC_dist = spatial_utils.nearest_dist_between_two_celltypes(msc_df, PC_df, sample_col = 'Sample', x_col='x_centroid', y_col='y_centroid')
msc_B_dist['ref_ct'] = 'B cell'
msc_earlyB_dist['ref_ct'] = 'Early B'
msc_PC_dist['ref_ct'] = 'Plasma cell'
msc_dists = pd.concat([msc_B_dist, msc_earlyB_dist, msc_PC_dist], axis = 0)
msc_dists.head()

In [None]:
from scipy.stats import gaussian_kde

res = []
for sample in msc_dists['Sample'].unique():
    sample_df = msc_dists[msc_dists['Sample'] == sample]
    for ref_ct in sample_df['ref_ct'].unique():
        sample_ref_df = sample_df[sample_df['ref_ct'] == ref_ct]
        if len(sample_ref_df) < 10:
            continue
        # Estimate distance using KDE
        kde = gaussian_kde(sample_ref_df['nearest_dist_to_df2'])
        dist_eval = np.arange(0, 501, 1)
        kde_pdf = kde.pdf(dist_eval)
        most_likely_dist = dist_eval[np.argmax(kde_pdf)]
        res.append({
            'Sample': sample,
            'Reference Cell Type': ref_ct,
            'Most Likely Distance': most_likely_dist,
            'Average Distance': sample_ref_df['nearest_dist_to_df2'].mean(),
            'Median Distance': sample_ref_df['nearest_dist_to_df2'].median()
        })

most_likely_dists_df = pd.DataFrame(res)


In [None]:
most_likely_dists_df

In [None]:
collection_meta = ird_xenium_merge.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
most_likely_dists_df['Collection'] = most_likely_dists_df['Sample'].map(collection_meta['Collection'])

In [None]:
plot_utils.plot_multigroup_boxplot_with_significance(most_likely_dists_df, x_col = 'Reference Cell Type', y_col = 'Most Likely Distance', hue_col = 'Collection', 
                                                    order=None, hue_order=None, show_swarm=True, palette=timecols, figsize=(6, 5), xlabel='Timepoint', ylabel=r'Distance to MSCs ($\mu$m)', 
                                                    title=None, save_path='/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_MSC_PCB_distance.pdf')

## Section 5: Spatial Neighbor Cell Type Composition

Analyze the cellular neighborhood (nearest neighbors) composition around each cell type using spatial proximity. This approach identifies which cell types preferentially localize near each other, revealing potential cellular interactions and microenvironmental niches.

### Nearest neighbor identification: Delaunay + KDTree hybrid approach

The neighbor detection combines two complementary methods:
1. **Delaunay triangulation**: Identifies neighbors based on spatial tessellation
2. **Distance threshold (KDTree)**: Filters neighbors by maximum distance, removing distant connections from the triangulation

Only cells that satisfy **both** criteria (Delaunay neighbors within distance threshold) are retained as true spatial neighbors. This hybrid approach avoids spurious long-range connections while maintaining biologically relevant local neighborhoods.

In [None]:
from scipy.spatial import Delaunay, KDTree
def get_proximal_neighbors(df, x_col='x_centroid', y_col='y_centroid', distance_threshold=50):
    """
    Identify proximal spatial neighbors using Delaunay triangulation intersected with distance threshold.
    
    This hybrid approach combines:
    1. Delaunay triangulation to find neighbors
    2. KDTree radius search to enforce distance constraints
    
    Only cells that are BOTH Delaunay neighbors AND within the distance threshold are retained,
    ensuring biologically relevant local neighborhoods without spurious long-range connections.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame containing cell coordinates with an index that will be preserved
    x_col : str
        Column name for x-coordinates (default: 'x_centroid')
    y_col : str  
        Column name for y-coordinates (default: 'y_centroid')
    distance_threshold : float
        Maximum distance (in same units as coordinates) for considering cells as neighbors.
        Typical values: 20-50 microns for Xenium data

    Returns:
    --------
    dict
        Dictionary mapping each cell index to a list of neighbor cell indices.
        Keys and values are original DataFrame indices.
        
    Example:
    --------
    # Get neighbors within 25 microns for a sample
    sample_data = cell_info[cell_info['Sample'] == 'Sample1']
    neighbors = get_proximal_neighbors(sample_data, distance_threshold=25)
    
    # Access neighbors of first cell
    first_cell_idx = sample_data.index[0]
    first_cell_neighbors = neighbors[first_cell_idx]
    """
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input 'df' must be a pandas DataFrame.")
    if x_col not in df.columns:
        raise ValueError(f"Column '{x_col}' not found in DataFrame.")
    if y_col not in df.columns:
        raise ValueError(f"Column '{y_col}' not found in DataFrame.")

    points = df[[x_col, y_col]].values
    original_indices_map = df.index # To map points array indices back to original df indices

    
    if len(points) < 3:
        # Delaunay triangulation requires at least 3 points to form a simplex (triangle in 2D).
        print("Not enough points to perform Delaunay triangulation. Need at least 3 points.")
        return {idx: [] for idx in df.index} # Return mapping to original DataFrame indices
        
    tri = Delaunay(points)

    # tri.vertex_neighbor_vertices is a CSR-like structure:
    # indptr[i] to indptr[i+1] gives the slice in 'indices' for the i-th point in the 'points' array
    # indices[indptr[i]:indptr[i+1]] gives the actual indices (referring to the 'points' array)
    # of the neighbors for the i-th point.
    indptr, delaunay_indices = tri.vertex_neighbor_vertices
    kdtree = KDTree(points)

    proximal_neighbors = {}
    for i in range(len(points)):
        current_original_idx = original_indices_map[i]
        
        # a. Get Delaunay neighbors (indices relative to 'points' array)
        delaunay_neighbors_for_i = delaunay_indices[indptr[i]:indptr[i+1]]
        delaunay_set = set(delaunay_neighbors_for_i)
        
        # b. Get K-D tree neighbors (indices relative to 'points' array)
        # query_ball_point returns a list of indices of points within the radius
        kdtree_neighbors_for_i = kdtree.query_ball_point(points[i], r=distance_threshold)
        # Remove the point itself from its k-d tree neighbors
        kdtree_set = set(kdtree_neighbors_for_i) - {i} 
        
        # c. Intersect the two sets
        intersected_neighbor_indices_in_points_array = list(delaunay_set.intersection(kdtree_set))
        
        # d. Map intersected indices back to original DataFrame indices
        final_neighbor_original_indices = [original_indices_map[n_idx] for n_idx in intersected_neighbor_indices_in_points_array]
        
        proximal_neighbors[current_original_idx] = final_neighbor_original_indices
        
    return proximal_neighbors


In [None]:
ird_xenium_merge = sc.read_h5ad("/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/merged_filtered.h5ad")

In [None]:
ird_xenium_merge

In [None]:
ird_cell_info = ird_xenium_merge.obs.copy()
ird_cell_info.head()

In [None]:
ird_cell_info['annot'].unique().tolist()

In [None]:
ird_cell_info['Collection'].unique().tolist()

In [None]:
from collections import Counter
def neighboring_celltype_counts(cell_info, ct_col_name, ct_of_interest, neighbors_dict):
    """
    Count the cell type composition of neighbors around a specific cell type.
    
    For all cells of a given type, aggregates their neighbors' cell types to understand
    the composition around that cell type.
    
    Parameters:
    -----------
    cell_info : pd.DataFrame
        DataFrame with cell metadata including cell type annotations
    ct_col_name : str
        Column name containing cell type annotations (e.g., 'annot', 'group')
    ct_of_interest : str
        Specific cell type to analyze (e.g., 'PC', 'MSC', 'T')
    neighbors_dict : dict
        Output from get_proximal_neighbors() function mapping cell indices to neighbor indices
        
    Returns:
    --------
    dict
        Dictionary mapping each cell type to the count of how many times it appears
        as a neighbor of the cell type of interest. All cell types in the dataset
        are included (with 0 for types that never appear as neighbors).
        
    Example output:
    ---------------
    {'PC': 1250, 'MSC': 450, 'T': 890, 'B': 120, ...}
    # Interpretation: Plasma cells (PC) have 1250 plasma cell neighbors total,
    # 450 MSC neighbors, 890 T cell neighbors, etc.
    
    Example usage:
    --------------
    # Count what cell types are around plasma cells in a sample
    neighbors = get_proximal_neighbors(sample_data)
    pc_neighborhood = neighboring_celltype_counts(
        sample_data, 
        ct_col_name='annot',
        ct_of_interest='PC',
        neighbors_dict=neighbors
    )
    """
    ct_index = cell_info[cell_info[ct_col_name] == ct_of_interest].index.values
    ct_neighbor_indices = [neighbors_dict[i] for i in ct_index]
    ct_neighbor_indices_flat = [item for sublist in ct_neighbor_indices for item in sublist]
    ct_neighbors = [cell_info[ct_col_name][i] for i in ct_neighbor_indices_flat]
    ct_neighbors_counter = Counter(ct_neighbors)
    
    # Get all unique cell types
    all_cell_types = np.unique(cell_info[ct_col_name])
    # Create dictionary with zeros for all cell types, then update with actual counts
    ct_neighbors_counts = {ct: 0 for ct in all_cell_types}
    ct_neighbors_counts.update(ct_neighbors_counter)

    return ct_neighbors_counts

In [None]:
test = ird_cell_info[ird_cell_info['Sample'] == 'IRD_S18-30740A1U3']
all_nbs = get_proximal_neighbors(test, x_col='x_centroid', y_col='y_centroid', distance_threshold=25)
nb_ct = neighboring_celltype_counts(test, 'annot', 'PC', all_nbs)
nb_ct

In [None]:
def run_neighboring_celltype_analysis(df, ct_col, sample_col, distance_threshold=25, x_col='x_centroid', y_col='y_centroid'):
    """
    Calculates neighbor composition across all cell types and samples. Calls neighboring_celltype_counts().

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with cell metadata including coordinates, cell types, and sample IDs
    ct_col : str
        Column name for cell type annotations
    sample_col : str
        Column name for sample identifiers
    distance_threshold : float
        Maximum distance (microns) for neighbor consideration (default: 25)
    x_col : str
        Column name for x-coordinates (default: 'x_centroid')
    y_col : str
        Column name for y-coordinates (default: 'y_centroid')
        
    Returns:
        pd.DataFrame: DataFrame with columns for cell type, neighboring cell type, sample ID, and counts of neighboring cell types.
    """

    # Initialize an empty DataFrame to store results
    all_nb_df = pd.DataFrame()

    # Loop through each sample
    for sample in df[sample_col].unique().tolist():
        sample_df = df[df[sample_col] == sample]
        # Get proximal neighbor dict for all cells
        all_neighbors = get_proximal_neighbors(sample_df, x_col=x_col, y_col=y_col, distance_threshold=distance_threshold)

        for ct in sample_df[ct_col].unique().tolist():       
            # Count neighboring cell types of all cell types
            nb_counts = neighboring_celltype_counts(sample_df, ct_col, ct, all_neighbors)
            nb_df = pd.DataFrame.from_dict(nb_counts, orient = "index", columns = ["counts"])
            nb_df['Cell Type'] = ct
            nb_df['Neighboring Cell Type'] = nb_df.index
            nb_df['Sample'] = sample
            nb_df.reset_index(drop = True, inplace = True)
            all_nb_df = pd.concat([all_nb_df, nb_df], axis = 0)
            
        print(f"Finished processing sample {sample}")
    
    return all_nb_df


all_nb_df = run_neighboring_celltype_analysis(ird_cell_info, 'annot', 'Sample', distance_threshold = 25, x_col='x_centroid', y_col='y_centroid')

### Scaled neighboring cell type composition heatmap

In [None]:
all_nb_df.to_csv("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_Xenium_merge_neighboringCelltype_counts.csv")

In [None]:
# Find sample ID whose collection is NBM
nbm_samples = ird_cell_info[ird_cell_info['Collection'] == 'NBM']['Sample'].unique().tolist()
ndmm_samples = ird_cell_info[ird_cell_info['Collection'] == 'NDMM']['Sample'].unique().tolist()
pt_samples = ird_cell_info[ird_cell_info['Collection'] == 'PT']['Sample'].unique().tolist()

In [None]:
# Calculate z score for each neighboring cell type across cell types within each sample
all_nb_df['pct'] = all_nb_df['counts'] / all_nb_df.groupby(['Sample', 'Cell Type'])['counts'].transform('sum') * 100
all_nb_df['zscore_nct'] = all_nb_df.groupby(['Sample', 'Neighboring Cell Type'])['pct'].transform(lambda x: (x - x.mean()) / x.std())

In [None]:
nbm_nb_df = all_nb_df[all_nb_df['Sample'].isin(nbm_samples)]
ndmm_nb_df = all_nb_df[all_nb_df['Sample'].isin(ndmm_samples)]
pt_nb_df = all_nb_df[all_nb_df['Sample'].isin(pt_samples)]



In [None]:
# Calculate mean and std of pct for each cell type and neighboring cell type, within each sample, grouped by disease
summary_nbm_df = nbm_nb_df.groupby(['Cell Type', 'Neighboring Cell Type']).agg(
    mean_zscore_nct=('zscore_nct', 'mean')
).reset_index()
summary_nbm_df

In [None]:
summary_nbm_df['mean_pct_scaled'] = summary_nbm_df.groupby('Neighboring Cell Type')['mean_zscore_nct'].transform(
    lambda x: (x - x.mean()) / x.std() if x.std() > 0 else 0
)
# Draw heatmap of scaled mean pct
plt.figure(figsize = (12, 8))
heatmap_data = summary_nbm_df.pivot(index = 'Cell Type', columns = 'Neighboring Cell Type', values = 'mean_zscore_nct')
sns.heatmap(heatmap_data, cmap = 'vlag', center = 0, cbar_kws={'label': 'Mean Scaled Neighboring Cell Type Composition'})
plt.title("NBM samples")
plt.tight_layout()
#plt.savefig("/diskmnt/Users2/chouw/Projects/BM_spatial/NBM_MM_Xenium_merge_WC_neighboringCelltype_composition_healthy_samples_heatmap_v09192025.pdf", dpi = 300, transparent = True)

In [None]:
summary_ndmm_df = ndmm_nb_df.groupby(['Cell Type', 'Neighboring Cell Type']).agg(
    mean_zscore_nct=('zscore_nct', 'mean')
).reset_index()
summary_ndmm_df['mean_pct_scaled'] = summary_ndmm_df.groupby('Neighboring Cell Type')['mean_zscore_nct'].transform(
    lambda x: (x - x.mean()) / x.std() if x.std() > 0 else 0
)
# Draw heatmap of scaled mean pct
plt.figure(figsize = (12, 8))
heatmap_data = summary_ndmm_df.pivot(index = 'Cell Type', columns = 'Neighboring Cell Type', values = 'mean_zscore_nct')
sns.heatmap(heatmap_data, cmap = 'vlag', center = 0, cbar_kws={'label': 'Mean Scaled Neighboring Cell Type Composition'})
plt.title("NDMM samples")
plt.tight_layout()

In [None]:
summary_pt_df = pt_nb_df.groupby(['Cell Type', 'Neighboring Cell Type']).agg(
    mean_zscore_nct=('zscore_nct', 'mean')
).reset_index()
summary_pt_df['mean_pct_scaled'] = summary_pt_df.groupby('Neighboring Cell Type')['mean_zscore_nct'].transform(
    lambda x: (x - x.mean()) / x.std() if x.std() > 0 else 0
)
# Draw heatmap of scaled mean pct
plt.figure(figsize = (12, 8))
heatmap_data = summary_pt_df.pivot(index = 'Cell Type', columns = 'Neighboring Cell Type', values = 'mean_zscore_nct')
sns.heatmap(heatmap_data, cmap = 'vlag', center = 0, cbar_kws={'label': 'Mean Scaled Neighboring Cell Type Composition'})
plt.title("PT samples")
plt.tight_layout()

### Comparison between timepoints of specific cell types

In [None]:
all_nb_df['pct'] = all_nb_df['counts'] / all_nb_df.groupby(['Sample', 'Cell Type'])['counts'].transform('sum') * 100
metadata = ird_cell_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
all_nb_df['Collection'] = all_nb_df['Sample'].map(metadata['Collection'])

In [None]:
B_nb_df = all_nb_df[all_nb_df['Cell Type'] == 'B']

plt.figure(figsize = (12, 6))
sns.boxplot(B_nb_df, x = 'Neighboring Cell Type', y = 'pct', hue = 'Collection', showfliers = False)
sns.swarmplot(B_nb_df, x = 'Neighboring Cell Type', y = 'pct', hue = 'Collection', dodge = True, color = 'k', alpha = .5, size = 3)
plt.xticks(rotation = 45)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Nearest neighboring cell type composition of B cells')
plt.tight_layout()

In [None]:
cytoT_nb_df = all_nb_df[all_nb_df['Cell Type'] == 'Cytotoxic T']
metadata = ird_cell_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
cytoT_nb_df['Collection'] = cytoT_nb_df['Sample'].map(metadata['Collection'])

plt.figure(figsize = (12, 6))
sns.boxplot(cytoT_nb_df, x = 'Neighboring Cell Type', y = 'pct', hue = 'Collection', showfliers = False)
sns.swarmplot(cytoT_nb_df, x = 'Neighboring Cell Type', y = 'pct', hue = 'Collection', dodge = True, color = 'k', alpha = .5, size = 3)
plt.xticks(rotation = 45)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Nearest neighboring cell type composition of Cytotoxic T cells')
plt.tight_layout()

In [None]:
exhT_nb_df = all_nb_df[all_nb_df['Cell Type'] == 'Exhausted T']
metadata = ird_cell_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
exhT_nb_df['Collection'] = exhT_nb_df['Sample'].map(metadata['Collection'])

plt.figure(figsize = (12, 6))
sns.boxplot(exhT_nb_df, x = 'Neighboring Cell Type', y = 'pct', hue = 'Collection', showfliers = False)
sns.swarmplot(exhT_nb_df, x = 'Neighboring Cell Type', y = 'pct', hue = 'Collection', dodge = True, color = 'k', alpha = .5, size = 3)
plt.xticks(rotation = 45)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Nearest neighboring cell type composition of Exhausted T cells')
plt.tight_layout()

## Section 6: Distance to reference cell type

In [None]:
ird_xenium_merge = sc.read_h5ad("/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/merged_filtered.h5ad")

In [None]:
ird_cell_info = ird_xenium_merge.obs.copy()
ird_cell_info.head()

In [None]:
test = ird_cell_info[ird_cell_info['Sample'] == 'IRD_S18-30740A1U3']

In [None]:
test['annot'].unique().tolist()

In [None]:
from scipy.spatial import KDTree
bcell_info = test[test['annot'] == 'B']
bcell_tree = KDTree(bcell_info[['x_centroid', 'y_centroid']].values)

In [None]:
myeloid_info = test[test['annot'] == 'Granulo.']
bcell_tree.query(myeloid_info[['x_centroid', 'y_centroid']].values)

In [None]:
from scipy.spatial import KDTree
def nearest_dist_to_ref_celltype(df, ct_col_name, ref_ct, x_col='x_centroid', y_col='y_centroid'):
    """
    This function finds the nearest distance for all cells in df to the nearest cell of a reference cell type.

    Args:
        df (pd.DataFrame): DataFrame containing cell coordinates and cell type annotation.
        ct_col_name (str): Name of the column in df that contains cell type annotations.
        ref_ct (str): The reference cell type to which distances will be calculated.
        x_col (str): Name of the column for x-coordinates. Default is 'x_centroid'.
        y_col (str): Name of the column for y-coordinates. Default is 'y_centroid'.
    Returns:
        res_df (pd.DataFrame): DataFrame with original df columns plus a new column 'nearest_dist_to_{ref_ct}'.
    """
    # Check inputs

    # Extract coordinates of reference cell type to build kdtree
    ref_cells = df[df[ct_col_name] == ref_ct]
    if ref_cells.empty:
        raise ValueError(f"No cells found for reference cell type '{ref_ct}' in column '{ct_col_name}'.")
    ref_tree = KDTree(ref_cells[[x_col, y_col]].values)

    # Query kdtree for all cells to find nearest reference cell type
    all_cells_coords = df[[x_col, y_col]].values
    dists, _ = ref_tree.query(all_cells_coords)

    # Create result DataFrame
    res_df = df.copy()
    res_df[f'nearest_dist_to_{ref_ct}'] = dists

    return res_df

In [None]:
test1 = nearest_dist_to_ref_celltype(test, 'annot', 'B', x_col='x_centroid', y_col='y_centroid')

In [None]:
from scipy.stats import gaussian_kde
kde1 = gaussian_kde(test1[test1['annot']=='Granulo.']['nearest_dist_to_B'])
dist_eval = np.arange(0, 105, 1)
kde1_pdf = kde1.pdf(dist_eval)
plt.plot(dist_eval, kde1_pdf)
print(dist_eval[np.argmax(kde1_pdf)])

In [None]:
ct_to_plot = test1['annot'].unique().tolist()
ct_to_plot.remove('B')

nsubplots = len(ct_to_plot)
ncols = 3
nrows = nsubplots // ncols + 1
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(3*ncols, 3*nrows))

for i in range(len(ct_to_plot)):
    sns.histplot(test1[test1['annot']==ct_to_plot[i]], x = 'nearest_dist_to_B', binwidth = 5, stat = 'density', ax = axes[i//ncols, i%ncols])
    axes[i//ncols, i%ncols].set_title(ct_to_plot[i])
    axes[i//ncols, i%ncols].set_xlim(0, 50)


In [None]:
hist, bin_edges = np.histogram(test1[test1['annot']=='Granulo.']['nearest_dist_to_B'], bins=range(0, 105, 5), density=True)

In [None]:
bin_edges

### Most likely distance to a reference cell type

In [None]:
ird_cell_info = ird_xenium_merge.obs.copy()

In [None]:
from scipy.stats import gaussian_kde

ref_ct = 'B'
ct_col = 'annot'
dist_eval = np.arange(0, 201, 1)
all_other_ct = ird_cell_info[ct_col].unique().tolist()
all_other_ct.remove(ref_ct)

dist_to_B_df = pd.DataFrame(columns = ['Sample', 'Cell Type', 'Most Likely Distance'])
for i in range(len(ird_cell_info['Sample'].unique().tolist())):
    sample = ird_cell_info['Sample'].unique().tolist()[i]
    sample_df = ird_cell_info[ird_cell_info['Sample'] == sample]
    sample_df1 = nearest_dist_to_ref_celltype(sample_df, ct_col, ref_ct, x_col='x_centroid', y_col='y_centroid')
    
    dist = []
    for ct in all_other_ct:
        dist_distribution = sample_df1[sample_df1[ct_col]==ct][f'nearest_dist_to_{ref_ct}']
        if len(dist_distribution) < 10:
            dist.append(np.nan)
            continue
        else:
            kde_ct = gaussian_kde(dist_distribution)
            ct_kde_pdf = kde_ct.pdf(dist_eval)
            ct_dist = dist_eval[np.argmax(ct_kde_pdf)]
            dist.append(ct_dist)

    dist_to_B_ct_df = pd.DataFrame({
        'Sample': sample,
        'Cell Type': all_other_ct,
        'Most Likely Distance': dist
    })
    dist_to_B_df = pd.concat([dist_to_B_df, dist_to_B_ct_df], axis = 0)

In [None]:
dist_to_B_df = dist_to_B_df.dropna()

In [None]:
metadata = ird_cell_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
dist_to_B_df['Collection'] = dist_to_B_df['Sample'].map(metadata['Collection'])
plt.figure(figsize = (12, 6))
sns.boxplot(dist_to_B_df, x = 'Cell Type', y = 'Most Likely Distance', hue = 'Collection', showfliers = False)
sns.swarmplot(dist_to_B_df, x = 'Cell Type', y = 'Most Likely Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .7)
plt.title('Most likely distance to B cells by cell type')
plt.xticks(rotation = 45)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
#plt.savefig("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_to_Bcells_by_celltype_boxplot_v09192025.pdf", dpi = 300, transparent = True)


In [None]:
# Statistical test within each cell type, compare distance to B cells between NBM/NDMM, NBM/PT, NDMM/PT
from scipy.stats import mannwhitneyu
from scipy.stats import false_discovery_control
from statsmodels.stats.multitest import multipletests

stat_results = pd.DataFrame(columns = ['Cell Type', 'Group1', 'Group2', 'U_statistic', 'p_value'])
for ct in dist_to_B_df['Cell Type'].unique():
    ct_df = dist_to_B_df[dist_to_B_df['Cell Type'] == ct]
    nbm_dist = ct_df[ct_df['Collection'] == 'NBM']['Most Likely Distance'].values.tolist()
    ndmm_dist = ct_df[ct_df['Collection'] == 'NDMM']['Most Likely Distance'].values.tolist()
    pt_dist = ct_df[ct_df['Collection'] == 'PT']['Most Likely Distance'].values.tolist()
    
    if len(nbm_dist) > 0 and len(ndmm_dist) > 0:
        u_stat, p_val = mannwhitneyu(nbm_dist, ndmm_dist, alternative='two-sided')
        stat_results = pd.concat([stat_results, pd.DataFrame({
            'Cell Type': [ct],
            'Group1': ['NBM'],
            'Group2': ['NDMM'],
            'U_statistic': [u_stat],
            'p_value': [p_val]
        })], axis = 0)
    
    if len(nbm_dist) > 0 and len(pt_dist) > 0:
        u_stat, p_val = mannwhitneyu(nbm_dist, pt_dist, alternative='two-sided')
        stat_results = pd.concat([stat_results, pd.DataFrame({
            'Cell Type': [ct],
            'Group1': ['NBM'],
            'Group2': ['PT'],
            'U_statistic': [u_stat],
            'p_value': [p_val]
        })], axis = 0)
    
    if len(ndmm_dist) > 0 and len(pt_dist) > 0:
        u_stat, p_val = mannwhitneyu(ndmm_dist, pt_dist, alternative='two-sided')
        stat_results = pd.concat([stat_results, pd.DataFrame({
            'Cell Type': [ct],
            'Group1': ['NDMM'],
            'Group2': ['PT'],
            'U_statistic': [u_stat],
            'p_value': [p_val]
        })], axis = 0)

stat_results['p_adj'] = multipletests(stat_results['p_value'], method='fdr_bh')[1]
stat_results.to_csv("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_to_Bcells_wilcoxtest_v10242025.csv")
stat_results

In [None]:
ref_ct = 'PC'
dist_eval = np.arange(0, 201, 1)
all_other_ct = ird_cell_info['annot'].unique().tolist()
all_other_ct.remove(ref_ct)

dist_to_PC_df = pd.DataFrame(columns = ['Sample', 'Cell Type', 'Most Likely Distance'])
for i in range(len(ird_cell_info['Sample'].unique().tolist())):
    sample = ird_cell_info['Sample'].unique().tolist()[i]
    sample_df = ird_cell_info[ird_cell_info['Sample'] == sample]
    sample_df1 = nearest_dist_to_ref_celltype(sample_df, 'annot', ref_ct, x_col='x_centroid', y_col='y_centroid')
    
    dist = []
    for ct in all_other_ct:
        kde_ct = gaussian_kde(sample_df1[sample_df1['annot']==ct][f'nearest_dist_to_{ref_ct}'])
        ct_kde_pdf = kde_ct.pdf(dist_eval)
        ct_dist = dist_eval[np.argmax(ct_kde_pdf)]
        dist.append(ct_dist)

    dist_to_PC_ct_df = pd.DataFrame({
        'Sample': sample,
        'Cell Type': all_other_ct,
        'Most Likely Distance': dist
    })
    dist_to_PC_df = pd.concat([dist_to_PC_df, dist_to_PC_ct_df], axis = 0)

In [None]:
dist_to_PC_df[(dist_to_PC_df['Cell Type'] == 'Erythro.') & (dist_to_PC_df['Collection'] == "PT")]

In [None]:
dist_to_PC_df['Collection'] = dist_to_PC_df['Sample'].map(metadata['Collection'])
plt.figure(figsize = (12, 6))
sns.boxplot(dist_to_PC_df, x = 'Cell Type', y = 'Most Likely Distance', hue = 'Collection', showfliers = False)
sns.swarmplot(dist_to_PC_df, x = 'Cell Type', y = 'Most Likely Distance', hue = 'Collection', dodge = True, color = 'k', alpha = .5)
plt.title('Most likely distance to Plasma cells by cell type')
plt.xticks(rotation = 45)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
#plt.savefig("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_to_PC_by_celltype_boxplot_v09192025.pdf", dpi = 300, transparent = True)


In [None]:
# Statistical test within each cell type, compare distance to B cells between NBM/NDMM, NBM/PT, NDMM/PT
stat_results = pd.DataFrame(columns = ['Cell Type', 'Group1', 'Group2', 'U_statistic', 'p_value'])
for ct in dist_to_PC_df['Cell Type'].unique():
    ct_df = dist_to_PC_df[dist_to_PC_df['Cell Type'] == ct]
    nbm_dist = ct_df[ct_df['Collection'] == 'NBM']['Most Likely Distance'].values.tolist()
    ndmm_dist = ct_df[ct_df['Collection'] == 'NDMM']['Most Likely Distance'].values.tolist()
    pt_dist = ct_df[ct_df['Collection'] == 'PT']['Most Likely Distance'].values.tolist()
    
    if len(nbm_dist) > 0 and len(ndmm_dist) > 0:
        u_stat, p_val = mannwhitneyu(nbm_dist, ndmm_dist, alternative='two-sided')
        stat_results = pd.concat([stat_results, pd.DataFrame({
            'Cell Type': [ct],
            'Group1': ['NBM'],
            'Group2': ['NDMM'],
            'U_statistic': [u_stat],
            'p_value': [p_val]
        })], axis = 0)
    
    if len(nbm_dist) > 0 and len(pt_dist) > 0:
        u_stat, p_val = mannwhitneyu(nbm_dist, pt_dist, alternative='two-sided')
        stat_results = pd.concat([stat_results, pd.DataFrame({
            'Cell Type': [ct],
            'Group1': ['NBM'],
            'Group2': ['PT'],
            'U_statistic': [u_stat],
            'p_value': [p_val]
        })], axis = 0)
    
    if len(ndmm_dist) > 0 and len(pt_dist) > 0:
        u_stat, p_val = mannwhitneyu(ndmm_dist, pt_dist, alternative='two-sided')
        stat_results = pd.concat([stat_results, pd.DataFrame({
            'Cell Type': [ct],
            'Group1': ['NDMM'],
            'Group2': ['PT'],
            'U_statistic': [u_stat],
            'p_value': [p_val]
        })], axis = 0)

stat_results['p_adj'] = multipletests(stat_results['p_value'], method='fdr_bh')[1]
stat_results.to_csv("/diskmnt/Users2/chouw/Projects/BM_spatial/IRD_JW_Xenium_merge_nearestDist_to_PlasmaCells_wilcoxtest_v10242025.csv")
stat_results

In [None]:
from scipy.stats import gaussian_kde
def nearest_dist_to_ref_celltype_allsamples(df, ref_ct, ct_col, sample_col, dist_eval, x_col='x_centroid', y_col='y_centroid'):
    ''' 
    This function calculates the most likely distance to a reference cell type for all other cell types, across all samples.
    Args:
        df (pd.DataFrame): DataFrame containing cell coordinates, cell type annotation, and sample information.
        ref_ct (str): The reference cell type to which distances will be calculated.
        ct_col (str): Name of the column in df that contains cell type annotations.
        sample_col (str): Name of the column in df that contains sample identifiers.
        dist_eval (np.array): Array of distance values over which to evaluate the KDE.
        x_col (str): Name of the column for x-coordinates. Default is 'x_centroid'.
        y_col (str): Name of the column for y-coordinates. Default is 'y_centroid'.
    Returns:
        dist_to_ref_df (pd.DataFrame): DataFrame with columns ['Sample', 'Cell Type', 'Most Likely Distance'].
    '''
    # Initialize empty dataframe to store results
    dist_to_ref_df = pd.DataFrame(columns = ['Sample', 'Cell Type', 'Most Likely Distance'])

    # Identify cell types to evaluate, excluding the reference cell type
    all_other_ct = df[ct_col].unique().tolist()
    all_other_ct.remove(ref_ct)

    # Loop through all samples
    nSamples = len(df[sample_col].unique().tolist())
    for i in range(nSamples):
        sample = df[sample_col].unique().tolist()[i]
        sample_df = df[df[sample_col] == sample]
        if sample_df[sample_df[ct_col] == ref_ct].empty:
            print(f"Skipping sample {sample} as it has no cells of reference cell type '{ref_ct}'")
            continue
        sample_df1 = nearest_dist_to_ref_celltype(sample_df, ct_col, ref_ct, x_col='x_centroid', y_col='y_centroid')
        
        dist = []
        for ct in all_other_ct:
            dist_distribution = sample_df1[sample_df1[ct_col]==ct][f'nearest_dist_to_{ref_ct}']
            if len(dist_distribution) < 10: # Require at least 10 data points to estimate KDE
                dist.append(np.nan)
                continue
            else:
                kde_ct = gaussian_kde(dist_distribution)
                ct_kde_pdf = kde_ct.pdf(dist_eval)
                ct_dist = dist_eval[np.argmax(ct_kde_pdf)]
                dist.append(ct_dist)

        dist_to_ref_ct_df = pd.DataFrame({
            'Sample': sample,
            'Cell Type': all_other_ct,
            'Most Likely Distance': dist
        })
        dist_to_ref_df = pd.concat([dist_to_ref_df, dist_to_ref_ct_df], axis = 0)
    return dist_to_ref_df



### Fraction of cell types within a distance bin

In [None]:
ref_ct = 'B'
all_other_ct = ird_cell_info['annot'].unique().tolist()
all_other_ct.remove(ref_ct)

dist_to_B_df = pd.DataFrame(columns = ['Sample', 'Cell Type', 'Distance Bin Start', 'Distance Bin End', 'Fraction'])
for i in range(len(ird_cell_info['Sample'].unique().tolist())):
    sample = ird_cell_info['Sample'].unique().tolist()[i]
    sample_df = ird_cell_info[ird_cell_info['Sample'] == sample]
    sample_df1 = nearest_dist_to_ref_celltype(sample_df, 'annot', ref_ct, x_col='x_centroid', y_col='y_centroid')
    hist_all, bin_edges = np.histogram(sample_df1[sample_df1['annot']!=ref_ct][f'nearest_dist_to_{ref_ct}'], bins=range(0, 105, 5), density=False)  ## number of cells in each distance bin
    
    for ct in all_other_ct:
        hist, _ = np.histogram(sample_df1[sample_df1['annot']==ct][f'nearest_dist_to_{ref_ct}'], bins=range(0, 105, 5), density=False)  ## number of ct in each distance bin
        frac = hist / hist_all  ## fraction of ct in each distance bin
        dist_to_B_ct_df = pd.DataFrame({
            'Sample': sample,
            'Cell Type': ct,
            'Distance Bin Start': bin_edges[:-1],
            'Distance Bin End': bin_edges[1:],
            'Fraction': frac
        })
        dist_to_B_df = pd.concat([dist_to_B_df, dist_to_B_ct_df], axis = 0)

In [None]:
metadata = ird_cell_info[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
metadata

In [None]:
dist_to_B_df['Collection'] = dist_to_B_df['Sample'].map(metadata['Collection'])
dist_to_B_df

In [None]:
dist_to_B_df['Distance Bin Mid'] = (dist_to_B_df['Distance Bin Start'] + dist_to_B_df['Distance Bin End']) / 2
for ct in dist_to_B_df['Cell Type'].unique():
    df_plot = dist_to_B_df[dist_to_B_df['Cell Type'] == ct]
    plt.figure(figsize = (6, 4))
    sns.boxplot(df_plot, x = 'Distance Bin Mid', y = 'Fraction', hue = 'Collection')
    plt.title(ct)


## Section 7: BANKSY Spatial Neighborhood Analysis

Identify and characterize spatially-defined neighborhoods using [BANKSY](https://www.nature.com/articles/s41588-024-01664-3). 

The BANKSY pipeline is run outside of this notebook. The goals of this section is to perform downstream analysis of the BANKSY results.

### Analysis Overview

This section performs:
1. **Cluster filtering**: Remove small or sample-specific neighborhoods
2. **Cell type enrichment**: Identify which cell types are enriched in each neighborhood
3. **Neighborhood abundance**: Analyze neighborhood proportions across disease stages
4. **Cellular composition**: Characterize cell type composition within neighborhoods across conditions

In [None]:
banksy = sc.read_h5ad('/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/banksy/Output/Run10242025_final_banksy_matrix.h5ad')
new_annot = pd.read_csv("/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/merged_metadata.csv", index_col = 0)
banksy.obs['revised_annot'] = banksy.obs.index.map(new_annot['ct'])
banksy.obs.head()

In [None]:
banksy.obs['leiden'].value_counts().tolist()

### Filter out clusters that are small or specific to one sample/condition

Remove neighborhoods that may represent technical artifacts or sample-specific biology rather than generalizable spatial patterns. Filtering criteria:
1. **Size filter**: Remove neighborhoods with < 3000 cells (likely noise or rare spatial artifacts)
2. **Sample specificity**: Remove neighborhoods predominantly found in a single sample (sample-specific technical/biological artifacts)

First try: filter out clusters with fewer than 3000 cells

In [None]:
banksy_cluster_size = banksy.obs['leiden'].value_counts().tolist()
banksy_clusters_to_keep = [i for i, size in enumerate(banksy_cluster_size) if size >= 3000]
banksy_clusters_to_keep = [str(c) for c in banksy_clusters_to_keep]

In [None]:
banksy_clusters_to_keep

In [None]:
banksy1 = banksy[banksy.obs['leiden'].isin(banksy_clusters_to_keep), :]
banksy1

In [None]:
# Plot proportion of cells in each bansky cluster per timepoint (Collection)
banksy_meta = banksy1.obs.copy()
banksy_tp_count = banksy_meta.groupby('Collection').size().reset_index(name='total_cells')
banksy_cluster_count = banksy_meta.groupby(['Collection', 'leiden']).size().reset_index(name='cluster_cells')
banksy_cluster_prop = banksy_cluster_count.merge(banksy_tp_count, on='Collection')
banksy_cluster_prop['proportion'] = banksy_cluster_prop['cluster_cells'] / banksy_cluster_prop['total_cells']
plt.figure(figsize=(8,6))
sns.barplot(banksy_cluster_prop, x='leiden', y='proportion', hue='Collection')
plt.title('Proportion of cells in each Banksy cluster per timepoint')
plt.xlabel('Banksy Cluster')
plt.ylabel('Proportion of cells')

Cluster 8 and 12 seem specific to NDMM - check if they are sample specific

In [None]:
# Plot percent of cells in banksy cluster 8 per sample
banksy_cluster8 = banksy1[banksy1.obs['leiden'] == '8', :]
banksy8_meta = banksy_cluster8.obs[['Sample']].copy()
banksy8_sample_count = banksy8_meta.groupby('Sample').size().reset_index(name='cluster8_cells')
banksy_sample_count = banksy1.obs.groupby('Sample').size().reset_index(name='total_cells')
banksy8_sample_prop = banksy8_sample_count.merge(banksy_sample_count, on='Sample')
banksy8_sample_prop['proportion'] = banksy8_sample_prop['cluster8_cells'] / banksy8_sample_prop['total_cells']
print("Over-represented sample in BANKSY cluster 8: ", banksy8_sample_prop.loc[banksy8_sample_prop['proportion']>0.1, 'Sample'].values[0])

In [None]:
# Plot percent of cells in banksy cluster 12 per sample
banksy_cluster12 = banksy1[banksy1.obs['leiden'] == '12', :]
banksy12_meta = banksy_cluster12.obs[['Sample']].copy()
banksy12_sample_count = banksy12_meta.groupby('Sample').size().reset_index(name='cluster12_cells')
banksy_sample_count = banksy1.obs.groupby('Sample').size().reset_index(name='total_cells')
banksy12_sample_prop = banksy12_sample_count.merge(banksy_sample_count, on='Sample')
banksy12_sample_prop['proportion'] = banksy12_sample_prop['cluster12_cells'] / banksy12_sample_prop['total_cells']
print("Over-represented sample in BANKSY cluster 12: ", banksy12_sample_prop.loc[banksy12_sample_prop['proportion']>0.05, 'Sample'].values[0])

In [None]:
sc.pl.dotplot(banksy1, var_names=['SDC1', 'SLAMF7', 'TNFRSF17', 'MZB1'], groupby = 'leiden')

Nbhd8 and Nbhd12 are both specific to plasma cells in S19-25371 - could be interesting down the road. Discard for now.

In [None]:
banksy_clusters_to_keep = [banksy_cluster for banksy_cluster in banksy_clusters_to_keep if banksy_cluster not in ['8', '12']]
banksy_clusters_to_keep
banksy1 = banksy[banksy.obs['leiden'].isin(banksy_clusters_to_keep), :]
banksy1

### Cell type enrichment analysis

Quantify which cell types are significantly enriched or depleted in each BANKSY neighborhood using statistical tests:

**Statistical approach:**
1. **Fisher's exact test**: Calculate odds ratios for each cell type in each neighborhood
2. **Hypergeometric test**: Calculate p-values for statistical significance of enrichment
3. **FDR correction**: Adjust p-values for multiple testing using Benjamini-Hochberg method

This analysis reveals the cellular composition defining each spatial neighborhood and identifies which cell types preferentially co-localize.

In [None]:
banksy1.obs['revised_annot'].unique()

In [None]:
def calculate_celltype_enrichment(adata, neighborhood_col, celltype_col):
    """
    Calculate cell type enrichment in spatial neighborhoods using Fisher's exact test.
    
    For each neighborhood-celltype pair, computes odds ratios and p-values to quantify
    whether a cell type is significantly enriched or depleted in that neighborhood
    compared to the overall population.
    
    Parameters:
    -----------
    adata : AnnData
        Annotated data object with neighborhood and cell type annotations
    neighborhood_col : str
        Column name in adata.obs containing neighborhood labels (e.g., 'leiden', 'nbhd_annot')
    celltype_col : str
        Column name in adata.obs containing cell type annotations
        
    Returns:
    --------
    tuple: (pd.DataFrame, pd.DataFrame)
        - odds_matrix: DataFrame with odds ratios (rows=neighborhoods, cols=cell types)
        - pval_matrix: DataFrame with hypergeometric p-values (same shape as odds_matrix)
        
    Statistical Tests:
    ------------------
    - **Fisher's exact test**: Computes odds ratio for enrichment
    - **Hypergeometric test**: Computes p-value for significance (upper tail test)
    
    Example:
    --------
    odds_matrix, pval_matrix = calculate_celltype_enrichment(
        banksy_filtered, 
        neighborhood_col='nbhd_annot',
        celltype_col='revised_annot'
    )
    
    # Visualize results
    log2_odds = np.log2(odds_matrix)
    sns.heatmap(log2_odds, center=0, cmap='Spectral_r')
    """
    from scipy.stats import fisher_exact, hypergeom
    
    neighborhoods = adata.obs[neighborhood_col].unique()
    cell_types = adata.obs[celltype_col].unique()
    df = adata.obs[[neighborhood_col, celltype_col]].copy()
    
    # Initialize output matrices
    odds_matrix = pd.DataFrame(
        np.nan,
        index=neighborhoods.categories if hasattr(neighborhoods, 'categories') else neighborhoods,
        columns=cell_types.categories if hasattr(cell_types, 'categories') else cell_types
    )
    
    pval_matrix = pd.DataFrame(
        np.nan,
        index=neighborhoods.categories if hasattr(neighborhoods, 'categories') else neighborhoods,
        columns=cell_types.categories if hasattr(cell_types, 'categories') else cell_types
    )
    
    # Calculate enrichment for each neighborhood-celltype pair
    for nb in odds_matrix.index:
        for ct in odds_matrix.columns:
            # Build 2x2 contingency table
            # a = in nb and of type ct
            a = ((df[neighborhood_col] == nb) & (df[celltype_col] == ct)).sum()
            # b = in nb and not type ct
            b = ((df[neighborhood_col] == nb) & (df[celltype_col] != ct)).sum()
            # c = not in nb and of type ct
            c = ((df[neighborhood_col] != nb) & (df[celltype_col] == ct)).sum()
            # d = not in nb and not type ct
            d = ((df[neighborhood_col] != nb) & (df[celltype_col] != ct)).sum()
            
            tbl = np.array([[a, b], [c, d]])
            
            # Fisher's exact test for odds ratio
            oddsratio, _ = fisher_exact(tbl)
            odds_matrix.loc[nb, ct] = oddsratio
            
            # Hypergeometric test (upper tail) for p-value
            m = a + c  # total ct cells
            n = b + d  # total non-ct cells
            k = a + b  # total cells in nb
            q = a - 1 if a > 0 else 0
            
            pval_matrix.loc[nb, ct] = hypergeom.sf(q, m + n, m, k)
    
    return odds_matrix, pval_matrix


In [None]:
banksy1.obs['revised_annot'] = pd.Categorical(banksy1.obs['revised_annot'], 
                                        categories=['HSPC', 'Erythroid', 'Megakaryocyte',
                                        'GMP', 'Late Myeloid', 'Neutrophil', 'Ba/Eo/Ma',
                                        'Monocyte', 'Macrophage','pDC', 'cDC',
                                        'Early B', 'Mature B', 'CD4 T', 'CD8 T', 'NK', 'PC', 
                                        'Adipocyte', 'MSC', 'Endothelial', 'Fibro/Osteo', 'vSMC/Pericyte', 'Low Confidence'],
                                        ordered=True)

In [None]:
odds_matrix, pval_matrix = calculate_celltype_enrichment(
        banksy1, 
        neighborhood_col='leiden',
        celltype_col='revised_annot'
    )

In [None]:
# Draw heatmap to show enriched cell types in each BANKSY neighborhood
from matplotlib.colors import LinearSegmentedColormap
from statsmodels.stats.multitest import multipletests

# Log2 transform odds ratios
log2_odds_matrix = np.log2(odds_matrix)

# Replace negative infinities with the minimum finite value
finite_values = log2_odds_matrix.values[np.isfinite(log2_odds_matrix.values)]
max_neg_value = finite_values.min()
log2_odds_matrix = log2_odds_matrix.replace(-np.inf, max_neg_value)

# Adjust p-values using FDR correction
pval_vector = pval_matrix.values.flatten()
padj_vector = multipletests(pval_vector, method='fdr_bh')[1]
padj_matrix = pd.DataFrame(
    padj_vector.reshape(pval_matrix.shape),
    index=pval_matrix.index,
    columns=pval_matrix.columns
)

# Create star annotation matrix
star_matrix = padj_matrix.map(lambda x: "*" if x < 0.05 else "")

## Define color palette (PiYG-like: purple → white → green)
#colors = ['#8e0152', '#f7f7f7', '#276419']  # Similar to RColorBrewer PiYG
#cmap = LinearSegmentedColormap.from_list('custom_PiYG', colors)

# Create figure and heatmap
fig, ax = plt.subplots(figsize=(8, 6))

# Draw heatmap
sns.heatmap(
    log2_odds_matrix,
    cmap='Spectral_r',
    center=0,
    vmin=-3,
    vmax=3,
    cbar_kws={'label': 'log2(Odds Ratio)'},
    linewidths=0.5,
    linecolor='lightgray',
    ax=ax
)

# Add star annotations
for i in range(len(log2_odds_matrix.index)):
    for j in range(len(log2_odds_matrix.columns)):
        if star_matrix.iloc[i, j] == "*":
            ax.text(j + 0.5, i + 0.5, '*',
                   ha='center', va='center',
                   fontsize=10, color='black')

# Formatting
ax.set_xlabel('Cell Type')
ax.set_ylabel('BANKSY Neighborhood Clusters')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
#plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_neighborhood_celltype_oddsratio_heatmap_v11062025.pdf', dpi=300, transparent = True)
plt.show()


In [None]:
# Plot percent of cells in banksy cluster 10 per sample
banksy_cluster10 = banksy1[banksy1.obs['leiden'] == '10', :]
banksy10_meta = banksy_cluster10.obs[['Sample']].copy()
banksy10_sample_count = banksy10_meta.groupby('Sample').size().reset_index(name='cluster10_cells')
banksy_sample_count = banksy1.obs.groupby('Sample').size().reset_index(name='total_cells')
banksy10_sample_prop = banksy10_sample_count.merge(banksy_sample_count, on='Sample')
banksy10_sample_prop['proportion'] = banksy10_sample_prop['cluster10_cells'] / banksy10_sample_prop['total_cells']
print("Over-represented sample in BANKSY cluster 10: ", banksy10_sample_prop.loc[banksy10_sample_prop['proportion']>0.05, 'Sample'].values)

In [None]:
sample_df = banksy1.obs[banksy1.obs['Sample'] == 'S16-30080A1U1']
sns.scatterplot(sample_df, x = 'x_centroid', y = 'y_centroid', hue = 'leiden', s = 1, alpha = 0.8, edgecolor = None)
plt.gca().set_aspect('equal', adjustable='box')
plt.legend(title='BANKSY Neighborhood', bbox_to_anchor=(1.05, 1), loc='upper left', markerscale = 4)

In [None]:
sns.scatterplot(sample_df[sample_df['leiden'].isin(['5', '10'])], x = 'x_centroid', y = 'y_centroid', hue = 'leiden', s = 1, alpha = 0.8, edgecolor = None)
plt.gca().set_aspect('equal', adjustable='box')
plt.legend(title='BANKSY Neighborhood', bbox_to_anchor=(1.05, 1), loc='upper left', markerscale = 4)

In [None]:
sc.tl.rank_genes_groups(banksy1, groupby = 'leiden', method = 'wilcoxon', groups = ['10'])

In [None]:
sc.pl.rank_genes_groups(banksy1)

In [None]:
# Plot scatterplot of cell centroids colored by BANKSY neighborhood, split by sample, save as multi-page PDF
pdf_name = "/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/IRD_JW_Xenium_merge_BANKSY_filtered_scatterplot_110625.pdf"
with PdfPages(pdf_name) as pdf:
    for sample in banksy1.obs['Sample'].unique():
        sample_data = banksy1[banksy1.obs['Sample'] == sample, :]
        plt.figure(figsize=(10, 10))
        sns.scatterplot(
            x=sample_data.obs['x_centroid'],
            y=sample_data.obs['y_centroid'],
            hue=sample_data.obs['leiden'],
            palette='tab20',
            s=2,
            alpha=0.8,
            edgecolor=None
        )
        plt.title(f'Cell Centroids Colored by BANKSY Neighborhood - Sample: {sample}')
        plt.xlabel('X Centroid')
        plt.ylabel('Y Centroid')
        plt.gca().set_aspect('equal', adjustable='box')
        plt.legend(title='BANKSY Neighborhood', bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.tight_layout()
        pdf.savefig()
        plt.close()

In [None]:
# Export BANKSY neighborhood annotations for each sample to Xenium browser format
for sample in banksy1.obs['Sample'].unique():
    sample_data = banksy1[banksy1.obs['Sample'] == sample, :]
    output_df = sample_data.obs[['Original_Barcode', 'leiden']].copy()
    output_df.columns = ['cell_id', 'group']
    output_path = f"/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/Annotation/IRD_JW_Xenium_{sample}_BANKSY_neighborhood_annotation_110625.csv"
    output_df.to_csv(output_path, index=False)

### Annotated and removed unknowns

In [None]:
# Neighborhood 4 is enriched for unknown cells -> remove
banksy_filtered = banksy1[banksy1.obs['leiden'].isin(['0', '1', '2', '3', '5', '6', '7', '9', '11']), :]
banksy_nbhd_annotation = {
    '0': 'Erythroid',
    '1': 'Lymphoid',
    '2': 'Late Myeloid',
    '3': 'Endothelial',
    '5': 'Plasma cell',
    '6': 'Early Myeloid',
    '7': 'Megakaryocyte',
    '9': 'Endosteal',
    '11': 'Arterial'
}
banksy_filtered.obs['nbhd_annot'] = banksy_filtered.obs['leiden'].map(banksy_nbhd_annotation)

In [None]:
banksy_filtered.obs['nbhd_annot'] = pd.Categorical(banksy_filtered.obs['nbhd_annot'], 
                                        categories=['Erythroid', 'Early Myeloid', 'Late Myeloid',
                                                    'Lymphoid', 'Plasma cell', 'Megakaryocyte',
                                                    'Endothelial', 'Arterial', 'Endosteal'],
                                        ordered=True)

In [None]:
odds_matrix, pval_matrix = calculate_celltype_enrichment(
        banksy_filtered, 
        neighborhood_col='nbhd_annot',
        celltype_col='revised_annot'
    )

In [None]:
# Draw heatmap to show enriched cell types in each BANKSY neighborhood
from matplotlib.colors import LinearSegmentedColormap
from statsmodels.stats.multitest import multipletests

# Log2 transform odds ratios
log2_odds_matrix = np.log2(odds_matrix)

# Replace negative infinities with the minimum finite value
finite_values = log2_odds_matrix.values[np.isfinite(log2_odds_matrix.values)]
max_neg_value = finite_values.min()
log2_odds_matrix = log2_odds_matrix.replace(-np.inf, max_neg_value)

# Adjust p-values using FDR correction
pval_vector = pval_matrix.values.flatten()
padj_vector = multipletests(pval_vector, method='fdr_bh')[1]
padj_matrix = pd.DataFrame(
    padj_vector.reshape(pval_matrix.shape),
    index=pval_matrix.index,
    columns=pval_matrix.columns
)

# Create star annotation matrix
star_matrix = padj_matrix.map(lambda x: "*" if x < 0.05 else "")

## Define color palette (PiYG-like: purple → white → green)
#colors = ['#8e0152', '#f7f7f7', '#276419']  # Similar to RColorBrewer PiYG
#cmap = LinearSegmentedColormap.from_list('custom_PiYG', colors)

# Create figure and heatmap
fig, ax = plt.subplots(figsize=(8, 5))

# Draw heatmap
sns.heatmap(
    log2_odds_matrix,
    cmap='Spectral_r',
    center=0,
    vmin=-3,
    vmax=3,
    cbar_kws={'label': 'log2(Odds Ratio)'},
    linewidths=0.5,
    linecolor='lightgray',
    ax=ax
)

# Add star annotations
for i in range(len(log2_odds_matrix.index)):
    for j in range(len(log2_odds_matrix.columns)):
        if star_matrix.iloc[i, j] == "*":
            ax.text(j + 0.5, i + 0.5, '*',
                   ha='center', va='center',
                   fontsize=10, color='black')

# Formatting
ax.set_xlabel('Cell Type')
ax.set_ylabel('BANKSY Neighborhood Annotation')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
#plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_annotated_celltype_oddsratio_heatmap_v11062025.pdf', dpi=300, transparent = True)
plt.show()


### Scatterplot of select samples

In [None]:
neighborhood_colors = {
    "Late Myeloid":  "#079450",  # later granulo
    "Erythroid":  "#9e9e9e", # erythroid
    "Plasma cell":  "#ff42ca", # PC
    "Early Myeloid": "#00ff1e", # early granulo/mye 
    "Megakaryocyte": "#241717", # MKC
    #"RN6":  '#00ba9e', # other myelo (cDC, ba/eo/ma, low confidence)
    #"RN7":  "#00f7ff",  # early B and myelo
    "Lymphoid":  "#b50d0d",  # cytotoxic T NK
    "Endothelial":  "#de9835",  # endothelial
    #"RN10": "#c6db02",  # HSPC
    #"RN11": "#7875ff",  # lymphoid
    "Endosteal": "#fabc02",  # fibro/osteo
    "Arterial": "#735b2e",  # pericyte
}

In [None]:
max_dim = max(p157_t0.obs['x_centroid'].max(), p157_t0.obs['y_centroid'].max())
min_dim = min(p157_t0.obs['x_centroid'].max(), p157_t0.obs['y_centroid'].max())
fig_size = max_dim//1000, min_dim//1000
fig_size

In [None]:
p157_t0 = banksy_filtered[banksy_filtered.obs['DI_Sample'] == 'P157_T0_S1']
max_dim = max(p157_t0.obs['x_centroid'].max(), p157_t0.obs['y_centroid'].max())
min_dim = min(p157_t0.obs['x_centroid'].max(), p157_t0.obs['y_centroid'].max())
fig_size = max_dim//1000, min_dim//1000
plt.figure(figsize = fig_size)
sns.scatterplot(p157_t0.obs.copy(), x = 'x_centroid', y = 'y_centroid', hue = 'nbhd_annot', palette = neighborhood_colors, s = .5, edgecolor = None, legend = False, rasterized = True)
plt.gca().set_aspect('equal', adjustable='box')
sns.despine()
plt.tight_layout()
plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_neighborhood_scatterplot_P157T0_v11062025.pdf', dpi=300, transparent = True)
plt.show()

In [None]:
p143_t0 = banksy_filtered[banksy_filtered.obs['DI_Sample'] == 'P143_T0_S1']
max_dim = max(p143_t0.obs['x_centroid'].max(), p143_t0.obs['y_centroid'].max())
min_dim = min(p143_t0.obs['x_centroid'].max(), p143_t0.obs['y_centroid'].max())
fig_size = max_dim//1000, min_dim//1000
plt.figure(figsize = fig_size)
sns.scatterplot(p143_t0.obs.copy(), x = 'y_centroid', y = 'x_centroid', hue = 'nbhd_annot', palette = neighborhood_colors, s = .5, edgecolor = None, legend = False, rasterized = True)
plt.gca().set_aspect('equal', adjustable='box')
sns.despine()
plt.tight_layout()
plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_neighborhood_scatterplot_P143T0_v11062025.pdf', dpi=300, transparent = True)
plt.show()

### Neighborhood abundance analysis across samples and conditions

Quantify how the proportion of each spatial neighborhood changes across disease stages. This analysis reveals which neighborhoods increase or decrease in proportion during disease progression.

In [None]:
# Calculate fraction of cells in each BANKSY neighborhood per sample
neighborhood_counts = banksy_filtered.obs.groupby(['Sample', 'nbhd_annot']).size().reset_index(name='count')
total_counts = banksy_filtered.obs.groupby('Sample').size().reset_index(name='total_count')
neighborhood_frac = neighborhood_counts.merge(total_counts, on='Sample')
neighborhood_frac['fraction'] = neighborhood_frac['count'] / neighborhood_frac['total_count']
neighborhood_frac.head()

In [None]:
# Match sample ID to collection timepoint
metadata = banksy_filtered.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
neighborhood_frac['Collection'] = neighborhood_frac['Sample'].map(metadata['Collection'])
neighborhood_frac.head()

In [None]:
# Calculate statistics between timepoints for each neighborhood
stats_res = []
tps = neighborhood_frac['Collection'].unique().tolist()

for nbhd in neighborhood_frac['nbhd_annot'].unique():
    nbhd_fracs = neighborhood_frac[neighborhood_frac['nbhd_annot'] == nbhd]

    for cond1, cond2 in combinations(tps, 2):
        cond1_frac = nbhd_fracs[nbhd_fracs['Collection'] == cond1]['fraction'].values.tolist()
        cond2_frac = nbhd_fracs[nbhd_fracs['Collection'] == cond2]['fraction'].values.tolist()
        if len(cond1_frac) > 0 and len(cond2_frac) > 0:
            u_stat, p_val = mannwhitneyu(cond1_frac, cond2_frac, alternative='two-sided')
            stats_res.append({'Condition 1': cond1, 'Condition 2': cond2, 'Neighborhood': nbhd, 'U statistic': u_stat, 'p-value': p_val})
stats_df = pd.DataFrame(stats_res)
stats_df['p_adj'] = multipletests(stats_df['p-value'], method='fdr_bh')[1]
stats_df['Sig'] = stats_df['p_adj'] < 0.05
stats_df[stats_df['Sig'] == True]

In [None]:
# Plot boxplot of neighborhood fractions across samples, colored by collection timepoint
plt.figure(figsize=(12, 6))
timecols = {"NBM": "#0C7515", "NDMM": "#E619B9", "PT": "#CF99C3"} 
order = neighborhood_frac['nbhd_annot'].unique().tolist()          # set your desired x order
hue_order = sorted(neighborhood_frac['Collection'].unique())  # set your desired hue order

ax = sns.boxplot(
    neighborhood_frac,
    x='nbhd_annot', y='fraction', hue='Collection',
    palette = timecols,
    order=order, hue_order=hue_order,
    showfliers=False
)
sns.swarmplot(
    neighborhood_frac,
    x='nbhd_annot', y='fraction', hue='Collection',
    order=order, hue_order=hue_order, dodge = True,
    color = 'k', alpha = .5, size = 3.5
)

################ These vibe coding snippets to draw significance bars ################
# Draw significance bars
def dodge_center(i, j, n_hue, width=0.8):
    # seaborn's default group width is ~0.8
    if n_hue <= 1:
        return i
    step = width / n_hue
    return i - width/2 + step/2 + j*step

# Geometry for stacked annotations
ymin, ymax = ax.get_ylim()
y_span = ymax - ymin if ymax > ymin else 1.0
pad = 0.04 * y_span      # space above boxes
step = 0.08 * y_span     # vertical spacing between annotations
h = 0.015 * y_span       # bracket height

top_y_by_nbhd = neighborhood_frac.groupby('nbhd_annot')['fraction'].max().to_dict()
top_y_by_nbhd = {k: v + pad for k, v in top_y_by_nbhd.items()}
stack_idx = {k: 0 for k in order}

# Optional: avoid duplicate pair annotations within the same neighborhood
seen_pairs = set()

for _, row in stats_df.iterrows():
    if not row['Sig']:
        continue
    nbhd = row['Neighborhood']
    cond1, cond2 = row['Condition 1'], row['Condition 2']
    if nbhd not in top_y_by_nbhd:
        continue
    # skip if one of the groups has no data
    m1 = ((neighborhood_frac['nbhd_annot'] == nbhd) &
          (neighborhood_frac['Collection'] == cond1))
    m2 = ((neighborhood_frac['nbhd_annot'] == nbhd) &
          (neighborhood_frac['Collection'] == cond2))
    if not m1.any() or not m2.any():
        continue

    # de-duplicate symmetrical pairs per neighborhood
    key = (nbhd, tuple(sorted([cond1, cond2])))
    if key in seen_pairs:
        continue
    seen_pairs.add(key)

    i = order.index(nbhd)
    j1 = hue_order.index(cond1)
    j2 = hue_order.index(cond2)
    x1 = dodge_center(i, j1, len(hue_order), width=0.8)
    x2 = dodge_center(i, j2, len(hue_order), width=0.8)

    # stacked y position
    y_base = top_y_by_nbhd[nbhd] + stack_idx[nbhd] * step
    y0, y1 = y_base - h, y_base

    ax.plot([x1, x1, x2, x2], [y0, y1, y1, y0], lw=1.5, c='k', clip_on=False)
    ax.text((x1 + x2) / 2, y1 + 0.5*h, f"p={row['p_adj']:.3e}",
            ha='center', va='bottom', color='k', fontsize=8, clip_on=False)

    stack_idx[nbhd] += 1

# Ensure enough headroom
max_top = max((top_y_by_nbhd[k] + max(0, stack_idx[k]-1) * step + pad) for k in order)
if max_top > ymax:
    ax.set_ylim(ymin, max_top)
########################## End of vibe coding snippets ################

ax.set_title('Fraction of Cells in Each BANKSY Neighborhood Across Samples')
ax.set_xlabel('BANKSY Neighborhood')
ax.set_ylabel('Proportion of Cells')
ax.set_ylim(0, 0.65)
plt.xticks(rotation=45, ha='right')
sns.despine()
plt.tight_layout()
plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_neighborhood_proportions_v11062025.pdf', dpi=300, transparent = True)

### Cell type composition analysis within neighborhood across conditions

Analyze how cellular composition within each neighborhood changes across disease stages. This reveals whether specific cell types are recruited to or lost from particular neighborhoods during disease progression.

In [None]:
# Calculate proportion of cell types within each BANKSY neighborhood per sample
celltype_counts = banksy_filtered.obs.groupby(['nbhd_annot', 'revised_annot', 'Sample']).size().reset_index(name='count')
total_counts = banksy_filtered.obs.groupby(['nbhd_annot', 'Sample']).size().reset_index(name='total_count')
celltype_frac = celltype_counts.merge(total_counts, on=['nbhd_annot', 'Sample'])
celltype_frac['fraction'] = celltype_frac['count'] / celltype_frac['total_count']
# Map collection timepoint
metadata = banksy_filtered.obs[['Sample', 'Collection']].drop_duplicates().set_index('Sample')
celltype_frac['Collection'] = celltype_frac['Sample'].map(metadata['Collection'])
celltype_frac.head()

In [None]:
# Calculate proportion of cell types per sample
celltype_sample_frac = banksy_filtered.obs.groupby(['revised_annot', 'Sample']).size().reset_index(name='count')
total_sample_counts = banksy_filtered.obs.groupby('Sample').size().reset_index(name='total_count')
celltype_sample_frac = celltype_sample_frac.merge(total_sample_counts, on='Sample')
celltype_sample_frac['fraction'] = celltype_sample_frac['count'] / celltype_sample_frac['total_count']
celltype_sample_frac['Collection'] = celltype_sample_frac['Sample'].map(metadata['Collection'])
celltype_sample_frac.head()

In [None]:
pdf_name = "/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_merge_BANKSY_filtered_celltype_fraction_by_nbhd_110625.pdf"

with PdfPages(pdf_name) as pdf:
    # First plot overall cell type composition per sample across conditions
    plt.figure(figsize = (8, 4))
    sns.boxplot(celltype_sample_frac, x='revised_annot', y='fraction', hue='Collection')
    plt.xlabel('Cell Type')
    plt.ylabel('Fraction within Sample')
    plt.title('Overall Cell Type Composition Across Samples')
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Collection', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    pdf.savefig()
    plt.close()
    
    # Plot boxplot of cell type fractions within each BANKSY neighborhood across samples
    nbhd_list = celltype_frac['nbhd_annot'].unique().tolist()
    for nbhd in nbhd_list:
        nbhd_fracs = celltype_frac[celltype_frac['nbhd_annot'] == nbhd]
        plt.figure(figsize=(8, 4))
        sns.boxplot(nbhd_fracs, x='revised_annot', y='fraction', hue='Collection')
        plt.xlabel('Cell Type')
        plt.ylabel(f'Fraction within BANKSY Neighborhood {nbhd}')
        plt.title(f'Cell Type Composition in BANKSY Neighborhood {nbhd}')
        plt.xticks(rotation=45, ha='right')
        plt.legend(title='Collection', bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.tight_layout()
        pdf.savefig()
        plt.close()

In [None]:
# Compare cell type proportions within neighborhoods vs overall sample proportions
celltype_frac['Sample_fraction'] = celltype_frac.apply(
    lambda row: celltype_sample_frac[(celltype_sample_frac['revised_annot'] == row['revised_annot']) & (celltype_sample_frac['Sample'] == row['Sample'])]['fraction'].values[0],
    axis=1
)
celltype_frac['Enrichment'] = (celltype_frac['fraction'] - celltype_frac['Sample_fraction'])/celltype_frac['Sample_fraction']
celltype_frac.head()

In [None]:
pdf_name = "/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_merge_BANKSY_filtered_celltype_enrichment_by_nbhd_110625.pdf"

with PdfPages(pdf_name) as pdf:
    # Plot boxplot of cell type fractions within each BANKSY neighborhood across samples
    nbhd_list = celltype_frac['nbhd_annot'].unique().tolist()
    for nbhd in nbhd_list:
        nbhd_fracs = celltype_frac[celltype_frac['nbhd_annot'] == nbhd]
        plt.figure(figsize=(10, 5))
        sns.boxplot(nbhd_fracs, x='revised_annot', y='Enrichment', hue='Collection')
        # Plot horizontal red dashed line at y=0
        plt.axhline(0, color='red', linestyle='--')
        plt.xlabel('Cell Type')
        plt.ylabel(f'Enrichment within BANKSY Neighborhood - {nbhd}')
        plt.title(f'Cell Type Enrichment in BANKSY Neighborhood - {nbhd}')
        plt.xticks(rotation=45, ha='right')
        plt.legend(title='Collection', bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.tight_layout()
        pdf.savefig()
        plt.close()

#### Plasma cells in neighborhood 6 - Early Myeloid

In [None]:
pc_nbhd6 = celltype_frac[(celltype_frac['nbhd_annot'] == 'Early Myeloid') & (celltype_frac['revised_annot'] == 'PC')]
fig, _ = plot_utils.plot_comparison_with_significance(pc_nbhd6, 'Collection', 'Enrichment', palette={"NBM": "#0C7515", "NDMM": "#E619B9", "PT": "#CF99C3"}, 
                                       xlabel=None, ylabel=None, title='Early Myeloid Neighborhood', figsize = (4, 6))
plt.axhline(0, color='red', linestyle='--')
ax = fig.get_axes()[0]
ax1 = ax.twinx()
ax1.set_ylim(ax.get_ylim())
plt.tight_layout()
plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_neighborhood6_plasma_cell_enrichment_v11062025.pdf', dpi=300, transparent = True)
                                       #save_path=None, figsize=(6, 5))

#### NK cells in endothelial neighborhoods (nbhd 3)

In [None]:
# Calculate statistics between timepoints for NK cells in nbhd3
pc_nbhd6 = celltype_frac[(celltype_frac['nbhd_annot'] == 'Endothelial') & (celltype_frac['revised_annot'] == 'NK')]
stats_res = []
tps = pc_nbhd6['Collection'].unique().tolist()

for cond1, cond2 in combinations(tps, 2):
    cond1_frac = pc_nbhd6[pc_nbhd6['Collection'] == cond1]['Enrichment'].values.tolist()
    cond2_frac = pc_nbhd6[pc_nbhd6['Collection'] == cond2]['Enrichment'].values.tolist()
    if len(cond1_frac) > 0 and len(cond2_frac) > 0:
        u_stat, p_val = mannwhitneyu(cond1_frac, cond2_frac, alternative='two-sided')
        stats_res.append({'Condition 1': cond1, 'Condition 2': cond2, 'Neighborhood': nbhd, 'U statistic': u_stat, 'p-value': p_val})
stats_df = pd.DataFrame(stats_res)
stats_df['p_adj'] = multipletests(stats_df['p-value'], method='fdr_bh')[1]
stats_df['Sig'] = stats_df['p_adj'] < 0.05

# Plot plasma cell proportions in nbhd2 across timepoints
plt.figure(figsize = (4, 6))
timecols = {"NBM": "#0C7515", "NDMM": "#E619B9", "PT": "#CF99C3"} 

ax = sns.boxplot(pc_nbhd6, x='Collection', y='Enrichment', hue = 'Collection', palette = timecols, showfliers=False)
sns.swarmplot(pc_nbhd6, x='Collection', y='Enrichment', color='black', size=5, alpha = .5)
plt.axhline(0, color='red', linestyle='--')

# Prepare bracket positions (no hue)
order = ['NBM', 'NDMM', 'PT']
x_pos = {lab: i for i, lab in enumerate(order)}

# Geometry
ymin, ymax = ax.get_ylim()
y_span = max(ymax - ymin, 1.0)
pad = 0.05 * y_span
step = 0.08 * y_span
h = 0.015 * y_span
text_pad = 0.5 * h

base = pc_nbhd6['Enrichment'].max() + pad
levels = [0] * len(order)  # per-x used levels to stack nested brackets

# Build pairs as (i, j, p_adj), sort by span so inner brackets go lower
pairs = []
for _, r in stats_df.iterrows():
    i, j = x_pos[r['Condition 1']], x_pos[r['Condition 2']]
    if i == j:
        continue
    if i > j:
        i, j = j, i
    pairs.append((i, j, r['p_adj']))

pairs.sort(key=lambda t: (t[1] - t[0], t[0]))  # shortest span first

for i, j, p in pairs:
    lvl = max(levels[i:j+1])  # find first free level across the span
    y1 = base + lvl * step
    y0 = y1 - h
    x1, x2 = i, j

    ax.plot([x1, x1, x2, x2], [y0, y1, y1, y0], lw=1.5, c='k', clip_on=False)
    ax.text((x1 + x2) / 2, y1 + text_pad, f"p={p:.2g}", ha='center', va='bottom', fontsize=9, clip_on=False)

    # reserve this level across the span so outer brackets go higher
    for k in range(i, j + 1):
        levels[k] = lvl + 1

# Ensure headroom
needed_top = base + (max(levels) * step) + pad
if needed_top > ymax:
    ax.set_ylim(ymin, needed_top)

plt.xlabel('Timepoint')
plt.ylabel('Fold enrichment of NK cells in neighborhood')
plt.title('Endothelial Neighborhood (nbhd3)')
plt.tight_layout()
plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_neighborhood3_NK_cell_enrichment_v11062025.pdf', dpi=300, transparent = True)

### Comparison with radial neighborhoods

In [None]:
rn_obj = sc.read_h5ad("/diskmnt/Projects/myeloma_scRNA_analysis/MMY_IRD/Xenium/analysis/radial_neighborhoods/Output/merged_RN.h5ad")

In [None]:
rn_obj.obs.head()

In [None]:
banksy_filtered.obs['RN_nbhd'] = rn_obj.obs.loc[banksy_filtered.obs_names, 'rn']
banksy_filtered = banksy_filtered[banksy_filtered.obs['RN_nbhd'] != 'Unassigned']

In [None]:
nbhd_counts

In [None]:
nbhd_counts = banksy_filtered.obs.groupby(['nbhd_annot', 'RN_nbhd']).size().reset_index(name='count')
rn_nbhd_counts = nbhd_counts.groupby('RN_nbhd')['count'].sum()
rn_nbhd_counts = rn_nbhd_counts.to_frame().reset_index()
nbhd_counts = nbhd_counts.merge(rn_nbhd_counts, on = 'RN_nbhd')
nbhd_counts['fraction'] = nbhd_counts['count_x'] / nbhd_counts['count_y']
nbhd_counts.head()


In [None]:
sns.scatterplot(nbhd_counts, y = 'RN_nbhd', x = 'nbhd_annot', size = 'fraction', hue = 'fraction', palette = 'viridis')
plt.legend(title = 'BANKSY Neighborhood', bbox_to_anchor = (1.05, 1), loc = 'upper left')
plt.xticks(rotation = 90, ha = 'center')
plt.xlabel('Radial Neighborhood')
plt.ylabel('BANKSY Neighborhood')
plt.title('Fraction of BANKSY Neighborhoods in each Radial Neighborhood')
plt.tight_layout()
#plt.savefig('/diskmnt/Users2/chouw/Projects/BM_spatial/IRD/BANKSY/IRD_JW_Xenium_BANKSY_neighborhood_fractions_by_RN_v11062025.pdf', dpi=300, transparent = True)
plt.show()




