# Broad IBD Challenge Data and results visualizations

## Ideas

Data Exploration: Visualizing input data characteristics
    Spatial distribution of cells colored by region
    Spatial distribution of cells colored by quality score

Feature Extraction: Visualizing extracted image features
    PCA (Principal Component Analysis): Visualizing the principal components of image features
    t-SNE (t-Distributed Stochastic Neighbor Embedding): Visualizing local relationships between features
    UMAP (Uniform Manifold Approximation and Projection): Visualizing global and local structure of feature space

Model Performance: Visualizing model training and evaluation

Training Metrics
    Line plots of training and validation loss over epochs
    Line plots of Spearman correlation over epochs
    Learning rate schedules
    Gradient norms during training

Prediction Accuracy
    Scatter plots of predicted vs. actual expression values
    Histograms of prediction errors
    Box plots of prediction accuracy across genes
    Heatmaps of correlation matrices
    PCA plots comparing predictions to ground truth
    t-SNE plots comparing predictions to ground truth
    UMAP plots comparing predictions to ground truth

Gene Expression: Visualizing predicted gene expression patterns
    PCA of gene expression colored by region
    Gene expression distributions
    Gene-gene correlation heatmap
        Heatmaps of gene expression across cells
        Clustered heatmaps showing gene modules
        Heatmaps comparing predicted and actual expression
        Heatmaps showing expression differences between tissue regions

Spatial Patterns: Visualizing spatial distribution of gene expression
    Original H&E images showing tissue structure
    Overlay of cell positions on H&E images
    Zoomed-in views of specific tissue regions
    Scatter plots of cells colored by expression level
    Heatmaps overlaid on tissue images
    Contour plots showing expression gradients
    3D surface plots of expression landscapes

Differential Expression: Visualizing differences between tissue regions
    Histograms of gene expression values
    Box plots showing expression distribution across genes
    Violin plots comparing expression in different tissue regions
    Density plots of expression values
    Scatter plots of -log10(p-value) vs. log2(fold change)
    Highlighted points for significantly differentially expressed genes
    Labeled points for top genes

Comparison of dysplastic and non-dysplastic regions
    MA Plots
        Scatter plots of log2(fold change) vs. log2(mean expression)
        Highlighted points for significantly differentially expressed genes
        Smoothed trend lines showing overall patterns

## Input data

In [2]:
 !ls data | grep ''

Crunch3_gene_list.csv
Crunch3_scRNAseq.h5ad
UC1_I.zarr
UC9_I-crunch3-HE-dysplasia-ROI.tif
UC9_I-crunch3-HE-label-stardist.tif
UC9_I-crunch3-HE.tif
UC9_I.zarr


In [3]:
name_data = 'UC1_I'

In [4]:
data_directory_path = './data'

In [5]:
import os
import spatialdata as sd 

  return _bootstrap._gcd_import(name[level:], package, level)


In [6]:
zarr_path = os.path.join(data_directory_path, f"{name_data}.zarr")
print("zarr_path", zarr_path, os.path.exists(zarr_path))
sdata = sd.read_zarr(zarr_path)


zarr_path ./data/UC1_I.zarr True


  compressor, fill_value = _kwargs_compat(compressor, fill_value, kwargs)


In [7]:
sdata

SpatialData object, with associated Zarr store: /home/catskills/Desktop/broad/data/data/UC1_I.zarr
├── Images
│     ├── 'DAPI': DataArray[cyx] (1, 47659, 51147)
│     ├── 'DAPI_nuc': DataArray[cyx] (1, 47659, 51147)
│     ├── 'HE_nuc_original': DataArray[cyx] (1, 20000, 22000)
│     ├── 'HE_nuc_registered': DataArray[cyx] (1, 47659, 51147)
│     ├── 'HE_original': DataArray[cyx] (3, 20000, 22000)
│     ├── 'HE_registered': DataArray[cyx] (3, 47659, 51147)
│     ├── 'group': DataArray[cyx] (1, 47659, 51147)
│     └── 'group_HEspace': DataArray[cyx] (1, 20000, 22000)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 8) (2D points)
└── Tables
      ├── 'anucleus': AnnData (202534, 460)
      └── 'cell_id-group': AnnData (234356, 0)
with coordinate systems:
    ▸ 'global', with elements:
        DAPI (Images), DAPI_nuc (Images), HE_nuc_original (Images), HE_nuc_registered (Images), HE_original (Images), HE_registered (Images), group (Images), group_HEspace (Images)
    

In [9]:
gene_name_list = sdata['anucleus'].var['gene_symbols'].values
gene_name_list[0:10], len(gene_name_list)

(array(['A2M', 'ACP5', 'ACTA2', 'ADAMTSL3', 'AFAP1L2', 'AHR', 'ALDH1B1',
        'ANO1', 'ANXA1', 'AQP1'], dtype=object),
 460)

In [10]:
size_subset = len(sdata['anucleus'].obs)
rows_to_keep = list(sdata['anucleus'].obs.sample(n=size_subset, random_state=42).index)
cell_id_train = sdata['anucleus'].obs["cell_id"].values

In [13]:
dir_processed_dataset = 'resources/processed_dataset'

In [14]:
patch_save_dir = os.path.join(dir_processed_dataset, "patches")
adata_save_dir = os.path.join(dir_processed_dataset, "adata")
splits_save_dir = os.path.join(dir_processed_dataset, "splits")

In [17]:
# Path for the .h5 image dataset
h5_path = os.path.join(patch_save_dir, name_data + '.h5')

In [18]:
sdata['anucleus'].obsm['spatial']

array([[15922.64229765,  4480.31331593],
       [15655.11413043,  4483.20108696],
       [15950.88442211,  4457.26130653],
       ...,
       [36321.08390023, 44317.16780045],
       [36977.44547564, 43940.57308585],
       [36902.07462687, 43880.94626866]])

In [24]:
[x for x in vars(sdata)]

['_path',
 '_shared_keys',
 '_images',
 '_labels',
 '_points',
 '_shapes',
 '_tables',
 '_attrs',
 '_query']

In [20]:
from extract_spatial_positions import extract_spatial_positions

In [21]:
new_spatial_coord = extract_spatial_positions(sdata, cell_id_train)
# Store new spatial coordinates into sdata
sdata['anucleus'].obsm['spatial'] = new_spatial_coord

Extracting spatial positions ...


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 234356/234356 [00:21<00:00, 10875.74it/s]


In [32]:
df = sdata['anucleus'][0].to_df()

In [34]:
df.loc[:, (df != 0).any(axis=0)].values

array([[1.3090631, 2.2091649, 1.3090631, 1.3090631, 1.8571422, 1.8571422,
        1.3090631, 1.3090631, 1.3090631, 1.8571422, 1.3090631, 1.3090631,
        1.3090631, 1.3090631, 1.3090631, 1.3090631, 1.3090631, 1.8571422,
        1.3090631, 1.3090631, 1.3090631, 1.3090631, 2.6750803, 1.3090631,
        1.3090631, 1.3090631, 1.3090631]], dtype=float32)

In [35]:
    # Create the gene expression dataset (Y)
    print("Create gene expression dataset (Y) ...")
    y_subtracted = sdata['anucleus'][rows_to_keep].copy()

Create gene expression dataset (Y) ...


In [41]:
    # Trick to set all index to same length to avoid problems when saving to h5
    y_subtracted.obs.index = ['x' + str(i).zfill(6) for i in y_subtracted.obs.index]

In [42]:
y_subtracted

AnnData object with n_obs × n_vars = 202534 × 460
    obs: 'cell_id'
    var: 'gene_symbols'
    obsm: 'spatial'
    layers: 'counts'

In [40]:
os.path.join(adata_save_dir, f'{name_data}.h5ad')


True

In [None]:
# Save the gene expression data to an H5AD file
y_subtracted.write(os.path.join(adata_save_ dir, f'{name_data}.h5ad'))

In [46]:
for index in y_subtracted.obs.index:
    if len(index) != len(y_subtracted.obs.index[0]):
        warnings.warn("indices of y_subtracted.obs should all have the same length to avoid problems when saving to h5", UserWarning)

In [45]:
import numpy as np

In [47]:
# Extract spatial coordinates and barcodes (cell IDs) for the patches
coords_center = y_subtracted.obsm['spatial']
barcodes = np.array(y_subtracted.obs.index)

In [49]:
barcodes

array(['x121675', 'x157346', 'x164660', ..., 'x161779', 'x179685',
       'x149190'], dtype=object)