# Hahn Paper Data Exploratory

The following code is executed after convering the .rds file into several files. Please see **"Replicating Data Processing for Hanh .rds File.pdf"** file for more information. Here, we import all of the converted data files and construct anndata to use for scanpy 

In [None]:
import scanpy as sc
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import os
import math
import anndata as ad
import pandas as pd
import sys

In [None]:
class PDF(object):
  def __init__(self, pdf, size=(200,200)):
    self.pdf = pdf
    self.size = size
  def _repr_html_(self):
    return '<iframe src={0} width={1[0]} height={1[1]}></iframe>'.format(self.pdf, self.size)
  def _repr_latex_(self):
    return r'\includegraphics[width=1.0\textwidth]{{{0}}}'.format(self.pdf)

## Paper: A spatiotemporal map of the aging mouse brain reveals white matter tracts as vulnerable foci

https://www.biorxiv.org/content/10.1101/2023.03.10.531984v1

In [None]:
PDF('2022.09.18.508419v2.full.pdf',size=(1300,1000))

## Oliver Hahn's email Replies Regarding the data

Regarding labeling: this was done pretty simply by hand (aka, me). I harmonized the transcriptome spots from all samples and then tried to annotate 
the resulting clusters by hand based on anatomy, marker genes, etc.

I’m not sure I would use that as ‘ground truth’ - I did it to the best of my abilities and so that the annotations are compatible with my bulk-seq data, 
and I think in that regard, it worked. But I would prefer to not have that being used as some sort of ’state of the art’ reference to which you contrast any 
computational method that you are setting up (and show that yours work a lot better ;) this was done in a very, very conservative way - and I might have 
preferred something more sophisticated - to not have reader/reviewers be distracted from the main narrative, which was dealing with the effects of aging. 



It contains a single RDS object, which you can load into R via loadRDS(“FILEPATH.rds”) -> SpatialData

All things considered, I decided to just pass you a cleaned R object that contains all the relevant count matrices (including normalizations), 
mapping locations, various forms of dimensionality reduction, clustering, metadata and spatial images. I assume you probably work more with python 
but I guess you’ll manage to extract what you need from that, and it’s a more compact format than multiple txt files, some images etc. 
This object was generated with the Seurat package, and a couple of relevant commands and ways to interact with the data 
(and extract whatever you need for your python or other environment) you’ll find in there. 
Please let me know once you downloaded the data, so that I can close the folder again.

The metadata contains two sets of region annotations: a ‘cluster-level’ and a ‘region-level’. 
    - In case of the former, we just looked at the transcriptional clusters and tried to annotate these one by one, often matching known substructures of the Allen Brain atlas. 
    - In the region-level  annotation, we grouped these together where we considered it ‘meaningful’. 
You’ll find a bit more explanation, and what clusters were combined into regions, in the attached copy of our respective figure.

One more thing: as you can probably see/imagine, we tried very hard to align the individual tissue slices as good as possible. That is, trying to 
cut at the same depth. We were less interested in analyzing what regions/clusters do we capture - that is kinda boring to us and people like the 
Allen Atlas have already done that - but rather what changes at the same location/region if you go from a young to an old brain.

Since region-identity has so much stronger transcriptional variance than aging (in some analyses it’s 40% of variance explained by region vs. 0.5% for aging), 
we had to be sure we’d not end up comparing apples with oranges by not cutting at the correct depth. However, since we had to do that literally by eye&hand, 
there’s an inevitable degree of variation present (you can see that in Young Sample 1). Just be mindful about that, as it could influence your analysis.

I hope all is clear and you’ll be able to navigate through the data from here on. Keep me in the loop, I’d be genuinely interested what you can dig out. 
The dataset is as of now massively under-explored, which was a concession I just had to make in order to get the paper out. However, as I’m moving to Calico 
this summer to start my own lab, I might revisit some of these things. There are always opportunities to collaborate and exchange ideas.


**Figure S5 Robust capture of spatial transcriptomes across age**

(A) Spatial-seq processing and analysis overview. Whole brains were frozen prior to OCT embedding and cryo-sectioning. Coronal sections were placed on a 10X Visium Spatial Gene Expression slide, followed by H&E staining and spatial reverse-transcription reaction. Single-spot transcriptomes were integrated, clustered with default settings and visualized as UMAP. Clustered spatial spot transcriptomes were mapped to their original location. To annotate the clusters, their marker genes (Table S6) were visualized, compared to the Allen Brain Atlas (23). (B) Complete data description and abbreviations of ontology and nomenclature for spatial transcriptome data. Regional-level annotated manually, and cluster-level determined by Seurat clustering. (C,D) Representative spatial transcriptome data (6 months replicate #2), colored by cluster-level annotation and represented as (C) UMAP and (D) spatial transcriptome. (E,F) Representative spatial transcriptome data (6 months replicate #2), colored by region-level annotation and represented as (E) UMAP and (F) spatial transcriptome. (G,H,I) Cluster-level annotation across replicates and datasets represented as (G) UMAP and (H) spatial transcriptome. (I) Fraction of spots corresponding to each cluster. (J,K,L) Region-level annotation across replicates and datasets represented as (J) UMAP and (K) spatial transcriptome. (L) Fraction of spots corresponding to each region.

In [None]:
PDF('FigS5.pdf',size=(1300,1000))

## Some Information of Original .rds Data file in RStudio

## Loading .rds's meta.data file from Exported CSV file 

In [None]:
seurat_metadata_df = pd.read_csv("seurat_metadata.csv")
seurat_metadata_df = seurat_metadata_df.rename(columns={"Unnamed: 0": "Cell"})
seurat_metadata_df = seurat_metadata_df.set_index("Cell")
seurat_metadata_df

## Loading .rds's image tissue slices from Exported CSV files

In [None]:
# Load all the tissue image slice files
tissue_coordinates_slice1_df = pd.read_csv("tissue_coordinates_slices/tissue_coordinates_slice1.csv")
tissue_coordinates_slice1_df["slice"] = "slice1"

tissue_coordinates_slice1_1_1_df = pd.read_csv("tissue_coordinates_slices/tissue_coordinates_slice1_1_1.csv")
tissue_coordinates_slice1_1_1_df["slice"] = "slice1_1_1"

tissue_coordinates_slice1_2_2_df = pd.read_csv("tissue_coordinates_slices/tissue_coordinates_slice1_2_2.csv")
tissue_coordinates_slice1_2_2_df["slice"] = "slice1_2_2"

tissue_coordinates_slice1_3_3_df = pd.read_csv("tissue_coordinates_slices/tissue_coordinates_slice1_3_3.csv")
tissue_coordinates_slice1_3_3_df["slice"] = "slice1_3_3"

tissue_coordinates_slice1_4_4_df = pd.read_csv("tissue_coordinates_slices/tissue_coordinates_slice1_4_4.csv")
tissue_coordinates_slice1_4_4_df["slice"] = "slice1_4_4"

tissue_coordinates_slice1_5_5_df = pd.read_csv("tissue_coordinates_slices/tissue_coordinates_slice1_5_5.csv")
tissue_coordinates_slice1_5_5_df["slice"] = "slice1_5_5"

# concatenate the all the tissue slice DataFrames
slice_list = [tissue_coordinates_slice1_df, tissue_coordinates_slice1_1_1_df, tissue_coordinates_slice1_2_2_df, 
              tissue_coordinates_slice1_3_3_df, tissue_coordinates_slice1_4_4_df, tissue_coordinates_slice1_5_5_df]
tissue_coordinates_all_slices = pd.concat(slice_list)
tissue_coordinates_all_slices = tissue_coordinates_all_slices.rename(columns={"Unnamed: 0": "Cell"})
tissue_coordinates_all_slices = tissue_coordinates_all_slices.set_index("Cell")

tissue_coordinates_all_slices

## Concatentate the Seurat Metadata DataFrame with the Tissue Coordinate DataFrame

In [None]:
seurat_full_metadata_df = pd.concat([seurat_metadata_df, tissue_coordinates_all_slices], axis=1)

# Create new columns for rotated coordinates
rotated_coord = np.rot90(np.rot90(np.array(seurat_full_metadata_df[['imagerow', 'imagecol']])))
seurat_full_metadata_df['imagerow_rotated_v2'] = rotated_coord[:, 0]
seurat_full_metadata_df['imagecol_rotated_v2'] = rotated_coord[:, 1]

seurat_full_metadata_df

In [None]:
slice1_data_df = seurat_full_metadata_df[seurat_full_metadata_df["slice"]=="slice1_3_3"]
slice1_age_data_df = slice1_data_df[slice1_data_df["age"]=="M6"]
slice1_age_data_df


## Assay CSV Files

Amongst the assay data available, we are opening the SCT and Spatial Assay for the "data" (normalized) information presented

### SCT Assay

In [None]:
sct_data_assay_df = pd.read_csv("assays/sct_data_assay.csv")
sct_data_assay_df = sct_data_assay_df.rename(columns={"Unnamed: 0": "Gene"})
sct_data_assay_df = sct_data_assay_df.set_index("Gene")
sct_data_assay_df

In [None]:
sct_data_assay_df.loc[(sct_data_assay_df!=0).any(axis=1)]

### Spatial Assay

In [None]:
spatial_data_assay_df = pd.read_csv("assays/spatial_data_assay.csv")
spatial_data_assay_df = spatial_data_assay_df.rename(columns={"Unnamed: 0": "Gene"})
spatial_data_assay_df = spatial_data_assay_df.set_index("Gene")
print("Number of Genes in Spatial Assay:", len(spatial_data_assay_df))
print("Number of Cells in Spatial Assay:", len(spatial_data_assay_df.columns))
display(spatial_data_assay_df)

In [None]:
sum(spatial_data_assay_df.loc['AC234645.1'])

In [None]:
# filter out genes that have expression level 0 across all cells

spatial_data_assay_df = spatial_data_assay_df.loc[spatial_data_assay_df.any(axis=1)]
print("Number of Genes in Spatial Assay:", len(spatial_data_assay_df))
print("Number of Cells in Spatial Assay:", len(spatial_data_assay_df.columns))
display(spatial_data_assay_df)

In [None]:
32285-23923

## Loading Raw Image Slices

In [None]:
# Load the CSV file into a DataFrame
raw_image_slice1_df = pd.read_csv('raw_image_slices/raw_image_slice1.csv', index_col=0)
raw_image_slice1_1_1_df = pd.read_csv('raw_image_slices/raw_image_slice1_1_1.csv', index_col=0)
raw_image_slice1_2_2_df = pd.read_csv('raw_image_slices/raw_image_slice1_2_2.csv', index_col=0)
raw_image_slice1_3_3_df = pd.read_csv('raw_image_slices/raw_image_slice1_3_3.csv', index_col=0)
raw_image_slice1_4_4_df = pd.read_csv('raw_image_slices/raw_image_slice1_4_4.csv', index_col=0)
raw_image_slice1_5_5_df = pd.read_csv('raw_image_slices/raw_image_slice1_5_5.csv', index_col=0)

# Convert the DataFrame into a NumPy array
raw_image_slice1_data = raw_image_slice1_df.values
raw_image_slice1_1_1_data = raw_image_slice1_1_1_df.values
raw_image_slice1_2_2_data = raw_image_slice1_2_2_df.values
raw_image_slice1_3_3_data = raw_image_slice1_3_3_df.values
raw_image_slice1_4_4_data = raw_image_slice1_4_4_df.values
raw_image_slice1_5_5_data = raw_image_slice1_5_5_df.values

raw_image_slice1_1_1_df

## Building AnnData for Scanpy

In [None]:
# https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html

import hdf5plugin

#assays = ["spatial", "sct"]
assays = ["spatial"]

for assay in assays:

    if assay == "spatial":
        assay_data_df = spatial_data_assay_df.copy()
    else:
        assay_data_df = sct_data_assay_df.copy()
    
    # transpose the assay data to use as 'X' matrix for the anndata
    temp_transposed_assay_data_df = assay_data_df.copy().T.reset_index()
    transposed_assay_data_df = pd.DataFrame(temp_transposed_assay_data_df.values)
    transposed_assay_data_df.columns = ["Cell"] + list(temp_transposed_assay_data_df.columns[1:])
    transposed_assay_data_df = transposed_assay_data_df.set_index("Cell")

    # Construct AnnData
    adata_build = ad.AnnData(X=transposed_assay_data_df.values.astype(np.float64))

    # copy Metadata to AnnData Object
    adata_build.obs = seurat_full_metadata_df
    
    # copy gene names for .var 
    adata_build.var = pd.DataFrame(transposed_assay_data_df.columns).rename(columns={0: 'Gene'}).set_index("Gene")

    # copy all Spatial information to AnnData Object (all image files are associated with "Spatial" assay)
    adata_build.obsm["X_spatial"] = np.array(adata_build.obs[["imagerow", "imagecol"]])
    adata_build.obsm["spatial"] = np.array(adata_build.obs[["imagerow", "imagecol"]])
    adata_build.obsm["X_spatial_rotated"] = np.array(adata_build.obs[["imagerow_rotated_v2", "imagecol_rotated_v2"]])
    adata_build.obsm["spatial_rotated"] = np.array(adata_build.obs[["imagerow_rotated_v2", "imagecol_rotated_v2"]])
    
    # placing slice image data in 'uns' (this is an extra, last resort place to put the data for the AnnData)
    adata_build.uns["spatial"] = {}
    adata_build.uns["spatial"]["assay_slice_images"] = {}
    adata_build.uns["spatial"]["assay_slice_images"]["images"] = {}
    adata_build.uns["spatial"]["assay_slice_images"]["images"]["slice1"] = raw_image_slice1_data
    adata_build.uns["spatial"]["assay_slice_images"]["images"] ["slice1_1_1"] = raw_image_slice1_1_1_data
    adata_build.uns["spatial"]["assay_slice_images"]["images"] ["slice1_2_2"] = raw_image_slice1_2_2_data
    adata_build.uns["spatial"]["assay_slice_images"]["images"] ["slice1_3_3"] = raw_image_slice1_3_3_data
    adata_build.uns["spatial"]["assay_slice_images"]["images"] ["slice1_4_4"] = raw_image_slice1_4_4_data
    adata_build.uns["spatial"]["assay_slice_images"]["images"] ["slice1_5_5"] = raw_image_slice1_5_5_data

    # for visualizations associated with regionLevel
    category_colors = {
        'Cortex': np.array([0.416, 0.0, 0.416, 1.0]),                                 # purple
        'Hypothalamus': np.array([0.0, 0.502, 0.0, 1.0]),                             # green
        'Amygdala': np.array([1.0, 0.80, 0.0, 1.0]),                                  # yellow
        'Thalamus': np.array([0.0, 0.0, 0.502, 1.0]),                                 # dark blue
        'Striatum': np.array([1.0, 0.714, 0.757, 1.0]),                               # light pink
        'White matter': np.array([1.0, 0.0, 0.0, 1.0]),                               # red
        'Cortical subplate': np.array([0.502, 0.502, 0.502, 1.0]),                    # grey
        'Globus pallidus': np.array([0.322, 0.651, 0.839, 1.0]),                      # light blue
        'Hippocampus': np.array([1.0, 0.502, 0.0, 1.0]),                              # orange
        'Ventricle': np.array([0.824, 0.706, 0.549, 1.0]),                            # tan
        'Thalamic reticular nucleus': np.array([0.565, 0.933, 0.560, 1.0]),           # light green
        'Basolateral amygdalar nucleus': np.array([1, 0.07843137, 0.57647059, 1.0]),  # dark pink
        'Choroid Plexus': np.array([0.0, 0.0, 0.0, 1.0]),                             # black
    }

    adata_build.uns["regionLevel_colors"] = category_colors

In [None]:
adata_build

In [None]:
adata_build.write_h5ad("hahn_spatial_assay_anndata")

In [None]:
if assay == "spatial":
    adata_build.write_h5ad("hahn_spatial_assay_anndata", compression=hdf5plugin.FILTERS["zstd"])
else:
    adata_build.write_h5ad("hahn_sct_assay_anndata", compression=hdf5plugin.FILTERS["zstd"])