
> **ISO2024 INTRODUCTORY SPATIAL 'OMICS ANALYSIS**
>
>
>- HYBRID : TORONTO & ZOOM
>- 9TH JULY 2024 <br>


>**Module 2 : Pre-processing steps**<BR>
>   * A. Understanding your output *
>   * B. Tidying and pre-evaluating your data *

>
>**Instructor : Shamini Ayyadhury**

---

```
MODULE 2 : SUPPLEMENTARY SCRIPT 01 - PERFORMING QUALITY CONTROL STEPS OVER WT MOUSE SAMPLE 
Repeat of module2___script01_pre_processing_steps.ipynb but the wildtpe mouse sample will be processed and saved.
Both wildtype and AD data output will be used for module 5/6

In [None]:
### import the following libraries

### Packages for general system functions, miscellaneous operating system interfaces, warning control system
import sys ### general system functions
import os ### miscellaneous operating system interfaces
import warnings ### warning control system
import psutil
warnings.filterwarnings('ignore') ### ignore warnings

### Packages for data manipulation and analysis, data visualization
import pandas as pd ### data manipulation and analysis for tabular data in python
import matplotlib.pyplot as plt ### plotting library for the Python programming language and its numerical mathematics extension NumPy
import seaborn as sns ### data visualization library based on matplotlib (my personal favourite over matplotlib)
import numpy as np ### support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays


In [None]:

sys.path.append('~/data/projects/spatial_workshop/')
import pre_processing_fnc as ppf

In [None]:
### its sometimes useful to assign the file names or paths to variables to avoid typing errors

### path variables
data_dir = '~/data1/data_orig/data/spatial/xenium/10xGenomics/' ### data directory
out = '~/data/projects/spatial_workshop/out/module2/' ### output directory for saving files. We have created these output directories in advance to save time. Participants are free to create their own if they wish to.
os.makedirs(out, exist_ok=True) ### create a new directory for saving files (but checks if the directory already exists)


### object variables
datasets_to_use = 'mice_AD_model/wt/xenium_out/' ### the name of the dataset to use
features_filepath = 'cell_feature_matrix.h5'
cells_filename = 'cells.parquet'
transcripts_filename = 'transcripts.parquet'
metrices_filename = 'metrics_summary.csv'


ppf.get_memory_usage() ### monitor memory usage

In [None]:

### we will load 3 files here: cell_feature_matrix.h5, cells.parquet and transcripts.parquet
### We will check the parquet file to ensure that the string values are not in bytes format and if they are to convert them back to string
df_cell = ppf.check_parquet(os.path.join(data_dir+datasets_to_use ,cells_filename))
df_transcript = ppf.check_parquet(os.path.join(data_dir+datasets_to_use, transcripts_filename))
df_metric = pd.read_csv(os.path.join(data_dir+datasets_to_use, metrices_filename))



#------------------------------------------------
ppf.get_memory_usage() ### monitor memory usage

Analyze transcript QC

In [None]:


processed_data = ppf.process_data(df_transcript) ### we process and assign the output to an object called processed_data
del df_transcript ### we delete the original transcript dataframe to save memory


#------------------------------------------------
ppf.get_memory_usage() ### monitor memory usage

In [None]:
processed_data.head()
### note the additional columns added to the processed_data dataframe : group and binary


In [None]:
"""
The cleaned data is then used to create the gene matrix, control matrix, counts matrix and cell centroid matrix 
Here, it involves removing low quality transcripts and assigning the transcripts to the cells based on the cell segmentation from the standard xenium clear_output
We still keep the negative control values as a separate matrix

However the 3 matrices that we will bring forward to the next lesson are
1. gene matrix
2. counts matrix
3. cell centroid matrix
"""

df_counts, transcripts_df, gene_mtx, neg_mtx, centroids = ppf.clean_processed_tf(processed_data)


#------------------------------------------------
ppf.get_memory_usage() ### monitor memory usage

In [None]:

os.makedirs(out+'wt_13_4mths', exist_ok=True)


df_counts.to_csv(out+'wt_13_4mths/df_counts.csv', index=True)
gene_mtx.to_csv(out+'wt_13_4mths/gene_mtx.csv', index=True)
neg_mtx.to_csv(out+'wt_13_4mths/neg_mtx.csv', index=True)
centroids.to_csv(out+'wt_13_4mths/centroids.csv', index=True)
transcripts_df.to_csv(out+'wt_13_4mths/transcripts_df.csv', index=True)

ppf.get_memory_usage() ### monitor memory usage

In [None]:
import scanpy as sc ### scanpy is a package for single-cell analysis in python

In [None]:
adata = sc.AnnData(X=gene_mtx, var=pd.DataFrame(index=gene_mtx.columns.values))

df_counts = df_counts[df_counts.index.isin(gene_mtx.index)]

df_counts = df_counts.reindex(gene_mtx.index)
adata.obs = df_counts.copy()
    
centroids = centroids.reindex(adata.obs.index)
adata.obs[['x_location', 'y_location']] = centroids[['centroid_x', 'centroid_y']].values

gene_mtx_bool = gene_mtx > 0
n_cells = gene_mtx_bool.sum(axis=0)
n_genes = gene_mtx_bool.sum(axis=1)

adata.var['n_cells'] = n_cells
adata.obs['n_genes'] = n_genes

sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_counts=9)



#------------------------------------------------
ppf.get_memory_usage() ### monitor memory usage

In [None]:
n_genes = gene_mtx_bool.sum(axis=1)
adata.obs['n_genes'] = n_genes

In [None]:
fig, axes = plt.subplots(1, 1, figsize=(7, 5))


median = np.median(adata.obs['n_genes'])
mean = np.mean(adata.obs['n_genes'])

adata.uns['n_genes_med'] = median
adata.uns['n_genes_mean'] = mean

### first look at counts distribution
ax = sns.histplot(adata.obs['n_genes'], bins=180, color='gray')
ax.axvline(median, color='red', linestyle='--', label='median')
ax.axvline(mean, color='blue', linestyle='--', label='mean')
ax.legend()
ax.set_title('B2C. Genes expressed per cell distribution')




#------------------------------------------------
ppf.get_memory_usage() ### monitor memory usage

In [None]:
### Always remember to save the coordinates of the spatial data in the adata object into the uns and obsm slots. Many methods call the spatial coordinates from these slots

adata.obsm['spatial'] = adata.obs[['x_location', 'y_location']].values
adata.uns['spatial'] = {'spatial' : adata.obsm['spatial'].copy()}

In [None]:
### save your adata
adata.raw = adata
adata.write(out+'wt_13_4mths/adata_wt.h5ad')