# Quality Control

This notebook contains the code to perform quality control evaluations on the desired set of 10x VISIUM samples. After the assesment, we set cutoffs to filter out low quality spots. The anndata files are then saved for further use.

# Table of Contents
* [Parameters and Metadata](#1.-Parameters-and-Metadata)
    * [Parameters to be set](#1.1.-Parameters-to-be-set)
    * [Define Input, sample names and metadata](#1.2.-Define-Input,-sample-names-and-metadata)
* [Quality Control](#2.-Quality-Control)
    * [Quality Control: Goblal Metrics](#2.1.-Quality-Control:-Goblal-Metrics)
    * [Image-based Quality control](#2.2.-Image-based-Quality-control)
* [Filtering and saving the anndata Objects](#3-Filtering-and-saving-the-anndata-Objects)
    * [QC-base filtering of spots and genes](#3.1.-QC-base-filtering-of-spots-and-genes)
    * [Saving the anndata objects for further analysis](#3.2.-Saving-the-anndata-objects-for-further-analysis)

In [None]:
import scanpy as sc
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import squidpy as sq
import re
import besca as bc
from wrapper_functions import *
sns.set()

In [None]:
# Automatically re-load wrapper functions after an update
# Find details here: https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [None]:
sc.logging.print_versions()
# sc.set_figure_params(facecolor="white", figsize=(6, 6))
sc.settings.verbosity = 3

# 1. Parameters and Metadata

## 1.1. Parameters to be set

In this section, we set up some parameters that are required by the functions contained in our wrappers. 

We currently need to define the 10x VISIUM experimental protocol. There are two options depending on the conservation approach of the tissue sections:  

- **Fresh Frozen (FF):** Unbiased approach captures polyadenylated mRNA. Species-independent and whole transcriptome approach. 
- **Formalin Fixed Paraffin Embedded (FFPE):** Probe-based approach. Currently only available for human and mice. Probes cover over 18.000 genes. Of note, the FFPE protocol can be performed on FF samples. 

We also need to set up the species name. This is mainly to compute the mitochondrial content and to use some prior knowledge approaches, such as gene set enrichment analysis. The current options are the followinng: 

- **human**
- **mouse**
- **rat**


In [None]:
organism = Organism.mouse
analyze_params = Analyze(protocol=Protocol.FF, organism=organism)

## 1.2. Define Input, sample names and metadata

We here define the location of the raw data and the most relevant metadata associated to the samples under consideration. 

In [None]:
root_path = os.getcwd()
# Include here the folder where the output of the spaceRanger is located (count matrices, images, spot coordinates.)
inpath=root_path+'/spaceranger-local/'
results_folder = os.path.join(root_path, 'results')

In [None]:
## check if folder exists and create it otherwise
if not os.path.exists(results_folder):
    os.makedirs(results_folder)
    print(f"Folder '{results_folder}' created.")
else:
    print(f"Folder '{results_folder}' already exists.")

In [None]:
metadata = pd.read_csv(os.path.join(root_path, '', 'raw', 'metadata.tsv'), sep="\t", index_col= 'BARCODE')

In [None]:
metadata

We can also generate a summary dataframe containing the unique conditions per sample as in the previous dataset we have information per spot. 

In [None]:
metadata_summary = metadata[['readout_id','CONDITION', 'treatment_id']].drop_duplicates().set_index('readout_id')

In [None]:
metadata_summary

# 2. Quality Control

## 2.1. Quality Control: Goblal Metrics

We first take a look to the global metrics that come out from the SpaceRanger pipeline for each sample and we plot them together into barplots for comparison. We can color the barplots by the different values in our metadata in order to detect batch or condition related effects.

In [None]:
globalQC_df = get_global_QCmetrics(inpath,  metadata_summary.index)

In [None]:
globalQC_df

In [None]:
get_barplot_qc(globalQC_df, metadata_summary['CONDITION'].tolist(), globalQC_df.columns.values)

## 2.2. Image-based Quality control

We are now going to look more into the QC details of the individual samples. We will explore potential contamination issues in the spots non covered by tissue, the number of counts and genes per spot the percentage of mithocondrial genes in the different regions of the samples. This analysis and the associated plots will help us to set up some parameters to filter out low quality spots or genes expressed in a very limited number of spots. Of note, in this project we used FF samples but using the FFPE probe-based protocol. Therefore, no mithocondrial or very reduce content is expected. 

In [None]:
%%capture --no-display
# %%capture --no-display: Removes warnings for this cell
# Here we ant to hide this warning: 'UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.'

adatas_filter = generate_adata_objects(path = inpath, samples_names = metadata_summary.index, metadata = metadata, analyze_params=analyze_params)
adatas_raw = generate_adata_objects(path = inpath, samples_names = metadata_summary.index, metadata = metadata, analyze_params=analyze_params, count_file='raw_feature_bc_matrix.h5')

In this particular example. all the samples were processed in the same 10x VISIUM slide and therefore there are no different batches. So, we just include the individual id as batch name. However, this is intended to capture some variations that can arise when running several slides in a study

In [None]:
#adata_filtered = qc_filtering(list_adatas = adatas_filter, max_counts = 40000, mt_pct_content = 50, min_cells = 5, min_counts=500)

In [None]:
perform_qc_analysis(adatas_filter, adatas_raw, color_map="coolwarm", sample_id="readout_id",
    condition_name="CONDITION", batch_name = 'individual_id')

# 3 Filtering and saving the anndata Objects

## 3.1. QC-base filtering of spots and genes

We are going to implement quality control (QC) filters for spot selection based on the previously displayed plots. Our primary goal is to exclude outlier spots characterized by unusually high counts, as these may not accurately represent typical cellular activity. Simultaneously, we aim to eliminate spots with low counts, which could indicate quality issues or areas where the tissue is damaged. 

Furthermore, we will be scrutinizing spots with a high proportion of mitochondrial content, as this can be a marker of tissue damage. It's important to note that mitochondrial content varies significantly across different tissues. For example, in liver tissues, spots with a mitochondrial content exceeding 5% might suggest potential issues. In contrast, brain tissues can exhibit spots with mitochondrial content as high as 40% without necessarily indicating poor quality or damage.

In this particular example, we will remove spots with more than 90.000 and less than 1000 counts. As the FFPE protocol was followed in this study, mitochondrial genes are not or barely detected (non probes for them) and tehrefore we do not set any particular treshold. We will also filter out genes expressed in less than 10 spots per sample. However, one should carefully inspect its own samples before setting this tresholds. 

Of note, one can also filter out spots based in different criteria. For instance, the pathologist monitoring the analysis can label some spots as "exlude" due to folds or other artifacts in the tissue. In addition, one can establish other filtering criteria like the removal of genes with a large content of hemoglobing genes. 

It is also **very important to consider** that we are here setting global tresholds for all the samples in our study. This may not be the best approach for studies with a large amount of samples displaying different ranges of counts coverage. In this cases, we suggest to implement per sample cutoffs based on the number of standard deviations or median absolute deviations from the mean number of counts. This was for instance applied in the following publication: ["The spatiotemporal program of zonal liver regeneration following acute injury"](https://pubmed.ncbi.nlm.nih.gov/35659879/).

In [None]:
adata_filtered = qc_filtering(list_adatas = adatas_filter, max_counts = 55000, min_cells = 0, min_counts=500)

In [None]:
adata_concat = sc.concat(
    adata_filtered,
    label="library_id",
    uns_merge="unique",
    keys=[
        k
        for d in [adata.uns["spatial"] for adata in adata_filtered]
        for k, v in d.items()
    ],
    index_unique="-",
    join='outer' 
)

In [None]:
### Filter the genes as we did not do any filtering per sample
print(f"# Genes before filter: {adata_concat.n_vars}")
sc.pp.filter_genes(adata_concat, min_cells=20, inplace=True)
print(f"# Genes after filter: {adata_concat.n_vars}")    

In [None]:
for i, adata in enumerate(adata_filtered):
    
    print(adata.obs["readout_id"].unique()[0])
    
        
    print(f"# Genes before filter: {adata.n_vars}")
    adata_filtered[i]=adata_filtered[i][:,adata_filtered[i].var.index.isin(adata_concat.var.index)].copy()
    print(f"# Genes after filter: {adata_filtered[i].n_vars}")    

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(15, 4))
sns.distplot(adata_concat.obs["total_counts"], kde=False, ax=axs[0])
sns.distplot(adata_concat.obs["total_counts"][adata_concat.obs["total_counts"] < 10000], kde=False, bins=40, ax=axs[1])
sns.distplot(adata_concat.obs["n_genes_by_counts"], kde=False, bins=60, ax=axs[2])
sns.distplot(adata_concat.obs["n_genes_by_counts"][adata_concat.obs["n_genes_by_counts"] < 4000], kde=False, bins=60, ax=axs[3])

In [None]:
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(ncols=3, nrows=2)
fig.set_figwidth(17)
fig.set_figheight(9)
fig.tight_layout(pad=4.5)

bc.pl.kp_genes(adata_concat, min_genes=100, ax = ax1)
bc.pl.kp_counts(adata_concat, min_counts=500, ax = ax2)
bc.pl.kp_cells(adata_concat, min_cells=20, ax = ax3)
bc.pl.max_genes(adata_concat, max_genes=10000, ax = ax4)
bc.pl.max_mito(adata_concat, max_mito=0.1, annotation_type='SYMBOL', species='mouse', ax = ax5)
bc.pl.max_counts(adata_concat, max_counts=55000, ax=ax6)

In [None]:
temp=bc.tl.count_occurrence(adata_concat,'readout_id')
sns.barplot(y=temp.index,x=temp.Counts,color='gray',orient='h')

In [None]:
sc.pl.violin(adata_concat, ['n_genes', 'n_counts', 'percent_mito'], jitter=0.2, multi_panel=True, save = '.before_filtering.png')

In [None]:
sc.pl.violin(adata_concat, ['n_genes', 'n_counts'], groupby='readout_id',jitter=0.1,rotation=90, save = '.before_filtering.split.png')

## 3.2. Saving the anndata objects for further analysis

In [None]:
qc_filtered_folder = os.path.join(results_folder, 'qc_filtered') 
    
## check if folder exists and create it otherwise
if not os.path.exists(qc_filtered_folder):
    os.makedirs(qc_filtered_folder)
    print(f"Folder '{qc_filtered_folder}' created.")
else:
    print(f"Folder '{qc_filtered_folder}' already exists.")

for current_adata in adata_filtered:
    
    current_sample = np.asarray(current_adata.obs["readout_id"].unique())
    filename = 'adata_filter_' + current_sample[0] + '.h5ad'

    current_adata.write(os.path.join(qc_filtered_folder , filename))

In [None]:
! jupyter nbconvert --to html 00_Quality_Control.ipynb