
> **ISO2024 INTRODUCTORY SPATIAL 'OMICS ANALYSIS**
>
>
>- HYBRID : TORONTO & ZOOM
>- 9TH JULY 2024 <br>


>**Module 2 : Pre-processing steps**<BR>
>   * A. Understanding your output *
>   * B. Tidying and pre-evaluating your data *

>
>**Instructor : Shamini Ayyadhury**

---

```
A. UNDERSTANDING YOUR OUTPUT
```
01. analysis_summary.html
02. analysis.tar
03. analysis.zarr.zip
04. cell_boundaries.csv 
05. cell_boundaries.parquet ***
06. cell_feature_matrix
07. cell_feature_matrix.h5 ***
08. cell_feature_matrix.tar
09. cell_feature_matrix.zarr.zip
10. cells.csv 
11. cells.parquet ***
12. cells.zarr.zip
13. experiment.xenium
14. gene_panel.json
15. metrics_summary.csv
16. morphology_focus.ome.tif
17. morphology_mip.ome.tif
18. morphology.ome.tif ***
19. nucleus_boundaries.csv
20. nucleus_boundaries.parquet ***
21. transcripts.csv
22. transcripts.parquet ***
23. transcripts.zarr.zip


```
B. TIDYING AND PRE-EVALUATING YOUR DATA
```
In this section participants will review the transcripts file. The transcripts file contains the identity, position of each transcript from genes and controls. Here we will start from this file and review the quality values and assess cut-off margins. 

OBJECTIVES
1. Process transcript file and evaluate quality of gene and control transcripts
2. Assess the distribution of transcripts across 'field of view'.
3. Derive the gene matrix, counts matrix and cell centroid matrix from the transcript file

DATASETS WE WILL USE
* We will use the FFPE half-brain xenium sample - "TgCRND8 17.9 months" for this first script
This is a transgenic mouse model for Alzheimer's disease pathology
"https://www.10xgenomics.com/datasets/xenium-in-situ-analysis-of-alzheimers-disease-mouse-model-brain-coronal-sections-from-one-hemisphere-over-a-time-course-1-standard"




* We will be using some packages routinely throughput this workshop. 
* Wrapper functions are provided where necessary.
    * The reason being, the purpose of this workshop is not to bias anyone towards any standard or popular packages but to deliver an understanding as to what is happenning.
    * There are multiple different tools out there and the purpose of this workshop is to give you the necessary knowledge to understand what is happenning under the hood.

---
>>> PACKAGE IMPORT

In [None]:
### import the following libraries

### Packages for general system functions, miscellaneous operating system interfaces, warning control system
import sys ### general system functions
import os ### miscellaneous operating system interfaces
import warnings ### warning control system
import psutil
import psutil ### module providing an interface for retrieving information on all running processes and system utilization (CPU, memory, disks, network, sensors) in a portable way by using Python
import gc ### garbage collector interface
import os ### miscellaneous operating system interfaces

warnings.filterwarnings('ignore') ### ignore warnings

### Packages for data manipulation and analysis, data visualization
import pandas as pd ### data manipulation and analysis for tabular data in python
import matplotlib.pyplot as plt ### plotting library for the Python programming language and its numerical mathematics extension NumPy
from matplotlib import colors as mcolors
import seaborn as sns ### data visualization library based on matplotlib (my personal favourite over matplotlib)
import numpy as np ### support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays
import scanpy as sc ### single-cell RNA-seq analysis in Python
from IPython.display import display, HTML


---
>>> WRAPPER FUNCTIONS - ONLY FOR MODULE 2

In [None]:
"""
The "pre_processing_fnc" is a custom python script that contains all the functions required for pre-processing the data.
These are wrapper functions written specifically for this workshop to help participants understand the pre-processing steps and to save time.
This package is available in the "spatial_workshop" repository where participants are encouraged to look at the exact codes and scripts after lessons.

The functions in this script include:

1. check_parquet
Sometimes, there is an issue when loading .parquet files, the string values end up as bytes. This function checks for this issue and fixes it.

2. process_data
This function will label the transcripts as gene or control and as assigned to a cell or not

3. clean_processed_tf
This function will remove low quality transcripts and give 4 matrices as output: gene matrix file, control matrix file, counts matrix and cell centroid matrix

4. process_adata
8. prepare_adata
This wrapper will process the centroid matrix, gene matrix and counts matrix properlt to create anndata object

5. fov_plot
A function to plot the field of view (FOV) images that saves time for participants to visualize the data.

6. plot_gene_and_neg_transcripts
Another function to plot the gene and negative transcripts and assess their QV values that saves time for participants to visualize the data.

7. display_side_by_side

An image plotting wrapper

"""

#sys.path.append('/home/shamini/data/projects/spatial_workshop/')

#import pre_processing_fnc as ppf

---
>>> MANAGING YOUR FOLDER AND FILE PATHS

In [None]:
### its sometimes useful to assign the file names or paths to variables to avoid typing errors

### path variables
data_dir = '/home/shamini/data1/data_orig/data/spatial/xenium/10xGenomics/' ### data directory
out = '/home/shamini/data/projects/spatial_workshop/out/module2/' ### output directory for saving files. We have created these output directories in advance to save time. Participants are free to create their own if they wish to.
os.makedirs(out, exist_ok=True) ### create a new directory for saving files (but checks if the directory already exists)

### object variables
datasets_to_use = 'mice_AD_model/TgCRND8/xenium_out/' ### the name of the dataset to use
features_filepath = 'cell_feature_matrix.h5'
cells_filename = 'cells.parquet'
transcripts_filename = 'transcripts.parquet'
metrices_filename = 'metrics_summary.csv'


### MEMORY USAGE
### Run th following code to check the memory usage
### THe following function is found in the pre_processing_fnc folder
def get_memory_usage():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return f"Memory usage: {mem_info.rss / (1024 ** 2):.2f} MB"

    gc.collect()

#------------------------------------------------
get_memory_usage() ### monitor memory usage

---
>>> LOADING ALL NECESSARY DATA

In [None]:
### The following function is used to check the parquet file to ensure that the string values are not in bytes format and if they are to convert them back to string
### This function is found in the pre_processing_fnc folder
### check your dataframe first before using as it is not required for all inputs

import pandas as pd

# Description: Read parquet file and return a pandas dataframe
def check_parquet(filename):
    df = pd.read_parquet(filename)
    for col in df.columns:
        if df[col].dtype == 'object':  # Check if column is of object type
            df[col] = df[col].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)
    return df
# ------------

### we will load 3 files here: cell_feature_matrix.h5, cells.parquet and transcripts.parquet
### We will check the parquet file to ensure that the string values are not in bytes format and if they are to convert them back to string
df_cell = check_parquet(os.path.join(data_dir+datasets_to_use ,cells_filename))
df_transcript = check_parquet(os.path.join(data_dir+datasets_to_use, transcripts_filename))
df_metric = pd.read_csv(os.path.join(data_dir+datasets_to_use, metrices_filename))



#------------------------------------------------
get_memory_usage() ### monitor memory usage

---
>>> RUN THE NEXT CELL AND REVIEW THE COLUMNS OF THE DATAFRAME AND TRY TO ANSWER THE QUESTIONS GIVEN BELOW

In [None]:
df_transcript

### Look at the transcript dataframe and what do you see? What are the columns? What are the values? What are the data types?
### Which columns are important for our analysis? 
### What is fov?


---
>>> START THE PROCESS OF QC <BR>
>>> STEP 1 : LABEL THE TRANSCRIPTS WELL 

In [None]:

"""
### The process_data function is a wrapper that does the following
1. The transcripts are derived from either gene or control probes
2. There are 3 different types of control probes : Negative codewords, Negative probes and Blank probes. The first two begin with Neg* and the last one begins with BLANK*
3. In this tutorial, we label all control probes as neg_probes and all gene-derived transcripts as gene_probes
4. We also label the transcripts as assigned to a cell or not assigned to a cell. Note this initial assigment is based from the cell segmentation from the standard xenium clear_output
"""
#### imports transcript csv file and removed the negative control probes.
#### assigns binary labels to assigned and unassigned cells
### this function is found in the pre_processing_fnc folder

def process_data(tf):
    ### filter out negative control probes
    df_neg = tf[tf.feature_name.str.contains('BLANK|Neg', regex=True)].copy()
    df_neg['group'] = 'neg_probes'

    ### filter out transcripts that are genes
    df_genes = tf[~tf['transcript_id'].isin(df_neg.transcript_id)].copy()
    df_genes['group'] = 'gene_probes'
    
    df = pd.concat([df_neg, df_genes], axis=0)

    ### ensure that index for df is equal to original tf
    df.set_index(tf.index, inplace=True)

    ### assign binary labels to assigned and unassigned cells
    df.loc[df.cell_id == 'UNASSIGNED', 'binary'] = 'unassigned'
    df.loc[df.cell_id != 'UNASSIGNED', 'binary'] = 'assigned'
    
    return df


### execute function
processed_data = process_data(df_transcript) ### we process and assign the output to an object called processed_data
del df_transcript ### we delete the original transcript dataframe to save memory



#------------------------------------------------
get_memory_usage() ### monitor memory usage

---
>>> RUN THE NEXT CELL AND OBSERVE THE ADDITIONAL COLUMNS ON YOUR FAR RIGHT

In [None]:
processed_data.head()
### note the additional columns added to the processed_data dataframe : group and binary


#### Now it is important to understand the quality of transcripts. 
* This is an image-based system. There are many facets of the experimental process that will affect the quality of the tissue.
* The purpose of "Module 1 : Garbage-in, Garbage-out" was to highlight this. 
* Therefore, review these steps carefully as it will help you identify if you need to "quarantine" certain regions from downstream processing steps.

---
>>> THE NEXT TWO CODE BLOCKS SHOW THE OVERALL QUALITY ASSESSMENT OF THE TRANSCRIPTS

B1. QUALITY ASSESSEMENT
* Lets look at some plots that will assess the quality of these imaged transcripts

In [None]:

fig, ax = plt.subplots(1, 1, figsize=(4.5, 4.5))

fig.suptitle('B1A.Transcript quality values (QV) distribution between control and gene-probe derived transcripts', fontsize=9)
sns.violinplot(x='group', y='qv', data=processed_data, hue='binary', split=True, inner='quartile', ax=ax, palette=['#ebac23', '#b80058'])
ax.set_ylim(0, 50)
ax.xaxis.set_tick_params(rotation=45, labelsize=9)
ax.set_xlabel('')
sns.despine()
plt.legend(title='Assignment status', loc='upper right', bbox_to_anchor=(1, 1.2))
plt.tight_layout(rect=[0, 0, 1.25, 0.95])
#plt.savefig(out+'B1A.Transcript_quality_values_distribution.png', dpi=300)
plt.show()

#------------------------------------------------
get_memory_usage() ### monitor memory usage

In [None]:
### plot the proportion of neg and positive probes and the proportion of assigned and unassigned probes
fig, axes = plt.subplots(1, 3, figsize=(12, 4.5))

fig.suptitle('B1B. Proportion of probes as a fraction of the whole tissue', fontsize=12, fontweight='bold', y=1.05, x=0.01)
ax = processed_data['binary'].value_counts(normalize=True).plot(kind='bar', ax=axes[0])
ax.set_title('A. Proportion of negativ and positive probes', fontsize=9, loc='left')
ax.xaxis.set_tick_params(rotation=45)

ax = processed_data['group'].value_counts(normalize=True).plot(kind='bar', ax=axes[1])
ax.set_title('B. Proportion of assigned and unassigned probes', fontsize=9, loc='left')
ax.xaxis.set_tick_params(rotation=45)

ax = processed_data.groupby(['group', 'binary']).size().unstack().plot(kind='bar', stacked=False, ax=axes[2])
ax.set_title('C. Proportion of negativ and positive probes by assignment', fontsize=9, loc='left')
ax.xaxis.set_tick_params(rotation=45)

plt.tight_layout()

### save the values above into a table for each sample
binary_proportion = processed_data['binary'].value_counts(normalize=True),
group_proportion = processed_data['group'].value_counts(normalize=True),
group_binary_proportion = processed_data.groupby(['group', 'binary']).size().unstack()

binary_proportion = pd.DataFrame(binary_proportion)
group_proportion = pd.DataFrame(group_proportion)
group_binary_proportion = pd.DataFrame(group_binary_proportion)

dfs = [binary_proportion, group_proportion, group_binary_proportion]
titles = ['Binary Proportion', 'Group Proportion', 'Group Binary Proportion']


### THis function is found in the pre_processing_fnc folder
### display the tables and the corresponding plots side by side

def display_side_by_side(dfs, titles=[]):
    """
    Display DataFrames side by side in Jupyter Notebook.
    
    Parameters:
    dfs (list): List of DataFrames or tuples to display.
    titles (list): List of titles for the DataFrames (optional).
    """
    html_str = ''
    
    for i, df in enumerate(dfs):
        # Convert tuples to DataFrames if necessary
        if isinstance(df, tuple):
            df = pd.DataFrame(df)
        elif not isinstance(df, pd.DataFrame):
            raise TypeError(f"Expected pd.DataFrame or tuple, but got {type(df)} at index {i}")
        
        title = f'<h3>{titles[i]}</h3>' if i < len(titles) else ''
        html_str += f'<td>{title}{df.to_html()}</td>'
    
    display(HTML(f"""
    <table>
        <tr>{html_str}</tr>
    </table>
    """))


### display the tables    
display_side_by_side(dfs, titles)

#------------------------------------------------
get_memory_usage() ### monitor memory usage

* Now let's look at the quality value of transcripts as they are distributed across the tissue area.
* It is important to look at the distribution of your low and high quality transcripts as this reflects the underlying tissue "health" that can help guide further image processing or transcript inclusion.

---
>>> NOW WE WILL RUN THE NEXT THREE CODE BLOCK WHICH WILL BREAK THE QUALITY ASSESSMENT INTO FOVS

In [None]:
# Assuming processed_data is already defined and loaded as a DataFrame
# Reorder the fov_name based on the mean qv values

fov_mean = processed_data.groupby('fov_name')['qv'].mean().reset_index()
fov_mean = fov_mean.sort_values('qv', ascending=False)
processed_data['fov_name'] = pd.Categorical(processed_data['fov_name'], fov_mean['fov_name'])

# Plot directly into the axes
g = sns.FacetGrid(processed_data, col='group', col_wrap=1, height=2.5, aspect=6, sharey=True, sharex=True, despine=True)

# Mapping the boxplot
g.map_dataframe(sns.boxplot, x='fov_name', y='qv', hue='binary', palette=['#ebac23', '#b80058'], showfliers=False)

# Customize the FacetGrid
for ax in g.axes.flat:
    ax.set_ylim(0, 40)
    ax.axhline(20, color='red', linestyle='dashed')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(handles=handles, labels=labels, loc='upper right', fontsize=9, bbox_to_anchor=(1.15, 1.15))
    # Shift title to the left

# Remove the additional title
g.figure.suptitle('B1C. QV values per FOV', y=1.05, x=0.1, fontsize=12, fontweight='bold', color='darkblue')
plt.tight_layout(rect=[0, 0, 1, 0.95])
#plt.savefig(out+'B1C.QV_values_per_fov.png', dpi=300)
plt.show()



#------------------------------------------------
get_memory_usage() ### monitor memory usage

In [None]:
### The fov_plot function allows us to plot the field of view (FOV) images that saves time for participants to visualize the data.
### This function is found in the pre_processing_fnc folder

def fov_plot(processed_data, plot_qv=True, ax=None, identifier=['gene_probes', 'neg_probes']):
    if ax is None:
        fig, ax = plt.subplots()  # Create figure and axis internally if not provided

    if plot_qv:
        qv_min = 0
        qv_max = 40
        cmap = plt.get_cmap('viridis')
        norm = mcolors.Normalize(vmin=qv_min, vmax=qv_max)
    else:
        cmap = None
        norm = None

    grouped = processed_data.groupby('fov_name')
    
    for fov, group in grouped:
        if group.shape[0] > 0:
            x = group['x_location'].values
            y = group['y_location'].values
            xy00 = (x.min(), y.min())
            xy01 = (x.min(), y.max())
            xy10 = (x.max(), y.min())
            xy11 = (x.max(), y.max())
            xy = [xy00, xy01, xy11, xy10, xy00]

            group_assigned = group[group['binary'] == 'assigned']
        
            if plot_qv:
                qv_avg = None
                if (group_assigned['group'] == identifier).any():
                    qv_avg = group_assigned.loc[group_assigned['group'] == identifier, 'qv'].mean()
                elif (group_assigned['group'] == identifier).any():
                    qv_avg = group_assigned.loc[group['group'] == identifier, 'qv'].mean()
                
                if qv_avg is not None:
                    color = cmap(norm(qv_avg))
                    alpha = 0.5
                    ax.fill(*zip(*xy), color=color, alpha=alpha, edgecolor=color, linewidth=0)
                else:
                    color = 'none'
                    alpha = 0
                    ax.fill(*zip(*xy), color=color, alpha=alpha, edgecolor=color, linewidth=0)
            else:
                color = 'none'
                alpha = 0
                ax.fill(*zip(*xy), color=color, alpha=alpha, edgecolor=color, linewidth=0)

            ax.plot(*zip(*xy), color='black', linewidth=0.5)

            centroid_x = (x.min() + x.max()) / 2
            centroid_y = (y.min() + y.max()) / 2

            ax.text(centroid_x, centroid_y, fov, fontsize=8, ha='center', va='center_baseline', color='black')
    
    if plot_qv:
        sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
        sm.set_array([])
        cbar = ax.figure.colorbar(sm, ax=ax)
        cbar.set_label('Quality value (qv)')
    
    if ax is None:
        plt.show()  # Only show plot if created internally


fig, ax = plt.subplots(1, 2, figsize=(12, 4.5))
fig.suptitle('B1D. QV values per FOV - plotted as a grid', fontsize=12, fontweight='bold', y=1.05, x=0.15)

### Execute the function
fov_plot(processed_data, plot_qv=True, identifier='gene_probes', ax=ax[0])
ax[0].set_title('A. Gene probes', fontsize=9, loc='left')

fov_plot(processed_data, plot_qv=True, identifier='neg_probes', ax=ax[1])
ax[1].set_title('B. Control probes', fontsize=9, loc='left')
plt.tight_layout()

#plt.savefig(out+'B1D.QV_values_per_fov_grid.png', dpi=300)
plt.show()


#------------------------------------------------
get_memory_usage() ### monitor memory usage

In [None]:


### Function to plot the gene and negative transcripts and assess their QV values
### This function is found in the pre_processing_fnc folder
def plot_gene_and_neg_transcripts(processed_data, type='molecule'):
    gene_assigned = processed_data[(processed_data['binary'] == 'assigned') & (processed_data['group'] == 'gene_probes')].copy()
    neg_assigned = processed_data[(processed_data['group'] == 'neg_probes') & (processed_data['binary'] == 'assigned')].copy()

    if type == 'molecule':
        gene_assigned = gene_assigned.sample(n=neg_assigned.shape[0]*2, random_state=42)
        data_list = [gene_assigned, neg_assigned]

    elif type == 'mean_fov':
        grouped_neg = neg_assigned.groupby('fov_name')
        neg_mean = []
        for fov, group in grouped_neg:
            if group.shape[0] > 0:
                x = group['x_location'].values
                y = group['y_location'].values
                centroid_x = (x.min() + x.max()) / 2
                centroid_y = (y.min() + y.max()) / 2
                qv_avg = group['qv'].mean()
                neg_mean.append({'fov_name': fov, 'x_location': centroid_x, 'y_location': centroid_y, 'qv': qv_avg})
        
        neg_mean = pd.DataFrame(neg_mean)
        
        grouped_gene = gene_assigned.groupby('fov_name')
        gene_mean = []
        for fov, group in grouped_gene:
            if group.shape[0] > 0:
                x = group['x_location'].values
                y = group['y_location'].values
                centroid_x = (x.min() + x.max()) / 2
                centroid_y = (y.min() + y.max()) / 2
                qv_avg = group['qv'].mean()
                gene_mean.append({'fov_name': fov, 'x_location': centroid_x, 'y_location': centroid_y, 'qv': qv_avg})

        gene_mean = pd.DataFrame(gene_mean)
        data_list = [gene_mean, neg_mean]
        
    # Create a figure and axis
    fig, axs = plt.subplots(1, 2, figsize=(13, 4.5))

    for i, df in enumerate(data_list):
        df = df.sort_values('qv', ascending=True)
        sns.scatterplot(x='x_location', y='y_location', data=df, hue='qv', palette='viridis', s=1, ax=axs[i], legend=False)
        axs[i].set_xlabel('')
        axs[i].set_ylabel('')
        sns.despine()

        norm = plt.Normalize(vmin=0, vmax=40)
        sm = plt.cm.ScalarMappable(cmap='viridis', norm=norm)
        sm.set_array([])

        cbar = fig.colorbar(sm, ax=axs[i], ticks=[0, 20, 40])
        cbar.set_label('Quality value', labelpad=15)
        
        fov_plot(processed_data, plot_qv=False, ax=axs[i])
        axs[i].set_title('Gene probes' if i == 0 else 'Negative control probes')
        
    plt.show()



### Execute the function
fig.suptitle('B1E. QV values of each transcript - plotted as a grid', fontsize=12, fontweight='bold', y=1.05, x=0.15)
plot_gene_and_neg_transcripts(processed_data)
#plt.savefig(out+'B1E.QV_values_per_transcript_grid.png', dpi=300)


#------------------------------------------------
get_memory_usage() ### monitor memory usage

---
>>> STOP FOR DISCUSSION / START LECTURE <BR>
>>> WHAT'S THE SIGNIFICANCE OF GOING THROUGH THIS PROCESS OF SPLITTING BY FOV? 

---
>>> NOW WE WILL CLEAN THE TRANSCRIPTS FILE AND PROCESS THE DATA TO CREATE THE FOLLOWING THREE FILES 
* MATRIX FILE - cell x genes = cell.csv output
* COORDINATE FILE
* COUNTS FILE

WE WILL USE THE THREE TO CREATE AN ANNDATA OBJECT <BR>
10X PROVIDES A CLEANED UP VERSION IN THE XENIUM OUTPUT <BR>
WE ARE CREATING THESE FILES FROM THE TRANSCRIPTS FILE <BR>
    * UNDERSTAND THE PROCESS <BR>
    * SEGMENTATION (REQUIRES THE USE OF THE TRANSCRIPTS FILE)

>>> B2. NOW WE CAN DECIDE TO CLEAN OUR DATASETS BASED ON
1. Choice of QV cut-off : standard practise is to use QV > 20 but this can be changed
2. Keep all or discard certain FOVs.
3. Check for edge-effects and decide to drop transcripts covering the borders
4. Choose to re-evaluate images (H&E or IF images if necessary) 


In [None]:
"""
The cleaned data is then used to create the gene matrix, control matrix, counts matrix and cell centroid matrix 
Here, it involves removing low quality transcripts and assigning the transcripts to the cells based on the cell segmentation from the standard xenium clear_output
We still keep the negative control values as a separate matrix

However the 3 matrices that we will bring forward to the next lesson are
1. gene matrix
2. counts matrix
3. cell centroid matrix
"""

### This function is found in the pre_processing_fnc folder

def clean_processed_tf(processed_data, qv=20):
    # Filter for gene probes that are assigned
    gene_assigned = processed_data[(processed_data['binary'] == 'assigned') & (processed_data['group'] == 'gene_probes')].copy()
    # Filter for negative probes that are assigned
    neg_assigned = processed_data[(processed_data['group'] == 'neg_probes') & (processed_data['binary'] == 'assigned')].copy()
    
    # Subset for transcripts with qv > 20
    gene_qv_tf = gene_assigned[gene_assigned['qv'] > qv].copy()

    # Group by cell_id and feature_name, then count the number of transcripts
    gene_qv = gene_qv_tf.groupby(['cell_id', 'feature_name'])['transcript_id'].size().reset_index(name='transcript_count')
    # Pivot table to create a matrix of transcript counts
    gene_mtx = gene_qv.pivot_table(index='cell_id', columns='feature_name', values='transcript_count').fillna(0)
    new_gene_mtx = pd.DataFrame(gene_mtx.values, columns=gene_mtx.columns, index=gene_mtx.index)
    new_gene_mtx.index.name = None
    new_gene_mtx = new_gene_mtx.rename_axis(None, axis=1)

    # Repeat for negative probes
    neg_qv = neg_assigned[neg_assigned['qv'] > qv].copy()
    neg_qv = neg_qv.groupby(['cell_id', 'feature_name'])['transcript_id'].size().reset_index(name='transcript_count')
    neg_mtx = neg_qv.pivot_table(index='cell_id', columns='feature_name', values='transcript_count').fillna(0)
    new_neg_mtx = pd.DataFrame(neg_mtx.values, columns=neg_mtx.columns, index=neg_mtx.index)
    new_neg_mtx.index.name = None    
    new_gene_mtx = new_gene_mtx.rename_axis(None, axis=1)
    
    # Sum the counts across features for each cell
    gene_counts = gene_mtx.sum(axis=1)
    neg_counts = neg_mtx.sum(axis=1)

    df_counts = pd.concat([gene_counts, neg_counts], axis=1)
    df_counts.columns = ['total_counts', 'neg_counts']
    df_counts = df_counts.fillna(0)

    ### calculate centroids
    gene_qv = gene_assigned[gene_assigned['qv'] > qv].copy()
    centroids = gene_qv.groupby('cell_id')[['x_location', 'y_location']].mean().reset_index()
    centroids.columns = ['cell_id', 'centroid_x', 'centroid_y']
    centroids.set_index('cell_id', inplace=True)
    new_centroids = pd.DataFrame(centroids.values, columns=centroids.columns, index=centroids.index)
    new_centroids.index.name = None
    new_centroids = new_centroids.rename_axis(None, axis=1)
    
    return df_counts, gene_qv_tf, new_gene_mtx, new_neg_mtx, centroids


### execute the function
df_counts, transcripts_df, gene_mtx, neg_mtx, centroids = clean_processed_tf(processed_data)


#------------------------------------------------
get_memory_usage() ### monitor memory usage

Look at each matrix and note how these are now your standard single cell matrices that you are probably used to working with.

In [None]:
gene_mtx 

In [None]:
centroids

>>> FURTHER QUALITY CONTROL <br>

It is important to ensure we keep cells that express a minimum number of transcripts and genes

Let's now evaluate the number of genes expressed by each cell and plot a distribution plot for it

In [None]:
### evaluate the distribution of genes expressed per cell
gene_mtx_bool = gene_mtx > 0
gene_counts = gene_mtx_bool.sum(axis=1)
gene_counts.plot(kind='hist', bins=150, color='darkblue', edgecolor=None)
plt.vlines(gene_counts.mean(), 0, 1700, color='red', linestyle='dashed')
plt.vlines(gene_counts.median(), 0, 1700, color='green', linestyle='dashed')
plt.vlines(gene_counts.mode(), 0, 1700, color='yellow', linestyle='dashed')
plt.vlines(gene_counts.quantile(0.02), 0, 1700, color='black', linestyle='dashed')


### label the plot
plt.title('B2A. Distribution of genes expressed per cell', fontsize=12, fontweight='bold')
plt.xlabel('Number of genes expressed')
plt.ylabel('Number of cells')
plt.legend(['Genes', 'Mean', 'Median', 'Mode', '2nd percentile'], loc='upper right')
#plt.savefig(out+'B2A.Distribution_of_genes_expressed_per_cell.png', dpi=300)
plt.show()


In [None]:
gene_counts.quantile(0.02)

### what would a reasonable threshold be for the number of genes expressed per cell?

Now we can decide what's the minimum number of genes we want to exclude 

Here, we will use the df_counts file to plot the distrbution of control and gene-derived transcript distribution. We can clearly see that using a threshold of discarding any cell having less than 10 cells is reasonable.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15.5, 3.9))
ax = sns.histplot(df_counts, x='neg_counts', bins=100, color='#8a034f', alpha=0.5, ax=axes[0], linewidth=0.1)
ax = sns.histplot(df_counts, x='total_counts', bins=100, alpha=0.5, ax=axes[0], color='#005a8a', linewidth=0.1)
ax.set_title('A. All counts from gene and negative probes', loc='left', color='black')
ax.set_xlabel('Counts')
ax.set_ylabel('Cell frequency')
ax.legend(['Negative probes', 'Gene probes'])

ax = sns.histplot(df_counts, x='neg_counts', bins=100, color='#8a034f', alpha=0.5, ax=axes[1], linewidth=0.1)
ax = sns.histplot(df_counts, x='total_counts', bins=100, alpha=0.5, ax=axes[1], color='#005a8a', linewidth=0.1)
ax.set_yscale('log')
ax.set_xlabel('Counts')
ax.set_ylabel('Cell frequency')
ax.set_title('B. Log scale of plot A' , loc='left', color='blue')
ax.legend(['Negative probes', 'Gene probes'])

ax = sns.histplot(df_counts, x='neg_counts', bins=100, color='#8a034f', alpha=0.5, ax=axes[2], linewidth=0.1)
#sns.histplot(df_counts, x='total_counts', bins=100, alpha=0.5, ax=ax, color='blue')
ax.set_yscale('log')
ax.set_xlabel('Counts')
ax.set_ylabel('Cell frequency')
ax.set_title('C. Negative counts only', loc='left', color='red')
ax.legend(['Negative probes'])

fig.suptitle('B2B. Distribution of counts from gene and negative probes', fontsize=12, fontweight='bold', y=1.05, x=0.3)
#plt.savefig(out+'B2B.Distribution_of_counts.png', dpi=300)
plt.show()



#------------------------------------------------
get_memory_usage() ### monitor memory usage

Remember, remember & remember to save your file.

In [None]:
### create folder with sample name to save files
os.makedirs(out+'TgCRND8_17_8mths', exist_ok=True)

df_counts.to_csv(out+'TgCRND8_17_8mths/df_counts.csv', index=True)
gene_mtx.to_csv(out+'TgCRND8_17_8mths/gene_mtx.csv', index=True)
neg_mtx.to_csv(out+'TgCRND8_17_8mths/neg_mtx.csv', index=True)
centroids.to_csv(out+'TgCRND8_17_8mths/centroids.csv', index=True)
transcripts_df.to_csv(out+'TgCRND8_17_8mths/transcripts_df.csv', index=True)



#------------------------------------------------
get_memory_usage() ### monitor memory usage

* Next, let's finally create single cell like object that will store our gene matrix, counts matrix and centroid information in the right order.
* In this workshop, we will use the AnnData container to store all our frames. In practise, you can use any S4 container to do so - as long as you slot them in properly.
* In the next steps, we will perform the final clean-up  by removing low transcript cells (as determined above) and also to take note of the median/mean expression of transcripts.

In [None]:
gene_mtx

>>> CREATING AN ANDATA OBJECT

In [None]:
adata = sc.AnnData(X=gene_mtx, var=pd.DataFrame(index=gene_mtx.columns.values))

df_counts = df_counts[df_counts.index.isin(gene_mtx.index)]

df_counts = df_counts.reindex(gene_mtx.index)
adata.obs = df_counts.copy()
    
centroids = centroids.reindex(adata.obs.index)
adata.obs[['x_location', 'y_location']] = centroids[['centroid_x', 'centroid_y']].values

gene_mtx_bool = gene_mtx > 0
n_cells = gene_mtx_bool.sum(axis=0)
n_genes = gene_mtx_bool.sum(axis=1)

adata.var['n_cells'] = n_cells
adata.obs['n_genes'] = n_genes

sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_counts=9)



#------------------------------------------------
get_memory_usage() ### monitor memory usage

In [None]:
### Always remember to save the coordinates of the spatial data in the adata object into the uns and obsm slots. Many methods call the spatial coordinates from these slots

adata.obsm['spatial'] = adata.obs[['x_location', 'y_location']].values
adata.uns['spatial'] = {'spatial' : adata.obsm['spatial'].copy()}

In [None]:
### save your adata
adata.raw = adata
adata.write(out+'TgCRND8_17_8mths/adata.h5ad')

```
C. Gene Coverage
```

We will briefly review the genes and the probeset coverage for each gene

In [None]:
# Import Python libraries
# Example with Python v3.12, pandas v2.1.1
import json
import pandas as pd

# Open JSON file
f = open(data_dir+'mice_AD_model/TgCRND8/xenium_out/gene_panel.json') # Edit file name here

# Return JSON object as a dictionary
data = json.load(f)

# Create lists to store extracted information
gene = []
ensembl = []
cov = []
# Iterate through the JSON list to extract information
for i in data['payload']['targets']:
    if (i['type']['descriptor'] == "gene"): # Only collect info for genes, not controls
        gene_name = i['type']['data']['name']
        ensembl_id = i['type']['data']['id']
        coverage = str(i['info']['gene_coverage'])

        gene.append(gene_name)
        ensembl.append(ensembl_id)
        cov.append(coverage)

# Create output CSV file
out_df = pd.DataFrame(list(zip(gene, ensembl, cov)), columns=['Gene name', 'Ensembl ID', 'Gene coverage'])
#out_df.to_csv(out+'my_panel_gene_info.csv', index=False)

# Close file
f.close()


In [None]:
out_df

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(7, 5))

fig.suptitle('B2C. Gene coverage distribution', fontsize=12, fontweight='bold', y=1.05, x=0.15)
out_df['Gene coverage'].value_counts(normalize=True).plot(kind='bar', color='darkblue', edgecolor='black')
ax.set_title('Gene coverage distribution', fontsize=9, loc='left')
ax.set_xlabel('Gene coverage')
ax.set_ylabel('Proportion of genes')
plt.savefig(out+'B2C.Gene_coverage_distribution.png', dpi=300)
plt.show()


> END OF MODULE 2 : Pre-processing steps <br>
> Thank you and see you in the next lecture where we will tackle spatial clustering