### Exercise: Stressed Cells
Sample preparation can lead to artificial changes in the transcriptome of live cells, based on their reaction to the the changes in environment, temperature fluctuations, loss of niche signalling etc. This is an important confounding factor to keep in mind when analysing your data.

This exercise uses the same data set as the basic workflow exercise (reduced PBMC object).

Research which (groups of) genes are known to be upregulated during sample preparation in PBMCs.

#### Import required packages and data
We will import the processed and annotated object that we generated before.

In [None]:
# general data handling
import numpy as np
import pandas as pd
from scipy import sparse

# single cell analysis
import scanpy as sc
import decoupler as dc

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# import the annotated object
adata = 

In [None]:
# basic information about the object is displayed if you enter 'adata' and execute the cell


In [None]:
# get an object of type list of all gene names in the dataset
gene_list = 

In [None]:
# research a few groups of genes that are associated with stress response,
# one example are FOS family genes. Extract each such family into a list as shown 
# in the example.
fos_genes = [gene for gene in gene_list if 'FOS' in gene]


In [None]:
# inspect these lists, and combine them into a master list stressed_genes
# lists can be combined using the + symbol
print(fos_genes)

stress_genes = fos_genes + 

In [None]:
# now, we will add a column to the adata.obs dataframe indicating whether a gene is a stress response gene
adata.var["stress"] = adata.var_names.isin(stress_genes)

In [None]:
# we can now use this stress tag in the calculate qc metrics step
# use the function calculate_qc_metrics as before for mitochondrial genes,
# but specify that layer='counts' needs to be used for calculations
# here, since the data in adata.X has been processed


In [None]:
# check which items have been added to adata.obs by inspecting it


In [None]:
# use scanpy's violin plot to plot the percentage of stresse gene reads per cell type


In [None]:
# below, we have prepared a loop for you which goes through each
# population in the cell_type column, subsets the anndata object
# to that population and calculates a correlation between the percentages
# of stressed counts and mitochondrial counts
# please explain to yourself what each line of code does, and then
# add a scatter plot at the end of the loop as described below
for ct in adata.obs['cell_type'].cat.categories:
    # Subset data for this cell type
    adata_sub = adata[adata.obs['cell_type'] == ct]
    
    # Calculate correlation
    corr = adata_sub.obs[['pct_counts_stress', 'pct_counts_mt']].corr().iloc[0, 1]
    
    # Plot scatter for this cell type using scanpy's scatter plot function,
    # include the celltype name and the Pearson correlation calculated above (corr)
    # in the title by using f strings and the function's title parameter


#### Questions
1. Does the stress count percentage correlate with mitochondrial read content? What could a correlation imply? What could no correlation imply?
2. Is it a viable strategy to remove these genes entirely from the count matrix or is there an argument against doing that? Discuss.
