In [1]:
%load_ext autoreload
%autoreload 2

# Dropout filtering

Among the replicates of each cell type sample, there are 4 possibilities when it comes to the expression of the replicates. These possibilities are illustrated in the table below. 

|gene| replicate1  |   replicate2 |   replicate3 |
|---|---|---|---|
| A | 0 | 0 | X |
| B | 0 | X | Y |
| C | 0 | 0 | 0 | 
| D | X | Y | Z | 


First, we will filter the genes based on the **majority rule**, i.e. if the gene has 2 replicates with expression 0, we consider that this gene is not expressed in this cell type condition. In the case where there is one zero replicate (case B), we will compute the mean of the values and consider that mean the expression of the gene in the given cell-type condition. In the case where all three replicates are 0, the gene will be considered as not expressed, while if all the replicates are different from zero the we will compute the mean as described in case B. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
font = {'family' : 'sans-serif',
        'size'   : 15}
matplotlib.rc('font', **font)
from local_utils import get_mean, get_mean2, pca_comp, pca_variance
import seaborn as sns

## 1. Preprocessing the dataframe

In [4]:
# loading the dataset
df_orig= pd.read_csv("rpkm.tsv", sep="\t")

# eliminating the genes that have no output in any condition and replicate
df = df_orig.loc[(df_orig.drop(columns=['geneID'])!=0).any(axis=1)]

We now average over replicates (it can take up to 3-4 minutes depending on the computer, please be patient).

In [5]:
# condition Y3S only has two replicates, thus we have to do it differently

df_3rep = df.drop(columns=['Y3S_1', 'Y3S_3'])  # all conditions except Y3S
df_3rep = get_mean(df_3rep)

df_2rep = df[['Y3S_1','Y3S_3']].copy(deep = True)   # Y3S case
df_2rep = get_mean2(df_2rep)

#add together the two dataframes
df_tot = df_3rep.join(df_2rep)

## 2 Filtering by majority rule

In [21]:
#filter only the columns containing the mean
df_mean = df_tot.filter(regex='mean|geneID')

#select from df_mean only genes that are non-zero in all cell-types
df_nonzero_means = df_mean.loc[(df_mean.drop(columns=['geneID'])!=0).any(axis=1)]
df_nonzero_reps = df[df['geneID'].isin(list(df_nonzero_means['geneID']))]

df_nonzero_means.to_csv("df_nonzero_means.csv", sep = "\t", index = False)
df_nonzero_reps.to_csv("df_nonzero_reps.csv", sep = "\t", index = False)

We have now obtained a dataframe with 28603 genes. The dataframe df_nonzero_means contains only the means of the replicates for each sample, while the df_nonzero_reps contains only the replicates of each sampple.