### Understanding Paul's Cell Oracle Data

This notebook adjusts a dataset from a paper about Myeloid Progenitors Transcriptional Heterogeneity ([Paul et al 2017](https://pubmed.ncbi.nlm.nih.gov/26627738/)). The goal is to check the dataset against a data checker and make necessary adjustments until it passes all conditions.

Here we tidy the dataset and carry out a simple exploration in pandas to ensure its integrity and usability.

In [3]:
# Get required libraries

import pandas as pd
import scanpy as sc
import numpy as np
import importlib
import pereggrn_perturbations as dc # the data checker (dc)

In [39]:
# Reload the module to catch any updates
importlib.reload(dc)

<module 'pereggrn_perturbations' from '/home/ec2-user/expression_forecasting_benchmarks/perturbation_data/setup/pereggrn_perturbations.py'>

In [4]:
# Load the main dataframe
df = pd.read_csv('../../perturbation_data_wrong/not_ready/paul/GSE72857_umitab.txt', sep='\t')

In [40]:
# Inspect the DataFrame
print(df.head())  # Print the first few rows to inspect the data
print(df.info())  # Print information about the DataFrame
print(df.describe())  # Get descriptive statistics if applicable

                      W29953  W29954  W29955  W29956  W29957  W29958  W29959  \
0610007C21Rik;Apr3         0       0       0       0       0       0       0   
0610007L01Rik              0       2       1       1       2       0       0   
0610007P08Rik;Rad26l       0       0       0       0       1       0       0   
0610007P14Rik              0       0       0       1       1       0       0   
0610007P22Rik              0       0       0       0       0       0       0   

                      W29960  W29961  W29962  ...  W76327  W76328  W76329  \
0610007C21Rik;Apr3         0       0       0  ...       0       0       0   
0610007L01Rik              0       1       1  ...       0       0       0   
0610007P08Rik;Rad26l       0       0       0  ...       0       0       0   
0610007P14Rik              1       0       0  ...       1       0       0   
0610007P22Rik              0       0       0  ...       0       0       0   

                      W76330  W76331  W76332  W76333  W7

In [41]:
# Understand the structure
# Check for the presence of headers, the shape of the data, and sample the data
print(df.columns)  # Check column names
print(df.index)  # Check row indices

Index(['W29953', 'W29954', 'W29955', 'W29956', 'W29957', 'W29958', 'W29959',
       'W29960', 'W29961', 'W29962',
       ...
       'W76327', 'W76328', 'W76329', 'W76330', 'W76331', 'W76332', 'W76333',
       'W76334', 'W76335', 'W76336'],
      dtype='object', length=10368)
Index(['0610007C21Rik;Apr3', '0610007L01Rik', '0610007P08Rik;Rad26l',
       '0610007P14Rik', '0610007P22Rik', '0610008F07Rik', '0610009B22Rik',
       '0610009D07Rik', '0610009O20Rik', '0610010B08Rik;Gm14434;Gm14308',
       ...
       'mTPK1;Tpk1', 'mimp3;Igf2bp3;AK045244', 'mszf84;Gm14288;Gm14435;Gm8898',
       'mt-Nd4', 'mt3-mmp;Mmp16', 'rp9', 'scmh1;Scmh1', 'slc43a2;Slc43a2',
       'tsec-1;Tex9', 'tspan-3;Tspan3'],
      dtype='object', length=27297)


In [7]:
# Load the experimental design table
exp_design = pd.read_csv('../../perturbation_data_wrong/not_ready/paul/GSE72857_experimental_design.txt', sep='\t', skiprows=19)

In [8]:
# Check that it only has the table of interest
print(exp_design)

      Well_ID Seq_batch_ID Amp_batch_ID well_coordinates  Mouse_ID  Plate_ID  \
0      W29953         SB17        AB167               A1         1         1   
1      W29954         SB17        AB167               C1         1         1   
2      W29955         SB17        AB167               E1         1         1   
3      W29956         SB17        AB167               G1         1         1   
4      W29957         SB17        AB167               I1         1         1   
...       ...          ...          ...              ...       ...       ...   
10363  W76332         SB29        AB396              H24         6        27   
10364  W76333         SB29        AB396              J24         6        27   
10365  W76334         SB29        AB396              L24         6        27   
10366  W76335         SB29        AB396              N24         6        27   
10367  W76336         SB29        AB396              P24         6        27   

          Batch_desc Pool_barcode Cell_

##### Combining the Data with the perturbations, controls, etc.

The data is currently stored in two files:
- GSE72857_umitab.txt which has the genes and sample/cell names based on well ID.
- GSE72857_experimental_design which has metadata about each sample (based on well ID).

Our next step will be merging the main dataframe with the experimental design table using the well IDs to have the data for perturbations, controls, etc.

##### Understanding the Data Structure

The data appears to have gene names as row indices and sample/cell names as column headers.
The values represent expression levels for each gene in each sample/cell.

Do the following to convert the txt file to h5ad

In [20]:
# Transpose the DataFrame to have genes as columns and cells as rows
df = df.T

# Extract gene names and expression data
gene_names = df.columns.values  # Gene names are in the columns
expression_data = df.values  # Expression data as numpy array
cell_names = df.index.values  # Cell/sample names are in the index

# Create AnnData object
adata = sc.AnnData(X=expression_data)
adata.var_names = gene_names  # Set gene names
adata.obs_names = cell_names  # Set cell/sample names

Add the necessary metadata to pass the data checker

In [24]:
# add highly_variable_rank by sorting the genes then ranking them

# Calculate the variability (variance) of each gene
variability = np.var(expression_data, axis=0)

# Rank genes based on variability in descending order
variability_rank = np.argsort(-variability)

# Create a DataFrame to hold the rankings
rankings_df = pd.DataFrame({
    'variability': variability,
    'highly_variable_rank': variability_rank
}, index=gene_names)

# Add the rankings to the AnnData object
adata.var['highly_variable_rank'] = rankings_df.loc[adata.var_names, 'highly_variable_rank']


In [42]:
# Use dummy data for the perturbations as I can't find them yet

# Create synthetic perturbation data
num_cells = adata.shape[0]
perturbations = np.random.choice(['geneA', 'geneB', 'geneC'], size=num_cells)
expression_levels = np.random.rand(num_cells)
perturbation_types = np.random.choice(['knockout', 'overexpression', 'knockdown'], size=num_cells)

# Add synthetic perturbation data to AnnData object
adata.obs['perturbation'] = perturbations
adata.obs['expression_level_after_perturbation'] = expression_levels
adata.obs['perturbation_type'] = perturbation_types

In [43]:
# Save to .h5ad file
output_file_path = '../perturbations/paul/test.h5ad'
adata.write_h5ad(output_file_path)

print(f"Data successfully saved to {output_file_path}")

Data successfully saved to ../perturbations/paul/test.h5ad


In [9]:
# set the path to the dataset
dc.set_data_path("../perturbations")

In [45]:
# load the paul perturbation dataset to the data checker
paul = dc.load_perturbation("paul")
print(paul.var.columns)

Index(['highly_variable_rank'], dtype='object')


In [46]:
# check the dataset using the data checker
is_valid = dc.check_perturbation_dataset(ad=paul)
print("Dataset validation result:", is_valid)

Checking gene metadata...
Checking perturbation labels...
Checking control labels...


AssertionError: No 'is_control' column