### Understanding Paul's Cell Oracle Data

This notebook adjusts a dataset from a paper about Myeloid Progenitors Transcriptional Heterogeneity ([Paul et al 2017](https://pubmed.ncbi.nlm.nih.gov/26627738/)). The goal is to check the dataset against a data checker and make necessary adjustments until it passes all conditions.

Here we tidy the dataset and carry out a simple exploration in pandas to ensure its integrity and usability.

In [1]:
# Get required libraries

import pandas as pd
import scanpy as sc
import numpy as np
import importlib
import pereggrn_perturbations as dc # the data checker (dc)

In [2]:
# Reload the module to catch any updates
importlib.reload(dc)

<module 'pereggrn_perturbations' from '/home/ec2-user/expression_forecasting_benchmarks/perturbation_data/setup/pereggrn_perturbations.py'>

In [3]:
# Load the main dataframe
df = pd.read_csv('../../perturbation_data_wrong/not_ready/paul/GSE72857_umitab.txt', sep='\t')

In [20]:
# Optional - Inspect the DataFrame
print(df.head())  # Print the first few rows to inspect the data
print(df.info())  # Print information about the DataFrame
print(df.describe())  # Get descriptive statistics if applicable

print(df.index.tolist()) # Print all row names

                      W29953  W29954  W29955  W29956  W29957  W29958  W29959  \
0610007C21Rik;Apr3         0       0       0       0       0       0       0   
0610007L01Rik              0       2       1       1       2       0       0   
0610007P08Rik;Rad26l       0       0       0       0       1       0       0   
0610007P14Rik              0       0       0       1       1       0       0   
0610007P22Rik              0       0       0       0       0       0       0   

                      W29960  W29961  W29962  ...  W76327  W76328  W76329  \
0610007C21Rik;Apr3         0       0       0  ...       0       0       0   
0610007L01Rik              0       1       1  ...       0       0       0   
0610007P08Rik;Rad26l       0       0       0  ...       0       0       0   
0610007P14Rik              1       0       0  ...       1       0       0   
0610007P22Rik              0       0       0  ...       0       0       0   

                      W76330  W76331  W76332  W76333  W7

In [21]:
# Print specific rows by their names (indices)
print(df.loc[['Cebpa', 'Cebpe']])

       W29953  W29954  W29955  W29956  W29957  W29958  W29959  W29960  W29961  \
Cebpa       0       0       0       0       0       0       0       0       1   
Cebpe       0       0       0       0       0       0       0       0       0   

       W29962  ...  W76327  W76328  W76329  W76330  W76331  W76332  W76333  \
Cebpa       0  ...       0       0       0       0       1       0       1   
Cebpe       0  ...       0       0       0       0       0       0       0   

       W76334  W76335  W76336  
Cebpa       0       0       0  
Cebpe       0       0       0  

[2 rows x 10368 columns]


In [41]:
# Optional - Understand the structure
# Check for the presence of headers, the shape of the data, and sample the data
print(df.columns)  # Check column names
print(df.index)  # Check row indices

Index(['W29953', 'W29954', 'W29955', 'W29956', 'W29957', 'W29958', 'W29959',
       'W29960', 'W29961', 'W29962',
       ...
       'W76327', 'W76328', 'W76329', 'W76330', 'W76331', 'W76332', 'W76333',
       'W76334', 'W76335', 'W76336'],
      dtype='object', length=10368)
Index(['0610007C21Rik;Apr3', '0610007L01Rik', '0610007P08Rik;Rad26l',
       '0610007P14Rik', '0610007P22Rik', '0610008F07Rik', '0610009B22Rik',
       '0610009D07Rik', '0610009O20Rik', '0610010B08Rik;Gm14434;Gm14308',
       ...
       'mTPK1;Tpk1', 'mimp3;Igf2bp3;AK045244', 'mszf84;Gm14288;Gm14435;Gm8898',
       'mt-Nd4', 'mt3-mmp;Mmp16', 'rp9', 'scmh1;Scmh1', 'slc43a2;Slc43a2',
       'tsec-1;Tex9', 'tspan-3;Tspan3'],
      dtype='object', length=27297)


In [4]:
# Load the experimental design table
exp_design = pd.read_csv('../../perturbation_data_wrong/not_ready/paul/GSE72857_experimental_design.txt', sep='\t', skiprows=19)

In [8]:
# Optional - Check that it only has the table of interest
print(exp_design)

      Well_ID Seq_batch_ID Amp_batch_ID well_coordinates  Mouse_ID  Plate_ID  \
0      W29953         SB17        AB167               A1         1         1   
1      W29954         SB17        AB167               C1         1         1   
2      W29955         SB17        AB167               E1         1         1   
3      W29956         SB17        AB167               G1         1         1   
4      W29957         SB17        AB167               I1         1         1   
...       ...          ...          ...              ...       ...       ...   
10363  W76332         SB29        AB396              H24         6        27   
10364  W76333         SB29        AB396              J24         6        27   
10365  W76334         SB29        AB396              L24         6        27   
10366  W76335         SB29        AB396              N24         6        27   
10367  W76336         SB29        AB396              P24         6        27   

          Batch_desc Pool_barcode Cell_

##### Isolating the Wildtype and Perturbations

The data is currently stored in two files:
- GSE72857_umitab.txt which has the genes and sample/cell names based on well ID.
- GSE72857_experimental_design which has metadata about each sample (based on well ID).

Our next step will be merging the main dataframe with the experimental design table using the well IDs to have the data for wildtype and perturbations.

In [5]:
# Transpose the main dataframe for merging
df_t = df.T

# Merge the main dataframe with the experimental design table using the well IDs
merged_df = df_t.merge(exp_design[['Well_ID', 'Batch_desc']], left_index=True, right_on='Well_ID', how='left')

# Set the index back to the well IDs
merged_df.set_index('Well_ID', inplace=True)

##### Creating the AnnData Structures

Before transposing the matrix, the data appears to have gene names as row indices and sample/cell names as column headers.
The values represent expression levels for each gene in each sample/cell.

Do the following to convert the txt file to h5ad and add the necessary metadata:

- Split the data into train (which contains the wildtype data) and test (which contains the perturbation data).
- Create the AnnData structure for train.

In [11]:
# Split the data based on Batch_desc
train_df = merged_df[~merged_df['Batch_desc'].str.contains('control|KO', na=False)]

# Extract gene names and expression data for train
train_numeric_data = train_df.select_dtypes(include=[np.number])
train_gene_names = train_numeric_data.columns.values
train_cell_names = train_numeric_data.index.values

# Create AnnData object for train
adata_train = sc.AnnData(X=train_numeric_data.values.astype(float))
adata_train.var_names = train_gene_names
adata_train.obs_names = train_cell_names

# Add metadata to obs
adata_train.obs['Batch_desc'] = train_df['Batch_desc'].values

# Save train.h5ad
train_output_file_path = '../perturbations/paul/train.h5ad'
adata_train.write_h5ad(train_output_file_path)
print(f"Train data successfully saved to {train_output_file_path}")

KeyboardInterrupt: 

- Prepare the AnnData structure for test.

In [6]:
# Split the data based on Batch_desc
test_df = merged_df[merged_df['Batch_desc'].str.contains('control|KO', na=False)]

# Extract numeric data for test
test_numeric_data = test_df.select_dtypes(include=[np.number])
test_gene_names = test_numeric_data.columns.values
test_cell_names = test_df.index.values

# Create AnnData object for test
adata_test = sc.AnnData(X=test_numeric_data.values.astype(float))
adata_test.var_names = test_gene_names
adata_test.obs_names = test_cell_names

# Add metadata to obs
adata_test.obs['Batch_desc'] = test_df['Batch_desc'].values

- Add the required metadata (highly_variable_rank).

In [8]:
# Add highly_variable_rank by sorting the genes then ranking them

# Calculate highly variable genes using Scanpy
sc.pp.highly_variable_genes(adata_test, n_bins=50, n_top_genes=adata_test.var.shape[0], flavor="seurat_v3")

# Rank genes based on the high variability information provided by Scanpy
adata_test.var['highly_variable_rank'] = np.argsort(~adata_test.var['highly_variable'].values)


- Add the required metadata 
    - perturbation
    - expression_level_after_perturbation
    - is_control
    - perturbation_type

In [24]:
# Ensure test_df is a copy to avoid warnings
test_df = test_df.copy()

# Perturbation: Use 'Batch_desc' to infer perturbations
adata_test.obs['perturbation'] = test_df['Batch_desc'].apply(lambda x: 'Cebpa' if 'Cebpa KO' in x else ('Cebpe' if 'Cebpe KO' in x else 'None'))

# Expression level after perturbation: Set to 0 if not control or the pre-expression level if control
test_df['expression_level_after_perturbation'] = test_df.apply(
    lambda row: df.loc['Cebpa', row.name] if 'Cebpa control' in row['Batch_desc'] else
                df.loc['Cebpe', row.name] if 'Cebpe control' in row['Batch_desc'] else 0, axis=1
)

# Is control: Infer from 'Batch_desc'
adata_test.obs['is_control'] = test_df['Batch_desc'].apply(lambda x: True if 'control' in str(x).lower() else False)

# Perturbation type: Always knockout since this is the only possible perturbation in this data
adata_test.obs['perturbation_type'] = test_df['Batch_desc'].apply(
    lambda x: 'knockout'
)

- Add the required uns for the AnnData object:
    - perturbed_and_measured_genes
    - perturbed_but_not_measured_genes

In [29]:
# Define perturbed and measured genes
perturbed_and_measured_genes = ['Cebpa', 'Cebpe']
perturbed_but_not_measured_genes = []

# Store these lists in the uns attribute
adata_test.uns['perturbed_and_measured_genes'] = perturbed_and_measured_genes
adata_test.uns['perturbed_but_not_measured_genes'] = perturbed_but_not_measured_genes

# Ensure the code correctly checks these lists
assert all([g in adata_test.var_names for g in adata_test.uns["perturbed_and_measured_genes"]]), "perturbed_and_measured_genes not all measured"
assert all([g not in adata_test.var_names for g in adata_test.uns["perturbed_but_not_measured_genes"]]), "perturbed_and_not_measured_genes sometimes measured"


- Normalize, natural-log transform, and save the raw data

In [35]:
# Calculate the total counts per cell
adata_test.obs['total_counts'] = adata_test.X.sum(axis=1)

# Normalize each cell by the total counts, then multiply by a scaling factor (e.g., 10,000)
scaling_factor = 10000
adata_test.X = adata_test.X / adata_test.obs['total_counts'].values[:, None] * scaling_factor

# Perform log transformation
adata_test.X = np.log1p(adata_test.X)  # This is equivalent to np.log(adata_test.X + 1)

# Save the raw data
adata_test.raw = adata_test

Save the AnnData object to an h5ad file

In [36]:
# Save to .h5ad file
test_output_file_path  = '../perturbations/paul/test.h5ad'
adata_test.write_h5ad(test_output_file_path )

print(f"Data successfully saved to {test_output_file_path }")

Data successfully saved to ../perturbations/paul/test.h5ad


In [37]:
# Set the path to the dataset
dc.set_data_path("../perturbations")

In [38]:
# Load the paul perturbation dataset to the data checker
paul = dc.load_perturbation("paul")

In [39]:
# Check the dataset using the data checker
is_valid = dc.check_perturbation_dataset(ad=paul)
print("Dataset validation result:", is_valid)

Checking gene metadata...
Checking perturbation labels...
Checking control labels...
Checking which genes are measured...
Checking for log-transform and raw data...
... done.
Dataset validation result: True
