### Understanding Paul's Cell Oracle Data - Version 2 (combined data)

This notebook ingests a dataset from a paper about myeloid progenitor transcriptional heterogeneity ([Paul et al 2015](https://pubmed.ncbi.nlm.nih.gov/26627738/)). We reformat the data, conduct exploratory analysis, and annotate cell types. 

In [None]:
# Get required libraries

import pandas as pd
import scanpy as sc
import numpy as np
import importlib
import matplotlib.pyplot as plt
import pereggrn_perturbations as dc # the data checker (dc)
import scipy.sparse

In [None]:
# Load the main dataframe
df = pd.read_csv('../not_ready/paul/GSE72857_umitab.txt', sep='\t')

# Load the experimental design table
exp_design = pd.read_csv('../not_ready/paul/GSE72857_experimental_design.txt', sep='\t', skiprows=19)

##### Isolating the Wildtype and Perturbations

The data is currently stored in two files:
- GSE72857_umitab.txt which has the genes and sample/cell names based on well ID.
- GSE72857_experimental_design which has metadata about each sample (based on well ID).

Our next step will be merging the main dataframe with the experimental design table using the well IDs to have the data for wildtype and perturbations.

In [None]:
# Transpose the main dataframe for merging
df_t = df.T
df_t.index.name = 'Well_ID'

# Merge the main dataframe with the experimental design table using the well IDs
# merged_df = df_t.merge(exp_design[['Well_ID', 'Batch_desc']], left_index=True, right_on='Well_ID', how='left')
merged_df = df_t.merge(exp_design, left_index=True, right_on='Well_ID', how='left')
merged_df.set_index('Well_ID', inplace=True)

##### Creating the AnnData Structures

Before transposing the matrix, the data appears to have gene names as row indices and sample/cell names as column headers.
The values represent expression levels for each gene in each sample/cell.

Do the following to convert the txt file to h5ad and add the necessary metadata:

In [None]:
# Extract gene names and expression data
merged_gene_columns = [col for col in merged_df.select_dtypes(include=[np.number]).columns if col not in exp_design]
merged_numeric_data = merged_df[merged_gene_columns]
merged_gene_names = merged_numeric_data.columns.values
merged_cell_names = merged_numeric_data.index.values

# Create AnnData object
adata_merged = sc.AnnData(X=merged_numeric_data.values.astype(float))
adata_merged.X = scipy.sparse.csr_matrix(adata_merged.X)
adata_merged.X = adata_merged.X.tocsr()
adata_merged.var_names = merged_gene_names
adata_merged.obs_names = merged_cell_names

# Add metadata to obs
adata_merged.obs['Batch_desc'] = merged_df['Batch_desc'].values
adata_merged.obs['Seq_batch_ID'] = merged_df['Seq_batch_ID'].values
adata_merged.obs['Amp_batch_ID'] = merged_df['Amp_batch_ID'].values
adata_merged.obs['well_coordinates'] = merged_df['well_coordinates'].values
adata_merged.obs['Mouse_ID'] = merged_df['Mouse_ID'].values
adata_merged.obs['Plate_ID'] = merged_df['Plate_ID'].values
adata_merged.obs['Pool_barcode'] = merged_df['Pool_barcode'].values
adata_merged.obs['Cell_barcode'] = merged_df['Cell_barcode'].values
adata_merged.obs['Number_of_cells'] = merged_df['Number_of_cells'].values
adata_merged.obs['CD34_measurement'] = merged_df['CD34_measurement'].values
adata_merged.obs['FcgR3_measurement'] = merged_df['FcgR3_measurement'].values
adata_merged.obs["well_row"]    = adata_merged.obs["well_coordinates"].str.extract(r"([A-Z])\d+")[0]
adata_merged.obs["well_column"] = adata_merged.obs["well_coordinates"].str.extract(r"([0-9]+)")[0]

##### Labeling by Cell Type

To label by cell types, we'll have to process the data, perform PCA & clustering, and label based on the clusters and our knowledge of the sequence and relations of cell types in human hematopoiesis.

- Exclude cells with RNA count <1000, then normalize and apply log transformation

In [None]:
# Calculate the total counts per cell
adata_merged.obs['total_counts'] = adata_merged.X.sum(axis=1)
adata_merged.obs['log10_total_counts'] = np.log10(adata_merged.obs['total_counts'])
# Filter out cells with fewer than 1,000 RNA counts
print(f"Total cells before filtering: {adata_merged.n_obs}")
adata_merged = adata_merged[adata_merged.obs['total_counts'] >= 1000, :].copy()

# Verify filtering step
print(f"Total cells after filtering: {adata_merged.n_obs}")

In [None]:
# Normalize each cell by the total counts, then multiply by a scaling factor (e.g., 10,000)
adata_merged.X = adata_merged.X / adata_merged.obs['total_counts'].values[:, None] * 10000

# Perform log transformation
adata_merged.X = adata_merged.X.tocsr()
adata_merged.raw = adata_merged.copy()
adata_merged.X = np.log1p(adata_merged.X)  # This is equivalent to np.log(adata_merged.X + 1)
adata_merged.X = adata_merged.X.tocsr()

# Verify normalization and log transformation
print(f"Data after normalization and log transformation (top left corner):\n{adata_merged.X.A[0:5, 0:5]}")

In [None]:
# Check the dimensions of the filtered data
print(f"adata_filtered shape: {adata_merged.shape}")

# Check for NaN values
if np.any(np.isnan(adata_merged.X.A)):
    print("NaN values found in the data.")
else:
    print("No NaN values found in the data.")


We run a typical exploratory analysis: variable gene selection, PCA, nearest-neighbors, diffusion maps, and modularity-minimizing graph clustering.

In [None]:
sc.pp.highly_variable_genes(adata_merged, n_bins=50, n_top_genes = adata_merged.var.shape[0], flavor = "seurat_v3" )
sc.tl.pca(adata_merged, svd_solver='arpack', n_comps=50)
sc.pp.neighbors(adata_merged, n_neighbors=4, n_pcs=20)
sc.tl.umap(adata_merged)
sc.tl.diffmap(adata_merged)
sc.tl.louvain(adata_merged, resolution=0.8)
S_genes_hum = ["MCM5", "PCNA", "TYMS", "FEN1", "MCM2", "MCM4", "RRM1", "UNG", "GINS2", 
            "MCM6", "CDCA7", "DTL", "PRIM1", "UHRF1", "CENPU", "HELLS", "RFC2", 
            "RPA2", "NASP", "RAD51AP1", "GMNN", "WDR76", "SLBP", "CCNE2", "UBR7", 
            "POLD3", "MSH2", "ATAD2", "RAD51", "RRM2", "CDC45", "CDC6", "EXO1", "TIPIN", 
            "DSCC1", "BLM", "CASP8AP2", "USP1", "CLSPN", "POLA1", "CHAF1B", "BRIP1", "E2F8"]
G2M_genes_hum = ["HMGB2", "CDK1", "NUSAP1", "UBE2C", "BIRC5", "TPX2", "TOP2A", "NDC80",
             "CKS2", "NUF2", "CKS1B", "MKI67", "TMPO", "CENPF", "TACC3", "PIMREG", 
             "SMC4", "CCNB2", "CKAP2L", "CKAP2", "AURKB", "BUB1", "KIF11", "ANP32E", 
             "TUBB4B", "GTSE1", "KIF20B", "HJURP", "CDCA3", "JPT1", "CDC20", "TTK",
             "CDC25C", "KIF2C", "RANGAP1", "NCAPD2", "DLGAP5", "CDCA2", "CDCA8", "ECT2", 
             "KIF23", "HMMR", "AURKA", "PSRC1", "ANLN", "LBR", "CKAP5", "CENPE", 
             "CTCF", "NEK2", "G2E3", "GAS2L3", "CBX5", "CENPA"]
sc.tl.score_genes_cell_cycle(
    adata_merged, 
    s_genes   = [g.title() for g in S_genes_hum if g.title() in adata_merged.var_names], 
    g2m_genes = [g.title() for g in G2M_genes_hum if g.title() in adata_merged.var_names]
)
adata_merged.write_h5ad('../not_ready/paul/paul_clustered_but_not_annotated.h5ad')

In [None]:
sc.pl.umap(adata_merged, color=adata_merged.obs.columns.difference({"well_coordinates", "Cell_barcode"}))
markers = {
    "Erythroids":["Gata1", "Klf1", "Gypa", "Hba-a2"],
    "Megakaryocytes":["Itga2b", "Pbx1", "Sdpr", "Vwf"],
    "Granulocytes":["Elane", "Cebpe", "Ctsg", "Mpo", "Gfi1"],
    "Monocytes":["Irf8", "Csf1r", "Ctsg", "Mpo"],
    "Mast_cells":["Cma1", "Gzmb", "Kit"],
    "Basophils":["Mcpt8", "Prss34"]
}
for ct in markers.keys():
    sc.pl.umap(adata_merged, color=set(markers[ct]).intersection(adata_merged.var_names))