## subset Oracle objects for in silico KO perturbation

- last updated: 04/17/2024
- author: Yang-Joon Kim


### Goals
- take and subset the Oracle objects for in silico KO simulation for a subset of population (for example, we can subset NMP trajectories as in Zebrahub) to focus on the genes whose KO effect change over dev stages.

In [1]:
import copy
import glob
import time
import os
import shutil
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
from tqdm.auto import tqdm

## 0.2. Import our library

In [2]:
import celloracle as co
from celloracle.applications import Pseudotime_calculator
co.__version__

  def twobit_to_dna(twobit: int, size: int) -> str:
  def dna_to_twobit(dna: str) -> int:
  def twobit_1hamming(twobit: int, size: int) -> List[int]:
INFO:matplotlib.font_manager:Failed to extract font properties from /usr/share/fonts/google-noto-emoji/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)


'0.14.0'

## 0.3. Plotting parameter setting

In [3]:
#plt.rcParams["font.family"] = "arial"
plt.rcParams["figure.figsize"] = [5,5]
%config InlineBackend.figure_format = 'retina'
plt.rcParams["savefig.dpi"] = 300

%matplotlib inline

# 1. Load data

- If you have `Oracle` object, please run **1.1.[Option1] Load oracle data.**

- If you have not made an `Oracle` object yet and want to calculate pseudotime using `Anndata` object, please run **1.2.[Option2] Load anndata.** 

In this notebook, we will load demo `Oracle` object and add pseudotime information to it.

## 1.1. [Option1] Load oracle data

In [7]:
# # Load demo scRNA-seq data.
# oracle = co.data.load_tutorial_oracle_object()

# # Instantiate pseudotime object using oracle object.
# pt = Pseudotime_calculator(oracle_object=oracle)

Data not found in the local folder. Loading data from github. Data will be saved at /home/yang-joon.kim/celloracle_data/tutorial_data


  0%|          | 0.00/77.7M [00:00<?, ?B/s]

In [4]:
# Load the TDR118 oracle data
oracle_15somites = co.load_hdf5("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/03_celloracle_celltype_GRNs/TDR118reseq/06_TDR118reseq.celloracle.oracle")
oracle_15somites

Oracle object

Meta data
    celloracle version used for instantiation: 0.14.0
    n_cells: 13614
    n_genes: 3000
    cluster_name: global_annotation
    dimensional_reduction_name: X_umap.joint
    n_target_genes_in_TFdict: 13576 genes
    n_regulatory_in_TFdict: 872 genes
    n_regulatory_in_both_TFdict_and_scRNA-seq: 327 genes
    n_target_genes_both_TFdict_and_scRNA-seq: 1731 genes
    k_for_knn_imputation: 340
Status
    Gene expression matrix: Ready
    BaseGRN: Ready
    PCA calculation: Done
    Knn imputation: Done
    GRN calculation for simulation: Not finished

In [5]:
from celloracle.applications import development_module

In [6]:
help(development_module.subset_oracle_for_development_analysiis)

Help on function subset_oracle_for_development_analysiis in module celloracle.applications.development_module:

subset_oracle_for_development_analysiis(oracle_object, cell_idx_use)
    Make a subset of oracle object by specifying of cluster.
    This function pick up some of attributes that needed for development analysis rather than whole attributes.



In [9]:
# import the csv file for the "alignedUMAP" coordinates for the NMP trajectory
umap_coords_nmps = pd.read_csv("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/01_Signac_processed/aligned_umap_coords_NMPs.csv", index_col=0)
umap_coords_nmps.head()

Unnamed: 0,UMAP_1,UMAP_2,timepoint,cell_type,cell_id
0,1.070022,-6.046746,0somites,Neural_Posterior,AAACAGCCAAACGGGC-1_5
1,2.922442,-6.883583,0somites,Neural_Posterior,AAACAGCCAACACTTG-1_5
2,8.178947,5.224598,0somites,Somites,AAACAGCCACAATGCC-1_5
3,-4.169683,-9.929869,0somites,Neural_Posterior,AAACAGCCACCTGGTG-1_5
4,5.393727,-3.432797,0somites,NMPs,AAACAGCCAGTTATCG-1_5


In [17]:
umap_coords_sub = umap_coords_nmps[umap_coords_nmps.timepoint=="15somites"]
umap_coords_sub

Unnamed: 0,UMAP_1,UMAP_2,timepoint,cell_type,cell_id
0,-5.541245,-7.746141,15somites,Neural_Posterior,AAACAGCCATAGACCC-1_1
1,4.825198,-7.891592,15somites,Neural_Posterior,AAACATGCAAACTCAT-1_1
2,-4.484263,7.431350,15somites,Neural_Anterior,AAACATGCAAGGACCA-1_1
3,-3.275126,-4.276315,15somites,Neural_Anterior,AAACATGCAAGGATTA-1_1
4,8.242519,0.078934,15somites,PSM,AAACATGCAGGACCTT-1_1
...,...,...,...,...,...
8232,-3.951273,-3.006925,15somites,Neural_Anterior,TTTGTGTTCGAGGTGG-1_1
8233,-3.961044,-4.258544,15somites,Neural_Anterior,TTTGTGTTCGCTAAGT-1_1
8234,-2.692312,10.970349,15somites,Neural_Anterior,TTTGTTGGTAAAGCAA-1_1
8235,-5.109061,-4.869098,15somites,Neural_Posterior,TTTGTTGGTAATAACC-1_1


In [18]:
umap_coords_sub.cell_id = umap_coords_sub.cell_id.str.replace("_1","")
umap_coords_sub.head()

Unnamed: 0,UMAP_1,UMAP_2,timepoint,cell_type,cell_id
0,-5.541245,-7.746141,15somites,Neural_Posterior,AAACAGCCATAGACCC-1
1,4.825198,-7.891592,15somites,Neural_Posterior,AAACATGCAAACTCAT-1
2,-4.484263,7.43135,15somites,Neural_Anterior,AAACATGCAAGGACCA-1
3,-3.275126,-4.276315,15somites,Neural_Anterior,AAACATGCAAGGATTA-1
4,8.242519,0.078934,15somites,PSM,AAACATGCAGGACCTT-1


In [21]:
# extract the adata (all cells)
adata = oracle_15somites.adata

# subset for the NMP trajectories
adata_sub = adata[adata.obs_names.isin(umap_coords_sub.cell_id)]
adata_sub

View of AnnData object with n_obs × n_vars = 8237 × 3000
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_ATAC', 'nFeature_ATAC', 'nucleosome_signal', 'nucleosome_percentile', 'TSS.enrichment', 'TSS.percentile', 'nCount_SCT', 'nFeature_SCT', 'global_annotation', 'prediction.score.Lateral_Mesoderm', 'prediction.score.Neural_Crest', 'prediction.score.Somites', 'prediction.score.Epidermal', 'prediction.score.Neural_Anterior', 'prediction.score.Neural_Posterior', 'prediction.score.Endoderm', 'prediction.score.PSM', 'prediction.score.Differentiating_Neurons', 'prediction.score.Adaxial_Cells', 'prediction.score.NMPs', 'prediction.score.Notochord', 'prediction.score.Muscle', 'prediction.score.unassigned', 'prediction.score.max', 'nCount_peaks_bulk', 'nFeature_peaks_bulk', 'nCount_peaks_celltype', 'nFeature_peaks_celltype', 'nCount_peaks_merged', 'nFeature_peaks_merged', 'SCT.weight', 'peaks_merged.weight', 'nCount_Gene.Activity', 'nFeature_Gene.Activity'
    var: 'features', 'high

In [None]:
adata_sub.obs_names

In [22]:
oracle_NMPs = development_module.subset_oracle_for_development_analysiis(oracle_15somites, cell_idx_use=adata_sub.obs_names)

AttributeError: 'Oracle' object has no attribute 'delta_embedding'

In [6]:
# redefine the default embedding for the oracle object ("X_atac.umap.cellranger")
oracle.embedding = oracle.adata.obsm["X_umap.joint"]
oracle.embedding_name = "X_umap.joint"

In [7]:
oracle.embedding_name

'X_umap.joint'

In [8]:
# Load the TDR118 links data (GRN)
links = co.load_hdf5("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/TDR119_cicero_output/08_TDR119_celltype_GRNs.celloracle.links")
links

<celloracle.network_analysis.links_object.Links at 0x149a39aef0d0>

In [9]:
# Instantiate pseudotime object using oracle object
pt = Pseudotime_calculator(oracle_object=oracle)

In [10]:
adata = oracle.adata
adata

AnnData object with n_obs × n_vars = 13022 × 3000
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_ATAC', 'nFeature_ATAC', 'nucleosome_signal', 'nucleosome_percentile', 'TSS.enrichment', 'TSS.percentile', 'nCount_SCT', 'nFeature_SCT', 'global_annotation', 'nCount_peaks_bulk', 'nFeature_peaks_bulk', 'nCount_peaks_celltype', 'nFeature_peaks_celltype', 'SCT.weight', 'peaks_celltype.weight'
    var: 'features', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'symbol', 'isin_top1000_var_mean_genes', 'isin_TFdict_targets', 'isin_TFdict_regulators'
    uns: 'hvg', 'log1p', 'global_annotation_colors'
    obsm: 'X_umap.atac', 'X_umap.joint', 'X_umap.rna'
    layers: 'raw_count', 'normalized_count', 'imputed_count'

In [4]:
# # Load the TDR118 oracle data
# oracle = co.load_hdf5("/hpc/projects/data.science/yangjoon.kim/zebrahub_multiome/data/processed_data/03_celloracle_celltype_GRNs/TDR118reseq/06_TDR118reseq.celloracle.oracle")
# oracle

Oracle object

Meta data
    celloracle version used for instantiation: 0.14.0
    n_cells: 13022
    n_genes: 3000
    cluster_name: global_annotation
    dimensional_reduction_name: X_umap.atac
    n_target_genes_in_TFdict: 12674 genes
    n_regulatory_in_TFdict: 872 genes
    n_regulatory_in_both_TFdict_and_scRNA-seq: 318 genes
    n_target_genes_both_TFdict_and_scRNA-seq: 1637 genes
    k_for_knn_imputation: 325
Status
    Gene expression matrix: Ready
    BaseGRN: Ready
    PCA calculation: Done
    Knn imputation: Done
    GRN calculation for simulation: Not finished