# Replogle dataset

Descriptive statistics and Preprocessing of replogle dataset to be used as a perturbation datasets for training bmfm-rna models.   
We use the original file with raw counts, but apply the split from scGPT.  
A small file with few perturbations but in same format is created for testing.  

We are using the scGPT processed version and splits of Repologle dataset from [scGPT paper](https://www.nature.com/articles/s41592-024-02201-0) and supplementary.
```
Replogle. The Replogle perturbation dataset contains genome-wide
perturbations in the K562 leukemia cell line with CRISPR interference34.
With consideration for data quality, we retained the subset of data
matching the 1,973 perturbations identified in the original study with
strong transcriptional phenotypes. We also removed 150 perturbations
that had no expression records of the perturbed gene in the sequencing data. We further retained 100 samples per perturbation and 2,500
control samples. The processed whole dataset consists of 171,542 samples from 1,823 perturbations, among which 99 are perturbations of
transcription factors. The test set consists of 456 perturbations, among
which 25 are perturbations of transcription factors.
```

In [1]:
import pickle
from pathlib import Path

import pandas as pd
import scanpy as sc

  from pkg_resources import get_distribution, DistributionNotFound


In [2]:
#dataset_folder = "/proj/bmfm/datasets/vcc/replogle2022/scGPT_Fig3_AB_PerturbPred/k562_1900_100_re_ctrl_sample"
dataset_folder = "/dccstor/bmfm-targets/data/omics/transcriptome/scRNA/finetune/Perturbation/replogle/k562_1900_100_re_ctrl_sample"
filename = "perturb_processed.h5ad"

In [3]:
adata = sc.read_h5ad(Path(dataset_folder) / filename)
adata

AnnData object with n_obs × n_vars = 171542 × 6017
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count', 'batch', 'condition', 'cell_type', 'dose_val', 'control', 'condition_name'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'log1p', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'

In [4]:
adata.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 387347629 stored elements and shape (171542, 6017)>

In [5]:
adata.obs.columns

Index(['gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript',
       'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI',
       'core_scale_factor', 'core_adjusted_UMI_count', 'batch', 'condition',
       'cell_type', 'dose_val', 'control', 'condition_name'],
      dtype='object')

In [6]:
adata.obs['gene'].value_counts()

gene
non-targeting    2500
NFRKB             100
NCBP2             100
NBAS              100
NARS2             100
                 ... 
RPS27A             26
PDCD11             26
HAUS1              26
PRPF4              25
SEM1               21
Name: count, Length: 1823, dtype: int64

In [7]:
adata.obs['control'].value_counts()

control
0    169042
1      2500
Name: count, dtype: int64

In [8]:
adata.obs[adata.obs['control'] ==1]['gene'].value_counts()

gene
non-targeting    2500
RANBP1              0
RAMAC               0
RAI14               0
RAE1                0
                 ... 
HIRA                0
HIPK3               0
HIPK1               0
HINFP               0
HNRNPK              0
Name: count, Length: 1823, dtype: int64

In [9]:
adata.obs['batch'].value_counts()

batch
1969    2500
1037     100
1005     100
1003     100
1002     100
        ... 
1461      26
1138      26
626       26
1252      25
1510      21
Name: count, Length: 1823, dtype: int64

### Apply split from scGPT file

In [10]:
splits_folder = "splits"
splits_file = "k562_1900_100_re_ctrl_sample_simulation_1_0.75.pkl"

In [11]:
with open(Path(dataset_folder) / splits_folder / splits_file, "rb") as f:
    splits = pickle.load(f)
    for split_name, gene_list in splits.items():
        print(f"{split_name} count: {len(gene_list)}")
        print(gene_list[:10])

test count: 456
['ACTB+ctrl', 'ADAM10+ctrl', 'ADNP+ctrl', 'ADRM1+ctrl', 'ADSL+ctrl', 'ALDOA+ctrl', 'ANKRD27+ctrl', 'ANKRD39+ctrl', 'ARHGAP6+ctrl', 'ASCC3+ctrl']
train count: 1230
['AACS+ctrl', 'AAMP+ctrl', 'AAR2+ctrl', 'AARS+ctrl', 'AARS2+ctrl', 'AATF+ctrl', 'ABCB7+ctrl', 'ABCE1+ctrl', 'ABCF1+ctrl', 'ABT1+ctrl']
val count: 137
['ACTL6A+ctrl', 'ADO+ctrl', 'AHCTF1+ctrl', 'ATR+ctrl', 'BTAF1+ctrl', 'CCNL1+ctrl', 'CCNQ+ctrl', 'CDIPT+ctrl', 'CDK9+ctrl', 'CENPC+ctrl']


Lets create a column in adata with the split:

In [12]:
gene_to_split = {}
for split_name, gene_list in splits.items():
    gene_to_split.update({gene.replace("+ctrl",""): split_name for gene in gene_list})

# Map the "gene" column in adata.obs to split labels
adata.obs["scgpt_split"] = adata.obs["gene"].map(gene_to_split)

#rename val to dev
adata.obs.loc[adata.obs["scgpt_split"] == "val", "scgpt_split"] = "dev"


In [13]:
adata.obs["scgpt_split"].value_counts(dropna=False)

scgpt_split
train    113743
test      42351
dev       12948
NaN        2500
Name: count, dtype: int64

In [14]:
adata.obs[adata.obs["scgpt_split"]=="train"]["gene"].value_counts()

gene
ZNHIT6    100
AACS      100
AAMP      100
ZNF787    100
ZNF706    100
         ... 
SARS2       0
SBNO1       0
ESF1        0
ETF1        0
POLR3C      0
Name: count, Length: 1823, dtype: int64

In [15]:
adata.obs[adata.obs["scgpt_split"]=="train"]["gene"].value_counts().mean()

np.float64(62.393307734503566)

In [16]:
# uncomment to save again

#adata.write("/dccstor/bmfm-targets/data/omics/transcriptome/scRNA/finetune/Perturbation/replogle/scGPT_k562_1900_100_re_ctrl_sample_perturb_processed_with_split.h5ad")

In [17]:
adata.var

Unnamed: 0_level_0,gene_name,chr,start,end,class,strand,length,in_matrix,mean,std,cv,fano,highly_variable,means,dispersions,dispersions_norm
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ENSG00000237491,LINC01409,chr1,778747,810065,gene_version10,+,31318,True,0.116626,0.349971,3.000803,1.050194,True,0.113362,0.325672,-0.055866
ENSG00000187961,KLHL17,chr1,960584,965719,gene_version14,+,5135,True,0.105599,0.330678,3.131439,1.035497,False,0.092125,0.245210,-0.637191
ENSG00000188290,HES4,chr1,998962,1000172,gene_version10,-,1210,True,0.242700,0.550596,2.268630,1.249098,True,0.209183,0.462903,0.935613
ENSG00000187608,ISG15,chr1,1001138,1014540,gene_version10,+,13402,True,0.334181,0.675689,2.021921,1.366189,True,0.307255,0.716577,2.768372
ENSG00000078808,SDF4,chr1,1216908,1232067,gene_version17,-,15159,True,0.889459,1.054260,1.185283,1.249596,True,0.628492,0.370973,0.093641
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000198786,MT-ND5,chrM,12337,14148,gene_version2,+,1811,True,24.573463,12.764438,0.519440,6.630358,True,3.177331,1.517886,1.775934
ENSG00000198695,MT-ND6,chrM,14149,14673,gene_version2,-,524,True,6.160370,5.194642,0.843235,4.380306,True,1.858941,1.337521,2.124659
ENSG00000278704,BX004987.1,GL000009.2,56140,58376,gene_version1,-,2236,True,0.195008,0.455099,2.333745,1.062085,True,0.183181,0.353474,0.144999
ENSG00000278384,AL354822.1,GL000218.1,51867,54893,gene_version1,-,3026,True,0.190869,0.450850,2.362089,1.064948,True,0.184353,0.397784,0.465131


In [18]:
adata.X.toarray()[:10,:10]

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 1.0932776 , 0.        , 1.0932776 ],
       [0.        , 0.        , 0.        , 0.6340871 , 0.6340871 ,
        0.        , 1.2963425 , 0.        , 0.        , 1.5131915 ],
       [0.        , 0.        , 0.        , 0.        , 1.5427817 ,
        0.        , 0.5513052 , 0.        , 0.9046365 , 0.9046365 ],
       [0.        , 0.        , 0.5408124 , 0.8898659 , 0.        ,
        0.        , 0.5408124 , 0.        , 0.5408124 , 0.8898659 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 1.178307  ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.8864786 , 0.53841245, 0.        , 0.53841245],
       [0.        , 0.        , 0.        , 0.75866437, 1.1850482 ,
        0.        , 0.        , 0.75866437, 0.        , 0.75866437],
       [0.        , 0.        , 0.       

In [19]:
adata

AnnData object with n_obs × n_vars = 171542 × 6017
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count', 'batch', 'condition', 'cell_type', 'dose_val', 'control', 'condition_name', 'scgpt_split'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'log1p', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'

### Save a small file for tests

In [20]:
limit_genes=3
split_short_list = {}
for split_name, gene_list in splits.items():
    split_short_list.update({gene.replace("+ctrl",""): split_name for gene in gene_list[:limit_genes]})

In [21]:
split_short_list

{'ACTB': 'test',
 'ADAM10': 'test',
 'ADNP': 'test',
 'AACS': 'train',
 'AAMP': 'train',
 'AAR2': 'train',
 'ACTL6A': 'val',
 'ADO': 'val',
 'AHCTF1': 'val'}

In [22]:
adata_small = adata[adata.obs["gene"].isin(set(split_short_list.keys()|{"non-targeting"}))]

In [23]:
adata_small

View of AnnData object with n_obs × n_vars = 3339 × 6017
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count', 'batch', 'condition', 'cell_type', 'dose_val', 'control', 'condition_name', 'scgpt_split'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'log1p', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'

In [24]:
adata_small.obs["scgpt_split"].value_counts()

scgpt_split
train    293
test     284
dev      262
Name: count, dtype: int64

In [25]:
adata_small.obs["gene"].value_counts()

gene
non-targeting    2500
AACS              100
AAMP              100
ACTB              100
ADNP              100
ADO               100
AHCTF1            100
AAR2               93
ADAM10             84
ACTL6A             62
Name: count, dtype: int64

In [26]:
# uncomment to save again
#adata_small.write("/dccstor/bmfm-targets/data/omics/transcriptome/scRNA/finetune/Perturbation/replogle/scGPT_k562_1900_100_re_ctrl_sample_perturb_processed_with_split_small.h5ad")

In [27]:
adata_small.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 7650274 stored elements and shape (3339, 6017)>

## Create row counts file with same split
After experimenting we decided to use raw counts. 
Merge with the original raw h5ad to get raw counts instead of scGPT normalized, but keep the split

In [28]:
adata.obs_names 

Index(['CATCCCACACAGTGAG-99-0', 'TCCACCACAGATGCGA-127-0',
       'CCTTTGGTCTGTCGTC-250-0', 'CATCCCAGTGAGCGAT-24-0',
       'GTCTACCGTAGCACAG-186-0', 'GCCATGGTCCGTGGCA-88-0',
       'ACAGGGATCACGAACT-101-0', 'TGCACGGCAAAGGGCT-181-0',
       'GTTACCCTCGTGTCAA-104-0', 'GAAGCGATCCTTTGAT-148-0',
       ...
       'GGGCGTTAGATGGTCG-98-1969', 'CTTCTAACAGAGTCAG-119-1969',
       'CACGTTCCACGGTGTC-103-1969', 'GTCAGCGTCCTTGGAA-149-1969',
       'GAGCTGCCACGACTAT-100-1969', 'TGTCCTGCACACCGCA-168-1969',
       'TCCCAGTCAATAGTCC-56-1969', 'CTACCTGCATAACGGG-75-1969',
       'TCCGGGAAGTAGCCAG-96-1969', 'GCCAGCACACTAGGCC-84-1969'],
      dtype='object', name='cell_barcode', length=171542)

In [29]:
raw_counts_filename = "K562_gwps_raw_singlecell_01.h5ad"
raw_counts_subset_filename = "K562_gwps_raw_singlecell_01.subset.h5ad"

adata_raw_counts = sc.read_h5ad(Path(dataset_folder).parent /"K562_gwps_raw_singlecell_01.h5ad", backed='r')

In [30]:
adata_raw_counts.obs_names

Index(['AAACCCAAGAAACCAT-157', 'AAACCCAAGAAACCAT-207', 'AAACCCAAGAAACCAT-29',
       'AAACCCAAGAAAGCGA-149', 'AAACCCAAGAAATCCA-172', 'AAACCCAAGAACAAGG-215',
       'AAACCCAAGAACAGGA-209', 'AAACCCAAGAACAGGA-54', 'AAACCCAAGAACCCGA-228',
       'AAACCCAAGAACCCGA-48',
       ...
       'TTTGTTGTCTTGGCTC-166', 'TTTGTTGTCTTGGCTC-86', 'TTTGTTGTCTTGTGCC-179',
       'TTTGTTGTCTTGTTAC-124', 'TTTGTTGTCTTGTTAC-246', 'TTTGTTGTCTTTCCAA-201',
       'TTTGTTGTCTTTCGAT-79', 'TTTGTTGTCTTTCTAG-218', 'TTTGTTGTCTTTGCGC-169',
       'TTTGTTGTCTTTGCGC-213'],
      dtype='object', name='cell_barcode', length=1989578)

looks like we need to trim adata.obs_names:

In [31]:
cell_names = [s.rsplit("-", 1)[0] for s in adata.obs_names]
cell_names[:10]

['CATCCCACACAGTGAG-99',
 'TCCACCACAGATGCGA-127',
 'CCTTTGGTCTGTCGTC-250',
 'CATCCCAGTGAGCGAT-24',
 'GTCTACCGTAGCACAG-186',
 'GCCATGGTCCGTGGCA-88',
 'ACAGGGATCACGAACT-101',
 'TGCACGGCAAAGGGCT-181',
 'GTTACCCTCGTGTCAA-104',
 'GAAGCGATCCTTTGAT-148']

In [32]:
#check on first 100 cell names if a subset exists:
[cell for cell in adata_raw_counts.obs_names[:100] if cell in cell_names]


['AAACCCAAGACAACAT-110',
 'AAACCCAAGACATCCT-144',
 'AAACCCAAGAGCATAT-255',
 'AAACCCAAGAGCCGAT-22',
 'AAACCCAAGAGGTCAC-12']

In [33]:
cell_names_set = set(cell_names)  
mask = adata_raw_counts.obs_names.isin(cell_names_set) 
keep_indices = mask.nonzero()[0]  
len(keep_indices)

171542

In [34]:
len(set(cell_names))

171542

In [35]:
#workes but takes 18 min
#adata_raw_counts[keep_indices, :].copy(Path(dataset_folder).parent / raw_counts_subset_filename)

In [36]:
# uncomment to save again
# adata_raw_counts_subset.write_h5ad(Path(dataset_folder).parent / raw_counts_subset_filename, compression="gzip")


In [37]:
subset_h5ad = sc.read_h5ad(Path(dataset_folder).parent / raw_counts_subset_filename)

In [38]:
subset_h5ad

AnnData object with n_obs × n_vars = 171542 × 8248
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano'

In [39]:
subset_h5ad.obs.columns

Index(['gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript',
       'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI',
       'core_scale_factor', 'core_adjusted_UMI_count'],
      dtype='object')

In [40]:
subset_h5ad.obs["gene"].value_counts()

gene
non-targeting    2500
NFRKB             100
NCBP2             100
NBAS              100
NARS2             100
                 ... 
RPS27A             26
PDCD11             26
HAUS1              26
PRPF4              25
SEM1               21
Name: count, Length: 1823, dtype: int64

In [41]:
# copy split again
# Map the "gene" column in adata.obs to split labels
subset_h5ad.obs["scgpt_split"] = subset_h5ad.obs["gene"].map(gene_to_split)

#rename val to dev
subset_h5ad.obs.loc[subset_h5ad.obs["scgpt_split"] == "val", "scgpt_split"] = "dev"

In [42]:
subset_h5ad.obs["scgpt_split"].value_counts(dropna=False)

scgpt_split
train    113743
test      42351
dev       12948
NaN        2500
Name: count, dtype: int64

In [43]:
# uncomment to save again
# save final h5ad
#subset_h5ad.write_h5ad(Path(dataset_folder).parent / "k562_gwps_raw_singlecell_01_processed_with_scgpt_split.h5ad")

In [44]:
subset_h5ad_small = subset_h5ad[subset_h5ad.obs["gene"].isin(set(split_short_list.keys()|{"non-targeting"}))]

In [45]:
# uncomment to save again
#subset_h5ad_small.write_h5ad(Path(dataset_folder).parent / "k562_gwps_raw_singlecell_01_processed_with_scgpt_split_small.h5ad")

#### Save gene names to csv file, this is usefull for external evalautions like cell-eval (https://pypi.org/project/cell-eval/)

In [67]:
gene_names_file = "gene_names.csv"
genes = subset_h5ad.obs["gene"].unique()
gene_names = [gene for gene in genes if gene != 'non-targeting']

In [None]:
genes_df = pd.DataFrame(gene_names, columns=["gene"])
genes_df

Unnamed: 0,gene
0,INO80B
1,MRPL52
2,LTV1
3,SNRPC
4,FUNDC2
...,...
1817,CPSF2
1818,GTF2F2
1819,NFYC
1820,PDCD11


In [None]:
# Save to CSV
genes_df.to_csv(Path(dataset_folder).parent / gene_names_file, index=False, header=False)

In [None]:
### Save test set to a seaparate file to be used as groun-truth in comparisons
gt_file_name = ""
small_gt_file_name = ""
gt_subset = subset_h5ad[subset_h5ad.obs["scgpt_split"] == "test"]
gt_subset.write_h5ad(filename=Path(dataset_folder).parent / gt_file_name)

gt_subset_small = subset_h5ad_small[subset_h5ad_small.obs["scgpt_split"] == "test"]
gt_subset_small.write_h5ad(filename=Path(dataset_folder).parent / small_gt_file_name)

# Validate final files
Run this part only to read the final files created

In [47]:
from pathlib import Path

import scanpy as sc

In [48]:
dataset_folder = "/dccstor/bmfm-targets/data/omics/transcriptome/scRNA/finetune/Perturbation/replogle/k562_1900_100_re_ctrl_sample"

In [49]:
adata_small = sc.read_h5ad(Path(dataset_folder).parent / "k562_gwps_raw_singlecell_01_processed_with_scgpt_split_small.h5ad")
adata_small

AnnData object with n_obs × n_vars = 3339 × 8248
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count', 'scgpt_split'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano'

In [50]:
adata_small.obs["gene"].value_counts(dropna=False)

gene
non-targeting    2500
AACS              100
AAMP              100
ACTB              100
ADNP              100
ADO               100
AHCTF1            100
AAR2               93
ADAM10             84
ACTL6A             62
Name: count, dtype: int64

In [51]:
adata_small.obs["scgpt_split"].value_counts(dropna=False)

scgpt_split
NaN      2500
train     293
test      284
dev       262
Name: count, dtype: int64

In [52]:
adata = sc.read_h5ad(Path(dataset_folder).parent / "k562_gwps_raw_singlecell_01_processed_with_scgpt_split.h5ad")
adata

AnnData object with n_obs × n_vars = 171542 × 8248
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count', 'scgpt_split'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano'

In [53]:
adata.obs["gene"].value_counts(dropna=False)

gene
non-targeting    2500
NFRKB             100
NCBP2             100
NBAS              100
NARS2             100
                 ... 
RPS27A             26
PDCD11             26
HAUS1              26
PRPF4              25
SEM1               21
Name: count, Length: 1823, dtype: int64

In [54]:
adata.obs["scgpt_split"].value_counts(dropna=False)

scgpt_split
train    113743
test      42351
dev       12948
NaN        2500
Name: count, dtype: int64

In [55]:
import pandas as pd

counts = pd.crosstab(
    adata.obs["gene"].astype(str),
    adata.obs["scgpt_split"].astype(str),
)

In [56]:
print(counts["train"][counts["train"]>0])

gene
AACS      100
AAMP      100
AAR2       93
AARS       46
AARS2     100
         ... 
ZNF574    100
ZNF695    100
ZNF706    100
ZNF787    100
ZNHIT6    100
Name: train, Length: 1229, dtype: int64


In [57]:
print(counts["dev"][counts["dev"]>0])

gene
ACTL6A     62
ADO       100
AHCTF1    100
ATR        78
BTAF1     100
         ... 
WAC       100
WDR26     100
WDR46     100
YBX3      100
ZNF622     92
Name: dev, Length: 137, dtype: int64


In [58]:
print(counts["test"][counts["test"]>0])

gene
ACTB      100
ADAM10     84
ADNP      100
ADRM1     100
ADSL      100
         ... 
ZNF84     100
ZNHIT1    100
ZNHIT3    100
ZNRD1      81
ZW10      100
Name: test, Length: 456, dtype: int64
