This script find the genes with multiple accession entries in the L5.all.agg.loom file. Created Sep 9th, 2022 by Linhe Xu.

In [1]:
import numpy as np
import pandas as pd
import loompy

In [2]:
agg = loompy.connect('../data/single-cell/zeisel_2018/l5_all.agg.loom')

### Some gene name has multiple accession (rows) 

In [3]:
agg.shape

(27998, 265)

In [4]:
len(np.unique(agg.ra['Accession'])) # number of Accession

27998

In [5]:
len(np.unique(agg.ra['Gene'])) # number of unique gene names

27933

This means there are 27998-27933=65 rows in the agg.loom file with at least one duplicated value in the 'Gene' key attrribute. 

In [6]:
all_gene_list = agg.ra['Gene'].tolist()

In [7]:
uniq_gene_list = np.unique(agg.ra['Gene']).tolist()

In [8]:
dup_set=set()
for gene in all_gene_list:
    if all_gene_list.count(gene) > 1:
        dup_set.add(gene)

In [9]:
len(dup_set) # the 65 duplicated entries come from 56 gene names

56

In [10]:
dup_list=[]
for gene in all_gene_list:
    if all_gene_list.count(gene) > 1:
        dup_list.append(gene)

In [11]:
len(dup_list)

121

In [12]:
dup_dict={}
for gene in dup_set:
    count=all_gene_list.count(gene)
    dup_dict.update({gene:count})

In [13]:
sum(dup_dict.values())

121

In [14]:
max(dup_dict.values())

7

This means there are 56 gene names with more than 1 entry; each of these gene name has 2 to 7 row corresponding to them. 

In [15]:
# Which gene has 7 entries? 
for gene, count in dup_dict.items():
    if count==7:
        print(gene)

Flg


In [16]:
agg.ra.Accession[agg.ra.Gene=='Flg']

array(['ENSMUSG00000102829', 'ENSMUSG00000103750', 'ENSMUSG00000103777',
       'ENSMUSG00000103335', 'ENSMUSG00000102439', 'ENSMUSG00000102395',
       'ENSMUSG00000102724'], dtype=object)

Only one accession (ENSMUSG00000102439) is what Ensembl assigned to the gene Flg in the mouse reference genome GRCm39 (accessed Sep 9th, 2022, Ensembl 107) 

### Generate a text file with duplicated genes

This list of gene name is to be used to get the "true" Ensembl IDs from Ensembl 107.

In [19]:
with open('../results/single-cell/zeisel_2018/duplicated_gene_names.txt', 'w') as f:
    for item in dup_set:
        f.write(f"{item}\n")