__Author:__ Bram Van de Sande

__Date:__ 5 MAR 2018

__Outline:__ Purpose is to create a collection of single-cell transcriptomics datasets to apply pySCENIC to with the attempt to create meta-regulomes, i.e. precalculated regulomes from a large databases.

The idea would be to use these precalculated meta-regulomes to map new sc-experiment data to meta-regulome feature space and use this new feature space to find similar experiments in the database.

Criteria for selecting datasets:
- Data should be available as an expression matrix in TPM, FPKM and/or a log transformation (e.g. log2(TPM+1)). We will not map reads from FASTQ files ourselves.
- Data should be related to human disease and/or clearly defined cell types.
- Preferrably hg19. Which is the case for most datasets. 
- Check gene annotations and version used.

__TODO:__
- Check how many cell lines are true tumor cell lines (versus host cells).

In [1]:
import pandas as pd
import numpy as np
import os
import math
from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from functools import partial
import glob
import mygene

In [2]:
DATABASE_FOLDER = "/Users/bramvandesande/Projects/lcb/databases/"
DATABASE_FNAME_500bp = os.path.join(DATABASE_FOLDER, "hg19-500bp-upstream-10species.mc9nr.feather")
RESOURCES_FOLDER = '/Users/bramvandesande/Projects/lcb/resources/sc-experiments/'
RESULTS_FOLDER = '/Users/bramvandesande/Projects/lcb/resources/sc-experiments/clean/'

## List of human transcription factors

Taken from http://www.cell.com/cell/fulltext/S0092-8674(18)30106-5 . See Supplementary Files.


In [3]:
! ls ~/Projects/lcb/resources

GSE60361_C1-3005-Expression.txt      motifs-v8-nr.hgnc-m0.001-o0.0.tbl
c6.all.v6.1.symbols.gmt.txt          motifs-v8-nr.mgi-m0.000-o0.0.tbl
ensg2hgnc.tsv                        motifs-v8-nr.mgi-m0.001-o0.0.tbl
mm_tfs.txt                           motifs-v8-nr.zfin-m0.000-o0.0.tbl
modules_zeisel_2015.yaml             motifs-v8-nr.zfin-m0.001-o0.0.tbl
motifs-v8-nr.fbgn-m0.000-o0.0.tbl    motifs-v9-nr.hgnc-m0.001-o0.0.tbl
motifs-v8-nr.fbgn-m0.001-o0.0.tbl    motifs-v9-nr.mgi-m0.001-o0.0.tbl
motifs-v8-nr.flybase-m0.000-o0.0.tbl regulomes_zeisel_2015.csv
motifs-v8-nr.flybase-m0.001-o0.0.tbl [1m[36mrscenic[m[m
motifs-v8-nr.hgnc-m0.000-o0.0.tbl    [1m[36msc-experiments[m[m


In [4]:
df = pd.read_excel(os.path.join(RESOURCES_FOLDER, 'human_tfs.xlsx'), sheet_name='Table S1. Related to Figure 1B',
                  header=[0], skiprows=[0])

In [5]:
df.columns

Index(['ID', 'Name', 'DBD', 'Unnamed: 3', 'TF assessment', 'Binding mode',
       'Motif status', 'Notes', 'Comments', 'Committee notes', 'MTW Notes',
       'TRH Notes', 'SL notes', 'AJ notes', 'Disagree on Assessment ',
       'Disagree on Binding', 'Author1', 'Assesment1', 'Binding1', 'Comment1',
       'Notes1', 'Author2', 'Assesment2', 'Binding2', 'Comment2', 'Notes2',
       'Vaquerizas 2009 TF classification', 'CisBP considers it as a TF?',
       'TFclass considers it as a TF? ', 'TF-CAT classification', 'Is a GO TF',
       'PDB'],
      dtype='object')

In [6]:
human_tfs = list(df[df['Unnamed: 3'] == 'Yes'].Name.unique())

In [7]:
len(human_tfs)

1639

In [8]:
with open(os.path.join(RESULTS_FOLDER, 'tfs.txt'), 'w') as file:
    file.write('\n'.join(human_tfs) + '\n')

## Source I: SCPortalen (http://single-cell.clst.riken.jp/)

Total list of FPKM expression matrics can be found at http://single-cell.clst.riken.jp/non_riken_data/bd_fpkm_file_list.php?pagesize=-1 .

Might need to convert FPKM to TPM units (https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/). Also, need to convert ENSEMBL gene identifiers (ENSG) to HGNC nomenclature. Cluster assignment might not be available.

| ID | Disease/Cell type | Download URL |
|----|--------------------------|--------------|
| DRA001287 | Lung adenocarcinoma | http://single-cell.clst.riken.jp/fpkm_tables/DRA001287_fpkm.table.gz |
| DRA002730 | Lung adenocarcinoma | http://single-cell.clst.riken.jp/fpkm_tables/DRA002730_fpkm.table.gz |
| E-MTAB-5061 | Human pancreas (healthy and diabetes type II) | http://single-cell.clst.riken.jp/fpkm_tables/E-MTAB-5061_fpkm.table.gz | 
| GSE57872 | Glioblastoma | http://single-cell.clst.riken.jp/fpkm_tables/GSE57872_fpkm.table.gz |
| GSE73121 | Renal Cell Carcinoma | http://single-cell.clst.riken.jp/fpkm_tables/GSE73121_fpkm.table.gz | 

## Source II: Single Cell Portal (https://portals.broadinstitute.org/single_cell)

Gene nomenclature is already HGNC.

| Disease/Cell type | Number of cells | Number of clusters | Publication | Unit | URL | 
|---------|-----------------|--------------------|-------------|------|-----|
| Astrocytoma | 6341 | 4 clusters | doi: 10.1126/science.aai8478 | log2(TPM/10+1), for all genes and cells | https://portals.broadinstitute.org/single_cell/study/single-cell-rna-seq-analysis-of-astrocytoma#study-summary |
| Blood dendritic cells and monocytes | 1078 | 10 clusters | doi: 10.1126/science.aah4573 | Log transformed and filtered TPMs | https://portals.broadinstitute.org/single_cell/study/atlas-of-human-blood-dendritic-cells-and-monocytes#study-summary |
| Melanoma | 4646 | 7 clusters | doi: 10.1126/science.aad0501 | log2(TPM/10+1) of all genes in all cells | https://portals.broadinstitute.org/single_cell/study/melanoma-intra-tumor-heterogeneity |
| Glioblastoma | 430 | ? | doi: 10.1126/science.1254257 | log2(TPM+1) expression matrix for all cells profiled including 5948 genes with highest expression | https://portals.broadinstitute.org/single_cell/study/glioblastoma-intra-tumor-heterogeneity  |
| Oligodendroglioma | 4347 | ? | doi:10.1038/nature20123 | log2(TPM/10+1) of all genes in all cells |  https://portals.broadinstitute.org/single_cell/study/oligodendroglioma-intra-tumor-heterogeneity#study-summary | 

## Source III: GEO (https://www.ncbi.nlm.nih.gov/geo)

Searched Geo Datasets (https://www.ncbi.nlm.nih.gov/gds/) using the following query:

```
"homo sapiens"[Organism] AND "expression profiling by high throughput sequencing"[DataSet Type] AND "single cell" AND "Neoplasms"[Mesh] 
```

Selected for the large sample dataset + included dataset for prevalent tumor types (https://www.cancer.gov/types/common-cancers). Bladder and prostate cancer is missing.

| Downloaded? | Source | Cancer subtype | # cells | Dataset ID | Publication ID | Unit | Method | URL |
|------|--------|--------|---------|------------|----------------|------|--------|-----|
| X (23686x6341) | BROAD (GEO) | IDH-mutant astrocytoma | 6341 | GSE89567 | DOI: 10.1126/science.aai8478 | Expression levels were quantified as Ei,j=log2(TPMi,j/10+1), where TPMi,j refers to transcriptper-million for gene i in sample j, as calculated by RSEM.  Tab-delimited text file containing the normalized expression levels (E) for 23,686 analyzed genes (rows, first, column indicates Gene Symbols) across 6341 astrocytoma cells (columns, sample names indicated at the top row). | SMART-seq2 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89567 (https://portals.broadinstitute.org/single_cell/study/single-cell-rna-seq-analysis-of-astrocytoma#study-summary) |
| X | GEO | Head and neck cancer | 5902 | GSE103322 | PMID: 29198524 | Tab-delimited text file containing TPM values ( transcript-per-million reads) for 23,686 analyzed genes (rows), across 5902 cells (columns, sample names indicated at the top row). | SMART-Seq2 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103322 |
| X | BROAD (GEO) | Oligodendroglioma | 4347 | GSE70630 | DOI: 10.1038/nature20123 | log2(TPM/10+1) of all genes in all cells | SMART-Seq2 | https://portals.broadinstitute.org/single_cell/study/oligodendroglioma-intra-tumor-heterogeneity#study-summary (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70630) | 
| X (23686x4646)| BROAD (GEO) | Melanoma | 4646 | GSE72056 | DOI: 10.1126/science.aad0501 | log2(TPM/10+1) of all genes in all cells | SMART-Seq2 | https://portals.broadinstitute.org/single_cell/study/melanoma-intra-tumor-heterogeneity (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056)  |
| X | BROAD (GEO +SCPortalen) | Glioblastoma | 430 | GSE57872 | DOI: 10.1126/science.1254257 | log2(TPM+1) expression matrix for all cells profiled including 5948 genes with highest expression. | SMART-seq | https://portals.broadinstitute.org/single_cell/study/glioblastoma-intra-tumor-heterogeneity (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57872) |
| - (Raw reads) | GEO | Glioblastoma | 3589 | GSE84465 | PMID: 29091775 | Tab-delimited text files with raw read values for each sample | SMART-Seq2 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84465 |
| X | GEO | Chronic Myeloid Leukemia | 2289 | GSE76312 | PMID: 28504724 | RPKM | Fluidigm | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76312 |
| X | GEO | Colorectal tumors | 1591 | GSE81861 | DOI: 10.1038/ng.3818 | FPKM | Fluidigm | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81861 | 
| X | GEO | Lung adenocarcinoma | 126  | GSE69405 | DOI: 10.1186/s13059-015-0692-3 | TPM | SMART-Seq | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69405 |
| X | GEO | Breastcancer | 549 | GSE75688 | PMID: 28474673 | A processed file is tab-delimited and contains the raw TPM values for 549 single cells and 14 bulk tumors. | integrated fluidic circuit (IFC) chip | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75688 |
| - (HGNC gene identifiers) | GEO (SCPortalen) | RCC | 116 | GSE73121 | PMID: 27139883 | Each row of the tab-delimited text file includes ENSEMBL gene ID,ENSEMBL transcript ID, transcript length, effective_length, expected_count, TPM, FPKM for samples. | Fluidigm + SMARTer | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73121 |

__Disclaimer:__ The current compendium suffers from the limited number of cells available for some very prevalent tumor types (e.g. lung, breast and colorectal).
The reason for this is that single cell has been pioneered in brain research. Hence, the dominance of the brain tumors in this compendium.

__Solution:__ For lung we can use the in-house dataset from Bernard Thienpont with over 55K cells taken from multiple patients.

## Preparation of datasets

In [9]:
!ls {RESOURCES_FOLDER}

GSE103322.mtx+clusters.txt
GSE57872.mtx.txt
GSE69405.mtx.txt
GSE70630.clusters.txt
GSE70630.mtx.txt
GSE72056.clusters.txt
GSE72056.mtx.txt
[1m[36mGSE73121[m[m
GSE75688.mtx.txt
GSE76312.clusters.txt
GSE76312.mtx.txt
GSE81861_CRC_NM_all_cells_FPKM.csv
GSE81861_CRC_NM_epithelial_cells_FPKM.csv
GSE81861_CRC_tumor_all_cells_FPKM.csv
GSE81861_CRC_tumor_epithelial_cells_FPKM.csv
GSE81861_Cell_Line_FPKM.csv
GSE89567.clusters.txt
GSE89567.mtx.txt
bernard_thienpont_2017_lungcancer.txt
bernard_thienpont_2017_lungcancer_grnboost.RData
bernard_thienpont_2017_lungcancer_grnboost.txt
bernard_thienpont_2017_lungcancer_metadata.txt
[1m[36mclean[m[m
human_tfs.xlsx


In [10]:
genes_in_db = RankingDatabase(DATABASE_FNAME_500bp, name="500bp", nomenclature="MGI").geneset

#### B. Thienpont - Lungcarcinoma

GRNBoost data is already available. The disadvantage of using the already available data is that we might not have used the same list of TFs.

In [12]:
adjacencies = pd.read_csv(os.path.join(RESOURCES_FOLDER, 'bernard_thienpont_2017_lungcancer_grnboost.txt'), sep=',')
del adjacencies['Unnamed: 0']
adjacencies.columns = ['TF', 'target', 'importance']

In [13]:
adjacencies.head()

Unnamed: 0,TF,target,importance
0,GATA2,LTC4S,0.877992
1,GATA2,CPA3,0.803889
2,GATA2,TPSAB1,0.679747
3,GATA2,HPGDS,0.653688
4,GATA2,MS4A2,0.651024


In [14]:
adjacencies.to_csv(os.path.join(RESULTS_FOLDER, 'b_thienpont.net.csv'), sep=',', index=False)

#### GSE89567 - Astrocytoma

In [72]:
def process_gse89567(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=0)
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [73]:
mtx = process_gse89567('GSE89567.mtx.txt', genes_in_db)

In [74]:
mtx.head()

Unnamed: 0_level_0,MGH42_P7_A01,MGH42_P7_A02,MGH42_P7_A03,MGH42_P7_A04,MGH42_P7_A05,MGH42_P7_A07,MGH42_P7_A09,MGH42_P7_A11,MGH42_P7_A12,MGH42_P7_B02,...,MGH107neg_P2_E06,MGH107pos_P2_B03,MGH107neg_P1_F03,MGH107neg_P1_G06,MGH107neg_P2_H03,MGH107neg_P2_C05,MGH107pos_P2_D07,MGH107neg_P1_E01,MGH107pos_P2_G09,MGH107neg_P1_D06
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MIR760,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR200C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR200B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR200A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [75]:
mtx.shape

(20684, 6341)

In [76]:
len(set(mtx.index).intersection(set(human_tfs)))

1567

In [77]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE89567.mtx.tsv'), sep='\t', index=True)

#### GSE103322 - Head and neck cancer

In [78]:
def process_gse103322(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=0, skiprows=[1,2,3,4,5])
    
    # Extract gene symbol
    mtx.index = list(map(lambda g: g[1:-1], mtx.index))
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # From TPM to log2(TPM/10+1)
    mtx = mtx.applymap(lambda tpm: math.log2((tpm/10.0) + 1.0))
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [79]:
mtx = process_gse103322('GSE103322.mtx+clusters.txt', genes_in_db)

In [80]:
mtx.head()

Unnamed: 0,HN28_P15_D06_S330_comb,HN28_P6_G05_S173_comb,HN26_P14_D11_S239_comb,HN26_P14_H05_S281_comb,HN26_P25_H09_S189_comb,HN26_P14_H06_S282_comb,HN25_P25_C04_S316_comb,HN26_P25_A11_S107_comb,HN26_P25_C09_S129_comb,HNSCC26_P24_H05_S377_comb,...,HNSCC20_P3_B10_S22_comb,HNSCC20_P13_B11_S215_comb,HNSCC20_P3_C08_S32_comb,HNSCC17_P4_H03_S183_comb,HNSCC20_P3_F09_S69_comb,HNSCC17_P4_G12_S180_comb,HNSCC20_P13_C05_S221_comb,HNSCC17_P4_C12_S132_comb,HNSCC20_P3_H08_S92_comb,HNSCC20_P3_G06_S78_comb
SNORD113-9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MAGEB16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SNORA49,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR26A1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR485,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [81]:
mtx.shape

(20684, 5902)

In [82]:
len(set(mtx.index).intersection(set(human_tfs)))

1567

In [83]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE103322.mtx.tsv'), sep='\t', index=True)

#### GSE70630 - Oligodendroglioma

In [11]:
def process_gse70630(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=0)
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [12]:
mtx = process_gse70630('GSE70630.mtx.txt', genes_in_db)

Remove NaNs.

In [None]:
N_nans = mtx.isnull().values.sum()
from operator import mul
from functools import reduce
N_totals = reduce(mul, mtx.shape)

In [24]:
N_nans/N_totals*100.0

0.018560091878683019

In [25]:
mtx = mtx.fillna(value=0.0)

In [26]:
mtx.head()

Unnamed: 0_level_0,MGH36_P6_A12,MGH36_P6_H09,MGH53_P4_G04,MGH36_P10_G12,MGH53_P2_H12,MGH53_P4_D10,MGH53_P4_D01,MGH36_P6_B07,MGH36_P10_B12,MGH53_P2_G11,...,93_P10_H06,93_P8_B12,93_P8_D09,93_P9_D11,93_P10_G08,93_P8_H06,93_P9_C07,93_P8_A12,93_P8_C01,93_P9_F06
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MIR3202-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR520G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR520F,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR520E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR520D,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
mtx.shape

(20684, 4347)

In [28]:
len(set(mtx.index).intersection(set(human_tfs)))

1567

In [29]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE70630.mtx.tsv'), sep='\t', index=True)

#### GSE72056 - Melanoma

In [90]:
def process_gse72056(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=0)
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [91]:
mtx = process_gse72056('GSE72056.mtx.txt', genes_in_db)

In [92]:
mtx.head()

Unnamed: 0_level_0,Cy72_CD45_H02_S758_comb,CY58_1_CD45_B02_S974_comb,Cy71_CD45_D08_S524_comb,Cy81_FNA_CD45_B01_S301_comb,Cy80_II_CD45_B07_S883_comb,Cy81_Bulk_CD45_B10_S118_comb,Cy72_CD45_D09_S717_comb,Cy74_CD45_A03_S387_comb,Cy71_CD45_B05_S497_comb,Cy80_II_CD45_C09_S897_comb,...,CY75_1_CD45_CD8_7__S265_comb,CY75_1_CD45_CD8_3__S127_comb,CY75_1_CD45_CD8_1__S61_comb,CY75_1_CD45_CD8_1__S12_comb,CY75_1_CD45_CD8_1__S25_comb,CY75_1_CD45_CD8_7__S223_comb,CY75_1_CD45_CD8_1__S65_comb,CY75_1_CD45_CD8_1__S93_comb,CY75_1_CD45_CD8_1__S76_comb,CY75_1_CD45_CD8_7__S274_comb
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MIR3924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR3910-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR3665,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR3179-3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [93]:
mtx.shape

(20682, 4645)

In [94]:
len(set(mtx.index).intersection(set(human_tfs)))

1567

In [95]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE72056.mtx.tsv'), sep='\t', index=True)

#### GSE57872 - Glioblastoma

In [96]:
def process_gse57872(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=0)
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [97]:
mtx = process_gse57872('GSE57872.mtx.txt', genes_in_db)

In [98]:
mtx.head()

Unnamed: 0_level_0,MGH264_A01,MGH264_A02,MGH264_A03,MGH264_A04,MGH264_A05,MGH264_A06,MGH264_A07,MGH264_A08,MGH264_A10,MGH264_A11,...,MGH31_H02,MGH31_H04,MGH31_H05,MGH31_H06,MGH31_H07,MGH31_H08,MGH31_H09,MGH31_H10,MGH31_H11,MGH31_H12
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
OSR1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
KRT73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SPRR2A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
STATH,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
HOXC9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.070857,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
mtx.shape

(5753, 430)

In [100]:
len(set(mtx.index).intersection(set(human_tfs)))

452

In [101]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE57872.mtx.tsv'), sep='\t', index=True)

#### GSE76312 - CML

In [102]:
def process_gse76312(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=0)
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    ## From FPKM to log2(TPM/10+1)
    mtx = (mtx / mtx.sum(axis=0)) * 10.0e6
    mtx = mtx.applymap(lambda tpm: math.log2((tpm/10.0) + 1.0))
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [103]:
mtx = process_gse76312('GSE76312.mtx.txt', genes_in_db)

In [104]:
mtx.head()

Unnamed: 0,OX2046D6,OX2046E12,OX2046D11,OX2046B5,OX2046H5,OX2046E9,OX2046D8,OX2046E7,OX2046H12,OX2046H9,...,CML15_3mTKIC7,CML15_3mTKIB10,CML15_3mTKIA6,CML15_3mTKIB11,CML15_3mTKID6,CML15_3mTKIC9,CML15_3mTKIC1,CML15_3mTKIC8,CML15_3mTKID9,CML15_3mTKID10
PPY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CCDC151,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR105-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ALX1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR101-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [105]:
mtx.shape

(18701, 2287)

In [106]:
len(set(mtx.index).intersection(set(human_tfs)))

1527

In [107]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE76312.mtx.tsv'), sep='\t', index=True)

#### GSE75688 - Breast cancer

In [108]:
def process_gse75688(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=1)
    del mtx['gene_id']
    del mtx ['gene_type']

    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # From FPKM to log2(TPM/10+1)
    mtx = mtx.applymap(lambda tpm: math.log2((tpm/10.0) + 1.0))
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [109]:
mtx = process_gse75688('GSE75688.mtx.txt', genes_in_db)

In [110]:
mtx.head()

Unnamed: 0_level_0,BC01_Pooled,BC01_Tumor,BC02_Pooled,BC03_Pooled,BC03LN_Pooled,BC04_Pooled,BC05_Pooled,BC06_Pooled,BC07_Tumor,BC07LN_Pooled,...,BC11_04,BC11_07,BC11_28,BC11_43,BC11_56,BC11_69,BC11_70,BC11_78,BC11_81,BC11_88
gene_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SNORD119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CDY1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
OR5A2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
OR5B12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
OR5B2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [111]:
mtx.shape

(19685, 563)

In [112]:
len(set(mtx.index).intersection(set(human_tfs)))

1565

In [114]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE75688.mtx.tsv'), sep='\t', index=True)

#### GSE69405 - Lung carcinoma

In [115]:
def process_gse69405(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep='\t', index_col=1)
    del mtx['gene_id']
    del mtx ['gene_type']
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # From FPKM to log2(TPM/10+1)
    mtx = mtx.applymap(lambda tpm: math.log2((tpm/10.0) + 1.0))
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [116]:
mtx = process_gse69405('GSE69405.mtx.txt', genes_in_db)

In [117]:
mtx.head()

Unnamed: 0_level_0,H358_Pooled,H358_SC05,H358_SC09,H358_SC10,H358_SC11,H358_SC12,H358_SC14,H358_SC16,H358_SC17,H358_SC18,...,LC-MBT-15_SC87,LC-MBT-15_SC89,LC-MBT-15_SC92,LC-MBT-15_SC94,LC-MBT-15_SC95,LC-MBT-15_SC96,LC-PT-45-mock,LC-PT-45-Selumetinib_R0,LC-PT-45-Selumetinib_R3,LC-PT-45-Selumetinib_R7
gene_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MIR764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MIR573,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SNORA36B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PGK2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
OR10J1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [118]:
mtx.shape

(19659, 201)

In [119]:
len(set(mtx.index).intersection(set(human_tfs)))

1565

In [120]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE69405.mtx.tsv'), sep='\t', index=True)

#### GSE81861 - Colorectal tumor

In [121]:
def process_gse81861(fname, genes_in_db):
    # Load CSV file
    mtx = pd.read_csv(os.path.join(RESOURCES_FOLDER, fname), sep=',', index_col=0)
    
    # Extract gene symbol
    def gene(s):
        parts = s.split('_')
        return "_".join(parts[1:-1])
    mtx.index = list(map(gene, mtx.index))
    
    # Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
    mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
    mtx = mtx[~mtx.index.duplicated(keep='last')]
    
    # From FPKM to log2(TPM/10+1)
    mtx = (mtx / mtx.sum(axis=0)) * 10.0e6
    mtx = mtx.applymap(lambda tpm: math.log2((tpm/10.0) + 1.0))
    
    # Keep only genes that are in the ranking database
    mtx = mtx[mtx.index.isin(genes_in_db)]
    
    return mtx

In [122]:
process_gse81861 = partial(process_gse81861, genes_in_db=genes_in_db)
mtx = pd.concat([
    process_gse81861('GSE81861_CRC_NM_all_cells_FPKM.csv'),
    process_gse81861('GSE81861_CRC_NM_epithelial_cells_FPKM.csv'),
    process_gse81861('GSE81861_CRC_tumor_all_cells_FPKM.csv'),
    process_gse81861('GSE81861_CRC_tumor_epithelial_cells_FPKM.csv'),
    process_gse81861('GSE81861_Cell_Line_FPKM.csv'),
    ], axis=1)

In [123]:
mtx.shape

(19683, 1583)

In [124]:
mtx.head()

Unnamed: 0,RHC3934__Bcell__#7DEA7B,RHC3944__Bcell__#7DEA7B,RHC3962__Tcell__#C6E879,RHC4003__Bcell__#7DEA7B,RHC4005__Bcell__#7DEA7B,RHC4007__Bcell__#7DEA7B,RHC4008__NA__grey,RHC4220__Bcell__#7DEA7B,RHC4227__Epithelial__#2749FE,RHC4240__Epithelial__#2749FE,...,RHC2497__H1_B2__brown,RHC2498__H1_B2__brown,RHC2499__H1_B2__brown,RHC2500__H1_B2__brown,RHC2501__H1_B2__brown,RHC2502__H1_B2__brown,RHC2503__H1_B2__brown,RHC2504__H1_B2__brown,RHC2505__H1_B2__brown,RHC2506__H1_B2__brown
A1BG,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.358942,0.0,0.0,0.0,0.0,0.0,1.735678,0.0,1.314749,6.163942
A1CF,0.0,0.0,0.105035,0.0,0.0,0.206954,0.0,0.0,0.0,6.558213,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A2M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.195409,0.0,0.0,1.664019
A2ML1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.083355,0.115427,0.413562,0.0,4.040125,0.053906,0.665874,2.608487,1.393939,0.0
A4GALT,0.0,0.0,0.0,7.488154,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.260969,0.0,0.0,0.0,0.89356,0.0


In [125]:
len(set(mtx.index).intersection(set(human_tfs)))

1565

In [127]:
mtx.to_csv(os.path.join(RESULTS_FOLDER, 'GSE81861.mtx.tsv'), sep='\t', index=True)

#### GSE73121 - RCC

Some difficulties with this dataset:
1. TPM-values need to be transformed, log2(TPM/10+1). EASY
2. Separate file for each cell in the experiment. EASY
3. The genes are specified as ensemble gene IDs (ENSG) and need to be converted to HGNC symbols. DIFFICULT

Downloading FPKM table from SCPortalen is not a solution becuase genes are also provided as ENSG numbers.

In [11]:
!ls {RESOURCES_FOLDER}/GSE73121/

GSM1887215_PDX_mRCC_Pooled.TPM.txt GSM1887276_PDX_pRCC_SC_57.TPM.txt
GSM1887216_PDX_mRCC_SC_01.TPM.txt  GSM1887277_PDX_pRCC_SC_58.TPM.txt
GSM1887217_PDX_mRCC_SC_03.TPM.txt  GSM1887278_PDX_pRCC_SC_59.TPM.txt
GSM1887218_PDX_mRCC_SC_04.TPM.txt  GSM1887279_PDX_pRCC_SC_60.TPM.txt
GSM1887219_PDX_mRCC_SC_05.TPM.txt  GSM1887280_PDX_pRCC_SC_61.TPM.txt
GSM1887220_PDX_mRCC_SC_06.TPM.txt  GSM1887281_PDX_pRCC_SC_62.TPM.txt
GSM1887221_PDX_mRCC_SC_08.TPM.txt  GSM1887282_PDX_pRCC_SC_65.TPM.txt
GSM1887222_PDX_mRCC_SC_12.TPM.txt  GSM1887283_PDX_pRCC_SC_66.TPM.txt
GSM1887223_PDX_mRCC_SC_14.TPM.txt  GSM1887284_PDX_pRCC_SC_67.TPM.txt
GSM1887224_PDX_mRCC_SC_15.TPM.txt  GSM1887285_PDX_pRCC_SC_68.TPM.txt
GSM1887225_PDX_mRCC_SC_20.TPM.txt  GSM1887286_PDX_pRCC_SC_70.TPM.txt
GSM1887226_PDX_mRCC_SC_21.TPM.txt  GSM1887287_PDX_pRCC_SC_72.TPM.txt
GSM1887227_PDX_mRCC_SC_34.TPM.txt  GSM1887288_PDX_pRCC_SC_73.TPM.txt
GSM1887228_PDX_mRCC_SC_41.TPM.txt  GSM1887289_PDX_pRCC_SC_75.TPM.txt
GSM1887229_PDX_mRCC_SC_44.TPM.txt 

In [67]:
def load(fname):
    col = pd.read_csv(fname, sep='\t', index_col=0, usecols=[0,5])
    col.rename(columns={'TPM': os.path.basename(fname).split('.')[0]}, inplace=True)
    return col.applymap(lambda tpm: math.log2((tpm/10.0) + 1.0))

def process_gse73121(foldername, genes_in_db):
    fnames = glob.glob(os.path.join(foldername, "GSM*.TPM.txt"))
    return pd.concat([ load(fname) for fname in fnames], axis=1)

In [68]:
mtx = process_gse73121(os.path.join(RESOURCES_FOLDER, "GSE73121"), genes_in_db)

In [69]:
mtx.shape

(57820, 121)

In [70]:
mtx.head()

Unnamed: 0_level_0,GSM1887287_PDX_pRCC_SC_72,GSM1887294_PDX_pRCC_SC_88,GSM1887230_PDX_mRCC_SC_52,GSM1887285_PDX_pRCC_SC_68,GSM1887254_PDX_pRCC_SC_05,GSM1887274_PDX_pRCC_SC_55,GSM1887335_Pt_mRCC_SC_57,GSM1887336_Pt_mRCC_SC_58,GSM1887286_PDX_pRCC_SC_70,GSM1887345_Pt_mRCC_SC_84,...,GSM1887225_PDX_mRCC_SC_20,GSM1887218_PDX_mRCC_SC_04,GSM1887321_Pt_mRCC_SC_21,GSM1887261_PDX_pRCC_SC_27,GSM1887277_PDX_pRCC_SC_58,GSM1887223_PDX_mRCC_SC_14,GSM1887312_Pt_mRCC_SC_10,GSM1887340_Pt_mRCC_SC_69,GSM1887244_PDX_mRCC_SC_87,GSM1887264_PDX_pRCC_SC_34
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.10,0.166073,0.0,1.954196,0.151859,1.529071,0.267835,1.21661,3.465322,1.226509,0.0,...,0.630405,3.762242,0.0,1.724214,2.931305,1.667665,0.465713,0.0,0.466758,0.0
ENSG00000000005.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.8,3.853597,3.245648,4.774787,0.0,2.919531,4.1918,4.71891,3.408712,2.931305,4.652601,...,4.501375,5.600448,5.326573,4.217231,1.473008,4.737849,3.994761,0.084064,2.713256,2.81578
ENSG00000000457.9,0.0,0.0,0.0,0.621056,2.010422,0.344828,1.937344,0.0,1.75446,0.0,...,0.189034,0.0,0.071763,0.516015,0.0,0.014355,0.097611,0.0,0.048236,0.0
ENSG00000000460.12,0.0,0.0,0.0,0.0,0.0,0.0,1.075875,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
mtx.index = list(map(lambda g: g.split('.')[0], mtx.index))

In [73]:
mtx.head()

Unnamed: 0,GSM1887287_PDX_pRCC_SC_72,GSM1887294_PDX_pRCC_SC_88,GSM1887230_PDX_mRCC_SC_52,GSM1887285_PDX_pRCC_SC_68,GSM1887254_PDX_pRCC_SC_05,GSM1887274_PDX_pRCC_SC_55,GSM1887335_Pt_mRCC_SC_57,GSM1887336_Pt_mRCC_SC_58,GSM1887286_PDX_pRCC_SC_70,GSM1887345_Pt_mRCC_SC_84,...,GSM1887225_PDX_mRCC_SC_20,GSM1887218_PDX_mRCC_SC_04,GSM1887321_Pt_mRCC_SC_21,GSM1887261_PDX_pRCC_SC_27,GSM1887277_PDX_pRCC_SC_58,GSM1887223_PDX_mRCC_SC_14,GSM1887312_Pt_mRCC_SC_10,GSM1887340_Pt_mRCC_SC_69,GSM1887244_PDX_mRCC_SC_87,GSM1887264_PDX_pRCC_SC_34
ENSG00000000003,0.166073,0.0,1.954196,0.151859,1.529071,0.267835,1.21661,3.465322,1.226509,0.0,...,0.630405,3.762242,0.0,1.724214,2.931305,1.667665,0.465713,0.0,0.466758,0.0
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419,3.853597,3.245648,4.774787,0.0,2.919531,4.1918,4.71891,3.408712,2.931305,4.652601,...,4.501375,5.600448,5.326573,4.217231,1.473008,4.737849,3.994761,0.084064,2.713256,2.81578
ENSG00000000457,0.0,0.0,0.0,0.621056,2.010422,0.344828,1.937344,0.0,1.75446,0.0,...,0.189034,0.0,0.071763,0.516015,0.0,0.014355,0.097611,0.0,0.048236,0.0
ENSG00000000460,0.0,0.0,0.0,0.0,0.0,0.0,1.075875,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [74]:
# Remove duplicate gene symbols (keep last when sorted ascending order according to row sum)
mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
mtx = mtx[~mtx.index.duplicated(keep='last')]

In [75]:
mtx.shape

(57820, 121)

Gene conversion attempt I: Using ensembl conversion tool (https://www.ensembl.org/biomart/)

In [88]:
ensg_hgnc = pd.read_csv('/Users/bramvandesande/Projects/lcb/resources/ensg2hgnc.tsv', sep='\t', index_col=0).dropna()
ensg2hgnc = ensg_hgnc.head().to_dict()['HGNC symbol']

In [90]:
set(mtx.index).intersection(set(ensg2hgnc.keys()))

{'ENSG00000209082',
 'ENSG00000210049',
 'ENSG00000210077',
 'ENSG00000210082',
 'ENSG00000211459'}

Gene conversion attempt II: via mygene REST API 

In [None]:
import mygene
mg = mygene.MyGeneInfo()
out = mg.querymany(list(mtx.index.values), scopes='ensemblgene', fields='ensemblgene,symbol,name', species='human')

In [42]:
ensg2symbol = dict(zip(map(lambda d: d.get('_id', '?'), out), 
                       map(lambda d: d.get('symbol', '?'), out)))

In [45]:
del ensg2symbol['?']

In [46]:
len(ensg2symbol)

52739

In [47]:
mtx.index = list(map(lambda g: ensg2symbol.get(g, '?'), mtx.index))

In [48]:
mtx.head()

Unnamed: 0,GSM1887287_PDX_pRCC_SC_72,GSM1887294_PDX_pRCC_SC_88,GSM1887230_PDX_mRCC_SC_52,GSM1887285_PDX_pRCC_SC_68,GSM1887254_PDX_pRCC_SC_05,GSM1887274_PDX_pRCC_SC_55,GSM1887335_Pt_mRCC_SC_57,GSM1887336_Pt_mRCC_SC_58,GSM1887286_PDX_pRCC_SC_70,GSM1887345_Pt_mRCC_SC_84,...,GSM1887225_PDX_mRCC_SC_20,GSM1887218_PDX_mRCC_SC_04,GSM1887321_Pt_mRCC_SC_21,GSM1887261_PDX_pRCC_SC_27,GSM1887277_PDX_pRCC_SC_58,GSM1887223_PDX_mRCC_SC_14,GSM1887312_Pt_mRCC_SC_10,GSM1887340_Pt_mRCC_SC_69,GSM1887244_PDX_mRCC_SC_87,GSM1887264_PDX_pRCC_SC_34
HMGB3P8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FTLP1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
?,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
?,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AL360089.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
mtx = mtx[mtx.index != '?']

In [55]:
mtx.shape

(28665, 121)

In [57]:
mtx = mtx.iloc[mtx.sum(axis=1).argsort()]
mtx = mtx[~mtx.index.duplicated(keep='last')]

In [58]:
mtx.shape

(27530, 121)

In [65]:
sorted(mtx[mtx.index.isin(genes_in_db)].index)

['ABCC6P1',
 'ACTR3BP2',
 'AGAP11',
 'AKR7A2P1',
 'ANKRD26P1',
 'ANKRD30BP2',
 'ANKRD36',
 'ANP32C',
 'ANXA2P1',
 'ANXA2P2',
 'ANXA2P3',
 'AOX2P',
 'AQP7P1',
 'ARGFXP2',
 'ASB18',
 'ASB9P1',
 'ATP5EP2',
 'ATXN8OS',
 'AURKAPS1',
 'BAGE2',
 'BCRP2',
 'BCRP3',
 'BMS1P4',
 'BTF3P11',
 'C10orf111',
 'C10orf25',
 'C10orf91',
 'C12orf77',
 'C15orf32',
 'C15orf56',
 'C17orf102',
 'C17orf77',
 'C17orf82',
 'C1orf229',
 'C20orf197',
 'C20orf203',
 'C22orf24',
 'C2orf27A',
 'C2orf27B',
 'C2orf48',
 'C3orf70',
 'C3orf79',
 'C6orf99',
 'C7orf65',
 'C7orf66',
 'C7orf69',
 'C7orf71',
 'C9orf106',
 'C9orf139',
 'C9orf163',
 'C9orf170',
 'CCL27',
 'CCT6P1',
 'CDC14C',
 'CEACAM18',
 'CELP',
 'CEP170P1',
 'CES1P2',
 'CETN4P',
 'CHKB-CPT1B',
 'CIDECP',
 'CLEC4GP1',
 'CLRN3',
 'COG8',
 'COL6A4P2',
 'CRX',
 'CSNK1A1P1',
 'CSPG4P2Y',
 'CTAGE10P',
 'CTAGE11P',
 'CXADRP2',
 'CXADRP3',
 'CXCR2P1',
 'CYP2S1',
 'CYP4F24P',
 'DDX11L2',
 'DEFT1P2',
 'DGCR10',
 'DGCR5',
 'DHRS4L1',
 'DHX40P1',
 'DIO3OS',
 'DIRC3',
 

In [66]:
len(mtx[mtx.index.isin(genes_in_db)].index)

473