# Parsing the TISSUES resource

The TISSUES databse contains tissue-specific gene presence. For more information, see the TISSUES [publication](https://dx.doi.org/10.7717/peerj.1054) or [website](http://tissues.jensenlab.org/Downloads).

In [1]:
base_url = 'http://download.jensenlab.org/'
filenames = [
    'human_tissue_knowledge_filtered.tsv',
    'human_tissue_knowledge_full.tsv',
    'human_tissue_experiments_filtered.tsv',
    'human_tissue_experiments_full.tsv',
    'human_tissue_textmining_filtered.tsv',
    'human_tissue_textmining_full.tsv',
    'human_tissue_integrated_full.tsv',
]

for filename in filenames:
    ! wget --no-verbose --timestamping --directory-prefix download/ {base_url}/{filename}
    ! gzip -f download/{filename}

In [4]:
import pandas

## terminology mappings

In [6]:
# Read BTO to Uberon cross-references
url = 'https://raw.githubusercontent.com/dhimmel/uberon/134f23479186abba03ba340fc6dc90e16c781920/data/xref.tsv'
uberon_map_df = pandas.read_table(url)
uberon_map_df = uberon_map_df[uberon_map_df.xref.str.startswith('BTO:').fillna(False)]
uberon_map_df = uberon_map_df.rename(columns={'xref': 'bto_id'})

In [7]:
# Read entrez gene mappings
url = 'https://raw.githubusercontent.com/dhimmel/entrez-gene/6e133f9ef8ce51a4c5387e58a6cc97564a66cec8/data/xrefs-human.tsv'
entrez_map_df = pandas.read_table(url)    
entrez_map_df = entrez_map_df[entrez_map_df.resource == 'Ensembl']
column_map = {'identifier': 'ensembl_id', 'GeneID': 'entrez_gene_id'}
entrez_map_df = entrez_map_df.rename(columns=column_map)[list(column_map.values())]

## dataset formats

Correspondence from Lars Juhl Jensen:

Regarding the file formats, the knowledge and experiments have the following format (columns enumerated):

1. Gene/protein ID (ENSP for proteins, other IDs for ncRNAs)
2. Human readable name for the gene/protein (HGNC gene symbol when available)
3. Brenda Tissue Ontology term
4. Human readable name for the BTO term
5. Source of evidence
6. Association support (can be a GO evidence code, number of ESTs, anything - the original evidence)
7. Confidence score (the score which is shown as stars in the web interface (rounded up on web pages))

The textmining files are a bit different; the first four columns are the same, followed by:
5. Co-occurrence Z-score
6. Confidence score (stars, comparable to other files)
7. Linkout for showing the abstracts that the association is based on


## textmining dataset

In [8]:
column_names = ['ensembl_id', 'gene_symbol', 'bto_id', 'bto_name', 'z-score', 'score', 'sources']
text_df = pandas.read_table('download/human_tissue_textmining_full.tsv.gz', names=column_names)
text_df.head()

Unnamed: 0,ensembl_id,gene_symbol,bto_id,bto_name,z-score,score,sources
0,5S_rRNA,5S_rRNA,BTO:0001481,Plant,4.614,2.3,http://tissues.jensenlab.org/Entity?documents=...
1,5S_rRNA,5S_rRNA,BTO:0000000,"tissues, cell types and enzyme sources",4.531,2.3,http://tissues.jensenlab.org/Entity?documents=...
2,5S_rRNA,5S_rRNA,BTO:0000964,BTO:0000964,4.286,2.1,http://tissues.jensenlab.org/Entity?documents=...
3,5S_rRNA,5S_rRNA,BTO:0002502,Kinetoplastid,4.15,2.1,http://tissues.jensenlab.org/Entity?documents=...
4,5S_rRNA,5S_rRNA,BTO:0004669,Finger,4.059,2.0,http://tissues.jensenlab.org/Entity?documents=...


## knowledge dataset

In [9]:
column_names = ['ensembl_id', 'gene_symbol', 'bto_id', 'bto_name', 'source', 'evidence', 'score']
knowledge_df = pandas.read_table('download/human_tissue_knowledge_full.tsv.gz', names=column_names)
knowledge_df.head()

Unnamed: 0,ensembl_id,gene_symbol,bto_id,bto_name,source,evidence,score
0,ENSP00000000233,ARF5,BTO:0000000,"tissues, cell types and enzyme sources",UniProtKB-RC,CURATED,4
1,ENSP00000000233,ARF5,BTO:0000000,"tissues, cell types and enzyme sources",UniProtKB-RC,CURATED,4
2,ENSP00000000233,ARF5,BTO:0000042,Animalic,UniProtKB-RC,CURATED,4
3,ENSP00000000233,ARF5,BTO:0000042,Animalic,UniProtKB-RC,CURATED,4
4,ENSP00000000233,ARF5,BTO:0000081,Reproductive system,UniProtKB-RC,CURATED,4


## experimental dataset

In [10]:
column_names = ['ensembl_id', 'gene_symbol', 'bto_id', 'bto_name', 'source', 'evidence', 'score']
experiment_df = pandas.read_table('download/human_tissue_experiments_full.tsv.gz', names=column_names)
experiment_df.head()

Unnamed: 0,ensembl_id,gene_symbol,bto_id,bto_name,source,evidence,score
0,ENSP00000000233,ARF5,BTO:0000000,"tissues, cell types and enzyme sources",Exon array,411 intensity units,1
1,ENSP00000000233,ARF5,BTO:0000000,"tissues, cell types and enzyme sources",GNF,103 Intensity units,0
2,ENSP00000000233,ARF5,BTO:0000000,"tissues, cell types and enzyme sources",HPA-RNA,61.1 FPKM,1
3,ENSP00000000233,ARF5,BTO:0000000,"tissues, cell types and enzyme sources",HPM,6 peptides,0
4,ENSP00000000233,ARF5,BTO:0000000,"tissues, cell types and enzyme sources",RNA-seq,2.156 RPKM,0


In [None]:
experiment_summary_df = experiment_df.groupby(['ensembl_id', 'bto_id']).apply(
    lambda df: pandas.Series({
        'count': len(df),
        'mean_score': df.score.mean(),
        'n_3star': sum(df.score >= 3)
    }))

In [None]:
experiment_summary_df.head()

## integrated dataset

Correspondence from Lars Juhl Jensen:

> If you go to download.jensenlab.org, you will find that there is also a file called human_tissue_integrated_full.tsv - it is a combined evidence score based on all available evidence. The first four columns are as in the other files; the fifth column contains the combined star confidence score. This file is not yet available via the download webpage, and the scores are not yet used in the web interface. You are welcome to experiment with it if you like, but please bear in mind that this is a test at this stage.

In [None]:
column_names = ['ensembl_id', 'gene_symbol', 'bto_id', 'bto_name', 'score']
integrated_df = pandas.read_table('download/human_tissue_integrated_full.tsv.gz', names=column_names)
integrated_df.head()

In [None]:
# convert to Uberon and Entrez Gene identifiers
integrated_df = integrated_df.merge(uberon_map_df)#.merge(entrez_gene_df)
integrated_df.groupby(['ensembl_id', 'uberon_id'])['score'].mean()

In [34]:
max(integrated_df.score)

5.0

In [44]:
summary_df = integrated_df.groupby(['uberon_id', 'uberon_name']).apply(lambda df: 
    pandas.Series({'n_2star': sum(df.score >= 2),
                   'n_3star': sum(df.score >= 3),
                   'n_4star': sum(df.score >= 4),
                   'n_5star': sum(df.score >= 5)})
)
summary_df = summary_df.sort('n_5star', ascending=False)

In [45]:
summary_df

Unnamed: 0_level_0,Unnamed: 1_level_0,n_2star,n_3star,n_4star,n_5star
uberon_id,uberon_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
UBERON:0000468,multicellular organism,17933,17430,16991,5458
UBERON:0000033,head,13115,11565,10412,735
UBERON:0001016,nervous system,12498,10941,9706,715
UBERON:0001017,central nervous system,12366,10850,9616,700
UBERON:0000955,brain,12267,10808,9560,691
UBERON:0000474,female reproductive system,12592,10899,9376,576
UBERON:0002530,gland,13117,11243,10354,325
UBERON:0002368,endocrine gland,12524,10697,9760,317
UBERON:0004122,genitourinary system,12597,10685,9691,281
UBERON:0000990,reproductive system,11933,9977,8935,280
