# Extracting data about zebrafish disease models from ZFIN

Potentially interesting targets for UniProt annotation.

Emma Hatton-Ellis. August 2018.

In [1]:
import pandas as pd

The fish_model_disease.txt file provides [Disease Ontology](https://disease-ontology.org/) cross references and associated PubMed ids.

In [2]:
zfin_models = pd.read_csv('https://zfin.org/downloads/fish_model_disease.txt',
                          sep='\t',
                          header=None,
                          dtype=str,
                          names=['zfin_fish_id', 'zfin_env_id', 'is_a_model', 'do_id', 'do_name', 'zfin_pub', 'pmid',
                                 'evidence_code'])

In [3]:
zfin_models.head()

Unnamed: 0,zfin_fish_id,zfin_env_id,is_a_model,do_id,do_name,zfin_pub,pmid,evidence_code
0,ZDB-FISH-150901-8233,ZDB-EXP-041102-1,is_a_model,DOID:13580,cholestasis,ZDB-PUB-050711-9,16000385,"TAS, ECO:0000304"
1,ZDB-FISH-150901-16512,ZDB-EXP-041102-1,is_a_model,DOID:0060468,Holt-Oram syndrome,ZDB-PUB-020913-3,12223419,"TAS, ECO:0000304"
2,ZDB-FISH-150901-21533,ZDB-EXP-041102-1,is_a_model,DOID:5230,hepatoerythropoietic porphyria,ZDB-PUB-981208-38,9806541,"TAS, ECO:0000304"
3,ZDB-FISH-150901-21533,ZDB-EXP-041102-1,is_a_model,DOID:5230,hepatoerythropoietic porphyria,ZDB-PUB-140325-4,24652768,"TAS, ECO:0000304"
4,ZDB-FISH-150901-24601,ZDB-EXP-041102-1,is_a_model,DOID:5295,intestinal disease,ZDB-PUB-151222-17,26685876,"TAS, ECO:0000304"


In [4]:
zfin_models.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3056 entries, 0 to 3055
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   zfin_fish_id   1595 non-null   object
 1   zfin_env_id    1595 non-null   object
 2   is_a_model     1595 non-null   object
 3   do_id          3056 non-null   object
 4   do_name        3056 non-null   object
 5   zfin_pub       3056 non-null   object
 6   pmid           3054 non-null   object
 7   evidence_code  3056 non-null   object
dtypes: object(8)
memory usage: 191.1+ KB


The zfin_pubs.txt file contains complete information about the papers curated in the database including title and journal citation.

In [21]:
zfin_bib = pd.read_csv('https://zfin.org/downloads/zfinpubs.txt', sep='\t', header=None,
                       names=['zfin_pub', 'pmid', 'authors', 'title', 'journal', 'year', 'vol', 'pages'],
                       dtype=str)

In [22]:
zfin_bib.head()

Unnamed: 0,zfin_pub,pmid,authors,title,journal,year,vol,pages
0,ZDB-PUB-190613-8,31188077,"Lee, M.S., Philippe, J., Katsanis, N., Zhou, W.",Polyketide Synthase Plays a Conserved Role in ...,Zebrafish,2019,16(4),363-369
1,ZDB-PUB-190127-23,30128619,"Galli, C.",UC Davis Transgenic Animal Research Conference...,Transgenic Research,2018,27,461-466
2,ZDB-PUB-181107-8,30398377,"Garcia, G.R., Shankar, P., Dunham, C.L., Garci...",Signaling Events Downstream of AHR Activation ...,Environmental health perspectives,2018,126,117002
3,ZDB-PUB-190516-6,31085211,"Broening, H.W., La Du, J., Carr, G.J., Nash, J...",Determination of narcotic potency using a neur...,Neurotoxicology,2019,74,67-73
4,ZDB-PUB-171204-49,29197232,"Pitt, J.A., Kozal, J.S., Jayasundara, N., Mass...","Uptake, tissue distribution, and toxicity of p...","Aquatic toxicology (Amsterdam, Netherlands)",2017,194,185-194


In [23]:
zfin_bib[['pmid', 'title']].head()

Unnamed: 0,pmid,title
0,31188077,Polyketide Synthase Plays a Conserved Role in ...
1,30128619,UC Davis Transgenic Animal Research Conference...
2,30398377,Signaling Events Downstream of AHR Activation ...
3,31085211,Determination of narcotic potency using a neur...
4,29197232,"Uptake, tissue distribution, and toxicity of p..."


In the disease models table, we are mainly interested in the DO name and identifier, and the PubMed id.

In [24]:
zfin_models[['zfin_fish_id', 'do_id', 'do_name', 'pmid']].head()

Unnamed: 0,zfin_fish_id,do_id,do_name,pmid
0,ZDB-FISH-150901-8233,DOID:13580,cholestasis,16000385
1,ZDB-FISH-150901-16512,DOID:0060468,Holt-Oram syndrome,12223419
2,ZDB-FISH-150901-21533,DOID:5230,hepatoerythropoietic porphyria,9806541
3,ZDB-FISH-150901-21533,DOID:5230,hepatoerythropoietic porphyria,24652768
4,ZDB-FISH-150901-24601,DOID:5295,intestinal disease,26685876


Merge the bibliography with relevant columns from the disease models dataframe.

In [25]:
zf = zfin_models[['zfin_fish_id', 'do_id', 'do_name', 'pmid']].merge(zfin_bib[['zfin_pub', 'pmid', 'title']], on='pmid')

In [31]:
zf.head()

Unnamed: 0,zfin_fish_id,do_id,do_name,pmid,zfin_pub,title
0,ZDB-FISH-150901-8233,DOID:13580,cholestasis,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...
1,ZDB-FISH-150901-22931,DOID:9452,fatty liver disease,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...
2,ZDB-FISH-150901-14732,DOID:899,choledochal cyst,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...
3,ZDB-FISH-150901-16512,DOID:0060468,Holt-Oram syndrome,12223419,ZDB-PUB-020913-3,The heartstrings mutation in zebrafish causes ...
4,ZDB-FISH-150901-12774,DOID:0060468,Holt-Oram syndrome,12223419,ZDB-PUB-020913-3,The heartstrings mutation in zebrafish causes ...


To map gene symbols to the disease models, we need the gene_publication.txt mapping table.

In [33]:
zfin_gene_pub = pd.read_csv('https://zfin.org/downloads/gene_publication.txt',
                            sep='\t',
                            dtype=str,
                            header=None,
                            names=['gene', 'gene_id', 'zfin_pub', 'pub_type', 'pmid'])

In [34]:
zfin_gene_pub.head()

Unnamed: 0,gene,gene_id,zfin_pub,pub_type,pmid
0,panx3,ZDB-GENE-091204-114,ZDB-PUB-170525-1,Data Submission,
1,tradv3.0.4,ZDB-GENE-170601-127,ZDB-PUB-110330-1,Curation,21873635.0
2,fgfr1a,ZDB-GENE-980526-255,ZDB-PUB-181119-1,Journal,30447699.0
3,asb12b,ZDB-GENE-040822-25,ZDB-PUB-061101-1,Curation,
4,rrm2b,ZDB-GENE-030616-614,ZDB-PUB-130213-1,Curation,


Merge the gene mapping table with the previously generated dataframe.

In [35]:
zfin_disease = zf.merge(zfin_gene_pub[['gene', 'zfin_pub']], on='zfin_pub')

In [37]:
zfin_disease.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19105 entries, 0 to 19104
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   zfin_fish_id  12461 non-null  object
 1   do_id         19105 non-null  object
 2   do_name       19105 non-null  object
 3   pmid          18393 non-null  object
 4   zfin_pub      19105 non-null  object
 5   title         19105 non-null  object
 6   gene          19105 non-null  object
dtypes: object(7)
memory usage: 1.2+ MB


In [38]:
zfin_disease.head()

Unnamed: 0,zfin_fish_id,do_id,do_name,pmid,zfin_pub,title,gene
0,ZDB-FISH-150901-8233,DOID:13580,cholestasis,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...,trappc11
1,ZDB-FISH-150901-8233,DOID:13580,cholestasis,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...,vps18
2,ZDB-FISH-150901-8233,DOID:13580,cholestasis,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...,tubg1
3,ZDB-FISH-150901-8233,DOID:13580,cholestasis,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...,sox9a
4,ZDB-FISH-150901-8233,DOID:13580,cholestasis,16000385,ZDB-PUB-050711-9,A genetic screen in zebrafish identifies the m...,thoc2


In [39]:
zfin_disease.gene.nunique()

4842

In [40]:
zfin_disease.groupby('title').count()

Unnamed: 0_level_0,zfin_fish_id,do_id,do_name,pmid,zfin_pub,gene
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
&#x3b3;-Aminobutyric acid receptor alpha 1 subunit loss of function causes genetic generalized epilepsy by impairing inhibitory network neurodevelopment,22,22,22,22,22,22
3-ketodihydrosphingosine reductase mutation induces steatosis and hepatic injury in zebrafish,104,104,104,104,104,104
<i>BMI1</i> Drives Metastasis of Prostate Cancer in Caucasian and African-American Men and Is A Potential Therapeutic Target: Hypothesis Tested in Race-specific Models.,0,1,1,1,1,1
"<i>EXTL3</i> mutations cause skeletal dysplasia, immune deficiency, and developmental delay.",0,2,2,2,2,2
<i>FLNC</i> Gene Splice Mutations Cause Dilated Cardiomyopathy.,3,6,6,6,6,6
...,...,...,...,...,...,...
tp53-dependent and independent signaling underlies the pathogenesis and possible prevention of Acrofacial Dysostosis - Cincinnati type,6,6,6,6,6,6
trappc11 is required for protein glycosylation in zebrafish and humans,34,34,34,34,34,34
α-COP binding to the survival motor neuron protein SMN is required for neuronal process outgrowth,1,1,1,1,1,1
α5β1 integrin recycling promotes Arp2/3-independent cancer cell invasion via the formin FHOD3,2,4,4,4,4,4


In [89]:
zfin_disease[zfin_disease['title'] == 'Zebrafish deficient for Muscleblind-like 2 exhibit features of myotonic dystrophy']

Unnamed: 0,zfin_fish_id,do_id,do_name,pmid,zfin_pub,title,gene
2369,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,clcn1b
2370,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,mbnl1
2371,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,mbnl2
2372,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,mbnl3
2373,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,myod1
2374,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,pax2a
2375,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,scn4ab
2376,ZDB-FISH-150901-8303,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,tnnt2a
2377,ZDB-FISH-150901-22841,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,clcn1b
2378,ZDB-FISH-150901-22841,DOID:450,myotonic disease,21303839,ZDB-PUB-110214-19,Zebrafish deficient for Muscleblind-like 2 exh...,mbnl1


Export data to tsv file.

In [90]:
zfin_disease.to_csv('zebrafish_disease_models.tsv', sep='\t')