# Accessing Ensembl with BioServices

**This notebook is used to test the BioServices API.** 

**Test with BioServices 1.3.4**

**author: TC**

It is based on the Ensembl API and do not provide other examples. 
For more advanced usage, please see other Ensembl Notebooks provided in BioServices

- [Introductory example](#introduction)
- [Archive](#archive)
- [Comparative genomics](#comparative)
- [Cross References](#reference)
- [Information](#information)
- [Lookup](#lookup)
- [Mapping](#mapping)
- [Ontology and Taxonomy](#ontology)
- [Overlap](#overlap)
- [Regulation](#regulation)
- [Sequences](#sequences)
- [Variation](#variation)

 **References**   : http://rest.ensembl.org/ 
             

In [1]:
from bioservices import ensembl
# for debugigng
reload(ensembl)

<module 'bioservices.ensembl' from '/home/cokelaer/Work/github/bioservices/src/bioservices/ensembl.pyc'>

In [2]:
e = ensembl.Ensembl()

## <a name="introduction"></a> Introductory example

- Most of the methods take one or 2 compulsary arguments
- Some are just informative (the get_info family)
- An argument that is not part of the Ensembl API is **frmt**. It can be set to one of the Ensembl output format that is. The valid list of format
depends on the method. Those two are always available:
    - json
    - jsonp
- By default, output is in json format, which is transformed into a Python dictionary

In [3]:
res = e.get_archive('ENSG00000157764')
res

{u'assembly': u'GRCh38',
 u'id': u'ENSG00000157764',
 u'is_current': u'1',
 u'latest': u'ENSG00000157764.10',
 u'peptide': None,
 u'possible_replacement': [],
 u'release': u'77',
 u'type': u'Gene',
 u'version': u'10'}

In the following example, the format can be phyloxml

In [4]:
print(e.get_archive('ENSG00000157764', frmt='xml'))

<html>
 <head>
  <title>
   EnsEMBL::REST
  </title>
 </head>
 <body>
  <pre>--- 
assembly: GRCh38
id: ENSG00000157764
is_current: 1
latest: ENSG00000157764.10
peptide: ~
possible_replacement: []

release: 77
type: Gene
version: 10
</pre>
 </body>
</html>


Here is another example where the requested frmt is json but there is a
parameter to specify the format (nh_format)

In [5]:
res = e.get_genetree_by_member_id('ENSG00000157764', frmt='json', 
                                  nh_format='phylip')
print(res[0:100])

<?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="http://www.phyloxml.org http:/


> Here, the input frmt (json) is changed since nh_format can be only in phyloxml format. When  a parameter specifies the format, it may overwrite the value of the argument **frmt** even if provided. 


In [6]:
# If your identifier is incorrect, you'll get a 500 error code 
# returned (most probably)
wrong = e.get_map_cds_to_region('ENST0000288602', '1..1000')
good = e.get_map_cds_to_region('ENST00000288602', '1..1000')
wrong, good['mappings'][0]


(500,
 {u'assembly_name': u'GRCh38',
  u'coord_system': u'chromosome',
  u'end': 140924703,
  u'gap': 0,
  u'rank': 0,
  u'seq_region_name': u'7',
  u'start': 140924566,
  u'strand': -1})

## <a name="archive"></a> Archive

In [7]:
# Get archived sequence given an identifer
archive = e.get_archive('ENSG00000157764')

In [8]:
identifiers = ["ENSG00000157764", "ENSG00000248378" ]
archives = e.post_archive(identifiers)
assert archive == archives[0]

## <a name="comparative"></a> Comparative genomics

### Gene tree by identifier

In [9]:
res = e.get_genetree_by_id('ENSGT00390000003602', nh_format='simple')
res['id'], res.keys()

(u'ENSGT00390000003602', [u'type', u'tree', u'rooted', u'id'])

In [10]:
res = e.get_genetree_by_id('ENSGT00390000003602', frmt='phyloxml')
print(res[0:200])

<?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ht


Retrieve genetree by member id and returns a phylip structure
This takes a few seconds and output xml is large`

In [11]:
# Here, the input frmt (json) is changed since nh_format can be 
# only in phyloxml format
res = e.get_genetree_by_member_id('ENSG00000157764', frmt='json', 
                                  nh_format='phylip')

In [12]:
len(res)

2235232

In [13]:
print(res[0:500])

<?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.phyloxml.org">
  <phylogeny rooted="true" type="gene tree">
    <clade branch_length="0">
      <taxonomy>
        <id>33213</id>
        <scientific_name>Bilateria</scientific_name>
      </taxonomy>
      <events>
        <type>speciation_or_duplication</type>
        <duplications>1</dup


In [14]:
res = e.get_genetree_by_member_symbol('human', 'BRCA2', 
                                      nh_format='simple')

In [15]:
print(res[0:200])

((((((((ENSPFOP00000001575:0.046083,ENSXMAP00000006983:0.065551):0.43822,ENSONIP00000006940:0.359035):0.019582,((ENSTRUP00000015030:0.077336,ENSTNIP00000002435:0.099898):0.208834,ENSGACP00000015199:0.


In [16]:
region = '2:106040000-106040050'
species = 'taeniopygia_guttata'
res = e.get_alignment_by_region(region, species, 
                                species_set_group='sauropsids')
res[0]['tree']


u'((gallus_gallus_2_100370206_100370256[+]:0.0414,meleagris_gallopavo_3_49885157_49885207[+]:0.0414)Ggal-Mgal[2]:0.1242,taeniopygia_guttata_2_106040000_106040050[+]:0.1715)Ggal-Mgal-Tgut[3]:0.3044;'

In [17]:
res = e.get_homology_by_id('ENSG00000157764')

In [18]:
res = e.get_homology_by_id('ENSG00000157764', frmt='xml')


In [19]:
res = e.get_homology_by_id('ENSG00000157764', format='condensed', 
                           type='orthologues', target_taxon='10090')
res['data'][0]

{u'homologies': [{u'id': u'ENSMUSG00000002413',
   u'method_link_type': u'ENSEMBL_ORTHOLOGUES',
   u'protein_id': u'ENSMUSP00000002487',
   u'species': u'mus_musculus',
   u'taxonomy_level': u'Euarchontoglires',
   u'type': u'ortholog_one2one'}],
 u'id': u'ENSG00000157764'}

## <a name="reference"></a> Cross references

In [20]:
res = e.get_xrefs_by_id('ENST00000288602', external_db='PDB', 
                        all_levels=True)
res[0]

{u'db_display_name': u'PDB',
 u'dbname': u'PDB',
 u'description': None,
 u'display_id': u'1UWH',
 u'info_text': u'',
 u'info_type': u'DEPENDENT',
 u'primary_id': u'1UWH',
 u'synonyms': [],
 u'version': u'0'}

In [21]:
res = e.get_xrefs_by_name('BRCA2', 'human')    
res[0]


{u'db_display_name': u'Vega gene',
 u'dbname': u'Vega_gene',
 u'description': None,
 u'display_id': u'BRCA2',
 u'info_text': u'',
 u'info_type': u'NONE',
 u'primary_id': u'OTTHUMG00000017411',
 u'synonyms': [],
 u'version': u'1'}

In [22]:
res = e.get_xrefs_by_symbol('BRCA2', 'homo_sapiens',
                            external_db='HGNC')    
res

[{u'id': u'ENSG00000139618', u'type': u'gene'},
 {u'id': u'LRG_293', u'type': u'gene'}]

## <a name="information"></a> Information

In [23]:
len(e.get_info_analysis('human'))

233

In [24]:
e.get_info_assembly('human')['karyotype']

[u'1',
 u'2',
 u'3',
 u'4',
 u'5',
 u'6',
 u'7',
 u'8',
 u'9',
 u'10',
 u'11',
 u'12',
 u'13',
 u'14',
 u'15',
 u'16',
 u'17',
 u'18',
 u'19',
 u'20',
 u'21',
 u'22',
 u'X',
 u'Y',
 u'MT']

In [25]:
e.get_info_assembly_by_region('homo_sapiens', 'X')

{u'assembly_exception_type': u'REF',
 u'assembly_name': u'GRCh38',
 u'coordinate_system': u'chromosome',
 u'is_chromosome': 1,
 u'length': 156040895}

In [26]:
len(e.get_info_biotypes('human'))

55

In [27]:
e.get_info_compara_methods()

{u'ConservationScore.conservation_score': [u'GERP_CONSERVATION_SCORE'],
 u'ConstrainedElement.constrained_element': [u'GERP_CONSTRAINED_ELEMENT'],
 u'Family.family': [u'FAMILY'],
 u'GenomicAlignBlock.multiple_alignment': [u'PECAN'],
 u'GenomicAlignBlock.pairwise_alignment': [u'TRANSLATED_BLAT_NET',
  u'BLASTZ_NET',
  u'LASTZ_NET',
  u'LASTZ_PATCH'],
 u'GenomicAlignTree.ancestral_alignment': [u'EPO'],
 u'GenomicAlignTree.tree_alignment': [u'EPO_LOW_COVERAGE'],
 u'Homology.homology': [u'ENSEMBL_PROJECTIONS',
  u'ENSEMBL_PARALOGUES',
  u'ENSEMBL_ORTHOLOGUES'],
 u'NCTree.nc_tree_node': [u'NC_TREES'],
 u'ProteinTree.protein_tree_node': [u'PROTEIN_TREES'],
 u'SyntenyRegion.synteny': [u'SYNTENY']}

In [28]:
e.get_info_compara_by_method('EPO')[0]

{u'method': u'EPO',
 u'name': u'5 teleost fish EPO',
 u'species_set': [u'oryzias_latipes',
  u'takifugu_rubripes',
  u'gasterosteus_aculeatus',
  u'tetraodon_nigroviridis',
  u'danio_rerio'],
 u'species_set_group': u'fish'}

In [29]:
e.get_info_comparas()

{u'comparas': [{u'name': u'multi', u'release': 77}]}

In [30]:
e.get_info_data()

{u'releases': [77]}

In [31]:
res = e.get_info_external_dbs('human')
[x['name'] for x in res if 'hgnc' in x['name'].lower()]

[u'HGNC',
 u'HGNC_curated_gene',
 u'HGNC_automatic_gene',
 u'HGNC_curated_transcript',
 u'HGNC_automatic_transcript',
 u'HGNC_trans_name']

In [32]:
e.get_info_ping()


1

In [33]:
e.get_info_rest()


{u'release': u'3.1.0'}

In [34]:
res = e.get_info_software()
res

{u'release': 77}

In [35]:
res = e.get_info_species()
[x['name'] for x in res['species'] if 'ovis' in x['name']]


[u'ovis_aries']

## <a name="lookup"></a> Lookup

In [36]:
res = e.get_lookup_by_id('ENSG00000157764', expand=True)
res.keys()

[u'assembly_name',
 u'display_name',
 u'description',
 u'seq_region_name',
 u'logic_name',
 u'object_type',
 u'start',
 u'id',
 u'source',
 u'db_type',
 u'biotype',
 u'end',
 u'Transcript',
 u'species',
 u'strand']

In [37]:
res = e.post_lookup_by_id(["ENSG00000157764", "ENSG00000248378" ], 
                          expand=0)
res['ENSG00000157764']


{u'assembly_name': u'GRCh38',
 u'biotype': u'protein_coding',
 u'db_type': u'core',
 u'description': u'B-Raf proto-oncogene, serine/threonine kinase [Source:HGNC Symbol;Acc:HGNC:1097]',
 u'display_name': u'BRAF',
 u'end': 140924764,
 u'id': u'ENSG00000157764',
 u'logic_name': u'ensembl_havana_gene',
 u'object_type': u'Gene',
 u'seq_region_name': u'7',
 u'source': u'ensembl_havana',
 u'species': u'homo_sapiens',
 u'start': 140719327,
 u'strand': -1}

In [38]:
res = e.get_lookup_by_symbol('homo_sapiens', 'BRCA2', expand=True)
len(res['Transcript'])

7

In [39]:
res = e.post_lookup_by_symbol('human', ["BRCA2", "BRAF" ], expand=True)
len(res['BRCA2']['Transcript'])

7

## <a name="mapping"></a> Mapping

	Description
- Convert from cDNA coordinates to genomic coordinates. Output reflects forward orientation coordinates as returned from the Ensembl API.
- GET map/cds/:id/:region 	Convert from CDS coordinates to genomic coordinates. Output reflects forward orientation coordinates as returned from the Ensembl API.
- GET map/:species/:asm_one/:region/:asm_two 	Convert the co-ordinates of one assembly to another
- GET map/translation/:id/:region 	Convert from protein (translation) coordinates to genomic coordinates. Output reflects forward orientation coordinates as returned from the Ensembl 

In [40]:
# the commented statement does not work
# res = e.get_map_assembly_one_to_two('GRCh37', 'NCBI36',
# region='X:10000000..1000100:1', species='human')
res = e.get_map_assembly_one_to_two('GRCh37', 'GRCh38', 
                                    region='X:1000000..1000100:1')
res

{u'mappings': [{u'mapped': {u'assembly': u'GRCh38',
    u'coordinate_system': u'chromosome',
    u'end': 1039365,
    u'seq_region_name': u'X',
    u'start': 1039265,
    u'strand': 1},
   u'original': {u'assembly': u'GRCh37',
    u'coordinate_system': u'chromosome',
    u'end': 1000100,
    u'seq_region_name': u'X',
    u'start': 1000000,
    u'strand': 1}}]}

In [41]:
res = e.get_map_translation_to_region('ENSP00000288602', '100..300')
res['mappings'][0]  

{u'assembly_name': u'GRCh38',
 u'coord_system': u'chromosome',
 u'end': 140834815,
 u'gap': 0,
 u'rank': 0,
 u'seq_region_name': u'7',
 u'start': 140834609,
 u'strand': -1}

In [42]:
res = e.get_map_cds_to_region('ENST00000288602', '1..1000')
res['mappings'][0]

{u'assembly_name': u'GRCh38',
 u'coord_system': u'chromosome',
 u'end': 140924703,
 u'gap': 0,
 u'rank': 0,
 u'seq_region_name': u'7',
 u'start': 140924566,
 u'strand': -1}

In [43]:
res = e.get_map_cdna_to_region('ENST00000288602', '100..300')
res['mappings'][0]

{u'assembly_name': u'GRCh38',
 u'coord_system': u'chromosome',
 u'end': 140924665,
 u'gap': 0,
 u'rank': 0,
 u'seq_region_name': u'7',
 u'start': 140924566,
 u'strand': -1}

## <a name="ontology"></a> Ontologies and Taxonomy

In [44]:
res = e.get_ontology_ancestors_by_id('GO:0005667')
res[0].keys()

[u'definition',
 u'name',
 u'subsets',
 u'namespace',
 u'accession',
 u'synonyms',
 u'ontology']

In [45]:
res = e.get_ontology_ancestors_chart_by_id('GO:0005667')

In [46]:
res = e.get_ontology_descendants_by_id('GO:0005667')
res[0]['accession']

u'GO:0043234'

In [47]:
res = e.get_ontology_by_id('GO:0005667')
res['accession']

u'GO:0005667'

In [48]:
res = e.get_ontology_by_name('transcription factor complex')

In [49]:
res = e.get_taxonomy_classification_by_id(9606)
res[0]['children']

[{u'id': u'9606',
  u'leaf': 0,
  u'name': u'Homo sapiens',
  u'scientific_name': u'Homo sapiens',
  u'tags': {u'authority': [u'Homo sapiens Linnaeus, 1758'],
   u'common name': [u'man'],
   u'ensembl alias name': [u'Human'],
   u'genbank common name': [u'human'],
   u'name': [u'Homo sapiens'],
   u'scientific name': [u'Homo sapiens']}}]

In [50]:
res = e.get_taxonomy_by_name('Homo')

In [51]:
e.get_taxonomy_by_id(9606)['scientific_name']

u'Homo sapiens'

## <a name="overlap"></a> Overlap

In [52]:
res = e.get_overlap_by_id("ENSG00000157764", feature='gene')
len(res)

3

In [53]:
#ture=transcript;feature=cds;feature=exon
res = e.get_overlap_by_region('7:140424943-140624564', 
                              species='human', feature='gene')
len(res)

4

In [54]:
res = e.get_overlap_by_translation('ENSP00000288602', type='Superfamily')
len(res)

3

In [55]:
#feature=transcript_variation;contson;
res = e.get_overlap_by_translation('ENSP00000288602', 
                                   type='missense_variant',
                                   feature='transcript_variation')

len(res)

113

In [56]:
res = e.get_overlap_by_translation('ENSP00000288602', 
                                   type='missense_variant',
                                   feature='somatic_transcript_variation')

len(res)

215

## <a name="regulation"></a> Regulation

In [57]:
e.get_regulatory_by_id('ENSR00001348195', 'human')

{u'ID': u'ENSR00001348195',
 u'activity_evidence': 1,
 u'bound_end': 48551079,
 u'bound_start': 48538280,
 u'cell_type': u'MultiCell',
 u'description': u'Predicted promoter',
 u'end': 48545479,
 u'feature_type': u'regulatory',
 u'seq_region_name': u'17',
 u'start': 48541080,
 u'strand': u'0'}

## <a name="sequences"></a> Sequences

In [58]:
sequence = e.get_sequence_by_id('ENSG00000157764', frmt='text')
print(sequence[0:120])

CGCCTCCCTTCCCCCTCCCCGCCCGACAGCGGCCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAA


In [59]:
sequence = e.get_sequence_by_id('ENSG00000157764', frmt='fasta')
print(sequence[0:120])

>ENSG00000157764 chromosome:GRCh38:7:140719327:140924764:-1
CGCCTCCCTTCCCCCTCCCCGCCCGACAGCGGCCGCTCGGGCCCCGGCTCTCGGTTATAA


In [60]:
sequence = e.get_sequence_by_id('CCDS5863.1', frmt='fasta', 
                     object_type='transcript', db_type='otherfeatures',
                     type='cds', species='human')
print(sequence[0:120])

>CCDS5863.1
ATGGCGGCGCTGAGCGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAAC
GGGGACATGGAGCCCGAGGCCGGCGCCGGCGCCGGCGCCGCGGCCTC


In [61]:
sequence = e.get_sequence_by_id('ENSG00000157764', frmt='seqxml',
                                multiple_sequences=True,type='protein')
print(sequence[0:240])

<?xml version="1.0" encoding="UTF-8"?>

<seqXML xsi:noNamespaceSchemaLocation="http://www.seqxml.org/0.4/seqxml.xsd" seqXMLversion="0.4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <entry id="ENSP00000419060">
    <AAseq>XSTTGL


In [62]:
sequence = e.get_sequence_by_region('X:1000000..1000100:1', 'human')
sequence

{u'id': u'chromosome:GRCh38:X:1000000:1000100:1',
 u'molecule': u'dna',
 u'seq': u'ctgtagaaacattagcctggctaacaaggtgaaaccccatctctactaacaatacaaaatattggttgggcgtggtggcgggtgcttgtaatcccagctac'}

In [63]:
sequence = e.get_sequence_by_region('ABBA01004489.1:1..100', 'human',
                                    frmt='json', coord_system='seqlevel')
sequence

{u'id': u'contig::ABBA01004489.1:1:100:1',
 u'molecule': u'dna',
 u'seq': u'ctgtactttccttgggatggagtagtttcgaaacacactttctgtagaatctgcaagtggatatttggacctgtctgaggaattcgttggaaacgggata'}

## <a name=variation></a> Variation

In [64]:
e.get_variation_by_id('rs56116432', 'human')

{u'MAF': u'0.00367309',
 u'ambiguity': u'Y',
 u'ancestral_allele': u'C',
 u'evidence': [u'Multiple_observations', u'1000Genomes', u'ESP'],
 u'mappings': [{u'allele_string': u'C/T',
   u'assembly_name': u'GRCh38',
   u'coord_system': u'chromosome',
   u'end': 133256042,
   u'location': u'9:133256042-133256042',
   u'seq_region_name': u'9',
   u'start': 133256042,
   u'strand': 1}],
 u'most_severe_consequence': u'Missense variant',
 u'name': u'rs56116432',
 u'source': u'Variants (including SNPs and indels) imported from dbSNP (mapped to GRCh38)',
 u'synonyms': [],
 u'var_class': u'SNP'}

In [65]:
res = e.get_vep_by_id('COSM476', 'human')


In [66]:
res = e.get_vep_by_id('rs116035550', 'human')

In [67]:
res =e.get_vep_by_region('9:22125503-22125502:1', 'C', 'human')
res[0]['most_severe_consequence']

u'downstream_gene_variant'