# Identifier mapping


## Yihang Xin and Alex Pico

## 2020-12-15

This vignette will show you how to map or translate identifiers from one database (e.g., Ensembl) to another (e.g, Entrez Gene). This is a common requirement for data analysis. In the context of Cytoscape, for example, identifier mapping is needed when you want to import data to overlay on a network but you don’t have matching keys. There are three distinct examples below, highlighting different lessons that may apply to your use cases



# Installation
The following chunk of code installs the `py4cytoscape` module.

In [1]:
%%capture
!python3 -m pip install python-igraph requests pandas networkx
!python3 -m pip install py4cytoscape

# Prerequisites
In addition to this package (py4cytoscape version 0.0.7), you will need:

* Latest version of Cytoscape, which can be downloaded from https://cytoscape.org/download.html. Simply follow the installation instructions on screen.

* Complete installation wizard

* Launch Cytoscape

For this vignette, you’ll also need the STRING app:

Install the STRING app from https://apps.cytoscape.org/apps/stringapp

Install the WikiPathways app from https://apps.cytoscape.org/apps/wikipathways

You can also install app inside Python notebook by running "py4cytoscape.install_app('Your App')"

# Import the required package¶


In [2]:
import os
import sys
import pandas as pd
import py4cytoscape as p4c

# Setup Cytoscape¶


In [3]:
p4c.cytoscape_version_info()

{'apiVersion': 'v1',
 'cytoscapeVersion': '3.8.2',
 'automationAPIVersion': '1.0.0',
 'py4cytoscapeVersion': '0.0.7'}

# Example: Species specific considerations
When planning to import data, you need to consider the key columns you have in your network data and in your table data. It’s always recommended that you use proper identifiers as your keys (e.g., from databases like Ensembl and Uniprot-TrEMBL). Relying on conventional symbols and names is not standard and error prone.

Let’s start with the sample network provided by Cytoscape.

Caution: Loading a session file will discard your current session. Save first, if you have networks or data you want to keep. Use saveSession(‘path_to_file’).

In [4]:
p4c.open_session()

Opening sampleData/sessions/Yeast Perturbation.cys...


{}

You should now see a network with just over 300 nodes. If you look at the Node Table, you’ll see that there are proper identifiers in the name columns, like “YDL194W”. These are the Ensembl-supported IDs for Yeast.



# Perform identifier mapping
You need to know a few things about your network in order to run this function, e.g., the species and starting (or source) identifier type. This isn’t usually a problem, but this example highlights a unique case where the Ensembl ID type for a particular species (i.e., Yeast) has a particular format (e.g., YDL194W), rather than the more typical ENSXXXG00001232 format.

So, with this knowledge, you can run the following function:



In [5]:
mapped_cols = p4c.map_table_column('name','Yeast','Ensembl','Entrez Gene')
mapped_cols.head()

Unnamed: 0,name,Entrez Gene
2560,YPR062W,856175
2561,YLR319C,851029
2562,YNL311C,855405
2563,YKL001C,853869
2564,YOL016C,854144


We are asking Cytoscape to look in the name column for Yeast Ensembl IDs and then provide a new columns of corresponding Entrez Gene IDs. And if you look back at the Node Table, you’ll see that new column (all the way to the right). That’s it!

The return value is a data frame of all the mappings between Ensembl and Entrez Gene that were found for your network in case you want those details:

In [6]:
mapped_cols.iloc[:3] #first three entries

Unnamed: 0,name,Entrez Gene
2560,YPR062W,856175
2561,YLR319C,851029
2562,YNL311C,855405


Note: the row names of the return data frame are the node SUIDs from Cytoscape. These are handy if you want to load the mappings yourself (see last example).

# Example: From proteins to genes
For this next example, you’ll need the STRING app to access the STRING database from within Cytoscape:
* Install the STRING app from https://apps.cytoscape.org/apps/stringapp



In [7]:
p4c.install_app('WikiPathways')

{}


{}

Now we can import protein interaction networks with a ton of annotations from the STRING database with a simple commands_run function, like this:

In [8]:
string_cmd = 'string disease query disease="breast cancer" cutoff=0.9 species="Homo sapiens" limit=150'
p4c.commands.commands_run(string_cmd)

["Loaded network 'String Network - breast cancer' with 150 nodes and 806 edges"]

Check out the Node Table and you’ll see display names and identifiers. In particular, the canonical name column appears to hold Uniprot-TrEMBL IDs. Nice, we can use that!



# Perform identifier mapping


Say we have a dataset keyed by Ensembl gene identifiers. Well, then we would want to perform this mapping:



In [9]:
mapped_cols = p4c.map_table_column('stringdb::canonical name','Human','Uniprot-TrEMBL','Ensembl')
mapped_cols.head()

Unnamed: 0,stringdb::canonical name,Ensembl
3840,P35222,ENSG00000168036
3841,Q05397,ENSG00000169398
3842,P23771,ENSG00000107485
3843,P15407,ENSG00000175592
3844,Q96SC6,


Scroll all the way to the right in the Node Table and you’ll see a new column with Ensembl IDs. This example highlights a useful translation from protein to gene identifiers (or vice versa), but is also a caution to be aware of the assumptions involved when making this translation. For example, a typical gene encodes for many proteins, so you may have many-to-one mappings in your results.



# Example: Mixed identifiers


From time to time, you’ll come across a case where the identifiers in your network are of mixed types. This is a rare scenario, but here is one approach to solving it.

First, you’ll need the WikiPathways app to access the WikiPathways database. The pathways in WikiPathways are curated by a community of interested researchers and citizen scientists. As such, there are times where authors might use different sources of identifiers. They are valid IDs, just not all from the same source. Future versions of the WikiPathways app will provide pre-mapped columns to a single ID type. But in the meantime (and relevant to other use cases), this example highlights how to handle a source of mixed identifier types.

* Install the WikiPathways app from https://apps.cytoscape.org/apps/wikipathways

Now we can import an Apoptosis Pathway from WikiPathways. Either from the web site (https://wikipathways.org), or from the Network Search Tool in Cytoscape GUI or from the rWikiPathways package, we could identify the pathway as WP254.



In [10]:
wp_cmd = 'wikipathways import-as-pathway id="WP254"'
p4c.commands.commands_run(wp_cmd)

[]

Take look in the XrefId column and you’ll see a mix of identifier types. The next column over, XrefDatasource, conveniently names each type’s source. Ignoring the metabolites for this example, we just have a mix of Ensembl and Entrez Gene to deal with.



# Perform identifier mapping


Say we want a column with only Ensembl IDs. The easiest approach is to simply overwrite all the non-Ensembl IDs, i.e., in this case, Entrez Gene IDs. Let’s collect the mappings first:



In [11]:
mapped_cols = p4c.map_table_column('XrefId','Human','Entrez Gene','Ensembl')

Next, we want to remove the values from the Ensembl column in our resulting mapped.cols data frame. We’ll also remove the original source columns (to avoid confusion) and rename our Ensembl column to XrefId to prepare to overwrite. Then we’ll load that into Cytosacpe:

In [12]:
only_mapped_cols = mapped_cols.dropna()
only_mapped_cols = only_mapped_cols['Ensembl']
only_mapped_cols.columns = ['XrefId']
only_mapped_cols = pd.DataFrame(only_mapped_cols)
only_mapped_cols.head()

Unnamed: 0,Ensembl
5735,ENSG00000283797
5736,ENSG00000284032
5737,ENSG00000141682
5738,ENSG00000132906
5739,ENSG00000135679


In [13]:
p4c.load_table_data(only_mapped_cols, table_key_column='SUID')

'Success: Data loaded in defaultnode table'

Done! See the updated XrefId column in Cytoscape with all Ensembl IDs.

Note: you’ll want to either update the XrefDatasource* column as well or simply make a note to ignore it at this point.*



# More advanced cases


If you need an ID mapping solution for species or ID types not covered by this tool, or if you want to connect to alternative sources of mappings, then check out the BridgeDb app: http://apps.cytoscape.org/apps/bridgedb.



In [14]:
#available in Cytoscape 3.7.0 and above
p4c.install_app('BridgeDb')

{}


{}