This is a reproduction of the [Biological Network Exploration with Cytoscape 3](https://pubmed.ncbi.nlm.nih.gov/25199793/) Basic Protocol 1, which loads an s. cervesiae network, filters out unneeded nodes, lays out the resulting network, creates clusters of similar nodes and then performs an enrichment calculation on one cluster.

Note that this workflow executes in a Jupyter Notebook running on your workstation, and communicates with a copy of Cytoscape also running on your workstation. For a similar workflow that runs on a cloud server (e.g., Google Colab), see [here](https://github.com/bdemchak/cytoscape-jupyter/tree/main/gangsu).

---
# Setup data files, py4cytoscape and Cytoscape connection
**NOTE: To run this notebook, you must manually start Cytoscape first -- don't proceed until you have started Cytoscape.**

This workflow requires two files that are located in cloud storage:

* BIOGRID-ORGANISM-Saccharomyces_cerevisiae-3.2.105.mitab (network file)
* GDS112_full.soft (annotation file)

Both files reside in a Dropbox folder, and they are downloaded by this workflow as needed.

## Setup: Fetch latest py4cytoscape




In [13]:
import py4cytoscape as p4c

## Setup: Sanity test to verify Cytoscape connection

By now, the connection to Cytoscape should be up and available. To verify this, try a simple operation that doesn't alter the state of Cytoscape.

In [14]:
p4c.cytoscape_version_info()


{'apiVersion': 'v1',
 'cytoscapeVersion': '3.10.2',
 'automationAPIVersion': '1.10.0',
 'py4cytoscapeVersion': '1.10.0'}

## Setup: Notebook data files
Create the 'output' directory, which will be used to store files uploaded from Cytoscape.

This is a good place to prepare any other system resources that might be needed by downstream Notebook cells.

Pro Tip: The "!" commands in this cell are passed to the host operating system. In this example, they're correct for a Windows host. Different commands would be appropriate for a Linux or Mac host.


In [15]:
!del /s/q/f output
!rmdir output
!mkdir output
!dir output
OUTPUT_DIR = 'output/'

 Volume in drive C has no label.
 Volume Serial Number is 50EF-8726

 Directory of C:\Users\CyDeveloper\PycharmProjects\py4cytoscape\tests\Notebooks\output

09/24/2024  02:42 PM    <DIR>          .
09/24/2024  02:42 PM    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  37,394,010,112 bytes free


## Setup: Import source data files
The network and annotation files are in a Dropbox folder, and this cell downloads them into the default Sandbox from where Cytoscape will access them.

The files could just as well have been on any cloud resource, including Google Drive, Github, Microsoft OneDrive or a private web site. Note that in this case, the network file was so large that it could not be saved on GitHub, so Dropbox was a handy alternative.

*Tip:* An alternative would be to load the files into this Notebook's file system (or create them there) and then download those files to the Sandbox. Loading them into the Notebook file system would require the use of Notebook "!" commands (e.g., !wget). wget for Windows can be found [here](https://eternallybored.org/misc/wget/).

---
Note that this cell uses the `import_file_from_url()` function to load resources from cloud storage.
This function is appropriate for Notebooks running on the same workstation as Cytoscape, but
not for Notebooks running on a remote server. On a remote server, the Notebook's file system is not the same as the Cytoscape workstation's file system, and [Sandbox functions](https://py4cytoscape.readthedocs.io/en/latest/concepts.html#sandboxing) should be used instead.

In [16]:
res_mitab = p4c.import_file_from_url("https://www.dropbox.com/s/8wc8o897tsxewt1/BIOGRID-ORGANISM-Saccharomyces_cerevisiae-3.2.105.mitab?dl=0", "BIOGRID-ORGANISM-Saccharomyces_cerevisiae-3.2.105.mitab")
print(f'Network file BIOGRID-ORGANISM-Saccharomyces_cerevisiae-3.2.105.mitab has {res_mitab["fileByteCount"]} bytes')

Network file BIOGRID-ORGANISM-Saccharomyces_cerevisiae-3.2.105.mitab has 166981992 bytes


---
# Load the Protein-protein Interaction Network into Cytoscape
The network is contained in the s. cerevisiae MITAB file.

Note that in this cell, the `import_network_from_file()` function (incorrectly) throws an exception in pre-3.10.0 Cytoscape. To ignore the exception, we enclose it in a try/except block.

Note: Once the CYTOSCAPE-12772 issue is solved, we can remove the try/except block.

In [17]:
from requests import HTTPError
p4c.close_session(False)

try:
  p4c.import_network_from_file('BIOGRID-ORGANISM-Saccharomyces_cerevisiae-3.2.105.mitab')
except:  
  pass
if p4c.get_network_count() != 1:
  raise Exception('Failed to load network')
net_suid = p4c.get_network_suid()
print(f'Network identifier: {net_suid}')



Network identifier: 17008665


---
# Import the gene expression data
The expression data is downloaded and merged into the network's node attribute table.

---
*Tip:* This cell shows how to create code that works around changes in Cytoscape capabilities. 

In this case, starting with Cytoscape 3.9.0, the `load_table_data_from_file()` function works as expected, so the gene expression data is merged into the node attribute table. 

Prior to 3.9.0, `load_table_data_from_file()` didn't work. As a workaround, we do most of the work in Pandas and then import the dataframe into the node attribute table. After Pandas reads the CSV, we will try to match dataframe Gene ID column to the `name` column in the Cytoscape node attribute table. To do this, we must explicitly set the Gene ID as a string (even though it's originally parsed as a number) because Cytoscape's `name` column is already a string. 

Pro Tip: The wget and mv commands work on a Jupyter system on a Linux host. You may have to choose different commands for a Windows host. wget for Windows can be found [here](https://eternallybored.org/misc/wget/).

In [18]:
if p4c.check_supported_versions(cytoscape='3.9') is None:
  # Load file directly into Sandbox so Cytoscape can import it
  res_soft = p4c.import_file_from_url("https://www.dropbox.com/s/r15azh0xb53smu1/GDS112_full.soft?dl=0", "GDS112_full.soft")
  print(f'Annotation file GDS112_full.soft has {res_soft["fileByteCount"]} bytes')

  res = p4c.load_table_data_from_file('GDS112_full.soft', start_load_row=83, data_key_column_index=10, delimiters='\t')
  print(f'Load result contains table identifiers: {res["mappedTables"]}')
else:
  # Load file into Notebook file system so Python can import it, tweak it, and download to Cytoscape
  !wget -q --no-check-certificate https://www.dropbox.com/s/r15azh0xb53smu1/GDS112_full.soft?dl=0
  !mv GDS112_full.soft?dl=0 GDS112_full.soft

  import pandas as df
  GDS112_full = df.read_csv('GDS112_full.soft', skiprows=82, sep='\t')
  GDS112_full.dropna(subset=['Gene ID'], inplace=True)
  GDS112_full['Gene ID'] = df.to_numeric(GDS112_full['Gene ID'], downcast='integer')
  GDS112_full = GDS112_full.astype({'Gene ID': 'string'})
  print(GDS112_full.dtypes)
  print(GDS112_full)
  p4c.load_table_data(GDS112_full, data_key_column='Gene ID')

  import os
  os.remove('GDS112_full.soft')


Annotation file GDS112_full.soft has 5536880 bytes
Load result contains table identifiers: [17008636, 17008674]


---
# Filter the Network with the Genes that have Expression Data
For this, we assume that if a node has no *Gene symbol*, it also has no expression data. 

The filter compares each node's *Gene symbol* attribute to a regular expression. If there is a match, the gene is selected; for no match, the gene isn't selected.

In [19]:
res = p4c.create_column_filter('SymbolOK', 'Gene symbol', '[A-Z0-9]*', 'REGEX')
print(f'Nodes selected: {len(res["nodes"])}')

No edges selected.
Nodes selected: 5519


---
# Create a New Network with the Selected Subset
Create a subnetwork containing only nodes selected by the filter (i.e., having a *Gene symbol* value, which implies that expression data is present for that node).

This could take several minutes.

At the end, you should see a view containing all nodes laid out. 

If you see only a single rectangle, it could be that your Cytoscape is set to operate with a small stack size. To increase the stack:

1. terminate Cytoscape

2. a) upgrade Cytoscape to 3.9.0 or later 

  ... or b) use a text editor to add -Xss5M to the cytoscape.vmoptions file in your Cytoscape program directory

3. restart Cytoscape

4. re-run this workflow

In [20]:
new_suid = p4c.create_subnetwork()
print(f'New network identifier: {new_suid}')

New network identifier: 18396562


## Get rid of the original network, which isn't needed anymore

In [21]:
p4c.delete_network(net_suid)
net_suid = new_suid

---
# Identify Network Modules
The overall strategy is to find clusters of nodes that share some common attribute. In this case, we use expression data values. Specifically:

* Load Cytoscape's clusterMaker2 app
* Use clusterMaker2 to create a dendogram showing a hierarchy of similar network modules

## Install clusterMaker2 if it hasn't already been installed

In [22]:
p4c.install_app('clusterMaker2')

{}


{}

## Identify network modules
Create a hierarchic clustering of similar nodes based on the expression data columns. Cytoscape renders the hierarchy as a dendogram.

*Tip:* Cytoscape's dendogram window can be used to manually explore module similarity. 

In [23]:
dendo_clustering = p4c.commands_post('cluster hierarchical showUI=true clusterAttributes=false nodeAttributeList="GSM1029,GSM1030,GSM1032,GSM1033,GSM1034"')

# dendo_clustering is a dictionary [{nodeOrder: [{nodeName: xxx, suid: sss}, ...]}
#                                   {nodeTree: [{name: ggg, left: lll, right: rrr}]}]
# where nodeOrder is a mapping between a leaf node name xxx and the suid sss of a network node,
# and nodeTree is a tree where the left node lll and right node rrr can be leaf nodes xxx or
# internal nodes ggg. 

---
# Perform an enrichment analysis using the gprofiler package

Use a package commonly available in PyPI to calculate functional enrichment for nodes similar to a node in which we may be interested. It's an example of how Cytoscape can work together with Python-based libraries to achieve a useful result.

In this case, we choose HBT1 (entrez-gene ID 851303).

1. Find SUID of network's HBT1 node

1. Find a set of nodes similar to HBT1 by collecting nodes nearby in the tree

1. Use each node's SUID to look up its entrez-gene ID

1. Pass the set of entrez-gene IDs to gprofiler as an enrichment query

## Find HBD1 in the similarity tree

In [24]:
node_suid = p4c.node_name_to_node_suid('851303')[0] # Use entrez-gene ID to get SUID for HBT1

In _item_to_suid(): Invalid name in node name list: [851303]


CyError: In _item_to_suid(): Invalid name in node name list: [851303]

## Collect set of SUIDs representing 85 similar nodes

Note that we use custom functions to parse and traverse the dendogram's similarity tree.

In [13]:
import parse_dendogram as pde # Use custom functions to decode dendogram tree

node_order = dendo_clustering[0]['nodeOrder']
node_tree = dendo_clustering[0]['nodeTree']

node_bag = pde.create_node_bag(node_order, node_tree)
similar_nodes = list(pde.find_node_set(node_suid, 85, node_order, node_bag))

## Using SUIDs, query Cytoscape for each node's entrez-gene ID

In [14]:
suid_to_entrez_gene = p4c.get_table_columns(columns='name')['name']
entrez_gene_query = [int(suid_to_entrez_gene[suid])  for suid in similar_nodes]

print(entrez_gene_query)

[856512, 850295, 855363, 854742, 856832, 853377, 853766, 853357, 855878, 851303, 854125, 852902, 850322, 856013, 851541, 850425, 853677, 856616, 854086, 853288, 851487, 854845, 855029, 855789, 850608, 852123, 855737, 851631, 854206, 854098, 850412, 853405, 856543, 852231, 854297, 854247, 851902, 851649, 852695, 853613, 855593, 852436, 853217, 852874, 851596, 853573, 855229, 855738, 851254, 851070, 851063, 855669, 854898, 853537, 853159, 855020, 851380, 855513, 851433, 853326, 851474, 851239, 850963, 854302, 850687, 851581, 852691, 853758, 854357, 852146, 854822, 854459, 855562, 853115, 851625, 856855, 850500, 850778, 852342, 852543, 851582, 854108, 852721, 852818, 850597]


## Install gprofiler package if it's not already installed

In [15]:
!pip install gprofiler-official
from gprofiler import GProfiler



You should consider upgrading via the 'C:\Users\CyDeveloper\PycharmProjects\py4cytoscape\venv\Scripts\python.exe -m pip install --upgrade pip' command.


## Use entrez-gene IDs to query gprofiler for GO functional enrichment

In [16]:
gp = GProfiler(user_agent='py4cytoscape', return_dataframe=True)
gp.profile(organism='scerevisiae', query=entrez_gene_query)

Unnamed: 0,source,native,name,p_value,significant,description,term_size,query_size,intersection_size,effective_domain_size,precision,recall,query,parents
0,GO:BP,GO:0044281,small molecule metabolic process,1.439664e-07,True,"""The chemical reactions and pathways involving...",789,85,33,6548,0.388235,0.041825,query_1,[GO:0008152]
1,GO:BP,GO:0019752,carboxylic acid metabolic process,4.92481e-06,True,"""The chemical reactions and pathways involving...",413,85,22,6548,0.258824,0.053269,query_1,[GO:0043436]
2,GO:BP,GO:0043436,oxoacid metabolic process,1.081803e-05,True,"""The chemical reactions and pathways involving...",431,85,22,6548,0.258824,0.051044,query_1,[GO:0006082]
3,GO:BP,GO:0006082,organic acid metabolic process,1.281259e-05,True,"""The chemical reactions and pathways involving...",435,85,22,6548,0.258824,0.050575,query_1,"[GO:0044237, GO:0044281, GO:0071704]"
4,GO:BP,GO:0044282,small molecule catabolic process,0.0001018934,True,"""The chemical reactions and pathways resulting...",164,85,13,6548,0.152941,0.079268,query_1,"[GO:0009056, GO:0044281]"
5,GO:MF,GO:0003824,catalytic activity,0.0001680909,True,"""Catalysis of a biochemical reaction at physio...",2317,84,52,6557,0.619048,0.022443,query_1,[GO:0003674]
6,GO:BP,GO:0046395,carboxylic acid catabolic process,0.0004046858,True,"""The chemical reactions and pathways resulting...",101,85,10,6548,0.117647,0.09901,query_1,"[GO:0016054, GO:0019752]"
7,GO:BP,GO:0016054,organic acid catabolic process,0.0004865659,True,"""The chemical reactions and pathways resulting...",103,85,10,6548,0.117647,0.097087,query_1,"[GO:0006082, GO:0044248, GO:0044282, GO:1901575]"
8,GO:CC,GO:0005576,extracellular region,0.0006964628,True,"""The space external to the outermost structure...",127,84,10,6569,0.119048,0.07874,query_1,[GO:0110165]
9,KEGG,KEGG:01200,Carbon metabolism,0.0007192714,True,Carbon metabolism,112,40,10,2085,0.25,0.089286,query_1,[KEGG:00000]
