# Interacting with Galaxy through the API to run the tool GECCO to identify putative novel Biosynthetic Gene Clusters (BGCs)

**Information about GECCO:** https://github.com/zellerlab/GECCO <br>

**Information about Galaxy** <br>
Training: https://training.galaxyproject.org/ <br>
Galaxy for Earth System and Environment: https://earth-system.usegalaxy.eu/ (DP: this one works)<br>
European Galaxy server: https://usegalaxy.eu/ (DP: I did not manage to run GECCO there as of 25-01-21)<br>

**Questions:**
How do I solve storage of job and file `IDs` which I need to query later?
  - Should be compatible for running locally and on the BC VRE
  - For now use `.json`
  - This file is created upon submission, but needs to be updated after the job is done. How if the user logs-out.
  - The analysis NB should be the one querying the results `IDs`

<h3> Installing and importing required modules <h3>

In [4]:
import sys
import os
import io

if 'google.colab' not in str(get_ipython()):
    sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
    from utils import init_setup
    init_setup()
else:
    print('Google Colab')

    # Install ngrok for hosting the dashboard
    try:
        os.system('pip install pyngrok --quiet')
        print('ngrok installed')
    except OSError as e:
        print(f"An error occurred while installing ngrok: {e}")

    # !git clone https://github.com/palec87/momics-demos.git
    # this gives access to utils module
    try:
        os.system('git clone https://github.com/palec87/momics-demos.git')
        print(f"Repository cloned")
    except OSError as e:
        print(f"An error occurred while cloning the repository: {e}")

    sys.path.insert(0,'/content/momics-demos')
    
    from utils import setup_ipython
    setup_ipython()

def get_notebook_environment():
    """
    Determine if the notebook is running in VS Code or JupyterLab.

    Returns:
        str: The environment in which the notebook is running ('vscode', 'jupyterlab', or 'unknown').
    """
    # Check for VS Code environment variable
    if 'VSCODE_PID' in os.environ:
        return 'vscode'
    
    elif "JPY_SESSION_NAME" in os.environ:
        return 'jupyterlab'

# Initialize the environment variable
notebook_environment = 'unknown'
# Determine the notebook environment
env = get_notebook_environment()
print(f"Environment: {env}")

Platform: local Linux
Environment: vscode


In [17]:
import os
import sys
import json
from datetime import datetime
from platform import python_version
import logging

# Import
import bioblend.galaxy as g  # BioBlend is a Python library, wrapping the functionality of Galaxy and CloudMan APIs
# import boto3
import pandas as pd
from bioblend.galaxy import GalaxyInstance
from bioblend.galaxy.datasets import DatasetClient

from momics.galaxy.blue_cloud import BCGalaxy
# instead of the jupyter magic, you can also use
# from dotenv import load_dotenv
# load_dotenv()
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Galaxy setup

### How to create a galaxy API key?

Code [here](https://github.com/galaxyproject/bioblend/blob/main/docs/examples/create_user_get_api_key.py). *If you already have login at Galaxy*, go to User(top right) -> Preferences -> Manage API Key

In [7]:
# Read your secrets from the .env file
# To see your API key login -> click 'user' (top right) -> click 'preferences' -> click 'Manage API Key' (menu to the left) -> click the icon to 'copy key'
GALAXY_URL = os.getenv("GALAXY_EARTH_URL")  # alternatively os.environ.get('GALAXY_URL'), "https://earth-system.usegalaxy.eu/"
GALAXY_KEY = os.getenv("GALAXY_EARTH_KEY")  # alternatively os.environ.get('GALAXY_KEY')

history_name = "GECCO Run"
# setup for gecco and galaxy
upload_data_flag = False
gecco_tool_id = "toolshed.g2.bx.psu.edu/repos/althonos/gecco/gecco/0.9.6"  # The id of the tool GECCO

In [8]:
# Connect to Galaxy instance
gi = GalaxyInstance(url=GALAXY_URL, key=GALAXY_KEY)

Create a new history for the GECCO run named `GECCO Run`

In [None]:

history = gi.histories.create_history(name=history_name)
history_id = history["id"]
print(history_id)

115816a852b41534


#### Upload input files to the Galaxy history

In [6]:
# Path to the file to upload to Jupyter (here using a sample fasta file in the folder 'data')
# file_path = "data/EMOBON00092_final_V2.contigs.fa"  # Ensure the file is in your working directory
file_path = "../input_gecco/EMOBON00092_final_V2.contigs.fa"

In [10]:
# Upload file
upload_data = gi.tools.upload_file(file_path, history_id)
uploaded_dataset_id = upload_data["outputs"][0]["id"]
print(
    f"File uploaded to Galaxy with dataset ID: {uploaded_dataset_id}"
)  # dataset ID might be usefull bellow

File uploaded to Galaxy with dataset ID: 4838ba20a6d86765a1e2919003005e46


In [18]:
# testing code
dc = DatasetClient(gi)

In [21]:
gi.datasets.get_datasets()

([{'id': '4838ba20a6d867651a5faa09242fcd71',
   'name': 'k141_98582_cluster_1',
   'history_id': '8d8d4bf21253beda',
   'hid': 201,
   'deleted': False,
   'visible': False,
   'type_id': 'dataset-4838ba20a6d867651a5faa09242fcd71',
   'type': 'file',
   'create_time': '2025-01-21T12:15:29.099982',
   'update_time': '2025-01-21T12:15:29.099984',
   'url': '/api/histories/8d8d4bf21253beda/contents/4838ba20a6d867651a5faa09242fcd71',
   'tags': [],
   'history_content_type': 'dataset',
   'dataset_id': '4838ba20a6d86765ba171b3f6da2a098',
   'state': 'ok',
   'extension': 'genbank',
   'purged': False,
   'genome_build': '?',
   'quota_source_label': None,
   'object_store_id': 'files28'},
  {'id': '4838ba20a6d867657abd44f87bc4191e',
   'name': 'k141_98582_cluster_1',
   'history_id': '8d8d4bf21253beda',
   'hid': 200,
   'deleted': False,
   'visible': False,
   'type_id': 'dataset-4838ba20a6d867657abd44f87bc4191e',
   'type': 'file',
   'create_time': '2025-01-21T11:44:35.666038',
   'upd

In [20]:
dc.get_datasets(history_id=history_id)

[]

## Run GECCO in Galaxy

In [28]:

tool_info = gi.tools.show_tool(gecco_tool_id)
print(tool_info)

{'model_class': 'Tool', 'id': 'toolshed.g2.bx.psu.edu/repos/althonos/gecco/gecco/0.9.6', 'name': 'GECCO', 'version': '0.9.6', 'description': 'is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).', 'labels': [], 'edam_operations': ['operation_0415'], 'edam_topics': ['topic_0080'], 'hidden': '', 'is_workflow_compatible': True, 'xrefs': [], 'tool_shed_repository': {'name': 'gecco', 'owner': 'althonos', 'changeset_revision': 'cc91d730cc4f', 'tool_shed': 'toolshed.g2.bx.psu.edu'}, 'panel_section_id': 'annotation', 'panel_section_name': 'Annotation', 'form_style': 'regular'}


In [29]:
## method to find all your available datasets on galaxy
# this method is called upon pressing a button in the webapp
def filter_datasets_by_key(datasets, key, value):
    lst_dict = [k for k in datasets if key in k and k[key] == value]
    names = [(k["name"], k['id']) for k in lst_dict]
    return names

In [31]:
if not upload_data_flag:
    dname, did = filter_datasets_by_key(gi.datasets.get_datasets(), "extension", 'fasta')[0]

In [32]:
# Define inputs for the GECCO tool with additional parameters

if upload_data_flag:
    inputs = {
        "input": {
            "id": uploaded_dataset_id,  # The dataset ID from the upload step
            "src": "hda",  # History Dataset Association
        },
        "mask": True,  # Enable masking of regions with unknown nucleotides
        "cds": 3,  # Minimum number of genes required for a cluster
        "threshold": 0.05,  # Probability threshold for cluster detection
        "postproc": "gecco",  # Post-processing method for gene cluster validation
        "antismash_sideload": False,  # ,  # Generate an antiSMASH v6 sideload JSON file
        #'email': 'email@email.pt'  # Email notification
    }
else:
    inputs = {
        "input": {
            "id": did,  # The dataset ID from the upload step
            "src": "hda",  # History Dataset Association
        },
        "mask": True,  # Enable masking of regions with unknown nucleotides
        "cds": 3,  # Minimum number of genes required for a cluster
        "threshold": 0.05,  # Probability threshold for cluster detection
        "postproc": "gecco",  # Post-processing method for gene cluster validation
        "antismash_sideload": False,  # ,  # Generate an antiSMASH v6 sideload JSON file
        #'email': 'email@email.pt'  # Email notification
    }

# Run the GECCO tool
tool_run = gi.tools.run_tool(
    history_id=history_id, tool_id=gecco_tool_id, tool_inputs=inputs
)

# Get job ID to monitor
job_id = tool_run["jobs"][0]["id"]
print(f"GECCO tool job submitted with job ID: {job_id}")

GECCO tool job submitted with job ID: 11ac94870d0bb33ac814be21ef6fbc2d


In [33]:
gi.jobs.cancel_job(job_id)

True

### Saving the `.json` file locally for the job

In [34]:
def store_galaxy_job_json(tool_id: str, job_id: str, history_id: str):
    # Store the job information in a JSON file
    job_info = gi.jobs.show_job(job_id)
    job_info["tool_id"] = tool_id
    job_info["history_id"] = history_id
    job_info["job_id"] = job_id
    
    ## Get the current datetime and format it
    # datetime_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    job_info_file = f"job_info_{job_id}.json"
    with open(job_info_file, "w") as f:
        json.dump(job_info, f)
    pass

In [35]:
# Monitor the job status (get job id from running the previous cell)
# gi.jobs.show_job(job_id), # 11ac94870d0bb33a4a74056d2ffeb889
gi.jobs.get_state(job_id)

'deleted'

In [36]:
# test, can I store the job info in a json file for a deleted job
# yes it works
store_galaxy_job_json(gecco_tool_id, job_id, history_id)

### TODO: this file needs to be also updated once the job is done and user accesses it.

#### List the Outputs from the Job

In [37]:
#  Get history id from running the previous cell
# List datasets in the history after the tool run
datasets = gi.histories.show_history(history_id, contents=True)

### Download the `.tsv` table outputs

In [38]:
# Identify the output dataset ids
# To understand the output: https://git.lumc.nl/mflarralde/gecco
target_names = {
    "GECCO summary of detected genes on data 1 (TSV)": "dataset_id_2",
    "GECCO summary of detected features on data 1 (TSV)": "dataset_id_3",
    "GECCO summary of detected BGCs on data 1 (TSV)": "dataset_id_4",
}

# Initialize the dataset ID variables
dataset_id_2 = None
dataset_id_3 = None
dataset_id_4 = None

# Loop through the datasets and assign the IDs to the correct variable
for dataset in datasets:
    if dataset["name"] in target_names:
        if target_names[dataset["name"]] == "dataset_id_2":
            dataset_id_2 = dataset["id"]
        elif target_names[dataset["name"]] == "dataset_id_3":
            dataset_id_3 = dataset["id"]
        elif target_names[dataset["name"]] == "dataset_id_4":
            dataset_id_4 = dataset["id"]

# Display the results
print(f"Dataset ID 2: {dataset_id_2}")
print(f"Dataset ID 3: {dataset_id_3}")
print(f"Dataset ID 4: {dataset_id_4}")

Dataset ID 2: 4838ba20a6d86765882e8980eaab1ccd
Dataset ID 3: 4838ba20a6d867656afc7e5eb299efb8
Dataset ID 4: 4838ba20a6d86765ef61086d00cbc156


In [None]:
# Download here

# Download the dataset (as TSV) to the 'data'folder
tsv_data2 = gi.datasets.download_dataset(
    dataset_id_2,
    file_path="../data/summary_detected_genes.tsv",
    use_default_filename=False,
)
tsv_data3 = gi.datasets.download_dataset(
    dataset_id_3,
    file_path="../data/summary_detected_features.tsv",
    use_default_filename=False,
)
tsv_data4 = gi.datasets.download_dataset(
    dataset_id_4, file_path="../data/summary_detected_BGC.tsv", use_default_filename=False
)

In [37]:
# Read the TSV File into a panda DataFrame

# df_detected_BGC = pd.read_csv('detected_BGC.tsv', sep='\t')
df_summary_detected_genes = pd.read_csv("../data/summary_detected_genes.tsv", sep="\t")
df_summary_detected_features = pd.read_csv("../data/summary_detected_features.tsv", sep="\t")
df_summary_detected_BGC = pd.read_csv("../data/summary_detected_BGC.tsv", sep="\t")

### Display the first few rows of each DataFrame

In [39]:
df_summary_detected_genes.head()

Unnamed: 0,sequence_id,protein_id,start,end,strand,average_p,max_p
0,k141_0,k141_0_1,1,315,+,0.079067,0.079067
1,k141_1,k141_1_1,3,218,-,0.138611,0.138611
2,k141_10,k141_10_1,2,70,-,0.138611,0.138611
3,k141_10,k141_10_2,402,491,-,0.138615,0.138615
4,k141_100,k141_100_1,3,425,-,0.002311,0.002311


In [40]:
df_summary_detected_features.head()

Unnamed: 0,sequence_id,protein_id,start,end,strand,domain,hmm,i_evalue,pvalue,domain_start,domain_end,cluster_probability
0,k141_0,k141_0_1,1,315,+,PF08448,Pfam,2.690279e-16,9.726245e-20,1,92,0.079067
1,k141_100,k141_100_1,3,425,-,PF05343,Pfam,3.8542599999999995e-38,1.3934419999999999e-41,2,139,0.002311
2,k141_100007,k141_100007_1,2,151,-,PF01118,Pfam,1.424239e-08,5.149094e-12,4,46,0.105378
3,k141_10001,k141_10001_1,1,621,+,PF13561,Pfam,4.8442400000000003e-60,1.751352e-63,1,207,0.118275
4,k141_10001,k141_10001_1,1,621,+,PF00106,Pfam,2.675476e-15,9.672726999999999e-19,2,171,0.118275


In [41]:
df_summary_detected_BGC.head()

Unnamed: 0,sequence_id,cluster_id,start,end,average_p,max_p,type,alkaloid_probability,nrp_probability,polyketide_probability,ripp_probability,saccharide_probability,terpene_probability,proteins,domains
0,k141_102496,k141_102496_cluster_1,3,2678,0.089128,0.089709,,0.038711,0.190326,0.472017,0.151461,0.01,0.075419,k141_102496_1;k141_102496_2;k141_102496_3;k141...,PF01261;PF13577;PF14534
1,k141_107500,k141_107500_cluster_1,2,2071,0.055816,0.057462,,0.029044,0.156383,0.163698,0.214358,0.01,0.160596,k141_107500_1;k141_107500_2;k141_107500_3,PF00571;PF00850;PF05175;PF13649
2,k141_111068,k141_111068_cluster_1,1492,4777,0.28885,0.292866,,0.028721,0.458357,0.432158,0.117242,0.0,0.063415,k141_111068_3;k141_111068_4;k141_111068_5;k141...,PF00254;PF00528;PF01613;PF19300
3,k141_11215,k141_11215_cluster_1,1,2794,0.12322,0.124093,,0.160109,0.08919,0.097305,0.055435,0.0,0.083311,k141_11215_1;k141_11215_2;k141_11215_3,PF00202;PF01177;PF01266
4,k141_114753,k141_114753_cluster_1,842,4439,0.066754,0.073121,,0.054746,0.183918,0.332323,0.216,0.01,0.082038,k141_114753_3;k141_114753_4;k141_114753_5;k141...,PF00204;PF01578;PF02518;PF04952
