<a href="https://colab.research.google.com/github/dthorburn/rb_automation/blob/main/scripts/legacy_colabproscan_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ColabProScan: InterproScan with Google Cloud Storage folders

### ***NB. This method is deprecated. Please use the nextflow implementation with fusion mounting the database bucket.***


## Instructions

**Quick Start**
1.  Upload your AA sequences in fasta format to GCP (NB. files must be appended with `.fasta` to be identified as fasta here).
2. Define GCP project ID using (`project_name`), then path to the GCP cloud storage bucket and sub-directory containing the fasta files using (`bucket_name`) define an outdir (`folder_name`).
3. Check the parameters are correctly set. Defaults are fine for most use cases.
4. Launch GCP GPU VM with Colab backend container using this [link]( https://console.cloud.google.com/marketplace/product/colab-marketplace-image-public/colab) (ensure you are logged into our GCP account). An NVIDIA T4 with 4 vCPU and 13Gb-26Gb of memory should be sufficient for most use cases.
5. Select `Connect to a custom GCE VM` under connection options in the top right and follow instructions.
6. Once connected, press `Runtime` -> `Run all` (Or select each block and run individually).
7. Follow the link provided to retrieve temporary access code to GCP during execution of the first block and paste it in the proivded space.

**Output**
1. A gff file with all interproscan annotations included.

In [None]:
#@title Install InterPro Scan - This will take hours!
%%shell
#perl -version
#python3 --version
#java -version

mkdir my_interproscan
#mv interproscan-5.62-94.0-64-bit.tar.gz my_interproscan
#mv interproscan-5.62-94.0-64-bit.tar.gz.md5 my_interproscan
cd my_interproscan
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.62-94.0/interproscan-5.62-94.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.62-94.0/interproscan-5.62-94.0-64-bit.tar.gz.md5

# Recommended checksum to confirm the download was successful:
md5sum -c interproscan-5.62-94.0-64-bit.tar.gz.md5
tar -pxvzf interproscan-5.62-94.0-*-bit.tar.gz
cd /content/my_interproscan/interproscan-5.62-94.0/
python3 setup.py -f interproscan.properties

In [None]:
#@title Setup and Authenticate with Google Cloud Storage
!pip install --upgrade google-cloud-storage
!gcloud auth application-default login
from google.cloud import storage


In [None]:
#@title Fetch inputs from Google Cloud Storage

#@markdown ## Set parameters to get input files run and wait for summary.
import os
from pathlib import Path
import shutil
import re

project_name = '' #@param {type:"string"}
bucket_name = '' #@param {type:"string"}
folder_name = '' #@param {type:"string"}

print("Ensuring download folder free")
input_local_dir = Path.home().resolve() / "input"
output_dir = Path.home().resolve() / "output"
os.mkdir(input_local_dir)
os.mkdir(output_dir)

print("Fetching data...")
storage_client = storage.Client(project_name)
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=folder_name)  # Get list of files
for blob in blobs:
  filename = blob.name.split("/")[-1]
  if bool(re.search("fasta", str(filename))):
    blob.download_to_filename(input_local_dir.resolve() / filename)
#for blob in blobs:
#  blob.download_to_filename(input_local_dir.resolve() / blob.name.split("/")[-1])

print("Done, input directory state:")
print(os.listdir(input_local_dir))

In [None]:
#@title Run InterProScan
%%shell
export PATH="/content/my_interproscan/interproscan-5.62-94.0/:$PATH"
fasta=`ls -1 /root/input`
## Just in case there are * characters
sed -i "s/\*//g" "/root/input/${fasta}"
outdir="/root/output"

if [ `grep "\*" "/root/input/${fasta}" | wc -l` == 0 ];
then
  echo $fasta
  interproscan.sh -i /root/input/${fasta} -f gff3 -t "p" -o ${outdir}/interpro_result.gff -cpu 4 -appl Pfam,Gene3D,SUPERFAMILY,PRINTS,SMART,CDD,ProSiteProfiles -dp
else
  echo "* characters found in fasta. Remove before continuing"
fi

In [None]:
#@title Upload Results to GCP
    print('Uploading results...')
    for filename in os.listdir(output_dir):
        if os.path.isfile(output_dir.resolve()/ filename):
            blob = bucket.blob( folder_name + "/" + filename)
            blob.upload_from_filename(output_dir.resolve()/ filename)
