# Mesoglea characterization pipeline

*by Bruno Gideon Bergheim, Center For Organismal Studies, Uni Heidelberg*

## Requirements

### Python Packages

**Biopython:** Used for Sequence manipulation.

`pip install biopython`

**pathlib:** Used to create folders.

`pip install pathlib`

**subprocess:** Used to call InterProScan (hopefully).

**logging:** Used to create a lob of the pipeline

In [2]:
#Load Packages
from Bio import SeqIO
import pathlib
import subprocess
import logging
logging.basicConfig(filename='logs/last_run.log', level=logging.DEBUG)

### InterProScan

We are looking up a lot of domains therefore it is highly recommended to use a local version of InterProScan to annotate the domains.

IPR is available for linux and can be run on Windows computers using a Linux Subsystem (WSL).

1. **(if on Windows) Install the WSL**
2. **Install Interproscan**

    See [here](https://interproscan-docs.readthedocs.io/en/latest/InstallationRequirements.html) for instructions

    At the time of writing the instructions were:
    ```shell
    #checking requirements versions
    uname -a
    perl -version
    python3 --version
    java -version

    #downloading interproscan
    mkdir my_interproscan
    cd my_interproscan
    wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.57-90.0/interproscan-5.57-90.0-64-bit.tar.gz
    wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.57-90.0/interproscan-5.57-90.0-64-bit.tar.gz.md5
    #checking if the download was completed
    md5sum -c interproscan-5.57-90.0-64-bit.tar.gz.md5

    #unpacking interproscan
    tar -pxvzf interproscan-5.57-90.0-*-bit.tar.gz
    cd interproscan-5.57-90.0

    #setup
    python3 initial_setup.py

    #test
    ./intersproscan.sh
    ```

3. **Run interproscan**

    We want to run intersprscan on all fasta sequences. Therefore we run it in a loop:

    ```shell
    for file in [path to folder]
    do
        ./interproscan.sh -i $file -o $file.xml -f xml  -goterms -pa
    done
    ```
    where:
    -f xml => specifies xml output file
    -goterms => activates go-term annotation
    -pa => activates pathway annotation

    e.g. for this analysis the command was:
    ```shell
    for file in /mnt/d/Data/programs/mesoglea_protein_pipeline/input/Hydra_vulgaris/*.fasta;
    do 	sudo ./interproscan.sh -i $file -o $file.tsv -f tsv -dra -cpu 14 -appl TIGRFAM,SFLD,SUPERFAMILY,PANTHER,ProSiteProfiles,SMART,CDD,PRINTS,PIRSR,ProSitePatterns,AntiFam,Pfam;
    done;
    ```

4. **Install SignalP, TMHMM and Phobius**

To get all annotations three licences databases have to be added.

http://phobius.sbc.su.se/data.html

http://www.cbs.dtu.dk/services/SignalP/

http://www.cbs.dtu.dk/services/TMHMM/

They are available for scientific use if the licence agreement is accepted. the download files will be send to your email.

After downloading the files they can be moved to the correct folders:

phoebius:
```shell
mv -v [download path]/* /my_intersproscan/interproscan-5.57-90.0/bin/phobius/1.01/
```
SignalP:
```shell
mv -v [download path]/* /my_intersproscan/interproscan-5.57-90.0/bin/signalp/4.1/
```

TMHMM:
```shell
mv -v [download path]/* /my_intersproscan/interproscan-5.57-90.0/bin/tmhmm/2.0c/
```


We first split the large proteome fasta file into its individual sequences which makes it a bit easier to work with in the domain annotation steps.

## Input files

**Proteome.fasta:** A .fasta file containing all sequences of a given species. Its name must be the Species name e.g. Hydra_vulgaris.

In [15]:
def clean_name(species):
    "Helper function used to clean the species name"
    return species.strip().replace(" ","_")

def split_proteome(species,rerun = False):
    """Reads the proteome file and splits it into individual sequence .fasta files."""

    logging.info("Starting new run for {}.".format(species))
    species = clean_name(species) #fixes name
    try:
        proteome = SeqIO.parse("input\{}.fasta".format(species),"fasta")
    except FileNotFoundError:
        logging.error("No proteome found.")
        raise FileNotFoundError("I could not find the proteome file. Are you sure it is named correctly?")
    # Create folder for the individual fasta sequences
    pathlib.Path("./input/{species}".format(species = species)).mkdir(parents=True,exist_ok = True)
    # Split the proteome into individual sequences
    for i,sequence in enumerate(proteome):
        sequence.id = "{species}_{num}_{old_id}".format(species=species, num=i, old_id=sequence.id)
        filename =  "./input/{species}/{id}.fasta".format(species = species, id = sequence.id)
        if not rerun and not pathlib.Path(filename).is_file(): #only create files if the file does not exist unless specified otherwise.
            SeqIO.write(sequence , filename ,"fasta")
    logging.info("Created {} files.".format(i))

split_proteome("Hydra vulgaris")

## InterProScan annotation

Now the sequences can be annotated using the local InterProScan:

in the first run I just annotate the protein domains loosely because it runs on the whole proteome. This is sufficient to identify ECM proteins.

```shell
  
    sudo ./interproscan.sh -i input/Hydra_vulgaris.fasta -o output/Hydra_vulgaris_annot.tsv -f tsv -dra -cpu 14 -appl TIGRFAM,SFLD,SUPERFAMILY,PANTHER,ProSiteProfiles,SMART,CDD,PRINTS,PIRSR,ProSitePatterns,AntiFam,Pfam;

```
This will create a tsv file with annotations for each of the sequences.