**Deep TMHMM from fasta**

Just a quick split up of all fasta-seqs from TCDB, as well as trying out TMHMM on a batch. Just to get a view of what it does.

Spoiler: It shows transmembrane segments, that's basically it

In [1]:
import biolib
import os
from Bio import SeqIO
from IPython.display import Image
deeptmhmm = biolib.load('DTU/DeepTMHMM')

2024-09-27 11:10:26,348 | INFO : Loaded project DTU/DeepTMHMM:1.0.42


***First of - Submitting a large batch at once***

*TMRs.gff3*: This file contains transmembrane region annotations in GFF3 format.        
*predicted_topologies.3line*: A plain-text file with the predicted topology for each protein (cytoplasmic, extracellular, transmembrane).                                                   
*deeptmhmm_results.md*: A markdown file summarizing the results.

Divide the large FASTA-file

In [2]:
def split_fasta(input_fasta, output_prefix, chunk_size):

    os.makedirs("fasta_batches_split", exist_ok=True)

    fasta_sequences = SeqIO.parse(open(input_fasta), 'fasta')
    batch_number = 1
    sequences_in_batch = []
    
    for i, fasta in enumerate(fasta_sequences):
        sequences_in_batch.append(fasta)
        
        if (i + 1) % chunk_size == 0:
            
            output_file = f"fasta_batches_split/{output_prefix}_batch_{batch_number}.fasta"
            SeqIO.write(sequences_in_batch, output_file, "fasta")
            sequences_in_batch = []
            batch_number += 1
    
    if sequences_in_batch:
        output_file = f"fasta_batches_split/{output_prefix}_batch_{batch_number}.fasta"
        SeqIO.write(sequences_in_batch, output_file, "fasta")

Process each batch with DeepTMHMM

In [3]:
def process_fasta_batches(output_prefix, batch_count):

    os.makedirs("fasta_batches_processed", exist_ok=True)

    for batch_number in range(1, batch_count + 1):
        
        batch_file = f"fasta_batches_split/{output_prefix}_batch_{batch_number}.fasta"
        print(f"Processing {batch_file}...")
        deeptmhmm_job = deeptmhmm.cli(args=f"--fasta {batch_file}")
        deeptmhmm_job.save_files(f"fasta_batches_processed/{batch_number}")

In [4]:
input_fasta = "../TCDB_fasta.txt"
output_prefix = "TCDB_fasta_split"
chunk_size = 50
# 100-500 is reasonable for TCC

split_fasta(input_fasta, output_prefix, chunk_size)

batch_count = (len(list(SeqIO.parse(input_fasta, "fasta"))) + chunk_size - 1) // chunk_size

In [5]:
# process_fasta_batches(output_prefix, batch_count)
process_fasta_batches(output_prefix, 1)


Processing fasta_batches_split/TCDB_fasta_split_batch_1.fasta...
2024-09-27 11:11:27,080 | INFO : Cloud: Initializing
2024-09-27 11:11:29,227 | INFO : Cloud: Pulling images...
2024-09-27 11:11:29,229 | INFO : Cloud: Computing...
2024-09-27 09:11:31,386 | INFO : Large input detected. Allocating dedicated capacity ...
Running DeepTMHMM on 50 sequences...
Step 1/4 | Loading transformer model...

Step 2/4 | Generating embeddings for sequences...
Generating embeddings: 100% 50/50 [00:06<00:00,  7.86seq/s]

Step 3/4 | Predicting topologies for sequences in batches of 1...
Topology prediction: 100% 50/50 [00:24<00:00,  2.05seq/s]

Step 4/4 | Generating output...
2024-09-27 11:16:19,871 | INFO : Cloud: Computation finished
2024-09-27 09:16:19,689 | INFO : Done in 288.60 seconds
2024-09-27 11:16:21,984 | INFO : Cloud: Result Ready
2024-09-27 11:16:21,986 | INFO : Waiting for job 846b574b-6758-4885-8e13-cef1529b3643 to finish...
2024-09-27 11:16:22,295 | INFO : Job 846b574b-6758-4885-8e13-cef152