<a href="https://colab.research.google.com/github/andyposbe/ColabFold-Pipeline-Toolkit/blob/main/Pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ColabFold-Pipeline-Toolkit**: Pre-Processing
<img src="https://github.com/andyposbe/ColabFold-Pipeline-Toolkit/blob/main/pre_processing_1.png?raw=true" height="200" align="right" style="height:200px">



This notebook helps with file preparation for submission to [ColabFold BATCH](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/batch/AlphaFold2_batch.ipynb).

## Operations covered

1. Fasta combiner for AF2-Multimer
2. Homo-multimer fasta preparation
3. Multifasta demultiplexer


---
**Author:** Andres Posbeyikian
**Date:** August 27th, 2023

For more details, checkout the [ColabFold-Pipeline-Toolkit GitHub](https://github.com/andyposbe/ColabFold-Pipeline-Toolkit)
![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=plastic&logo=github&logoColor=white)

To cite this toolkit, refer to this article: [10.5281/zenodo.10565786](https://doi.org/10.5281/zenodo.10565786)

# Mounting the drive and importing libraries

In [None]:
#@title Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

In [98]:
#@title Import libraries
from pathlib import Path
import glob
import os
import io
import json
import scipy
from google.colab import files
import numpy as np
import numpy
import json
import pandas as pd
import seaborn as sns
import openpyxl
import matplotlib.pyplot as plt
import matplotlib
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.subplots as sp
import plotly.express as px
import ipywidgets as widgets

# 1. Fasta combiner for Multimer
Execute the cells below to prepare the fasta files for input into ColabFold BATCH. Specify the Google Drive directories where each pool of proteins was uploaded.

In [91]:
#@title Input/Output directories

#Monomer_directory = '/content/drive/MyDrive/Project/monomer_fastas' # @param {type:"string"}
Pool_A_directory = '/content/gdrive/MyDrive/Project/Pool_A_fastas' # @param {type:"string"}
Pool_A_directory = Pool_A_directory.rstrip('/')
Pool_B_directory = '/content/gdrive/MyDrive/Project/Pool_B_fastas' # @param {type:"string"}
Pool_B_directory = Pool_B_directory.rstrip('/')

output_dir = '/content/gdrive/MyDrive/Project/output' #@param {type:"string"}
output_dir = output_dir.rstrip('/')

In [102]:
#@title Import the fasta files
Pool_A_list = []
Pool_B_list = []

# Load files from each directory into lists
for file in glob.glob(Pool_A_directory+'/*.fasta'):
  Pool_A_list.append(file.split('/')[-1])

for file in glob.glob(Pool_B_directory+'/*.fasta'):
  Pool_B_list.append(file.split('/')[-1])

Pool_A_list.sort()
Pool_B_list.sort()

In [None]:
#@title Select the proteins of interest from **Pool A**


select_multiple_widget = widgets.SelectMultiple(
    options= Pool_A_list,
    description='Pool A',
    disabled=False,
    #ålayout=widgets.Layout(height='200px', width='auto')
)
# Global list to store selected items
selected_items_list_A = []

# Function to update the list based on selection
def update_list(change):
    global selected_items_list_A
    selected_items_list_A = list(change['new'])

# Attach the update function to the 'value' trait of the widget
select_multiple_widget.observe(update_list, names='value')

# Display the widget
display(select_multiple_widget)

In [None]:
#@title Select the proteins of interest from **Pool B**
# Create the SelectMultiple widget
select_multiple_widget = widgets.SelectMultiple(
    options=Pool_B_list,
    description='Pool B',
    disabled=False,
    layout=widgets.Layout(height='200px', width='auto')
)
# Global list to store selected items
selected_items_list_B = []

# Function to update the list based on selection
def update_list(change):
    global selected_items_list_B
    selected_items_list_B = list(change['new'])

# Attach the update function to the 'value' trait of the widget
select_multiple_widget.observe(update_list, names='value')

# Display the widget
display(select_multiple_widget)

In [109]:
#@title Combine fastas for ColabFold BATCH and save to `output_dir`
def read_single_fasta(file_path):
    """Reads a single-record FASTA file and returns a tuple (header, sequence)."""
    with open(file_path, 'r') as file:
        header = None
        sequence = []

        for line in file:
            if line.startswith(">"):
                if header is not None:
                    break  # Stop if another header is found, assuming only one sequence in the file
                header = line.strip()[1:]  # Remove '>' and newline
            else:
                sequence.append(line.strip())

        return header, ''.join(sequence)


i = 0
for pool_A_file in selected_items_list_A:
  header_A,sequence_A = read_single_fasta(os.path.join(Pool_A_directory,pool_A_file))
  for pool_B_file in selected_items_list_B:
      header_B,sequence_B = read_single_fasta(os.path.join(Pool_B_directory, pool_B_file))

      fused_header = '>' + header_A + header_B
      fused_sequence = sequence_A + ':' + sequence_B
      output_path = os.path.join(output_dir, "{:03d}".format((i))+'_'+ pool_A_file.strip('.fasta') + '_' + pool_B_file)

      with open(output_path, 'w') as output_file:
        output_file.write(fused_header + '\n')
        output_file.write(fused_sequence)
        i += 1







---



# 2. Homo-multimer fasta preparation
The cells below allow preparation of homo-oligomer fasta files from simple monomer fasta files.

In [116]:
#@title Input/Output directories
oligomeric_state = 'tetramer' # @param ["dimer", "trimer", "tetramer","pentamer","hexamer","heptamer","octamer","nonamer","decamer"]

input_dir = '/content/gdrive/MyDrive/Project/input' # @param {type:"string"}
input_dir = input_dir.rstrip('/')
output_dir = '/content/gdrive/MyDrive/Project/output' #@param {type:"string"}
output_dir = output_dir.rstrip('/')

In [117]:
#@title Import fasta files
monomer_file_list = []

# Load files from each directory into lists
for file in glob.glob(input_dir+'/*.fasta'):
  monomer_file_list.append(file.split('/')[-1])

monomer_file_list.sort()

In [122]:
#@title Generate homo-oligomer fasta files and save in `output_dir`
def read_single_fasta(file_path):
    """Reads a single-record FASTA file and returns a tuple (header, sequence)."""
    with open(file_path, 'r') as file:
        header = None
        sequence = []

        for line in file:
            if line.startswith(">"):
                if header is not None:
                    break  # Stop if another header is found, assuming only one sequence in the file
                header = line.strip()[1:]  # Remove '>' and newline
            else:
                sequence.append(line.strip())

        return header, ''.join(sequence)


oligomeric_state_dictionary = {'dimer':2,
                               'trimer':3,
                               'tetramer':4,
                               'pentamer':5,
                               'hexamer':6,
                               'heptamer':7,
                               'octamer': 8,
                               'nonamer':9,
                               'decamer':10}

i = 0
for monomer_file in monomer_file_list:
  monomer_header,monomer_sequence = read_single_fasta(os.path.join(input_dir,monomer_file))

  oligomeric_header = '>'+ monomer_header + '_' + oligomeric_state
  oligomeric_sequence = ':'.join([monomer_sequence]*oligomeric_state_dictionary[oligomeric_state])
  output_path = os.path.join(output_dir, "{:03d}".format((i))+'_'+ oligomeric_header[1:]+'.fasta')
  with open(output_path, 'w') as output_file:
    output_file.write(oligomeric_header + '\n')
    output_file.write(oligomeric_sequence)
    i += 1




---



# 3. Multifasta Demultiplexer
ColabFold BATCH reads individual fasta files, but sometimes we have a multifasta file that we need to de-multiplex in order to feed into the pipeline. The code below does just that.

Specify the **directory** where you want the individual fasta files to be saved, and **upload the multifasta file below**.

In [None]:
#@title Specify output directory and execute cell to upload multifasta

output_dir = '/content/gdrive/My Drive/output/' #@param {type:"string"}
output_dir = output_dir.rstrip('/')

# Upload the multifasta file
from google.colab import files
uploaded = files.upload()
file_name = next(iter(uploaded))
file_path = "/content/" + file_name

In [36]:
#@title Demultiplex fasta file
def demultiplex_fasta(input_file, output_dir):
    """
    Reads a multi-FASTA file and creates individual FASTA files for each sequence.

    Args:
    input_file (str): Path to the multi-FASTA file.
    output_dir (str): Directory where the individual FASTA files will be saved.
    """
    import os
    import numpy as np

    # Create the output directory if it doesn't exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    with open(input_file, 'r') as file:
        content = file.read().split('>')[1:]  # Split the file by '>' and remove the first empty string
        record_index = 0
        for record in content:
            lines = record.split('\n')
            header = lines[0].split()[0]  # Get the first word of the header line
            sequence = ''.join(lines[1:])  # Join the remaining lines as the sequence

            output_path = os.path.join(output_dir, "{:03d}".format((record_index))+'_'+ header + '.fasta')
            with open(output_path, 'w') as output_file:
                output_file.write('>' + record)

            record_index += 1

demultiplex_fasta(file_path, output_dir)