<h1 style="font-family: 'Times New Roman'; text-align: center; font-weight: bold;">
    CoxL MOTIF Search
</h1>

<div style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
  <div>
    <span style="display: inline-block; width: 100px;"><b>Date</b></span>: 15<sup>th</sup> November 2024
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Author</b></span>: Deepan Kanagarajan Babu
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Description</b></span>: In this document, the sequences are further filtered based on the presence of MOTIF regions. The gene sequeces fetched are looked for MOTIF regions. CoxL protein have either form 1 MOTIF ('AYXCSFR') or form 2 MOTIF ('AYRGAGR') not both which differenciates the protein function and their ability. Form 1 is know to involved in CODH. Form 2 has unclear function, but some studies claim, their have divergent function & not performing CODH and some claim it perform CODH but not as effective as form 1 proteins.
  </div>
</div>


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Libraries
</h2>

In [2]:
# importing necessary libraries
import os
import re
import pandas as pd
import numpy as np
from Bio import SeqIO
from collections import defaultdict

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Functions
</h2>

In [4]:
# Defining single function to open multiple file formats (.csv, .txt, .fasta or .fastq (to display only 1st 100 lines), and .xlsx)
def open_file(file_path):
    if file_path.endswith('.csv'):
        # Read CSV file
        data = pd.read_csv(file_path)
        return data
    elif file_path.endswith('.txt'):
        # Read TXT file
        try:
            with open(file_path, 'r') as file:
                data = file.read()
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith(('.fasta', '.fastq')):
        # Read FASTA or FASTQ file
        try:
            with open(file_path, 'r') as file:
                data = []
                for _ in range(100):
                    line = file.readline()
                    if not line:
                        break
                    data.append(line.strip())
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith('.xlsx'):
        # Read XLSX file
        try:
            data = pd.read_excel(file_path)
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    else:
        raise ValueError("Unsupported file format")

In [7]:
# Function to search for 2 motifs with unique labels
def search_motifs(directory, pattern1, label1="Motif1", pattern2=None, label2="Motif2", output_file=None):
    all_results = []

    # Ensure directory exists
    if not os.path.isdir(directory):
        print(f"Error: Directory '{directory}' not found.")
        return pd.DataFrame()

    # Process each FASTA or FAA file in the directory
    for filename in os.listdir(directory):
        if filename.endswith((".fasta", ".faa")):
            fasta_file = os.path.join(directory, filename)

            for record in SeqIO.parse(fasta_file, "fasta"):
                seq = str(record.seq)

                # Search for the first pattern (if provided)
                if pattern1:
                    for match in pattern1.finditer(seq):
                        all_results.append({
                            "File_Name": filename,
                            "ID": record.id,
                            "Sequence": seq,
                            "MOTIF": match.group(0),
                            "Previous_3_AA": match.group(1),
                            "Motif_Label": label1
                        })

                # Search for the second pattern (if provided)
                if pattern2:
                    for match in pattern2.finditer(seq):
                        all_results.append({
                            "File_Name": filename,
                            "ID": record.id,
                            "Sequence": seq,
                            "MOTIF": match.group(0),
                            "Previous_3_AA": match.group(1),
                            "Motif_Label": label2
                        })

    # Convert results to DataFrame
    df = pd.DataFrame(all_results)

    # Set output file name dynamically if not provided
    if output_file is None:
        output_file = f"{label1}_{label2}_motif_search.csv"  # Default to label-based filename

    # Save only if results are found
    if not df.empty:
        df.to_csv(output_file, index=False)
        print(f"Results saved to {output_file}")
    else:
        print("No results found for motifs.")

    return df

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Directories Path
</h2>

In [10]:
# Specify the input folder paths
## concatenated MOTIF form 1 BLAST CSV file path
motif_1_csv_file_path = "CoxL1_Results/MOTIF/coxl1_csv.csv"


## concatenated MOTIF form 2 BLAST CSV file path
motif_2_csv_file_path = "CoxL2_Results/MOTIF/coxl2_csv.csv"

## CoxL Form 1 BLAST proteins sequence directory
motif_1_proteins = "CoxL1_Results/MOTIF_proteins"

## CoxL Form 2 BLAST proteins sequence directory
motif_2_proteins = "CoxL2_Results/MOTIF_proteins"

# Specify the output folder paths
## Directory to store CoxL Form 1 BLAST MOTIF search results output
motif1 = "CoxL1_Results/MOTIF"

## Directory to store CoxL Form 2 BLAST MOTIF search results output
motif2 = "CoxL2_Results/MOTIF"

## Directory to store unaligned sequences for MOTIF form 1
motif_1_fasta_path = "CoxL1_Results/MOTIF/fasta"
## Ensure the output directory exists
os.makedirs(motif_1_fasta_path, exist_ok=True)

## Directory to store unaligned sequences for MOTIF form 2
motif_2_fasta_path = "CoxL2_Results/MOTIF/fasta"
## Ensure the output directory exists
os.makedirs(motif_2_fasta_path, exist_ok=True)

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Search MOTIFs
</h2>

In [13]:
# MOTIF search pattern
## Define the motifs as regular expressions
form1 = re.compile(r"(.{3})CSFR") # last four amino acids of form 1: AYXCSFR
form2 = re.compile(r"(.{3})GAGR") # Last four amino acids of form 2: AYRGAGR

## Define Label for MOTIFs
label1 = "CoxL_form_1"
label2 = "CoxL_form_2"

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    Search MOTIFs in CoxL Form 1 BLAST Results
</h3>

In [16]:
# Search MOTIFs for CoxL Form 1 BLAST results
coxl1_motif_results = search_motifs(motif_1_proteins, form1, label1, form2, label2, output_file=f"{motif1}/coxl1_motif_search.csv")

Results saved to CoxL1_Results/MOTIF/coxl1_motif_search.csv


In [17]:
# Checking MOTIF search results
coxl1_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label
0,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2
1,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2
2,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2
3,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2
4,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2
...,...,...,...,...,...,...
5145,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2
5146,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2
5147,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2
5148,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,QRVVALTIEPRSVLAARDPETGRLTVRMSTQMPSGVRDAICAAIGL...,AYRGAGR,AYR,CoxL_form_2


In [18]:
# Checking for the presence of Form 1 and 2 MOTIFs in the results
coxl1_motif_results['Motif_Label'].unique()

array(['CoxL_form_2', 'CoxL_form_1'], dtype=object)

In [107]:
coxl1_motif_output = coxl1_motif_results[coxl1_motif_results['Motif_Label'] == 'CoxL_form_1']

In [109]:
coxl1_motif_output

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
9,uncultured_Comamonas_sp._ERS6626602.faa,Genome43_S3Ck127_66545_1,HKQWNYWARGSSTSSGAGRCARPSFRSALRPRRGRGCRTTCPRRTG...,NGSCSFR,NGS,CoxL_form_1,uncultured_Comamonas_sp._ERS6626602.fasta,Genome43_S3Ck127_66545,199


In [19]:
# Checking for number of unique genomes containing  CoxL Form 2 MOTIF
coxl1_motif_results['File_Name'].nunique()

59

In [20]:
# Checking the structure of sequence ID
coxl1_motif_results['ID'].unique()[0]

'Genome228_k127_160184_1'

In [21]:
# Adding a Column 'target_file' for easy analysis in further steps
coxl1_motif_results['target_file'] = coxl1_motif_results['File_Name'].apply(lambda x: x[:-4] + '.fasta')
coxl1_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file
0,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta
1,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta
2,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta
3,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta
4,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta
...,...,...,...,...,...,...,...
5145,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta
5146,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta
5147,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta
5148,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,QRVVALTIEPRSVLAARDPETGRLTVRMSTQMPSGVRDAICAAIGL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta


In [22]:
# Adding column 'qseqid' from the updated sequence ID coulmn in the dataframe
coxl1_motif_results['qseqid'] = coxl1_motif_results['ID'].apply(lambda x: x[:-2])
coxl1_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid
0,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184
1,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184
2,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184
3,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184
4,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184
...,...,...,...,...,...,...,...,...
5145,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488
5146,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488
5147,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488
5148,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,QRVVALTIEPRSVLAARDPETGRLTVRMSTQMPSGVRDAICAAIGL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488


In [23]:
# Finding the length of every sequence and save the value in 'sequence length' column
coxl1_motif_results['sequence_length'] = coxl1_motif_results['Sequence'].apply(len)
coxl1_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
0,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
1,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,196
2,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,196
3,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
4,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
...,...,...,...,...,...,...,...,...,...
5145,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,761
5146,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,760
5147,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,760
5148,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,QRVVALTIEPRSVLAARDPETGRLTVRMSTQMPSGVRDAICAAIGL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,571


In [24]:
# Filter data frame based on 'File_Name', 'qseqid', 'MOTIF' & 'sequence_length'
coxl1_motif_results = coxl1_motif_results.sort_values(by=['File_Name', 'qseqid', 'MOTIF', 'sequence_length'],
                                        ascending=[True, True, True, False])
coxl1_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
0,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
3,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
4,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
5,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
11,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
...,...,...,...,...,...,...,...,...,...
5078,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,VRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARILSIDTS...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,758
5080,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,VRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARILSIDTS...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,758
5098,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,VRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARILSIDTS...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,757
5108,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,QAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARILSID...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,738


In [25]:
# Grouping data frame by 'File_Name', 'qseqid', 'MOTIF' and remove the duplicates, considering the 1st row of every group is the best match
coxl1_motif_results = coxl1_motif_results.groupby(['File_Name', 'qseqid', 'MOTIF']).head(1).reset_index(drop=True)
coxl1_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
0,uncultured_Acidovorax_sp._ERS6626787.faa,Genome228_k127_160184_1,PRRWSCWPRGSLISFGAGRYARPSRRRARGLLTGRVCRTTRPHRTG...,ISFGAGR,ISF,CoxL_form_2,uncultured_Acidovorax_sp._ERS6626787.fasta,Genome228_k127_160184,197
1,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,MGASDFSKLPYIGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFV...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059,795
2,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,DMNQTINLSISKKSAKAGDSIPHESAHLHVTGQATYIDDLPELENT...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961,776
3,uncultured_Acinetobacter_sp._ERS6626858.faa,Genome299_k127_417201_1,YLNAVELRNLRCKTNTVSNTAYRGFGGPQGMFVIENIIDDIARYLG...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626858.fasta,Genome299_k127_417201,444
4,uncultured_Agrobacterium_sp._ERS6626766.faa,Genome207_k127_77525_1,MLPVFVGPAGPLVQAERGEAVVSAAGLRAAAVRDGPSAGVVLSPAA...,GCIGAGR,GCI,CoxL_form_2,uncultured_Agrobacterium_sp._ERS6626766.fasta,Genome207_k127_77525,220
...,...,...,...,...,...,...,...,...,...
69,uncultured_Stenotrophomonas_sp._ERS6627319.faa,Genome543_k127_120424_1,RSKATYGGRVHGAGRMTASSGVAFGSASRLA*,RVHGAGR,RVH,CoxL_form_2,uncultured_Stenotrophomonas_sp._ERS6627319.fasta,Genome543_k127_120424,32
70,uncultured_Variovorax_sp._ERS6626779.faa,Genome220_k127_244806_1,KRTTLTTEPNPTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAH...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6626779.fasta,Genome220_k127_244806,777
71,uncultured_Variovorax_sp._ERS6626779.faa,Genome220_k127_244806_1,KRTTLTTEPNPTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAH...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6626779.fasta,Genome220_k127_244806,777
72,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,KRTTLTTEPNPTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAH...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,777


In [123]:
coxl1_motif_results[coxl1_motif_results['Motif_Label'] == 'CoxL_form_1']

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
9,uncultured_Comamonas_sp._ERS6626602.faa,Genome43_S3Ck127_66545_1,HKQWNYWARGSSTSSGAGRCARPSFRSALRPRRGRGCRTTCPRRTG...,NGSCSFR,NGS,CoxL_form_1,uncultured_Comamonas_sp._ERS6626602.fasta,Genome43_S3Ck127_66545,199


In [26]:
# Save updated results
coxl1_motif_results.to_csv(f"{motif1}/coxl1_motif_results.csv")

In [27]:
# Generating a list of Genome file names
coxl1_motif_mag_list = list(coxl1_motif_results['target_file'].unique())
coxl1_motif_mag_list

['uncultured_Acidovorax_sp._ERS6626787.fasta',
 'uncultured_Acidovorax_sp._ERS6627271.fasta',
 'uncultured_Acinetobacter_sp._ERS6626821.fasta',
 'uncultured_Acinetobacter_sp._ERS6626858.fasta',
 'uncultured_Agrobacterium_sp._ERS6626766.fasta',
 'uncultured_Aureimonas_sp._ERS6626915.fasta',
 'uncultured_Aureimonas_sp._ERS6627306.fasta',
 'uncultured_Comamonas_sp._ERS6626602.fasta',
 'uncultured_Comamonas_sp._ERS6626630.fasta',
 'uncultured_Comamonas_sp._ERS6626638.fasta',
 'uncultured_Comamonas_sp._ERS6626814.fasta',
 'uncultured_Comamonas_sp._ERS6627283.fasta',
 'uncultured_Deinococcus_sp._ERS6626797.fasta',
 'uncultured_Deinococcus_sp._ERS6626820.fasta',
 'uncultured_Deinococcus_sp._ERS6626837.fasta',
 'uncultured_Deinococcus_sp._ERS6627293.fasta',
 'uncultured_Deinococcus_sp._ERS6627335.fasta',
 'uncultured_Hymenobacter_sp._ERS6626733.fasta',
 'uncultured_Hymenobacter_sp._ERS6627279.fasta',
 'uncultured_Massilia_sp._ERS6627272.fasta',
 'uncultured_Methylobacterium_sp._ERS6627288.fast

In [28]:
# Checking the length of the genomone file names list
len(coxl1_motif_mag_list)

59

In [29]:
# Generating a set containing genome file names with their respective query sequence ID
coxl1_motif_id_set = coxl1_motif_results.groupby('target_file')['qseqid'].apply(set).to_dict()
coxl1_motif_id_set

{'uncultured_Acidovorax_sp._ERS6626787.fasta': {'Genome228_k127_160184'},
 'uncultured_Acidovorax_sp._ERS6627271.fasta': {'Genome495_k127_317059'},
 'uncultured_Acinetobacter_sp._ERS6626821.fasta': {'Genome262_k127_235961'},
 'uncultured_Acinetobacter_sp._ERS6626858.fasta': {'Genome299_k127_417201'},
 'uncultured_Agrobacterium_sp._ERS6626766.fasta': {'Genome207_k127_77525'},
 'uncultured_Aureimonas_sp._ERS6626915.fasta': {'Genome356_k127_339079',
  'Genome356_k127_498042',
  'Genome356_k127_610672'},
 'uncultured_Aureimonas_sp._ERS6627306.fasta': {'Genome530_k127_16854'},
 'uncultured_Comamonas_sp._ERS6626602.fasta': {'Genome43_S3Ck127_66545'},
 'uncultured_Comamonas_sp._ERS6626630.fasta': {'Genome71_S1Ck127_18694'},
 'uncultured_Comamonas_sp._ERS6626638.fasta': {'Genome79_S3Ck127_287303'},
 'uncultured_Comamonas_sp._ERS6626814.fasta': {'Genome255_k127_46261'},
 'uncultured_Comamonas_sp._ERS6627283.fasta': {'Genome507_k127_91973'},
 'uncultured_Deinococcus_sp._ERS6626797.fasta': {'Geno

In [30]:
# Generate a set within a set containing genome file names with their respective query sequences and the first three amino acid sequences of the MOTIF
## Initialize a dictionary to hold the data
coxl1_motif_3aa = coxl1_motif_results.groupby(['target_file', 'qseqid'])['Previous_3_AA'].apply(list).reset_index()

## Create the dictionary
coxl1_motif_3aa_dict = coxl1_motif_3aa.groupby('target_file').apply(lambda x: x.set_index('qseqid')['Previous_3_AA'].to_dict()).to_dict()

## Print the resulting dictionary
for target_file, ids in coxl1_motif_3aa_dict.items():
    print(f"Target File: {target_file}")
    for id, previous_3_aa in ids.items():
        print(f"  ID: {id}, Previous 3 AA: {previous_3_aa}")

Target File: uncultured_Acidovorax_sp._ERS6626787.fasta
  ID: Genome228_k127_160184, Previous 3 AA: ['ISF']
Target File: uncultured_Acidovorax_sp._ERS6627271.fasta
  ID: Genome495_k127_317059, Previous 3 AA: ['AYR']
Target File: uncultured_Acinetobacter_sp._ERS6626821.fasta
  ID: Genome262_k127_235961, Previous 3 AA: ['EQP']
Target File: uncultured_Acinetobacter_sp._ERS6626858.fasta
  ID: Genome299_k127_417201, Previous 3 AA: ['EQP']
Target File: uncultured_Agrobacterium_sp._ERS6626766.fasta
  ID: Genome207_k127_77525, Previous 3 AA: ['GCI']
Target File: uncultured_Aureimonas_sp._ERS6626915.fasta
  ID: Genome356_k127_339079, Previous 3 AA: ['SGR']
  ID: Genome356_k127_498042, Previous 3 AA: ['AYR']
  ID: Genome356_k127_610672, Previous 3 AA: ['AGT']
Target File: uncultured_Aureimonas_sp._ERS6627306.fasta
  ID: Genome530_k127_16854, Previous 3 AA: ['AYR']
Target File: uncultured_Comamonas_sp._ERS6626602.fasta
  ID: Genome43_S3Ck127_66545, Previous 3 AA: ['NGS', 'TSS']
Target File: uncul

  coxl1_motif_3aa_dict = coxl1_motif_3aa.groupby('target_file').apply(lambda x: x.set_index('qseqid')['Previous_3_AA'].to_dict()).to_dict()


<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    Filter CoxL Form 1 BLAST Results with MOTIF Search Results
</h3>

In [32]:
# Open the MOTIF 1 concatenated BLAST results
coxl1_csv = open_file(motif_1_csv_file_path)
coxl1_csv.head()

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
0,Genome20_S3Ck127_316360,Mycobacterium_marinum_M,24.634,42,5.05e-43,163.0,4980,2872,53,784,751,Bdellovibrionaceae_bacterium_ERS6626579_scores...,Bdellovibrionaceae_bacterium_ERS6626579.fasta
1,Genome20_S3Ck127_316360,Mycobacterium_tusciae_WP_006247553,24.433,42,2.45e-41,158.0,4980,2872,53,784,749,Bdellovibrionaceae_bacterium_ERS6626579_scores...,Bdellovibrionaceae_bacterium_ERS6626579.fasta
2,Genome20_S3Ck127_316360,Rhodococcus_opacus_B4,24.074,39,4.12e-41,157.0,4980,3022,56,735,702,Bdellovibrionaceae_bacterium_ERS6626579_scores...,Bdellovibrionaceae_bacterium_ERS6626579.fasta
3,Genome20_S3Ck127_316360,Micromonospora_aurantiaca_ATCC_27029,24.536,42,4.1499999999999997e-41,157.0,4980,2872,53,784,754,Bdellovibrionaceae_bacterium_ERS6626579_scores...,Bdellovibrionaceae_bacterium_ERS6626579.fasta
4,Genome20_S3Ck127_316360,Mycobacterium_ulcerans_Agy99_YP_904347,24.528,42,5.75e-41,156.0,4953,2872,62,784,742,Bdellovibrionaceae_bacterium_ERS6626579_scores...,Bdellovibrionaceae_bacterium_ERS6626579.fasta


In [33]:
# Filter the BLAST results with the genome file names list, to remove the other genomes results
coxl1_csv_filtered = coxl1_csv[coxl1_csv['target_file'].isin(coxl1_motif_mag_list)]
coxl1_csv_filtered

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
5188,Genome228_k127_160184,Pseudonocardia_dioxanivorans_CB1190_coxL2_(651...,36.242,0,2.990000e-15,79.7,33523,33092,644,792,149,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5189,Genome228_k127_160184,Amycolatopsis_thermoflava_N1165_DSM_44574_2512...,34.228,1,8.420000e-13,71.6,33523,33110,632,777,149,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5190,Genome228_k127_160184,Amycolatopsis_thermoflava_N1165_DSM_44574_2512...,28.788,1,5.960000e-07,52.4,34717,34043,26,285,264,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5191,Genome228_k127_160184,Solirubrobacter_soli_DSM_1,33.566,0,1.040000e-12,71.2,33523,33110,632,774,143,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5192,Genome228_k127_160184,Solirubrobacter_sp._URHD008_1,34.416,0,1.080000e-12,71.2,33523,33086,621,774,154,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96485,Genome470_k127_212027,Halobellus_limi_CGMCC_1.10331_:_Ga0070554_102_...,47.500,4,1.390000e-07,48.1,1,120,762,801,40,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
96486,Genome470_k127_212027,Pseudonocardia_autotrophica_2,55.000,4,1.920000e-07,47.8,1,120,737,773,40,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
96487,Genome470_k127_212027,Pseduonocardia_autotrophica_1,50.000,4,3.370000e-07,47.0,1,120,737,776,40,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
96488,Genome470_k127_212027,Alkalilimnicola_ehrlichei_MLHE-1_YP_742401,50.000,4,5.070000e-07,46.6,1,120,750,789,40,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta


In [34]:
# Sort the blast results based on genome file name ('target_file'), query sequence id ('qseqid'), E-value, Bitscore, & Percentage Identity ('pident') 
coxl1_csv_sorted = coxl1_csv_filtered.sort_values(by=['target_file', 'qseqid', 'evalue', 'bitscore', 'pident'],
                                        ascending=[True, True, True, False, False])

coxl1_csv_sorted

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
5188,Genome228_k127_160184,Pseudonocardia_dioxanivorans_CB1190_coxL2_(651...,36.242,0,2.990000e-15,79.7,33523,33092,644,792,149,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5189,Genome228_k127_160184,Amycolatopsis_thermoflava_N1165_DSM_44574_2512...,34.228,1,8.420000e-13,71.6,33523,33110,632,777,149,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5191,Genome228_k127_160184,Solirubrobacter_soli_DSM_1,33.566,0,1.040000e-12,71.2,33523,33110,632,774,143,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5192,Genome228_k127_160184,Solirubrobacter_sp._URHD008_1,34.416,0,1.080000e-12,71.2,33523,33086,621,774,154,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
5193,Genome228_k127_160184,Pseudonocardia_autotrophica_2,32.215,0,2.380000e-12,70.1,33523,33092,630,778,149,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96140,Genome470_k127_46816,Alicyclobacillus_pomorum_DSM_14955_2513792691,22.444,19,2.300000e-07,48.5,4199,5248,23,399,401,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
96141,Genome470_k127_46816,Conexibacter_woesei_DSM_14684_(YP_003394821),27.511,12,3.230000e-07,48.1,4598,5266,196,405,229,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
96142,Genome470_k127_46816,Micromonospora_aurantiaca_ATCC_27029,23.614,20,3.980000e-07,47.8,4172,5266,23,415,415,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
96143,Genome470_k127_46816,Thermocrispum_municipale_coxL2,23.833,20,4.750000e-07,47.8,4199,5266,27,410,407,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta


In [35]:
# Group the data frame by 'target_file' & 'qseqid' and considering the 1st row of every group is the best results, removing other duplicates
coxl1_csv_sorted = coxl1_csv_sorted.groupby(['target_file', 'qseqid']).head(1).reset_index(drop=True)
coxl1_csv_sorted

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
0,Genome228_k127_160184,Pseudonocardia_dioxanivorans_CB1190_coxL2_(651...,36.242,0,2.990000e-15,79.7,33523,33092,644,792,149,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
1,Genome228_k127_35392,Meiothermus_rufus_DSM_22234_2523182758,26.582,3,7.500000e-36,145.0,48735,46567,16,787,790,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
2,Genome495_k127_217386,Sulfolobus_solfataricus_98/2,23.488,10,7.110000e-13,67.8,30,1124,28,423,430,uncultured_Acidovorax_sp._ERS6627271_scores.csv,uncultured_Acidovorax_sp._ERS6627271.fasta
3,Genome495_k127_317059,Mycobacterium_smegmatis_MC2_155,43.511,7,0.000000e+00,591.0,29003,31342,27,796,786,uncultured_Acidovorax_sp._ERS6627271_scores.csv,uncultured_Acidovorax_sp._ERS6627271.fasta
4,Genome495_k127_374228,BKZW01_10191_Dictyobacter_vulcani_W12,25.616,15,1.010000e-44,170.0,9123,7084,20,720,730,uncultured_Acidovorax_sp._ERS6627271_scores.csv,uncultured_Acidovorax_sp._ERS6627271.fasta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,Genome470_k127_290350,Geoarchaeota_archaeon_OSPB-1_2504814364,25.548,10,6.960000e-41,159.0,1764,3992,23,764,775,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
260,Genome470_k127_338375,BKZW01_10191_Dictyobacter_vulcani_W12,41.550,53,0.000000e+00,595.0,2342,6,7,777,787,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
261,Genome470_k127_380329,BKZW01_10191_Dictyobacter_vulcani_W12,27.982,4,2.830000e-35,142.0,27723,25837,100,771,679,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
262,Genome470_k127_43488,BKZW01_10191_Dictyobacter_vulcani_W12,36.086,32,5.640000e-125,408.0,1355,3637,8,776,787,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta


In [36]:
# Filtering the data frame with a genome file names and query sequence id set, to remove the other sequences which don't contain MOTIF
## Create a empty set to store fileted rows
filtered_rows = []
## Filter the data frame
for key, values in coxl1_motif_id_set.items():
    filtered_rows.append(coxl1_csv_sorted[(coxl1_csv_sorted['target_file'] == key) & (coxl1_csv_sorted['qseqid'].isin(values))])
    ## Combine the filtered results
    filtered_coxl1_csv = pd.concat(filtered_rows).reset_index(drop=True)
## Check the output
filtered_coxl1_csv

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
0,Genome228_k127_160184,Pseudonocardia_dioxanivorans_CB1190_coxL2_(651...,36.242,0,2.990000e-15,79.7,33523,33092,644,792,149,uncultured_Acidovorax_sp._ERS6626787_scores.csv,uncultured_Acidovorax_sp._ERS6626787.fasta
1,Genome495_k127_317059,Mycobacterium_smegmatis_MC2_155,43.511,7,0.000000e+00,591.0,29003,31342,27,796,786,uncultured_Acidovorax_sp._ERS6627271_scores.csv,uncultured_Acidovorax_sp._ERS6627271.fasta
2,Genome262_k127_235961,BKZW01_10191_Dictyobacter_vulcani_W12,27.422,7,9.230000e-56,205.0,6207,8249,20,717,733,uncultured_Acinetobacter_sp._ERS6626821_scores...,uncultured_Acinetobacter_sp._ERS6626821.fasta
3,Genome299_k127_417201,BKZW01_10191_Dictyobacter_vulcani_W12,26.842,40,9.620000e-28,113.0,121,1212,362,717,380,uncultured_Acinetobacter_sp._ERS6626858_scores...,uncultured_Acinetobacter_sp._ERS6626858.fasta
4,Genome207_k127_77525,Rhodococcus_opacus_B4,31.536,96,2.350000e-19,82.8,1063,23,431,791,371,uncultured_Agrobacterium_sp._ERS6626766_scores...,uncultured_Agrobacterium_sp._ERS6626766.fasta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
63,Genome424_k127_514686,Edaphobacter_aggregans_DSM_19364_2571099646,26.106,76,2.250000e-39,149.0,1902,13,5,655,678,uncultured_Spirosoma_sp._ERS6626983_scores.csv,uncultured_Spirosoma_sp._ERS6626983.fasta
64,Genome452_k127_363663,Methyloferula_stellata_WP_020174914,26.638,22,6.140000e-13,65.9,641,9,20,240,229,uncultured_Spirosoma_sp._ERS6627011_scores.csv,uncultured_Spirosoma_sp._ERS6627011.fasta
65,Genome543_k127_120424,Sphaerobacter_thermophilus_DSM_20745_ZP_04495982,27.835,1,7.470000e-42,164.0,20040,17932,12,769,776,uncultured_Stenotrophomonas_sp._ERS6627319_sco...,uncultured_Stenotrophomonas_sp._ERS6627319.fasta
66,Genome220_k127_244806,BKZW01_10191_Dictyobacter_vulcani_W12,36.086,12,4.310000e-125,410.0,11575,13857,8,776,787,uncultured_Variovorax_sp._ERS6626779_scores.csv,uncultured_Variovorax_sp._ERS6626779.fasta


In [37]:
# Save the filtered data frame for further analysis
filtered_coxl1_csv.to_csv(f"{motif1}/filtered_coxl1_csv.csv")

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    Search MOTIFs in CoxL Form 2 BLAST Results
</h3>

In [39]:
# Search MOTIFs for CoxL Form 2 BLAST results 
coxl2_motif_results = search_motifs(motif_2_proteins, form1, label1, form2, label2, output_file=f"{motif2}/coxl2_motif_search.csv")

Results saved to CoxL2_Results/MOTIF/coxl2_motif_search.csv


In [40]:
# Checking the MOTIF search results output
coxl2_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label
0,Burkholderiaceae_bacterium_ERS6626861.faa,Genome302_k127_117597_1,MAGARRHQAAEDVVGAGRDLKRLVRPGREDDEADAAAAVAPPPPVR...,DVVGAGR,DVV,CoxL_form_2
1,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2
2,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2
3,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,HQVSVESRRMGGGFGGKESQSAQWACIASLAAQKTGRPCKLRLDRD...,EQPGAGR,EQP,CoxL_form_2
4,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,NTMHLAVGFSSCAKGKISKFDLDAVRQADGVHAVFSAKDIDVENNW...,EQPGAGR,EQP,CoxL_form_2
...,...,...,...,...,...,...
104,uncultured_Variovorax_sp._ERS6626779.faa,Genome220_k127_244806_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,AYRGAGR,AYR,CoxL_form_2
105,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2
106,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2
107,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,LLSGAGR,LLS,CoxL_form_2


In [41]:
# Checking for the presence of MOTIF form 1 and 2
coxl2_motif_results['Motif_Label'].unique()

array(['CoxL_form_2'], dtype=object)

In [42]:
# Checking for number of genomes containing MOTIF form 2
coxl2_motif_results['File_Name'].nunique()

48

In [43]:
# Generating genome file name columns ('target_file')
coxl2_motif_results['target_file'] = coxl2_motif_results['File_Name'].apply(lambda x: x[:-4] + '.fasta')
coxl2_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file
0,Burkholderiaceae_bacterium_ERS6626861.faa,Genome302_k127_117597_1,MAGARRHQAAEDVVGAGRDLKRLVRPGREDDEADAAAAVAPPPPVR...,DVVGAGR,DVV,CoxL_form_2,Burkholderiaceae_bacterium_ERS6626861.fasta
1,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta
2,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta
3,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,HQVSVESRRMGGGFGGKESQSAQWACIASLAAQKTGRPCKLRLDRD...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta
4,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,NTMHLAVGFSSCAKGKISKFDLDAVRQADGVHAVFSAKDIDVENNW...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta
...,...,...,...,...,...,...,...
104,uncultured_Variovorax_sp._ERS6626779.faa,Genome220_k127_244806_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6626779.fasta
105,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta
106,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta
107,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta


In [44]:
# Generating input query sequence ID coulmn from updated sequence ID column ('ID')
coxl2_motif_results['qseqid'] = coxl2_motif_results['ID'].apply(lambda x: x[:-2])
coxl2_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid
0,Burkholderiaceae_bacterium_ERS6626861.faa,Genome302_k127_117597_1,MAGARRHQAAEDVVGAGRDLKRLVRPGREDDEADAAAAVAPPPPVR...,DVVGAGR,DVV,CoxL_form_2,Burkholderiaceae_bacterium_ERS6626861.fasta,Genome302_k127_117597
1,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059
2,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059
3,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,HQVSVESRRMGGGFGGKESQSAQWACIASLAAQKTGRPCKLRLDRD...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961
4,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,NTMHLAVGFSSCAKGKISKFDLDAVRQADGVHAVFSAKDIDVENNW...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961
...,...,...,...,...,...,...,...,...
104,uncultured_Variovorax_sp._ERS6626779.faa,Genome220_k127_244806_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6626779.fasta,Genome220_k127_244806
105,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488
106,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488
107,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488


In [45]:
# Finding the length of every sequence and save the value in 'sequence length' column
coxl2_motif_results['sequence_length'] = coxl2_motif_results['Sequence'].apply(len)
coxl2_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
0,Burkholderiaceae_bacterium_ERS6626861.faa,Genome302_k127_117597_1,MAGARRHQAAEDVVGAGRDLKRLVRPGREDDEADAAAAVAPPPPVR...,DVVGAGR,DVV,CoxL_form_2,Burkholderiaceae_bacterium_ERS6626861.fasta,Genome302_k127_117597,183
1,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059,783
2,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059,781
3,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,HQVSVESRRMGGGFGGKESQSAQWACIASLAAQKTGRPCKLRLDRD...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961,537
4,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,NTMHLAVGFSSCAKGKISKFDLDAVRQADGVHAVFSAKDIDVENNW...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961,663
...,...,...,...,...,...,...,...,...,...
104,uncultured_Variovorax_sp._ERS6626779.faa,Genome220_k127_244806_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6626779.fasta,Genome220_k127_244806,767
105,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,763
106,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,763
107,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,767


In [46]:
# Filter data frame based on 'File_Name', 'qseqid', 'MOTIF' & 'sequence_length'
coxl2_motif_results = coxl2_motif_results.sort_values(by=['File_Name', 'qseqid', 'MOTIF', 'sequence_length'],
                                        ascending=[True, True, True, False])
coxl2_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
0,Burkholderiaceae_bacterium_ERS6626861.faa,Genome302_k127_117597_1,MAGARRHQAAEDVVGAGRDLKRLVRPGREDDEADAAAAVAPPPPVR...,DVVGAGR,DVV,CoxL_form_2,Burkholderiaceae_bacterium_ERS6626861.fasta,Genome302_k127_117597,183
1,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059,783
2,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059,781
4,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,NTMHLAVGFSSCAKGKISKFDLDAVRQADGVHAVFSAKDIDVENNW...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961,663
3,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,HQVSVESRRMGGGFGGKESQSAQWACIASLAAQKTGRPCKLRLDRD...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961,537
...,...,...,...,...,...,...,...,...,...
101,uncultured_Variovorax_sp._ERS6626779.faa,Genome220_k127_244806_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6626779.fasta,Genome220_k127_244806,763
108,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,767
106,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,GSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPHARIL...,AYRGAGR,AYR,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,763
107,uncultured_Variovorax_sp._ERS6627246.faa,Genome470_k127_43488_1,PTRFGSGQAVRRLEDESLLSGAGRYTDDVTLPEQAHLVFLRSSYPH...,LLSGAGR,LLS,CoxL_form_2,uncultured_Variovorax_sp._ERS6627246.fasta,Genome470_k127_43488,767


In [47]:
# Group the data frame by 'File_Name', 'qseqid' & 'MOTIF', considering 1st row of the group as best match removing other duplicates
coxl2_motif_results = coxl2_motif_results.groupby(['File_Name', 'qseqid', 'MOTIF']).head(1).reset_index(drop=True)

coxl2_motif_results

Unnamed: 0,File_Name,ID,Sequence,MOTIF,Previous_3_AA,Motif_Label,target_file,qseqid,sequence_length
0,Burkholderiaceae_bacterium_ERS6626861.faa,Genome302_k127_117597_1,MAGARRHQAAEDVVGAGRDLKRLVRPGREDDEADAAAAVAPPPPVR...,DVVGAGR,DVV,CoxL_form_2,Burkholderiaceae_bacterium_ERS6626861.fasta,Genome302_k127_117597,183
1,uncultured_Acidovorax_sp._ERS6627271.faa,Genome495_k127_317059_1,IGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFVRSPHAHAKINS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059,783
2,uncultured_Acinetobacter_sp._ERS6626821.faa,Genome262_k127_235961_1,NTMHLAVGFSSCAKGKISKFDLDAVRQADGVHAVFSAKDIDVENNW...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626821.fasta,Genome262_k127_235961,663
3,uncultured_Acinetobacter_sp._ERS6626858.faa,Genome299_k127_417201_1,ELRNLRCKTNTVSNTAYRGFGGPQGMFVIENIIDDIARYLGCDPIE...,EQPGAGR,EQP,CoxL_form_2,uncultured_Acinetobacter_sp._ERS6626858.fasta,Genome299_k127_417201,438
4,uncultured_Aureimonas_sp._ERS6626915.faa,Genome356_k127_498042_1,LAMIAAEKLKRPVRWVADRNEHFLADTHGRANLATATMALDAKGRF...,AYRGAGR,AYR,CoxL_form_2,uncultured_Aureimonas_sp._ERS6626915.fasta,Genome356_k127_498042,498
5,uncultured_Aureimonas_sp._ERS6626915.faa,Genome356_k127_610672_1,IGKGIDRFEGPLKVTGRARYAYENMTGPEVAYGFVVTAGTGAGRIK...,AGTGAGR,AGT,CoxL_form_2,uncultured_Aureimonas_sp._ERS6626915.fasta,Genome356_k127_610672,370
6,uncultured_Aureimonas_sp._ERS6627306.faa,Genome530_k127_16854_1,MGIEGIGARVLRKEDRRFITGRGRYTDDMVVPGMKYAAFVRSPHAH...,AYRGAGR,AYR,CoxL_form_2,uncultured_Aureimonas_sp._ERS6627306.fasta,Genome530_k127_16854,780
7,uncultured_Comamonas_sp._ERS6626630.faa,Genome71_S1Ck127_18694_1,IGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFVRSPHAHARVRS...,AYRGAGR,AYR,CoxL_form_2,uncultured_Comamonas_sp._ERS6626630.fasta,Genome71_S1Ck127_18694,782
8,uncultured_Comamonas_sp._ERS6626638.faa,Genome79_S3Ck127_287303_1,PATPADVAAIRTRPAVGGNARPGQRAPGTGKRQRPGPARRAGRATP...,PRPGAGR,PRP,CoxL_form_2,uncultured_Comamonas_sp._ERS6626638.fasta,Genome79_S3Ck127_287303,200
9,uncultured_Comamonas_sp._ERS6626638.faa,Genome79_S3Ck127_287303_3,GVAHSVCRGSCRRRRRRAGGGTTPPTRARPTAAACRRASRARPSAT...,RVSGAGR,RVS,CoxL_form_2,uncultured_Comamonas_sp._ERS6626638.fasta,Genome79_S3Ck127_287303,294


In [48]:
# Save filtered data frame
coxl2_motif_results.to_csv(f"{motif2}/coxl2_motif_results.csv")

In [49]:
# Genereating genome files name list
coxl2_motif_mag_list = list(coxl2_motif_results['target_file'].unique())
print(len(coxl2_motif_mag_list))
coxl2_motif_mag_list

48


['Burkholderiaceae_bacterium_ERS6626861.fasta',
 'uncultured_Acidovorax_sp._ERS6627271.fasta',
 'uncultured_Acinetobacter_sp._ERS6626821.fasta',
 'uncultured_Acinetobacter_sp._ERS6626858.fasta',
 'uncultured_Aureimonas_sp._ERS6626915.fasta',
 'uncultured_Aureimonas_sp._ERS6627306.fasta',
 'uncultured_Comamonas_sp._ERS6626630.fasta',
 'uncultured_Comamonas_sp._ERS6626638.fasta',
 'uncultured_Comamonas_sp._ERS6626814.fasta',
 'uncultured_Comamonas_sp._ERS6627283.fasta',
 'uncultured_Deinococcus_sp._ERS6626797.fasta',
 'uncultured_Deinococcus_sp._ERS6626820.fasta',
 'uncultured_Deinococcus_sp._ERS6626837.fasta',
 'uncultured_Deinococcus_sp._ERS6627293.fasta',
 'uncultured_Deinococcus_sp._ERS6627335.fasta',
 'uncultured_Hymenobacter_sp._ERS6626733.fasta',
 'uncultured_Hymenobacter_sp._ERS6627279.fasta',
 'uncultured_Methylobacterium_sp._ERS6627288.fasta',
 'uncultured_Microbacterium_sp._ERS6626906.fasta',
 'uncultured_Microbacterium_sp._ERS6626964.fasta',
 'uncultured_Mucilaginibacter_sp._

In [50]:
# Generating a set with genome file names and its query sequence ID
coxl2_motif_id_set = coxl2_motif_results.groupby('target_file')['qseqid'].apply(set).to_dict()
coxl2_motif_id_set

{'Burkholderiaceae_bacterium_ERS6626861.fasta': {'Genome302_k127_117597'},
 'uncultured_Acidovorax_sp._ERS6627271.fasta': {'Genome495_k127_317059'},
 'uncultured_Acinetobacter_sp._ERS6626821.fasta': {'Genome262_k127_235961'},
 'uncultured_Acinetobacter_sp._ERS6626858.fasta': {'Genome299_k127_417201'},
 'uncultured_Aureimonas_sp._ERS6626915.fasta': {'Genome356_k127_498042',
  'Genome356_k127_610672'},
 'uncultured_Aureimonas_sp._ERS6627306.fasta': {'Genome530_k127_16854'},
 'uncultured_Comamonas_sp._ERS6626630.fasta': {'Genome71_S1Ck127_18694'},
 'uncultured_Comamonas_sp._ERS6626638.fasta': {'Genome79_S3Ck127_287303'},
 'uncultured_Comamonas_sp._ERS6626814.fasta': {'Genome255_k127_46261'},
 'uncultured_Comamonas_sp._ERS6627283.fasta': {'Genome507_k127_91973'},
 'uncultured_Deinococcus_sp._ERS6626797.fasta': {'Genome238_k127_17753',
  'Genome238_k127_54530'},
 'uncultured_Deinococcus_sp._ERS6626820.fasta': {'Genome261_k127_146972',
  'Genome261_k127_332399'},
 'uncultured_Deinococcus_sp.

In [51]:
# Generate a set within a set containing genome file names with their respective query sequences and the first three amino acid sequences of the MOTIF
## Initialize a dictionary to hold the data
coxl2_motif_3aa = coxl2_motif_results.groupby(['target_file', 'qseqid'])['Previous_3_AA'].apply(list).reset_index()

## Create the dictionary
coxl2_motif_3aa_dict = coxl2_motif_3aa.groupby('target_file').apply(lambda x: x.set_index('qseqid')['Previous_3_AA'].to_dict()).to_dict()

## Print the resulting dictionary
for target_file, ids in coxl2_motif_3aa_dict.items():
    print(f"Target File: {target_file}")
    for id, previous_3_aa in ids.items():
        print(f"  ID: {id}, Previous 3 AA: {previous_3_aa}")

Target File: Burkholderiaceae_bacterium_ERS6626861.fasta
  ID: Genome302_k127_117597, Previous 3 AA: ['DVV']
Target File: uncultured_Acidovorax_sp._ERS6627271.fasta
  ID: Genome495_k127_317059, Previous 3 AA: ['AYR']
Target File: uncultured_Acinetobacter_sp._ERS6626821.fasta
  ID: Genome262_k127_235961, Previous 3 AA: ['EQP']
Target File: uncultured_Acinetobacter_sp._ERS6626858.fasta
  ID: Genome299_k127_417201, Previous 3 AA: ['EQP']
Target File: uncultured_Aureimonas_sp._ERS6626915.fasta
  ID: Genome356_k127_498042, Previous 3 AA: ['AYR']
  ID: Genome356_k127_610672, Previous 3 AA: ['AGT']
Target File: uncultured_Aureimonas_sp._ERS6627306.fasta
  ID: Genome530_k127_16854, Previous 3 AA: ['AYR']
Target File: uncultured_Comamonas_sp._ERS6626630.fasta
  ID: Genome71_S1Ck127_18694, Previous 3 AA: ['AYR']
Target File: uncultured_Comamonas_sp._ERS6626638.fasta
  ID: Genome79_S3Ck127_287303, Previous 3 AA: ['PRP', 'RVS']
Target File: uncultured_Comamonas_sp._ERS6626814.fasta
  ID: Genome255

  coxl2_motif_3aa_dict = coxl2_motif_3aa.groupby('target_file').apply(lambda x: x.set_index('qseqid')['Previous_3_AA'].to_dict()).to_dict()


<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    Filter CoxL Form 2 BLAST Results with MOTIF Search Results
</h3>

In [53]:
# Open the MOTIF 2 concatenated BLAST results
coxl2_csv = open_file(motif_2_csv_file_path)
coxl2_csv.head()

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
0,Genome20_S3Ck127_316360,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,24.931,42,8.8e-31,119.0,4944,2872,52,754,726,Bdellovibrionaceae_bacterium_ERS6626579_scores...,Bdellovibrionaceae_bacterium_ERS6626579.fasta
1,Genome20_S3Ck127_316360,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,24.933,41,1.08e-26,105.0,4953,2911,48,760,742,Bdellovibrionaceae_bacterium_ERS6626579_scores...,Bdellovibrionaceae_bacterium_ERS6626579.fasta
2,Genome267_k127_201050,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,23.846,21,4.9e-35,134.0,6807,8963,7,767,780,Bdellovibrionaceae_bacterium_ERS6626826_scores...,Bdellovibrionaceae_bacterium_ERS6626826.fasta
3,Genome267_k127_201050,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,24.791,17,4.11e-34,130.0,7299,9014,188,766,597,Bdellovibrionaceae_bacterium_ERS6626826_scores...,Bdellovibrionaceae_bacterium_ERS6626826.fasta
4,Genome267_k127_433915,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,24.156,20,1.98e-34,132.0,3178,995,8,741,770,Bdellovibrionaceae_bacterium_ERS6626826_scores...,Bdellovibrionaceae_bacterium_ERS6626826.fasta


In [54]:
# Filter MOTIF form 2 BLAST results with genome files that contain Form 2 MOTIFs
coxl2_csv_filtered = coxl2_csv[coxl2_csv['target_file'].isin(coxl2_motif_mag_list)]
coxl2_csv_filtered

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
12,Genome302_k127_440979,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,50.296,15,4.210000e-43,157.0,509,9,6,172,169,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
13,Genome302_k127_440979,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,43.195,16,1.120000e-32,124.0,509,3,8,170,169,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
14,Genome302_k127_117597,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,27.599,28,2.150000e-31,121.0,5027,3483,264,758,529,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
15,Genome302_k127_117597,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,25.149,26,6.280000e-21,87.4,5015,3558,272,759,505,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
16,Genome302_k127_836765,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,64.516,12,3.680000e-71,244.0,5633,4983,567,778,217,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2138,Genome470_k127_338375,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,43.406,53,0.000000e+00,578.0,2342,9,8,759,781,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
2139,Genome470_k127_380329,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,26.543,3,2.760000e-19,85.5,27726,25918,97,723,648,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
2140,Genome470_k127_380329,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,23.981,3,6.110000e-16,74.7,27669,25918,120,746,638,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
2141,Genome470_k127_212027,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,70.000,4,1.200000e-12,59.3,1,120,738,774,40,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta


In [55]:
# Sort the BLAST results based on genome file name, query sequence id, E-value, Bitscore, & percentage Identity
coxl2_csv_sorted = coxl2_csv_filtered.sort_values(by=['target_file', 'qseqid', 'evalue', 'bitscore', 'pident'],
                                        ascending=[True, True, True, False, False])

coxl2_csv_sorted

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
14,Genome302_k127_117597,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,27.599,28,2.150000e-31,121.0,5027,3483,264,758,529,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
15,Genome302_k127_117597,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,25.149,26,6.280000e-21,87.4,5015,3558,272,759,505,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
12,Genome302_k127_440979,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,50.296,15,4.210000e-43,157.0,509,9,6,172,169,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
13,Genome302_k127_440979,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,43.195,16,1.120000e-32,124.0,509,3,8,170,169,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
18,Genome302_k127_499603,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,39.371,30,4.730000e-139,444.0,7271,4965,5,780,795,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2138,Genome470_k127_338375,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,43.406,53,0.000000e+00,578.0,2342,9,8,759,781,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
2139,Genome470_k127_380329,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,26.543,3,2.760000e-19,85.5,27726,25918,97,723,648,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
2140,Genome470_k127_380329,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,23.981,3,6.110000e-16,74.7,27669,25918,120,746,638,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
2135,Genome470_k127_43488,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,40.759,32,1.710000e-153,485.0,1349,3637,5,778,790,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta


In [56]:
# Group the BLAST results by genome file names and query sequence ID, considering 1st row of every group is the best match removing duplicates
coxl2_csv_sorted = coxl2_csv_sorted.groupby(['target_file', 'qseqid']).head(1).reset_index(drop=True)
coxl2_csv_sorted

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
0,Genome302_k127_117597,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,27.599,28,2.150000e-31,121.0,5027,3483,264,758,529,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
1,Genome302_k127_440979,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,50.296,15,4.210000e-43,157.0,509,9,6,172,169,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
2,Genome302_k127_499603,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,39.371,30,4.730000e-139,444.0,7271,4965,5,780,795,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
3,Genome302_k127_836765,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,64.516,12,3.680000e-71,244.0,5633,4983,567,778,217,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
4,Genome495_k127_217386,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,24.510,6,3.890000e-07,43.5,450,1058,204,398,204,uncultured_Acidovorax_sp._ERS6627271_scores.csv,uncultured_Acidovorax_sp._ERS6627271.fasta
...,...,...,...,...,...,...,...,...,...,...,...,...,...
184,Genome470_k127_221644,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,28.889,1,4.050000e-07,45.1,432,40,663,777,135,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
185,Genome470_k127_290350,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,28.371,8,1.420000e-33,130.0,2379,4031,229,758,571,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
186,Genome470_k127_338375,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,58.537,53,0.000000e+00,903.0,2342,9,6,778,779,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta
187,Genome470_k127_380329,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,26.543,3,2.760000e-19,85.5,27726,25918,97,723,648,uncultured_Variovorax_sp._ERS6627246_scores.csv,uncultured_Variovorax_sp._ERS6627246.fasta


In [57]:
# Filtering the data frame with a genome file names and query sequence id set, to remove the other sequences which don't contain MOTIF
## Create a empty set to store fileted rows
filtered_rows = []
## Filter the data frame
for key, values in coxl2_motif_id_set.items():
    filtered_rows.append(coxl2_csv_sorted[(coxl2_csv_sorted['target_file'] == key) & (coxl2_csv_sorted['qseqid'].isin(values))])
    ## Combine the filtered results
    filtered_coxl2_csv = pd.concat(filtered_rows).reset_index(drop=True)
## Check the output
filtered_coxl2_csv

Unnamed: 0,qseqid,sseqid,pident,qcovs,evalue,bitscore,qstart,qend,sstart,send,length,source_file,target_file
0,Genome302_k127_117597,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,27.599,28,2.15e-31,121.0,5027,3483,264,758,529,Burkholderiaceae_bacterium_ERS6626861_scores.csv,Burkholderiaceae_bacterium_ERS6626861.fasta
1,Genome495_k127_317059,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,61.175,7,0.0,905.0,28991,31339,6,781,783,uncultured_Acidovorax_sp._ERS6627271_scores.csv,uncultured_Acidovorax_sp._ERS6627271.fasta
2,Genome262_k127_235961,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,27.289,5,1.71e-33,130.0,6804,8414,243,776,568,uncultured_Acinetobacter_sp._ERS6626821_scores...,uncultured_Acinetobacter_sp._ERS6626821.fasta
3,Genome299_k127_417201,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,27.152,48,1.1000000000000001e-29,114.0,79,1377,348,776,453,uncultured_Acinetobacter_sp._ERS6626858_scores...,uncultured_Acinetobacter_sp._ERS6626858.fasta
4,Genome356_k127_498042,BAB48572_Form2_Rhodopseudomonas_palustris_BisA...,56.338,41,0.0,552.0,1,1491,264,760,497,uncultured_Aureimonas_sp._ERS6626915_scores.csv,uncultured_Aureimonas_sp._ERS6626915.fasta
5,Genome356_k127_610672,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,24.757,30,6.79e-14,63.9,2298,3407,6,404,412,uncultured_Aureimonas_sp._ERS6626915_scores.csv,uncultured_Aureimonas_sp._ERS6626915.fasta
6,Genome530_k127_16854,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,78.59,24,0.0,1218.0,7296,9635,1,780,780,uncultured_Aureimonas_sp._ERS6627306_scores.csv,uncultured_Aureimonas_sp._ERS6627306.fasta
7,Genome71_S1Ck127_18694,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,59.591,5,0.0,882.0,29663,32008,6,780,782,uncultured_Comamonas_sp._ERS6626630_scores.csv,uncultured_Comamonas_sp._ERS6626630.fasta
8,Genome79_S3Ck127_287303,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,24.521,2,4.89e-26,108.0,67567,65405,19,759,783,uncultured_Comamonas_sp._ERS6626638_scores.csv,uncultured_Comamonas_sp._ERS6626638.fasta
9,Genome255_k127_46261,BAB48572_Form2_Mesorhizobium_japonicum_MAFF,24.036,12,4.94e-28,112.0,13099,10937,19,759,778,uncultured_Comamonas_sp._ERS6626814_scores.csv,uncultured_Comamonas_sp._ERS6626814.fasta


In [58]:
# Save the filtered data frame
filtered_coxl2_csv.to_csv(f"{motif2}/filtered_coxl2_csv.csv")