<h1 style="font-family: 'Times New Roman'; text-align: center; font-weight: bold;">
    Conserved Domain Search
</h1>

<div style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
  <div>
    <span style="display: inline-block; width: 100px;"><b>Date</b></span>: 16<sup>th</sup> November 2024
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Author</b></span>: Deepan Kanagarajan Babu
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Description</b></span>: In this document, the filtered sequences are BLAST against the conserved domain database of NCBI. Conserved domain search (CDS) is performed in the document and directly uploaded to the server (called as "open_blast" in this document). Both the out put are analysed in the document. In document CDS only gives the .csv file output, to visualize the structure of the sequence domain architecture, use the server. The first hits as coxL gene are consedered as carrying the coxL domain and also based on the e-value signinificance.
  </div>
</div>


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Libraries
</h2>

In [1]:
import pandas as pd
import numpy as np
import os
import csv
import time
import shutil
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from xml.etree import ElementTree as ET

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Functions
</h2>

In [3]:
# Defining single function to open multiple file formats (.csv, .txt, .fasta or .fastq (to display only 1st 100 lines), and .xlsx)
def open_file(file_path):
    if file_path.endswith('.csv'):
        # Read CSV file
        data = pd.read_csv(file_path)
        return data
    elif file_path.endswith('.txt'):
        # Read TXT file
        try:
            with open(file_path, 'r') as file:
                data = file.read()
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith(('.fasta', '.fastq')):
        # Read FASTA or FASTQ file
        try:
            with open(file_path, 'r') as file:
                data = []
                for _ in range(100):
                    line = file.readline()
                    if not line:
                        break
                    data.append(line.strip())
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith('.xlsx'):
        # Read XLSX file
        try:
            data = pd.read_excel(file_path)
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    else:
        raise ValueError("Unsupported file format")

In [4]:
def filter_fasta(csv_file, input_folder, output_folder, duplicates_folder):
    os.makedirs(output_folder, exist_ok=True)
    os.makedirs(duplicates_folder, exist_ok=True)
    
    # Read CSV file and store sequences to extract
    sequences_to_extract = {}
    try:
        with open(csv_file, newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                file_name = row["File_Name"].strip()
                sequence_id = row["ID"].strip()
                sequence = row["Sequence"].strip()
                sequence_length = int(row["sequence_length"].strip())
                
                if file_name not in sequences_to_extract:
                    sequences_to_extract[file_name] = {}
                
                sequences_to_extract[file_name][sequence_id] = (sequence, sequence_length)  # Store ID, sequence, and length
    
    except FileNotFoundError:
        print(f"CSV file not found: {csv_file}")
        return
    except KeyError:
        print("CSV file is missing required columns: 'File_Name', 'ID', 'Sequence', and 'sequence_length'")
        return
    
    # Process each FASTA file
    for file_name, sequence_data in sequences_to_extract.items():
        input_file_path = os.path.join(input_folder, file_name)
        output_file_path = os.path.join(output_folder, f"{os.path.splitext(file_name)[0]}_filtered.faa")
        duplicates_file_path = os.path.join(duplicates_folder, f"{os.path.splitext(file_name)[0]}_duplicates.txt")

        if not os.path.exists(input_file_path):
            print(f"File not found: {input_file_path}")
            continue

        extracted_records = []
        seen_sequences = {}
        duplicate_headers = []

        for record in SeqIO.parse(input_file_path, "fasta"):
            if record.id in sequence_data and str(record.seq) == sequence_data[record.id][0] and len(record.seq) == sequence_data[record.id][1]:
                if str(record.seq) in seen_sequences:
                    duplicate_headers.append(record.id)
                else:
                    seen_sequences[str(record.seq)] = record.id
                    extracted_records.append(record)

        if extracted_records:
            with open(output_file_path, "w") as output_f:
                SeqIO.write(extracted_records, output_f, "fasta")
            print(f"Extracted sequences saved to: {output_file_path}")
        
        if duplicate_headers:
            with open(duplicates_file_path, "w") as duplicates_f:
                duplicates_f.write(f"Duplicate headers in {file_name}:\n")
                for header in duplicate_headers:
                    duplicates_f.write(f"{header}\n")
            print(f"Duplicate headers saved to: {duplicates_file_path}")

    print("Sequence extraction completed.")

In [5]:
# Function to count number of sequences in every file
def count_sequences_in_fasta_files(folder_path):
    # Dictionary to store the count of sequences in each file
    sequence_counts = {}
    total_sequences = 0

    # Iterate over each file in the folder
    for file_name in os.listdir(folder_path):
        if file_name.endswith(".faa") | file_name.endswith(".fasta") | file_name.endswith(".fna"):
            file_path = os.path.join(folder_path, file_name)
            # Count the number of sequences in the current .faa file
            sequence_count = sum(1 for _ in SeqIO.parse(file_path, "fasta"))
            sequence_counts[file_name] = sequence_count
            total_sequences += sequence_count

    return sequence_counts, total_sequences

In [6]:
# Function to perform BLAST search for a protein sequence with retry mechanism
def search_cdd(sequence, program="blastp", db="cdd", e_value=0.01, max_hits=10, retries=10, timeout=600):
    for attempt in range(retries):
        try:
            print(f"Attempt {attempt + 1} for sequence.")
            result_handle = NCBIWWW.qblast(
                program=program,
                database=db,
                sequence=sequence,
                expect=e_value,
                hitlist_size=max_hits
            )
            results = result_handle.read()
            result_handle.close()
            return results
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(5)  # Wait before retrying
    print("All attempts failed.")
    return None

# Function to parse BLAST XML and extract summary information
def parse_cdd_xml(xml_content):
    try:
        root = ET.fromstring(xml_content)
        hits = []
        
        # Parse the hits from the XML
        for hit in root.findall(".//Hit"):
            hit_id = hit.findtext("Hit_id")
            hit_def = hit.findtext("Hit_def")
            hit_score = hit.findtext(".//Hsp_bit-score")
            e_value = hit.findtext(".//Hsp_evalue")
            hits.append({
                "hit_id": hit_id,
                "hit_def": hit_def,
                "hit_score": hit_score,
                "e_value": e_value,
            })
        return hits
    except ET.ParseError as e:
        print(f"Error parsing XML: {e}")
        return []

# Function to process all .faa files in a folder
def conserved_domain_search(folder_path, output_folder, summary_csv, program="blastp", db="cdd"):
    os.makedirs(output_folder, exist_ok=True)
    
    with open(summary_csv, "w", newline="") as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(["Fasta File", "Sequence Header", "CDD Hit ID", "CDD Hit Definition", "Bit Score", "E-value"])
        
        for file_name in os.listdir(folder_path):
            if file_name.endswith(".faa"):
                file_path = os.path.join(folder_path, file_name)
                print(f"Processing file: {file_name}")
                
                for record in SeqIO.parse(file_path, "fasta"):
                    sequence_id = record.id
                    sequence = str(record.seq)
                    
                    print(f"Searching CDD for sequence: {sequence_id}")
                    result = search_cdd(sequence, program=program, db=db)
                    
                    if result:
                        output_file = os.path.join(output_folder, f"{file_name}_{sequence_id}_cdd.xml")
                        with open(output_file, "w") as f:
                            f.write(result)
                        print(f"Results saved to: {output_file}")
                        
                        hits = parse_cdd_xml(result)
                        for hit in hits:
                            csv_writer.writerow([
                                file_name,
                                sequence_id,
                                hit["hit_id"],
                                hit["hit_def"],
                                hit["hit_score"],
                                hit["e_value"]
                            ])
                print(f"Completed processing file: {file_name}")

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Directories Path
</h2>

In [8]:
# Specify the input folder paths
## Filtered MOTIF form 1 BLAST CSV file path
motif_1_csv_file_path = "CoxL1_Results/MOTIF/filtered_coxl1_csv.csv"


## Filtered MOTIF form 2 BLAST CSV file path
motif_2_csv_file_path = "CoxL2_Results/MOTIF/filtered_coxl2_csv.csv"

## Filtered CoxL1 MOTIF results
motif_1_results = "CoxL1_Results/MOTIF/coxl1_motif_results.csv"

## Filtered CoxL2 MOTIF results
motif_2_results = "CoxL2_Results/MOTIF/coxl2_motif_results.csv"

## CoxL Form 1 BLAST proteins sequence directory
motif_1_proteins = "CoxL1_Results/MOTIF_proteins"

## CoxL Form 2 BLAST proteins sequence directory
motif_2_proteins = "CoxL2_Results/MOTIF_proteins"

## MOTIF1 results directory
motif1 = "CoxL1_Results/MOTIF"

## MOTIF2 results directory
motif2 = "CoxL2_Results/MOTIF"

## CoxL 2 MOTIF FASTA Output
motif_2_fasta_path = "CoxL2_Results/MOTIF/fasta"

## NCBI Open db BLAST Results Folder
cross_ver = "CoxL2_Results/Cross_Verification"

# Specify the output folder paths
## CoxL Form 1 BLAST proteins sequence directory
motif_1_filtered_proteins = "CoxL1_Results/MOTIF_filtered_proteins"
## Ensure the output directory exists
os.makedirs(motif_1_filtered_proteins, exist_ok=True)

## CoxL Form 2 BLAST proteins sequence directory
motif_2_filtered_proteins = "CoxL2_Results/MOTIF_filtered_proteins"
## Ensure the output directory exists
os.makedirs(motif_2_proteins, exist_ok=True)

## CoxL Form1 BLAST outputs with Form 2 MOTIFs, cds search results directory for .xlm file
xlm1 = "CoxL1_Results/MOTIF/xlm"
## Ensure the output directory exists
os.makedirs(xlm1, exist_ok=True)

## CoxL Form2 BLAST output with Form 2 MOTIFS, cds search results directory for .xlm file
xlm2 = "CoxL2_Results/MOTIF/xlm"
## Ensure the output directory exists
os.makedirs(xlm2, exist_ok=True)

## MOTIF1 BLAST output cds search results directory for .xlm file
duplicate1 = "CoxL1_Results/MOTIF/duplicate_list"
## Ensure the output directory exists
os.makedirs(xlm1, exist_ok=True)

## MOTIF2 BLAST output cds search results directory for .xlm file
duplicate2 = "CoxL2_Results/MOTIF/duplicate"
## Ensure the output directory exists
os.makedirs(xlm2, exist_ok=True)

## CoxL Form 1 BLAST proteins sequence CDS result file path
cds1 = "CoxL1_Results/MOTIF/cds1.csv"

## CoxL Form 2 BLAST proteins sequence CDS result file path
cds2 = "CoxL2_Results/MOTIF/cds2.csv"

## CoxL Form 2 genes directory
coxl2_genes = "Coxl2_Results/Genes"
## Ensure the output directory exists
os.makedirs(coxl2_genes, exist_ok=True)

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Filter Fasta Files with Filtered MOTIF Search Results
</h2>

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    For Form 1 BLAST Results with Form 2 MOTIFs
</h3>

In [11]:
# Run the extraction
filter_fasta(motif_1_results, motif_1_proteins, motif_1_filtered_proteins, duplicate1)

Extracted sequences saved to: CoxL1_Results/MOTIF_filtered_proteins\uncultured_Acidovorax_sp._ERS6626787_filtered.faa
Duplicate headers saved to: CoxL1_Results/MOTIF/duplicate_list\uncultured_Acidovorax_sp._ERS6626787_duplicates.txt
Extracted sequences saved to: CoxL1_Results/MOTIF_filtered_proteins\uncultured_Acidovorax_sp._ERS6627271_filtered.faa
Duplicate headers saved to: CoxL1_Results/MOTIF/duplicate_list\uncultured_Acidovorax_sp._ERS6627271_duplicates.txt
Extracted sequences saved to: CoxL1_Results/MOTIF_filtered_proteins\uncultured_Acinetobacter_sp._ERS6626821_filtered.faa
Extracted sequences saved to: CoxL1_Results/MOTIF_filtered_proteins\uncultured_Acinetobacter_sp._ERS6626858_filtered.faa
Extracted sequences saved to: CoxL1_Results/MOTIF_filtered_proteins\uncultured_Agrobacterium_sp._ERS6626766_filtered.faa
Extracted sequences saved to: CoxL1_Results/MOTIF_filtered_proteins\uncultured_Aureimonas_sp._ERS6626915_filtered.faa
Duplicate headers saved to: CoxL1_Results/MOTIF/dupli

In [12]:
# Get the count of sequences in each .faa file and the total sequences
sequence_counts_1, total_sequences_1 = count_sequences_in_fasta_files(motif_1_filtered_proteins)

# Print the results
for file_name, count in sequence_counts_1.items():
    print(f"{file_name}: {count} sequences")

print(f"Total sequences from all files: {total_sequences_1}")


Burkholderiaceae_bacterium_ERS6626861_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6626735_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6626737_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6626738_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6626787_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6626883_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6627009_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6627271_filtered.faa: 1 sequences
uncultured_Acinetobacter_sp._ERS6626821_filtered.faa: 1 sequences
uncultured_Acinetobacter_sp._ERS6626858_filtered.faa: 1 sequences
uncultured_Agrobacterium_sp._ERS6626766_filtered.faa: 1 sequences
uncultured_Aureimonas_sp._ERS6626915_filtered.faa: 3 sequences
uncultured_Aureimonas_sp._ERS6627306_filtered.faa: 1 sequences
uncultured_Comamonas_sp._ERS6626602_filtered.faa: 1 sequences
uncultured_Comamonas_sp._ERS6626630_filtered.faa: 1 sequences
uncultured_Comamonas_sp._ERS6626638_filtered.fa

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    For Form 2 BLAST Results with Form 2 MOTIFs
</h3>

In [22]:
# Run the extraction
filter_fasta(motif_2_results, motif_2_proteins, motif_2_filtered_proteins, duplicate2)

Extracted sequences saved to: CoxL2_Results/MOTIF_filtered_proteins\Burkholderiaceae_bacterium_ERS6626861_filtered.faa
Extracted sequences saved to: CoxL2_Results/MOTIF_filtered_proteins\uncultured_Acidovorax_sp._ERS6627271_filtered.faa
Extracted sequences saved to: CoxL2_Results/MOTIF_filtered_proteins\uncultured_Acinetobacter_sp._ERS6626821_filtered.faa
Extracted sequences saved to: CoxL2_Results/MOTIF_filtered_proteins\uncultured_Acinetobacter_sp._ERS6626858_filtered.faa
Extracted sequences saved to: CoxL2_Results/MOTIF_filtered_proteins\uncultured_Aureimonas_sp._ERS6626915_filtered.faa
Duplicate headers saved to: CoxL2_Results/MOTIF/duplicate\uncultured_Aureimonas_sp._ERS6626915_duplicates.txt
Extracted sequences saved to: CoxL2_Results/MOTIF_filtered_proteins\uncultured_Aureimonas_sp._ERS6627306_filtered.faa
Extracted sequences saved to: CoxL2_Results/MOTIF_filtered_proteins\uncultured_Comamonas_sp._ERS6626630_filtered.faa
Extracted sequences saved to: CoxL2_Results/MOTIF_filtered

In [24]:
# Get the count of sequences in each .faa file and the total sequences
sequence_counts_2, total_sequences_2 = count_sequences_in_fasta_files(motif_2_filtered_proteins)

# Print the results
for file_name, count in sequence_counts_2.items():
    print(f"{file_name}: {count} sequences")

print(f"Total sequences from all files: {total_sequences_2}")


Burkholderiaceae_bacterium_ERS6626861_filtered.faa: 1 sequences
uncultured_Acidovorax_sp._ERS6627271_filtered.faa: 1 sequences
uncultured_Acinetobacter_sp._ERS6626821_filtered.faa: 1 sequences
uncultured_Acinetobacter_sp._ERS6626858_filtered.faa: 1 sequences
uncultured_Aureimonas_sp._ERS6626915_filtered.faa: 2 sequences
uncultured_Aureimonas_sp._ERS6627306_filtered.faa: 1 sequences
uncultured_Comamonas_sp._ERS6626630_filtered.faa: 1 sequences
uncultured_Comamonas_sp._ERS6626638_filtered.faa: 2 sequences
uncultured_Comamonas_sp._ERS6626814_filtered.faa: 1 sequences
uncultured_Comamonas_sp._ERS6627283_filtered.faa: 1 sequences
uncultured_Deinococcus_sp._ERS6626797_filtered.faa: 2 sequences
uncultured_Deinococcus_sp._ERS6626820_filtered.faa: 2 sequences
uncultured_Deinococcus_sp._ERS6626837_filtered.faa: 1 sequences
uncultured_Deinococcus_sp._ERS6627293_filtered.faa: 1 sequences
uncultured_Deinococcus_sp._ERS6627335_filtered.faa: 1 sequences
uncultured_Hymenobacter_sp._ERS6626733_filtered

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Conserved Domain Search
</h2>

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    For Form 1 BLAST Results with Form 2 MOTIFs
</h3>

In [27]:
# Conserved Domain Search for CoxL Form 1 BLAST outputs with Form 2 MOTIF
conserved_domain_search(motif_1_filtered_proteins, xlm1, cds1, program="blastp", db="cdd")
print(f"Summary CSV created at: {cds1}")

Processing file: Burkholderiaceae_bacterium_ERS6626861_filtered.faa
Searching CDD for sequence: Genome302_k127_117597_3
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\Burkholderiaceae_bacterium_ERS6626861_filtered.faa_Genome302_k127_117597_3_cdd.xml
Completed processing file: Burkholderiaceae_bacterium_ERS6626861_filtered.faa
Processing file: uncultured_Acidovorax_sp._ERS6626735_filtered.faa
Searching CDD for sequence: Genome176_S3Ck127_734649_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Acidovorax_sp._ERS6626735_filtered.faa_Genome176_S3Ck127_734649_1_cdd.xml
Completed processing file: uncultured_Acidovorax_sp._ERS6626735_filtered.faa
Processing file: uncultured_Acidovorax_sp._ERS6626737_filtered.faa
Searching CDD for sequence: Genome178_S3Ck127_362732_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Acidovorax_sp._ERS6626737_filtered.faa_Genome178_S3Ck127_362732_1_cdd.xml
Completed processing file: uncultured_Acidovorax_sp._ERS6626737_filte



Results saved to: MOTIF1/xlm\uncultured_Acinetobacter_sp._ERS6626858_filtered.faa_Genome299_k127_417201_1_cdd.xml
Completed processing file: uncultured_Acinetobacter_sp._ERS6626858_filtered.faa
Processing file: uncultured_Aureimonas_sp._ERS6626915_filtered.faa
Searching CDD for sequence: Genome356_k127_610672_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Aureimonas_sp._ERS6626915_filtered.faa_Genome356_k127_610672_1_cdd.xml
Searching CDD for sequence: Genome356_k127_498042_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Aureimonas_sp._ERS6626915_filtered.faa_Genome356_k127_498042_1_cdd.xml
Completed processing file: uncultured_Aureimonas_sp._ERS6626915_filtered.faa
Processing file: uncultured_Aureimonas_sp._ERS6627306_filtered.faa
Searching CDD for sequence: Genome530_k127_16854_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Aureimonas_sp._ERS6627306_filtered.faa_Genome530_k127_16854_1_cdd.xml
Completed processing file: unculture



Results saved to: MOTIF1/xlm\uncultured_Comamonas_sp._ERS6627283_filtered.faa_Genome507_k127_91973_1_cdd.xml
Completed processing file: uncultured_Comamonas_sp._ERS6627283_filtered.faa
Processing file: uncultured_Curtobacterium_sp._ERS6626839_filtered.faa
Searching CDD for sequence: Genome280_k127_966990_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Curtobacterium_sp._ERS6626839_filtered.faa_Genome280_k127_966990_1_cdd.xml
Completed processing file: uncultured_Curtobacterium_sp._ERS6626839_filtered.faa
Processing file: uncultured_Deinococcus_sp._ERS6626797_filtered.faa
Searching CDD for sequence: Genome238_k127_54530_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Deinococcus_sp._ERS6626797_filtered.faa_Genome238_k127_54530_1_cdd.xml
Searching CDD for sequence: Genome238_k127_17753_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Deinococcus_sp._ERS6626797_filtered.faa_Genome238_k127_17753_1_cdd.xml
Completed processing file: uncul



Results saved to: MOTIF1/xlm\uncultured_Deinococcus_sp._ERS6627293_filtered.faa_Genome517_k127_348199_1_cdd.xml
Completed processing file: uncultured_Deinococcus_sp._ERS6627293_filtered.faa
Processing file: uncultured_Deinococcus_sp._ERS6627335_filtered.faa
Searching CDD for sequence: Genome559_k127_243715_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Deinococcus_sp._ERS6627335_filtered.faa_Genome559_k127_243715_1_cdd.xml
Completed processing file: uncultured_Deinococcus_sp._ERS6627335_filtered.faa
Processing file: uncultured_Hymenobacter_sp._ERS6626733_filtered.faa
Searching CDD for sequence: Genome174_S1Ck127_276112_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Hymenobacter_sp._ERS6626733_filtered.faa_Genome174_S1Ck127_276112_1_cdd.xml
Completed processing file: uncultured_Hymenobacter_sp._ERS6626733_filtered.faa
Processing file: uncultured_Hymenobacter_sp._ERS6627279_filtered.faa
Searching CDD for sequence: Genome503_k127_453582_1
Attempt 1 fo



Results saved to: MOTIF1/xlm\uncultured_Microbacterium_sp._ERS6627334_filtered.faa_Genome558_k127_281833_1_cdd.xml
Completed processing file: uncultured_Microbacterium_sp._ERS6627334_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6626873_filtered.faa
Searching CDD for sequence: Genome314_k127_168730_2
Attempt 1 for sequence.




Results saved to: MOTIF1/xlm\uncultured_Mucilaginibacter_sp._ERS6626873_filtered.faa_Genome314_k127_168730_2_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6626873_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6626923_filtered.faa
Searching CDD for sequence: Genome364_k127_154890_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Mucilaginibacter_sp._ERS6626923_filtered.faa_Genome364_k127_154890_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6626923_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa
Searching CDD for sequence: Genome403_k127_249621_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa_Genome403_k127_249621_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6626972_filtered.faa
Searching CDD for sequence: G



Results saved to: MOTIF1/xlm\uncultured_Mucilaginibacter_sp._ERS6627275_filtered.faa_Genome499_k127_471593_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6627275_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6627290_filtered.faa
Searching CDD for sequence: Genome514_k127_4285_1
Attempt 1 for sequence.




Results saved to: MOTIF1/xlm\uncultured_Mucilaginibacter_sp._ERS6627290_filtered.faa_Genome514_k127_4285_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6627290_filtered.faa
Processing file: uncultured_Pantoea_sp._ERS6627005_filtered.faa
Searching CDD for sequence: Genome446_k127_627589_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pantoea_sp._ERS6627005_filtered.faa_Genome446_k127_627589_1_cdd.xml
Completed processing file: uncultured_Pantoea_sp._ERS6627005_filtered.faa
Processing file: uncultured_Pantoea_sp._ERS6627332_filtered.faa
Searching CDD for sequence: Genome556_k127_19993_2
Attempt 1 for sequence.




Results saved to: MOTIF1/xlm\uncultured_Pantoea_sp._ERS6627332_filtered.faa_Genome556_k127_19993_2_cdd.xml
Completed processing file: uncultured_Pantoea_sp._ERS6627332_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6626577_filtered.faa
Searching CDD for sequence: Genome18_S1Ck127_217968_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudacidovorax_sp._ERS6626577_filtered.faa_Genome18_S1Ck127_217968_1_cdd.xml
Searching CDD for sequence: Genome18_S1Ck127_178435_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudacidovorax_sp._ERS6626577_filtered.faa_Genome18_S1Ck127_178435_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6626577_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6626850_filtered.faa
Searching CDD for sequence: Genome291_k127_419550_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudacidovorax_sp._ERS6626850_filtered.faa_Genome291_k127_419550_1_cdd.xml
Completed



Results saved to: MOTIF1/xlm\uncultured_Pseudacidovorax_sp._ERS6627320_filtered.faa_Genome544_k127_857444_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6627320_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa
Searching CDD for sequence: Genome545_k127_344480_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa_Genome545_k127_344480_1_cdd.xml
Searching CDD for sequence: Genome545_k127_208535_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa_Genome545_k127_208535_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa
Processing file: uncultured_Pseudomonas_sp._ERS6626621_filtered.faa
Searching CDD for sequence: Genome62_S2Ck127_218854_6
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudomonas_sp._ERS6626621_filtered.faa_Genome62_S2Ck127_218854_6_cdd.xml
Comp



Results saved to: MOTIF1/xlm\uncultured_Pseudomonas_sp._ERS6626713_filtered.faa_Genome154_S3Ck127_221454_4_cdd.xml
Completed processing file: uncultured_Pseudomonas_sp._ERS6626713_filtered.faa
Processing file: uncultured_Pseudomonas_sp._ERS6627273_filtered.faa
Searching CDD for sequence: Genome497_k127_28836_2
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudomonas_sp._ERS6627273_filtered.faa_Genome497_k127_28836_2_cdd.xml
Completed processing file: uncultured_Pseudomonas_sp._ERS6627273_filtered.faa
Processing file: uncultured_Pseudorhodoferax_sp._ERS6626941_filtered.faa
Searching CDD for sequence: Genome382_k127_146993_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Pseudorhodoferax_sp._ERS6626941_filtered.faa_Genome382_k127_146993_1_cdd.xml
Completed processing file: uncultured_Pseudorhodoferax_sp._ERS6626941_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6626586_filtered.faa
Searching CDD for sequence: Genome27_S3Ck127_309455_2
Att



Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6626586_filtered.faa_Genome27_S3Ck127_549998_1_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6626586_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6626658_filtered.faa
Searching CDD for sequence: Genome99_S1Ck127_187716_1
Attempt 1 for sequence.




Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6626658_filtered.faa_Genome99_S1Ck127_187716_1_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6626658_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6626669_filtered.faa
Searching CDD for sequence: Genome110_S3Ck127_116759_5
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6626669_filtered.faa_Genome110_S3Ck127_116759_5_cdd.xml
Searching CDD for sequence: Genome110_S3Ck127_256731_4
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6626669_filtered.faa_Genome110_S3Ck127_256731_4_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6626669_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6626831_filtered.faa
Searching CDD for sequence: Genome272_k127_58089_2
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6626831_filtered.faa_Genome272_k127_58089_2_cdd.xml
Completed pr



Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6627258_filtered.faa_Genome482_k127_81292_1_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6627258_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6627261_filtered.faa
Searching CDD for sequence: Genome485_k127_83547_3
Attempt 1 for sequence.




Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6627261_filtered.faa_Genome485_k127_83547_3_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6627261_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6627280_filtered.faa
Searching CDD for sequence: Genome504_k127_368_4
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6627280_filtered.faa_Genome504_k127_368_4_cdd.xml
Searching CDD for sequence: Genome504_k127_368_1
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Sphingomonas_sp._ERS6627280_filtered.faa_Genome504_k127_368_1_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6627280_filtered.faa
Processing file: uncultured_Variovorax_sp._ERS6626779_filtered.faa
Searching CDD for sequence: Genome220_k127_244806_3
Attempt 1 for sequence.
Results saved to: MOTIF1/xlm\uncultured_Variovorax_sp._ERS6626779_filtered.faa_Genome220_k127_244806_3_cdd.xml
Completed processing file: uncultured_Var

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    For Form 2 BLAST Results with Form 2 MOTIFs
</h3>

In [26]:
conserved_domain_search(motif_2_filtered_proteins, xlm2, cds2, program="blastp", db="cdd")
print(f"Summary CSV created at: {cds2}")

Processing file: Burkholderiaceae_bacterium_ERS6626861_filtered.faa
Searching CDD for sequence: Genome302_k127_117597_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\Burkholderiaceae_bacterium_ERS6626861_filtered.faa_Genome302_k127_117597_1_cdd.xml
Completed processing file: Burkholderiaceae_bacterium_ERS6626861_filtered.faa
Processing file: uncultured_Acidovorax_sp._ERS6627271_filtered.faa
Searching CDD for sequence: Genome495_k127_317059_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Acidovorax_sp._ERS6627271_filtered.faa_Genome495_k127_317059_1_cdd.xml
Completed processing file: uncultured_Acidovorax_sp._ERS6627271_filtered.faa
Processing file: uncultured_Acinetobacter_sp._ERS6626821_filtered.faa
Searching CDD for sequence: Genome262_k127_235961_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Acinetobacter_sp._ERS6626821_filtered.faa_Genome262_k127_235961_1_cdd.xml
Completed processing file: uncultured_Acinetobacter_sp._ERS6626821_filtered.faa
Processing file: uncultured_Acinetobacter_sp._ERS6626858_filtered.faa
Searching CDD for sequence: Genome299_k127_417201_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Acinetobacter_sp._ERS6626858_filtered.faa_Genome299_k127_417201_1_cdd.xml
Completed processing file: uncultured_Acinetobacter_sp._ERS6626858_filtered.faa
Processing file: uncultured_Aureimonas_sp._ERS6626915_filtered.faa
Searching CDD for sequence: Genome356_k127_610672_1
Attempt 1 for



Results saved to: MOTIF2/xlm\uncultured_Aureimonas_sp._ERS6626915_filtered.faa_Genome356_k127_610672_1_cdd.xml
Searching CDD for sequence: Genome356_k127_498042_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Aureimonas_sp._ERS6626915_filtered.faa_Genome356_k127_498042_1_cdd.xml
Completed processing file: uncultured_Aureimonas_sp._ERS6626915_filtered.faa
Processing file: uncultured_Aureimonas_sp._ERS6627306_filtered.faa
Searching CDD for sequence: Genome530_k127_16854_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Aureimonas_sp._ERS6627306_filtered.faa_Genome530_k127_16854_1_cdd.xml
Completed processing file: uncultured_Aureimonas_sp._ERS6627306_filtered.faa
Processing file: uncultured_Comamonas_sp._ERS6626630_filtered.faa
Searching CDD for sequence: Genome71_S1Ck127_18694_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Comamonas_sp._ERS6626630_filtered.faa_Genome71_S1Ck127_18694_1_cdd.xml
Completed processing file: uncultured_Comamonas_sp._ERS6626630_filtered.faa
Processing file: uncultured_Comamonas_sp._ERS6626638_filtered.faa
Searching CDD for sequence: Genome79_S3Ck127_287303_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Comamonas_sp._ERS6626638_filtered.faa_Genome79_S3Ck127_287303_1_cdd.xml
Searching CDD for sequence: Genome79_S3Ck127_287303_3
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Comamonas_sp._ERS6626638_filtered.faa_Genome79_S3Ck127_287303_3_cdd.xml
Completed processing file: uncultured_Comamonas_sp._ERS6626638_filtered.faa
Processing file: uncultured_Comamonas_sp._ERS6626814_filtered.faa
Searching CDD for sequence: Genome255_k127_46261_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Comamonas_sp._ERS6626814_filtered.faa_Genome255_k127_46261_1_cdd.xml
Completed processing file: uncultured_Comamonas_sp._ERS6626814_filtered.faa
Processing file: uncultured_Comamonas_sp._ERS6627283_filtered.faa
Searching CDD for sequence: Genome507_k127_91973_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Comamonas_sp._ERS6627283_filtered.faa_Genome507_k127_91973_1_cdd.xml
Completed processing file: uncultured_Comamona



Attempt 1 failed: <urlopen error [WinError 10053] An established connection was aborted by the software in your host machine>
Attempt 2 for sequence.
Attempt 2 failed: <urlopen error [Errno 11001] getaddrinfo failed>
Attempt 3 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Deinococcus_sp._ERS6626797_filtered.faa_Genome238_k127_54530_1_cdd.xml
Searching CDD for sequence: Genome238_k127_17753_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Deinococcus_sp._ERS6626797_filtered.faa_Genome238_k127_17753_1_cdd.xml
Completed processing file: uncultured_Deinococcus_sp._ERS6626797_filtered.faa
Processing file: uncultured_Deinococcus_sp._ERS6626820_filtered.faa
Searching CDD for sequence: Genome261_k127_146972_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Deinococcus_sp._ERS6626820_filtered.faa_Genome261_k127_146972_1_cdd.xml
Searching CDD for sequence: Genome261_k127_332399_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Deinococcus_sp._ERS6626820_filtered.faa_Genome261_k127_332399_1_cdd.xml
Completed processing file: uncultured_Deinococcus_sp._ERS6626820_filtered.faa
Processing file: uncultured_Deinococcus_sp._ERS6626837_filtered.faa
Searching CDD for sequence: Genome278_k127_93298_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Deinococcus_sp._ERS6626837_filtered.faa_Genome278_k127_93298_1_cdd.xml
Completed processing file: uncultur



Results saved to: MOTIF2/xlm\uncultured_Mucilaginibacter_sp._ERS6626923_filtered.faa_Genome364_k127_154890_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6626923_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa
Searching CDD for sequence: Genome403_k127_249621_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa_Genome403_k127_249621_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6626972_filtered.faa
Searching CDD for sequence: Genome413_k127_255610_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Mucilaginibacter_sp._ERS6626972_filtered.faa_Genome413_k127_255610_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6626972_filtered.faa
Processing file: uncultured_Mucilaginibacter_sp._ERS6626981_filtered.faa
Searching CDD for sequence: G



Results saved to: MOTIF2/xlm\uncultured_Mucilaginibacter_sp._ERS6627290_filtered.faa_Genome514_k127_4285_1_cdd.xml
Completed processing file: uncultured_Mucilaginibacter_sp._ERS6627290_filtered.faa
Processing file: uncultured_Pantoea_sp._ERS6626643_filtered.faa
Searching CDD for sequence: Genome84_S2Ck127_371734_2
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Pantoea_sp._ERS6626643_filtered.faa_Genome84_S2Ck127_371734_2_cdd.xml
Completed processing file: uncultured_Pantoea_sp._ERS6626643_filtered.faa
Processing file: uncultured_Pantoea_sp._ERS6627005_filtered.faa
Searching CDD for sequence: Genome446_k127_205538_3
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pantoea_sp._ERS6627005_filtered.faa_Genome446_k127_205538_3_cdd.xml
Completed processing file: uncultured_Pantoea_sp._ERS6627005_filtered.faa
Processing file: uncultured_Pelomonas_sp._ERS6626569_filtered.faa
Searching CDD for sequence: Genome10_S2Ck127_387977_3
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Pelomonas_sp._ERS6626569_filtered.faa_Genome10_S2Ck127_387977_3_cdd.xml
Completed processing file: uncultured_Pelomonas_sp._ERS6626569_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6626577_filtered.faa
Searching CDD for sequence: Genome18_S1Ck127_217968_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6626577_filtered.faa_Genome18_S1Ck127_217968_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6626577_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6626850_filtered.faa
Searching CDD for sequence: Genome291_k127_419550_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6626850_filtered.faa_Genome291_k127_419550_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6626850_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6626909_filtered.faa
Searching CDD for sequence: Genome350_k127_2



Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6626909_filtered.faa_Genome350_k127_213661_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6626909_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6627003_filtered.faa
Searching CDD for sequence: Genome444_k127_2949_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6627003_filtered.faa_Genome444_k127_2949_1_cdd.xml
Searching CDD for sequence: Genome444_k127_25498_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6627003_filtered.faa_Genome444_k127_25498_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6627003_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6627320_filtered.faa
Searching CDD for sequence: Genome544_k127_977136_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6627320_filtered.faa_Genome544_k127_977136_1_cdd.xml
Searching CDD for sequence: Genome544_k127_833201_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6627320_filtered.faa_Genome544_k127_833201_1_cdd.xml
Searching CDD for sequence: Genome544_k127_857444_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6627320_filtered.faa_Genome544_k127_857444_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6627320_filtered.faa
Processing file: uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa
Searching CDD for sequence: Genome545_k127_344480_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa_Genome545_k127_344480_1_cdd.xml
Searching CDD for sequence: Genome545_k127_208535_1
Attempt 1 for sequence.
Attempt 1 failed: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
Attempt 2 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa_Genome545_k127_208535_1_cdd.xml
Completed processing file: uncultured_Pseudacidovorax_sp._ERS6627321_filtered.faa
Processing file: uncultured_Pseudomonas_sp._ERS6626673_filtered.faa
Searching CDD for sequence: Genome114_S3Ck127_13141_2
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Pseudomonas_sp._ERS6626673_filtered.faa_Genome114_S3Ck127_13141_2_cdd.xml
Completed processing file: uncultured_Pseudomonas_sp._ERS6626673_filtered.faa
Processing file: uncultured_Pseudomonas_sp._ERS6627273_filtered.faa
Searching CDD for sequence: Genome497_k127_28836_2
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Pseudomonas_sp._ERS6627273_filtered.faa_Genome497_k127_28836_2_cdd.xml
Completed processing file: uncultured_Pseudomonas_sp._ERS6627273_filtered.faa
Processing file: uncultured_Pseudorhodoferax_sp._ERS6626941_filtered.faa
Searching CDD for sequence: Genome382_k127_146993_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Pseudorhodoferax_sp._ERS6626941_filtered.faa_Genome382_k127_146993_1_cdd.xml
Completed processing file: uncultured_Pseudorhodoferax_sp._ERS6626941_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6626680_filtered.faa
Searching CDD for sequence: Genome121_S1Ck127_116016_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Sphingomonas_sp._ERS6626680_filtered.faa_Genome121_S1Ck127_116016_1_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6626680_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6626831_filtered.faa
Searching CDD for sequence: Genome272_k127_58089_2
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Sphingomonas_sp._ERS6626831_filtered.faa_Genome272_k127_58089_2_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6626831_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6626957_filtered.faa
Searching CDD for sequence: Genome398_k127_96590_2
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Sphingomonas_sp._ERS6626957_filtered.faa_Genome398_k127_96590_2_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6626957_filtered.faa
Processing file: uncultured_Sphingomonas_sp._ERS6627258_filtered.faa
Searching CDD for sequence: Genome482_k127_81292_1
Attempt 1 for s



Results saved to: MOTIF2/xlm\uncultured_Sphingomonas_sp._ERS6627258_filtered.faa_Genome482_k127_81292_1_cdd.xml
Completed processing file: uncultured_Sphingomonas_sp._ERS6627258_filtered.faa
Processing file: uncultured_Spirosoma_sp._ERS6627011_filtered.faa
Searching CDD for sequence: Genome452_k127_363663_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Spirosoma_sp._ERS6627011_filtered.faa_Genome452_k127_363663_1_cdd.xml
Completed processing file: uncultured_Spirosoma_sp._ERS6627011_filtered.faa
Processing file: uncultured_Stenotrophomonas_sp._ERS6626688_filtered.faa
Searching CDD for sequence: Genome129_S3Ck127_15574_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Stenotrophomonas_sp._ERS6626688_filtered.faa_Genome129_S3Ck127_15574_1_cdd.xml
Completed processing file: uncultured_Stenotrophomonas_sp._ERS6626688_filtered.faa
Processing file: uncultured_Stenotrophomonas_sp._ERS6627319_filtered.faa
Searching CDD for sequence: Genome543_k127_120424_1
Att



Results saved to: MOTIF2/xlm\uncultured_Stenotrophomonas_sp._ERS6627319_filtered.faa_Genome543_k127_120424_1_cdd.xml
Completed processing file: uncultured_Stenotrophomonas_sp._ERS6627319_filtered.faa
Processing file: uncultured_Variovorax_sp._ERS6626779_filtered.faa
Searching CDD for sequence: Genome220_k127_244806_1
Attempt 1 for sequence.
Results saved to: MOTIF2/xlm\uncultured_Variovorax_sp._ERS6626779_filtered.faa_Genome220_k127_244806_1_cdd.xml
Completed processing file: uncultured_Variovorax_sp._ERS6626779_filtered.faa
Processing file: uncultured_Variovorax_sp._ERS6627246_filtered.faa
Searching CDD for sequence: Genome470_k127_43488_1
Attempt 1 for sequence.




Results saved to: MOTIF2/xlm\uncultured_Variovorax_sp._ERS6627246_filtered.faa_Genome470_k127_43488_1_cdd.xml
Completed processing file: uncultured_Variovorax_sp._ERS6627246_filtered.faa
Summary CSV created at: MOTIF1/cds2.csv


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Conserved Domain Search Output Analysis
</h2>

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    For Form 1 BLAST Results with Form 2 MOTIFs
</h3>

In [28]:
cds1_out = open_file(cds1)

In [30]:
cds1_out

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",549.280,0.000000e+00
1,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",539.650,0.000000e+00
2,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|468947,"NF041018, glyceraldDH_alpha, glyceraldehyde de...",502.671,1.041350e-167
3,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|468726,"NF040766, CODH_aero_grp5, aerobic carbon-monox...",445.277,4.658080e-145
4,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|132238,"TIGR03194, 4-hydroxybenzoyl-CoA_reductase_subu...",273.478,2.331170e-80
...,...,...,...,...,...,...
398,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_2,gnl|CDD|236637,"PRK09970, PRK09970, xanthine dehydrogenase sub...",216.468,1.198080e-60
399,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_2,gnl|CDD|466407,"pfam20256, MoCoBD_2, Molybdopterin cofactor-bi...",184.882,1.771450e-53
400,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_2,gnl|CDD|460671,"pfam02738, Ald_Xan_dh_C2, Molybdopterin-bindin...",179.489,6.816020e-52
401,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_2,gnl|CDD|443669,"COG4631, XdhB, Xanthine dehydrogenase, molybdo...",187.193,2.291670e-50


In [32]:
cds1_out['Fasta File'].nunique()

33

In [34]:
best_cds1 = cds1_out.loc[cds1_out.groupby('Fasta File')['E-value'].idxmin()]

In [36]:
best_cds1

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",549.28,0.0
10,uncultured_Acinetobacter_sp._ERS6626821_filter...,Genome262_k127_235961_1,gnl|CDD|443669,"COG4631, XdhB, Xanthine dehydrogenase, molybdo...",737.643,0.0
20,uncultured_Acinetobacter_sp._ERS6626858_filter...,Genome299_k127_417201_1,gnl|CDD|443669,"COG4631, XdhB, Xanthine dehydrogenase, molybdo...",468.003,4.77691e-160
40,uncultured_Aureimonas_sp._ERS6626915_filtered.faa,Genome356_k127_498042_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",362.073,2.64864e-117
50,uncultured_Aureimonas_sp._ERS6627306_filtered.faa,Genome530_k127_16854_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",573.548,0.0
60,uncultured_Comamonas_sp._ERS6626630_filtered.faa,Genome71_S1Ck127_18694_2,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",475.707,2.05612e-158
70,uncultured_Comamonas_sp._ERS6627283_filtered.faa,Genome507_k127_91973_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.806,0.0
90,uncultured_Deinococcus_sp._ERS6626797_filtered...,Genome238_k127_17753_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",573.548,0.0
100,uncultured_Deinococcus_sp._ERS6626820_filtered...,Genome261_k127_146972_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",573.548,0.0
120,uncultured_Deinococcus_sp._ERS6626837_filtered...,Genome278_k127_93298_2,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",564.688,0.0


In [38]:
best_cds1['CDD Hit Definition'].unique()

array(['TIGR02416, Carbon_monoxide_dehydrogenase_large_chain, carbon-monoxide dehydrogenase, large subunit.  This model represents the large subunits of group of carbon-monoxide dehydrogenases that include molybdenum as part of the enzymatic cofactor. There are various forms of carbon-monoxide dehydrogenase; Salicibacter pomeroyi DSS-3, for example, has two forms. Note that, at least in some species, the active site Cys is modified with a selenium attached to (rather than replacing) the sulfur atom. This is termed selanylcysteine, and created post-translationally, in contrast to selenocysteine incorporation during translation as for many other selenoproteins. [Energy metabolism, Other].',
       'COG4631, XdhB, Xanthine dehydrogenase, molybdopterin-binding subunit XdhB [Nucleotide transport and metabolism].  ',
       'COG1529, CoxL, Aldehyde, CO or xanthine dehydrogenase, Mo-binding subunit [Energy production and conversion].  Aldehyde, CO or xanthine dehydrogenase, Mo-binding subunit

In [40]:
condition_1 = 'TIGR02416, Carbon_monoxide_dehydrogenase_large_chain, carbon-monoxide dehydrogenase, large subunit.  This model represents the large subunits of group of carbon-monoxide dehydrogenases that include molybdenum as part of the enzymatic cofactor. There are various forms of carbon-monoxide dehydrogenase; Salicibacter pomeroyi DSS-3, for example, has two forms. Note that, at least in some species, the active site Cys is modified with a selenium attached to (rather than replacing) the sulfur atom. This is termed selanylcysteine, and created post-translationally, in contrast to selenocysteine incorporation during translation as for many other selenoproteins. [Energy metabolism, Other].'
condition_2 = 'COG1529, CoxL, Aldehyde, CO or xanthine dehydrogenase, Mo-binding subunit [Energy production and conversion].  Aldehyde, CO or xanthine dehydrogenase, Mo-binding subunit is part of the Pathway/BioSystem: Non-phosphorylated Entner-Doudoroff pathway.'

In [42]:
best_cds1 = best_cds1[(best_cds1['CDD Hit Definition'] == condition_1) | (best_cds1['CDD Hit Definition'] == condition_2)].reset_index(drop=True)
best_cds1

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",549.28,0.0
1,uncultured_Aureimonas_sp._ERS6626915_filtered.faa,Genome356_k127_498042_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",362.073,2.64864e-117
2,uncultured_Aureimonas_sp._ERS6627306_filtered.faa,Genome530_k127_16854_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",573.548,0.0
3,uncultured_Comamonas_sp._ERS6626630_filtered.faa,Genome71_S1Ck127_18694_2,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",475.707,2.05612e-158
4,uncultured_Comamonas_sp._ERS6627283_filtered.faa,Genome507_k127_91973_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.806,0.0
5,uncultured_Deinococcus_sp._ERS6626797_filtered...,Genome238_k127_17753_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",573.548,0.0
6,uncultured_Deinococcus_sp._ERS6626820_filtered...,Genome261_k127_146972_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",573.548,0.0
7,uncultured_Deinococcus_sp._ERS6626837_filtered...,Genome278_k127_93298_2,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",564.688,0.0
8,uncultured_Deinococcus_sp._ERS6627293_filtered...,Genome517_k127_348199_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",595.89,0.0
9,uncultured_Deinococcus_sp._ERS6627335_filtered...,Genome559_k127_243715_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",434.491,5.20927e-142


In [44]:
best_cds1['Fasta File'].unique()

array(['uncultured_Acidovorax_sp._ERS6627271_filtered.faa',
       'uncultured_Aureimonas_sp._ERS6626915_filtered.faa',
       'uncultured_Aureimonas_sp._ERS6627306_filtered.faa',
       'uncultured_Comamonas_sp._ERS6626630_filtered.faa',
       'uncultured_Comamonas_sp._ERS6627283_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6626797_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6626820_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6626837_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6627293_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6627335_filtered.faa',
       'uncultured_Methylobacterium_sp._ERS6627288_filtered.faa',
       'uncultured_Microbacterium_sp._ERS6626906_filtered.faa',
       'uncultured_Microbacterium_sp._ERS6626964_filtered.faa',
       'uncultured_Mucilaginibacter_sp._ERS6626923_filtered.faa',
       'uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa',
       'uncultured_Mucilaginibacter_sp._ERS6627275_filtered.faa',
     

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    For Form 2 BLAST Results with Form 2 MOTIFs
</h3>

In [47]:
cds2_out = open_file(cds2)
cds2_out

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",548.895,0.000000e+00
1,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",540.036,0.000000e+00
2,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|468947,"NF041018, glyceraldDH_alpha, glyceraldehyde de...",499.590,1.025440e-166
3,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|468726,"NF040766, CODH_aero_grp5, aerobic carbon-monox...",445.277,4.983950e-145
4,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|132238,"TIGR03194, 4-hydroxybenzoyl-CoA_reductase_subu...",273.478,1.968160e-80
...,...,...,...,...,...,...
412,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_1,gnl|CDD|132238,"TIGR03194, 4-hydroxybenzoyl-CoA_reductase_subu...",258.070,6.223070e-75
413,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_1,gnl|CDD|132356,"TIGR03313, Se_sel_red_Mo, probable selenate re...",235.343,1.521620e-65
414,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_1,gnl|CDD|443669,"COG4631, XdhB, Xanthine dehydrogenase, molybdo...",214.927,2.707380e-59
415,uncultured_Variovorax_sp._ERS6627246_filtered.faa,Genome470_k127_43488_1,gnl|CDD|274367,"TIGR02965, xanthine_dehydrogenase, xanthine de...",201.830,9.922190e-55


In [49]:
cds2_out['Fasta File'].nunique()

34

In [51]:
best_cds2 = cds2_out.loc[cds2_out.groupby('Fasta File')['E-value'].idxmin()]
best_cds2

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",548.895,0.0
10,uncultured_Acinetobacter_sp._ERS6626821_filter...,Genome262_k127_235961_1,gnl|CDD|443669,"COG4631, XdhB, Xanthine dehydrogenase, molybdo...",752.666,0.0
20,uncultured_Acinetobacter_sp._ERS6626858_filter...,Genome299_k127_417201_1,gnl|CDD|443669,"COG4631, XdhB, Xanthine dehydrogenase, molybdo...",533.872,0.0
40,uncultured_Aureimonas_sp._ERS6626915_filtered.faa,Genome356_k127_498042_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",362.073,2.55748e-117
50,uncultured_Aureimonas_sp._ERS6627306_filtered.faa,Genome530_k127_16854_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",573.548,0.0
60,uncultured_Comamonas_sp._ERS6626630_filtered.faa,Genome71_S1Ck127_18694_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.036,0.0
70,uncultured_Comamonas_sp._ERS6627283_filtered.faa,Genome507_k127_91973_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.806,0.0
90,uncultured_Deinococcus_sp._ERS6626797_filtered...,Genome238_k127_17753_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",596.66,0.0
100,uncultured_Deinococcus_sp._ERS6626820_filtered...,Genome261_k127_146972_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0
120,uncultured_Deinococcus_sp._ERS6626837_filtered...,Genome278_k127_93298_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0


In [53]:
best_cds2['CDD Hit Definition'].unique()

array(['TIGR02416, Carbon_monoxide_dehydrogenase_large_chain, carbon-monoxide dehydrogenase, large subunit.  This model represents the large subunits of group of carbon-monoxide dehydrogenases that include molybdenum as part of the enzymatic cofactor. There are various forms of carbon-monoxide dehydrogenase; Salicibacter pomeroyi DSS-3, for example, has two forms. Note that, at least in some species, the active site Cys is modified with a selenium attached to (rather than replacing) the sulfur atom. This is termed selanylcysteine, and created post-translationally, in contrast to selenocysteine incorporation during translation as for many other selenoproteins. [Energy metabolism, Other].',
       'COG4631, XdhB, Xanthine dehydrogenase, molybdopterin-binding subunit XdhB [Nucleotide transport and metabolism].  ',
       'COG1529, CoxL, Aldehyde, CO or xanthine dehydrogenase, Mo-binding subunit [Energy production and conversion].  Aldehyde, CO or xanthine dehydrogenase, Mo-binding subunit

In [55]:
best_cds2 = best_cds2[(best_cds2['CDD Hit Definition'] == condition_1) | (best_cds2['CDD Hit Definition'] == condition_2)].reset_index(drop=True)
best_cds2

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",548.895,0.0
1,uncultured_Aureimonas_sp._ERS6626915_filtered.faa,Genome356_k127_498042_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",362.073,2.55748e-117
2,uncultured_Aureimonas_sp._ERS6627306_filtered.faa,Genome530_k127_16854_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",573.548,0.0
3,uncultured_Comamonas_sp._ERS6626630_filtered.faa,Genome71_S1Ck127_18694_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.036,0.0
4,uncultured_Comamonas_sp._ERS6627283_filtered.faa,Genome507_k127_91973_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.806,0.0
5,uncultured_Deinococcus_sp._ERS6626797_filtered...,Genome238_k127_17753_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",596.66,0.0
6,uncultured_Deinococcus_sp._ERS6626820_filtered...,Genome261_k127_146972_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0
7,uncultured_Deinococcus_sp._ERS6626837_filtered...,Genome278_k127_93298_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0
8,uncultured_Deinococcus_sp._ERS6627293_filtered...,Genome517_k127_348199_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",596.66,0.0
9,uncultured_Deinococcus_sp._ERS6627335_filtered...,Genome559_k127_243715_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",398.667,1.23967e-128


In [57]:
best_cds2['Fasta File'].unique()

array(['uncultured_Acidovorax_sp._ERS6627271_filtered.faa',
       'uncultured_Aureimonas_sp._ERS6626915_filtered.faa',
       'uncultured_Aureimonas_sp._ERS6627306_filtered.faa',
       'uncultured_Comamonas_sp._ERS6626630_filtered.faa',
       'uncultured_Comamonas_sp._ERS6627283_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6626797_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6626820_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6626837_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6627293_filtered.faa',
       'uncultured_Deinococcus_sp._ERS6627335_filtered.faa',
       'uncultured_Methylobacterium_sp._ERS6627288_filtered.faa',
       'uncultured_Microbacterium_sp._ERS6626906_filtered.faa',
       'uncultured_Microbacterium_sp._ERS6626964_filtered.faa',
       'uncultured_Mucilaginibacter_sp._ERS6626923_filtered.faa',
       'uncultured_Mucilaginibacter_sp._ERS6626962_filtered.faa',
       'uncultured_Mucilaginibacter_sp._ERS6626972_filtered.faa',
     

<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    Find Unique MAGs from both the Outputs
</h3>

In [60]:
unique_in_best_cds1 = np.setdiff1d(best_cds1['Fasta File'].unique(), best_cds2['Fasta File'].unique())
unique_in_best_cds2 = np.setdiff1d(best_cds2['Fasta File'].unique(), best_cds1['Fasta File'].unique())

print("Unique in best_cds1:", unique_in_best_cds1)
print("Unique in best_cds2:", unique_in_best_cds2)

Unique in best_cds1: []
Unique in best_cds2: ['uncultured_Mucilaginibacter_sp._ERS6626972_filtered.faa'
 'uncultured_Pseudacidovorax_sp._ERS6626909_filtered.faa']


<h3 style="font-family: 'Times New Roman'; font-weight: bold;">
    Cross Check
</h3>

In [64]:
best_cds2['source_file'] = best_cds2['Fasta File'].str.replace('_filtered.faa', '.fasta')
best_cds2

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value,source_file
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",548.895,0.0,uncultured_Acidovorax_sp._ERS6627271.fasta
1,uncultured_Aureimonas_sp._ERS6626915_filtered.faa,Genome356_k127_498042_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",362.073,2.55748e-117,uncultured_Aureimonas_sp._ERS6626915.fasta
2,uncultured_Aureimonas_sp._ERS6627306_filtered.faa,Genome530_k127_16854_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",573.548,0.0,uncultured_Aureimonas_sp._ERS6627306.fasta
3,uncultured_Comamonas_sp._ERS6626630_filtered.faa,Genome71_S1Ck127_18694_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.036,0.0,uncultured_Comamonas_sp._ERS6626630.fasta
4,uncultured_Comamonas_sp._ERS6627283_filtered.faa,Genome507_k127_91973_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.806,0.0,uncultured_Comamonas_sp._ERS6627283.fasta
5,uncultured_Deinococcus_sp._ERS6626797_filtered...,Genome238_k127_17753_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",596.66,0.0,uncultured_Deinococcus_sp._ERS6626797.fasta
6,uncultured_Deinococcus_sp._ERS6626820_filtered...,Genome261_k127_146972_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0,uncultured_Deinococcus_sp._ERS6626820.fasta
7,uncultured_Deinococcus_sp._ERS6626837_filtered...,Genome278_k127_93298_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0,uncultured_Deinococcus_sp._ERS6626837.fasta
8,uncultured_Deinococcus_sp._ERS6627293_filtered...,Genome517_k127_348199_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",596.66,0.0,uncultured_Deinococcus_sp._ERS6627293.fasta
9,uncultured_Deinococcus_sp._ERS6627335_filtered...,Genome559_k127_243715_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",398.667,1.23967e-128,uncultured_Deinococcus_sp._ERS6627335.fasta


In [66]:
# Adding column 'qseqid' from the updated sequence ID coulmn in the dataframe
best_cds2['qseqid'] = best_cds2['Sequence Header'].apply(lambda x: x[:-2])
best_cds2

Unnamed: 0,Fasta File,Sequence Header,CDD Hit ID,CDD Hit Definition,Bit Score,E-value,source_file,qseqid
0,uncultured_Acidovorax_sp._ERS6627271_filtered.faa,Genome495_k127_317059_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",548.895,0.0,uncultured_Acidovorax_sp._ERS6627271.fasta,Genome495_k127_317059
1,uncultured_Aureimonas_sp._ERS6626915_filtered.faa,Genome356_k127_498042_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",362.073,2.55748e-117,uncultured_Aureimonas_sp._ERS6626915.fasta,Genome356_k127_498042
2,uncultured_Aureimonas_sp._ERS6627306_filtered.faa,Genome530_k127_16854_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",573.548,0.0,uncultured_Aureimonas_sp._ERS6627306.fasta,Genome530_k127_16854
3,uncultured_Comamonas_sp._ERS6626630_filtered.faa,Genome71_S1Ck127_18694_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.036,0.0,uncultured_Comamonas_sp._ERS6626630.fasta,Genome71_S1Ck127_18694
4,uncultured_Comamonas_sp._ERS6627283_filtered.faa,Genome507_k127_91973_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",540.806,0.0,uncultured_Comamonas_sp._ERS6627283.fasta,Genome507_k127_91973
5,uncultured_Deinococcus_sp._ERS6626797_filtered...,Genome238_k127_17753_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",596.66,0.0,uncultured_Deinococcus_sp._ERS6626797.fasta,Genome238_k127_17753
6,uncultured_Deinococcus_sp._ERS6626820_filtered...,Genome261_k127_146972_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0,uncultured_Deinococcus_sp._ERS6626820.fasta,Genome261_k127_146972
7,uncultured_Deinococcus_sp._ERS6626837_filtered...,Genome278_k127_93298_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",597.045,0.0,uncultured_Deinococcus_sp._ERS6626837.fasta,Genome278_k127_93298
8,uncultured_Deinococcus_sp._ERS6627293_filtered...,Genome517_k127_348199_1,gnl|CDD|131469,"TIGR02416, Carbon_monoxide_dehydrogenase_large...",596.66,0.0,uncultured_Deinococcus_sp._ERS6627293.fasta,Genome517_k127_348199
9,uncultured_Deinococcus_sp._ERS6627335_filtered...,Genome559_k127_243715_1,gnl|CDD|441138,"COG1529, CoxL, Aldehyde, CO or xanthine dehydr...",398.667,1.23967e-128,uncultured_Deinococcus_sp._ERS6627335.fasta,Genome559_k127_243715


In [68]:
# Create the destination folder if it doesn't exist
if not os.path.exists(coxl2_genes):
    os.makedirs(coxl2_genes)

# Get the list of files in the source folder
files_in_source_folder = os.listdir(motif_2_fasta_path)

# Iterate over each file in the source folder
for file in files_in_source_folder:
    if file in best_cds2['source_file'].values:
        # Move the file if it is in the DataFrame
        shutil.move(os.path.join(motif_2_fasta_path, file), os.path.join(coxl2_genes, file))

print("Files have been filtered and transferred successfully.")

Files have been filtered and transferred successfully.


In [70]:
parent_directory = os.path.dirname(coxl2_genes)
duplicates_folder = os.path.join(parent_directory, 'gene_duplicates')

# Create the duplicates folder if it doesn't exist
os.makedirs(duplicates_folder, exist_ok=True)

# Assuming best_cds2 is your DataFrame
# Extract sequences to extract from the DataFrame
sequences_to_extract = {}
for index, row in best_cds2.iterrows():
    file_name = row["source_file"].strip()
    sequence_id = row["qseqid"].strip()
    
    if file_name not in sequences_to_extract:
        sequences_to_extract[file_name] = set()
    
    sequences_to_extract[file_name].add(sequence_id)  # Store ID

# Process each FASTA file
for file_name, sequence_ids in sequences_to_extract.items():
    input_file_path = os.path.join(coxl2_genes, file_name)
    output_file_path = os.path.join(coxl2_genes, f"{os.path.splitext(file_name)[0]}.fasta")
    duplicates_file_path = os.path.join(duplicates_folder, f"{os.path.splitext(file_name)[0]}_duplicates.txt")

    if not os.path.exists(input_file_path):
        print(f"File not found: {input_file_path}")
        continue

    extracted_records = []
    seen_sequences = set()
    duplicate_headers = []

    for record in SeqIO.parse(input_file_path, "fasta"):
        if record.id in sequence_ids:
            if str(record.seq) in seen_sequences:
                duplicate_headers.append(record.id)
            else:
                seen_sequences.add(str(record.seq))
                extracted_records.append(record)

    if extracted_records:
        with open(output_file_path, "w") as output_f:
            SeqIO.write(extracted_records, output_f, "fasta")
        print(f"Extracted sequences saved to: {output_file_path}")
    
    if duplicate_headers:
        with open(duplicates_file_path, "w") as duplicates_f:
            duplicates_f.write(f"Duplicate headers in {file_name}:\n")
            for header in duplicate_headers:
                duplicates_f.write(f"{header}\n")
        print(f"Duplicate headers saved to: {duplicates_file_path}")

print("Sequence extraction completed.")

Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Acidovorax_sp._ERS6627271.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Aureimonas_sp._ERS6626915.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Aureimonas_sp._ERS6627306.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Comamonas_sp._ERS6626630.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Comamonas_sp._ERS6627283.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Deinococcus_sp._ERS6626797.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Deinococcus_sp._ERS6626820.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Deinococcus_sp._ERS6626837.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Deinococcus_sp._ERS6627293.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Deinococcus_sp._ERS6627335.fasta
Extracted sequences saved to: Coxl2_Results/Genes\uncultured_Methyl

In [71]:
# Get the count of sequences in each .faa file and the total sequences
sequence_counts, total_sequences = count_sequences_in_fasta_files(coxl2_genes)

# Print the results
for file_name, count in sequence_counts.items():
    print(f"{file_name}: {count} sequences")

print(f"Total sequences from all files: {total_sequences}")

uncultured_Acidovorax_sp._ERS6627271.fasta: 2 sequences
uncultured_Aureimonas_sp._ERS6626915.fasta: 2 sequences
uncultured_Aureimonas_sp._ERS6627306.fasta: 2 sequences
uncultured_Comamonas_sp._ERS6626630.fasta: 2 sequences
uncultured_Comamonas_sp._ERS6627283.fasta: 2 sequences
uncultured_Deinococcus_sp._ERS6626797.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6626820.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6626837.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6627293.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6627335.fasta: 2 sequences
uncultured_Methylobacterium_sp._ERS6627288.fasta: 2 sequences
uncultured_Microbacterium_sp._ERS6626906.fasta: 2 sequences
uncultured_Microbacterium_sp._ERS6626964.fasta: 2 sequences
uncultured_Mucilaginibacter_sp._ERS6626923.fasta: 2 sequences
uncultured_Mucilaginibacter_sp._ERS6626962.fasta: 2 sequences
uncultured_Mucilaginibacter_sp._ERS6626972.fasta: 2 sequences
uncultured_Mucilaginibacter_sp._ERS6627275.fasta: 2 sequences
uncultu

In [72]:
# List to store dataframes
dfs = []

# Iterate over each file in the directory
for filename in os.listdir(cross_ver):
    if filename.endswith('.csv'):
        # Read the CSV file
        df = pd.read_csv(os.path.join(cross_ver, filename))
        
        # Add 'MAG' and 'query_index' columns
        if filename.endswith('_1.csv'):
            mag_value = filename[:-6]  
            query_index_value = 'Query 2'
        else:
            mag_value = filename[:-4]  
            query_index_value = 'Query 1'
        
        df['MAG'] = mag_value
        df['query_index'] = query_index_value
        
        # Append the dataframe to the list
        dfs.append(df)

# Concatenate all dataframes
result_df = pd.concat(dfs, ignore_index=True)

# Save the concatenated dataframe to a new CSV file
result_df.to_csv('open_blast.csv', index=False)

print("All files have been concatenated into 'concatenated_file.csv'.")


All files have been concatenated into 'concatenated_file.csv'.


In [75]:
open_blast = open_file('open_blast.csv')
open_blast

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
0,xanthine dehydrogenase family protein molybdop...,uncultured Acidovorax sp.,1548,1548,100%,0.0,100.00,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 1
1,xanthine dehydrogenase family protein molybdop...,Acidovorax temperans,1545,1545,100%,0.0,99.62,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 1
2,xanthine dehydrogenase family protein molybdop...,Acidovorax temperans,1545,1545,100%,0.0,99.36,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 1
3,xanthine dehydrogenase family protein molybdop...,Acidovorax temperans,1545,1545,100%,0.0,99.62,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 1
4,xanthine dehydrogenase family protein molybdop...,Acidovorax sp. IB03,1544,1544,100%,0.0,99.49,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 1
...,...,...,...,...,...,...,...,...,...,...,...
4495,xanthine dehydrogenase family protein molybdop...,Variovorax paradoxus,1342,1342,100%,0.0,91.00,776,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2
4496,MULTISPECIES: xanthine dehydrogenase family pr...,Variovorax,1341,1341,100%,0.0,90.85,769,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2
4497,MULTISPECIES: xanthine dehydrogenase family pr...,Variovorax,1341,1341,100%,0.0,91.26,776,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2
4498,xanthine dehydrogenase family protein molybdop...,Variovorax paradoxus,1341,1341,100%,0.0,91.13,776,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2


In [76]:
# Fetch the best hits for every sequence
open_blast = open_blast.sort_values(by=['MAG', 'query_index', 'E value', 'Per. ident'], ascending=[True, True, True, False]).drop_duplicates(subset=['MAG', 'query_index']).reset_index(drop=True)
open_blast

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
0,xanthine dehydrogenase family protein molybdop...,uncultured Acidovorax sp.,1548,1548,100%,0.0,100.0,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 1
1,xanthine dehydrogenase family protein molybdop...,uncultured Acidovorax sp.,1545,1545,100%,0.0,100.0,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 2
2,xanthine dehydrogenase family protein molybdop...,Bosea sp. (in: a-proteobacteria),996,996,100%,0.0,96.78,769,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6626915,Query 1
3,xanthine dehydrogenase family protein molybdop...,Bosea sp. (in: a-proteobacteria),998,998,100%,0.0,96.79,769,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6626915,Query 2
4,xanthine dehydrogenase family protein molybdop...,uncultured Aureimonas sp.,1531,1531,100%,0.0,100.0,784,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6627306,Query 1
5,xanthine dehydrogenase family protein molybdop...,uncultured Aureimonas sp.,1523,1523,100%,0.0,100.0,784,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6627306,Query 2
6,xanthine dehydrogenase family protein molybdop...,Delftia acidovorans,1521,1521,100%,0.0,99.87,796,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6626630,Query 1
7,xanthine dehydrogenase family protein molybdop...,Delftia acidovorans,1515,1515,100%,0.0,99.87,796,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6626630,Query 2
8,xanthine dehydrogenase family protein molybdop...,Delftia sp. DT-2,1507,1507,100%,0.0,100.0,796,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6627283,Query 1
9,xanthine dehydrogenase family protein molybdop...,Delftia sp. DT-2,1502,1502,100%,0.0,100.0,796,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6627283,Query 2


In [77]:
open_blast['Description'].unique()

array(['xanthine dehydrogenase family protein molybdopterin-binding subunit [uncultured Acidovorax sp.]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [Bosea sp. (in: a-proteobacteria)]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [uncultured Aureimonas sp.]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [Delftia acidovorans]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [Delftia sp. DT-2]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [Deinococcus gobiensis]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [uncultured Deinococcus sp.]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [uncultured Methylobacterium sp.]',
       'MULTISPECIES: xanthine dehydrogenase family protein molybdopterin-binding subunit [Methylobacterium]',
       'molybdopterin-dependent oxidoreductase

In [78]:
open_blast[open_blast['Description'] == 'molybdopterin-dependent oxidoreductase [Microbacterium hydrothermale]']

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
20,molybdopterin-dependent oxidoreductase [Microb...,Microbacterium hydrothermale,1212,1212,100%,0.0,95.77,1024,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Microbacterium_sp._ERS6626964,Query 1
21,molybdopterin-dependent oxidoreductase [Microb...,Microbacterium hydrothermale,1256,1256,100%,0.0,95.97,1024,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Microbacterium_sp._ERS6626964,Query 2


<p style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
    Open BLAST (open_blast) df refers to the output downloaded from the NCBI server CDD BLAST analysis.
</p>