<h1 style="font-family: 'Times New Roman'; text-align: center; font-weight: bold;">
    CODH Function Search
</h1>

<div style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
  <div>
    <span style="display: inline-block; width: 100px;"><b>Date</b></span>: 14<sup>th</sup> February 2025
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Author</b></span>: Deepan Kanagarajan Babu
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Description</b></span>: In this document, the output of RAST is analysed and looked for the Carbon-monoxide dehydrogenase (CODH) function of uploaded MAGs. The Seeds' RAST server has different threasholds, so only 534 Metagenomic Assembled Genomes were uploaded to the server, rest are shown to have huge size and wasnot able to upload to the server. After finding the functions, those sequence are looked for the presence of MOTIF. Other CODH associated functions were examined in the shortlisted MAGs.
  </div>
</div>


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Libraries
</h2>

In [2]:
# import necessary libraries
import pandas as pd
import re
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import shutil

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Functions
</h2>

In [3]:
# Defining single function to open multiple file formats (.csv, .txt, .fasta or .fastq (to display only 1st 100 lines), and .xlsx)
def open_file(file_path):
    if file_path.endswith('.csv'):
        # Read CSV file
        data = pd.read_csv(file_path)
        return data
    elif file_path.endswith('.txt'):
        # Read TXT file
        try:
            with open(file_path, 'r') as file:
                data = file.read()
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith(('.fasta', '.fastq')):
        # Read FASTA or FASTQ file
        try:
            with open(file_path, 'r') as file:
                data = []
                for _ in range(100):
                    line = file.readline()
                    if not line:
                        break
                    data.append(line.strip())
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith('.xlsx'):
        # Read XLSX file
        try:
            data = pd.read_excel(file_path)
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    else:
        raise ValueError("Unsupported file format")

In [4]:
# Function to concate Excel files
def concatenate_excel(input_file, output_file, summary):
    # Initialize an empty DataFrame to store the compiled data
    compiled_results = pd.DataFrame()

    # Initialize lists to store the number of entries and empty files
    entries_count = []
    empty_files = []

    # Initialize a counter for the number of processed files
    processed_files = 0

    # Iterate over each file in the folder
    for file_name in os.listdir(input_file):
        if file_name.endswith('.xls'):
            # Read the Excel file into a DataFrame
            file_path = os.path.join(input_file, file_name)
            try:
                df = pd.read_excel(file_path)
                processed_files += 1

                # Check if the file is empty
                if df.empty:
                    empty_files.append(file_name)
                    print(f"File {file_name} is empty.")
                    continue  # Skip further processing of this file

                # Remove '.xls' from the file name
                modified_file_name = file_name[:-4]

                # Add a new column with the modified file name
                df.insert(0, 'MAG', modified_file_name)

                # Concatenate non-empty DataFrames
                compiled_results = pd.concat([compiled_results, df], ignore_index=True)
                # Append the number of entries to the list
                entries_count.append({'file_name': modified_file_name, 'entries': len(df)})

            except Exception as e:
                print(f"Error processing file {file_name}: {e}")

    # Save the compiled data to the new CSV file
    compiled_results.to_csv(output_file, index=False)

    # Save the number of entries from individual files to a new CSV file
    entries_count_df = pd.DataFrame(entries_count)
    entries_count_file_path = os.path.join(summary, 'number_of_annotations.csv')
    entries_count_df.to_csv(entries_count_file_path, index=False)

    # Save the empty files to a separate .txt file
    empty_files_file_path = os.path.join(summary, 'empty_files.txt')
    with open(empty_files_file_path, 'w') as f:
        if empty_files:
            for file in empty_files:
                f.write(file + '\n')
        else:
            f.write("No empty files.\n")

    # Confirmation code
    print(f"Data compilation complete. The compiled data is saved to '{output_file}'.")
    print(f"The number of entries from individual files is saved to '{entries_count_file_path}'.")
    print(f"The empty files have been saved to '{empty_files_file_path}'.")
    print(f"Processed {processed_files} files out of {len(os.listdir(input_file))} total files.")

In [5]:
# Function to Filter FASTA file sequences
def filter_fasta_files(input_folder, output_folder, df):
    # Filter the DataFrame to get the list of IDs to keep
    ids_to_keep = set(df['feature_id'])

    # Create the output directory if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    # Initialize a list to store the results
    results = []

    # Process each FASTA file in the input directory
    for filename in os.listdir(input_folder):
        if filename.endswith(".faa"):
            fasta_file = os.path.join(input_folder, filename)
            filtered_sequences = []

            for record in SeqIO.parse(fasta_file, "fasta"):
                sequence_id = record.id.split("Sequence ID: ")[-1]
                if sequence_id in ids_to_keep:
                    filtered_sequences.append(record)
                    results.append({
                        "FASTA_File": filename,
                        "Sequence_ID": sequence_id
                    })

            # Save the filtered sequences to a new FASTA file in the output directory if not empty
            if filtered_sequences:
                filtered_fasta_file = os.path.join(output_folder, f"filtered_{filename}")
                SeqIO.write(filtered_sequences, filtered_fasta_file, "fasta")

    # Create a DataFrame from the results
    df_results = pd.DataFrame(results)

    # Save the DataFrame to a CSV file in the output directory
    df_results.to_csv(os.path.join(output_folder, "filtered_sequences_info.csv"), index=False)

    print(f"Filtered sequences have been saved to new FASTA files in '{output_folder}' and results have been saved to 'filtered_sequences_info.csv'")

In [6]:
# Function to search for 2 motifs with unique labels
def search_motifs(directory, pattern1, label1="Motif1", pattern2=None, label2="Motif2", output_file=None):
    all_results = []

    # Ensure directory exists
    if not os.path.isdir(directory):
        print(f"Error: Directory '{directory}' not found.")
        return pd.DataFrame()

    # Process each FASTA or FAA file in the directory
    for filename in os.listdir(directory):
        if filename.endswith((".fasta", ".faa")):
            fasta_file = os.path.join(directory, filename)

            for record in SeqIO.parse(fasta_file, "fasta"):
                seq = str(record.seq)

                # Search for the first pattern (if provided)
                if pattern1:
                    for match in pattern1.finditer(seq):
                        all_results.append({
                            "MAG": filename[9:-4],
                            "feature_id": record.id,
                            "aa_sequence": seq,
                            "MOTIF": match.group(0),
                            "Previous_3_AA": match.group(1),
                            "Motif_Label": label1
                        })

                # Search for the second pattern (if provided)
                if pattern2:
                    for match in pattern2.finditer(seq):
                        all_results.append({
                            "MAG": filename[9:-4],
                            "feature_id": record.id,
                            "aa_sequence": seq,
                            "MOTIF": match.group(0),
                            "Previous_3_AA": match.group(1),
                            "Motif_Label": label2
                        })

    # Convert results to DataFrame
    df = pd.DataFrame(all_results)

    # Set output file name dynamically if not provided
    if output_file is None:
        output_file = f"{label1}_{label2}_motif_search.csv"  # Default to label-based filename

    # Save only if results are found
    if not df.empty:
        df.to_csv(output_file, index=False)
        print(f"Results saved to {output_file}")
    else:
        print("No results found for motifs.")

    return df

In [7]:
# Function to extract nucleotide sequence from RAST output CSV
def create_fasta_from_df(df, mag_col, feature_id_col, sequence_col, output_folder):
    # Ensure the output folder exists
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Group by 'MAG' and create separate FASTA files for each group
    for mag, group in df.groupby(mag_col):
        fasta_records = []
        for _, row in group.iterrows():
            record = SeqRecord(Seq(row[sequence_col]), id=row[feature_id_col], description="")
            fasta_records.append(record)
        
        # Write the records to a FASTA file named after the 'MAG' value in the specified output folder
        fasta_file = os.path.join(output_folder, f"{mag}.fasta")
        SeqIO.write(fasta_records, fasta_file, "fasta")

    print("FASTA files created successfully.")

In [8]:
# Function to count number of sequences in every file
def count_sequences_in_fasta_files(folder_path):
    # Dictionary to store the count of sequences in each file
    sequence_counts = {}
    total_sequences = 0

    # Iterate over each file in the folder
    for file_name in os.listdir(folder_path):
        if file_name.endswith(".faa") | file_name.endswith(".fasta") | file_name.endswith(".fna"):
            file_path = os.path.join(folder_path, file_name)
            # Count the number of sequences in the current .faa file
            sequence_count = sum(1 for _ in SeqIO.parse(file_path, "fasta"))
            sequence_counts[file_name] = sequence_count
            total_sequences += sequence_count

    return sequence_counts, total_sequences

In [9]:
# Function to perform BLAST search for a protein sequence with retry mechanism
def search_cdd(sequence, program="blastp", db="cdd", e_value=0.01, max_hits=10, retries=10, timeout=600):
    for attempt in range(retries):
        try:
            print(f"Attempt {attempt + 1} for sequence.")
            result_handle = NCBIWWW.qblast(
                program=program,
                database=db,
                sequence=sequence,
                expect=e_value,
                hitlist_size=max_hits
            )
            results = result_handle.read()
            result_handle.close()
            return results
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(5)  # Wait before retrying
    print("All attempts failed.")
    return None

# Function to parse BLAST XML and extract summary information
def parse_cdd_xml(xml_content):
    try:
        root = ET.fromstring(xml_content)
        hits = []
        
        # Parse the hits from the XML
        for hit in root.findall(".//Hit"):
            hit_id = hit.findtext("Hit_id")
            hit_def = hit.findtext("Hit_def")
            hit_score = hit.findtext(".//Hsp_bit-score")
            e_value = hit.findtext(".//Hsp_evalue")
            hits.append({
                "hit_id": hit_id,
                "hit_def": hit_def,
                "hit_score": hit_score,
                "e_value": e_value,
            })
        return hits
    except ET.ParseError as e:
        print(f"Error parsing XML: {e}")
        return []

# Function to process all .faa files in a folder
def conserved_domain_search(folder_path, output_folder, summary_csv, program="blastp", db="cdd"):
    os.makedirs(output_folder, exist_ok=True)
    
    with open(summary_csv, "w", newline="") as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(["Fasta File", "Sequence Header", "CDD Hit ID", "CDD Hit Definition", "Bit Score", "E-value"])
        
        for file_name in os.listdir(folder_path):
            if file_name.endswith(".faa"):
                file_path = os.path.join(folder_path, file_name)
                print(f"Processing file: {file_name}")
                
                for record in SeqIO.parse(file_path, "fasta"):
                    sequence_id = record.id
                    sequence = str(record.seq)
                    
                    print(f"Searching CDD for sequence: {sequence_id}")
                    result = search_cdd(sequence, program=program, db=db)
                    
                    if result:
                        output_file = os.path.join(output_folder, f"{file_name}_{sequence_id}_cdd.xml")
                        with open(output_file, "w") as f:
                            f.write(result)
                        print(f"Results saved to: {output_file}")
                        
                        hits = parse_cdd_xml(result)
                        for hit in hits:
                            csv_writer.writerow([
                                file_name,
                                sequence_id,
                                hit["hit_id"],
                                hit["hit_def"],
                                hit["hit_score"],
                                hit["e_value"]
                            ])
                print(f"Completed processing file: {file_name}")

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Directories Path
</h2>

In [11]:
# Input Folders
## RAST excel output
rast_xls = 'xls'

## RAST output protein sequences
rast_faa = 'faa'

## Folder will be used to store the open BLAST results of carriers
cross_ver = 'summary/cross_verification'

# Output Folder
## Results summary folder
summary = 'summary'
## Ensure the output directory exists
os.makedirs(summary, exist_ok=True)

## Folder to store filtered fasta files
filtered_fasta = 'summary/faa'
## Ensure the output directory exists
os.makedirs(filtered_fasta, exist_ok=True)

## Folder to store extracted nucleotide fasta files
fna = 'summary/fna'
## Ensure the output directory exists
os.makedirs(fna, exist_ok=True)

## Concated CSV file output path
rast_out = 'summary/rast.csv'

## Filtered RAST output for CODH functions file path
codh_out = 'summary/codh.csv'

## MOTIF Search output file path
motif_out = 'summary/codh_motif.csv'

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Data Preprocessing
</h2>

In [13]:
# Concatenate RAST outputs
concatenate_excel(rast_xls, rast_out, summary)

Data compilation complete. The compiled data is saved to 'summary/rast.csv'.
The number of entries from individual files is saved to 'summary\number_of_annotations.csv'.
The empty files have been saved to 'summary\empty_files.txt'.
Processed 534 files out of 534 total files.


In [14]:
# Import the file
rast = open_file(rast_out)
rast.head()

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
0,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.1,peg,Genome20_S3Ck127_100924_3_194,3,194,+,hypothetical protein,,,,gcgatcttcacgcactctgacgaacaaaaaagcgtcgctcttcagg...,AIFTHSDEQKSVALQAKAKAQQKWDKPIVTQIVDAKPFYSAERVHQ...
1,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.2,peg,Genome20_S3Ck127_100924_197_814,197,814,+,hypothetical protein,,,,atgccgattcaatcctggtttccgacgtcgctttatgttgagcccc...,MPIQSWFPTSLYVEPLLANGLARFNRELLEQCYLVRDTDPHGVEWS...
2,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.3,peg,Genome20_S3Ck127_100924_1362_811,1362,811,-,hypothetical protein,,,,atgaagcgcttgatctcagctattcttttaatttcttttttaacca...,MKRLISAILLISFLTTQAEAQVAERESRISRAATLLSDVAESVSPV...
3,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.4,peg,Genome20_S3Ck127_100924_2234_1452,2234,1452,-,hypothetical protein,,,,tttgtcataacggcttccgcctcagcgcaactcacgaatggccgca...,FVITASASAQLTNGRIRQRTANAPGFTACLSKDGDKCRLIAPAVAI...
4,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_104562,fig|2026715.123.peg.5,peg,Genome20_S3Ck127_104562_2_271,2,271,+,hypothetical protein,,,,aaaaagaaaagttggaattacgagttcggccagcgcgcgctagaag...,KKKSWNYEFGQRALEELAKKFEGGTDVVKQLSTPEDHHAALEALWQ...


In [15]:
# Check for number of unique MAGs
rast['MAG'].nunique()

534

In [16]:
# Search for CODH function
search_strings = [r'Carbon monoxide', r'carbon monoxide', r'CoxL', r'CODH', r'Xanthine dehydrogenase family protein', r'\bCO\b', r'TIGR02416', r'COG1529']
## '\bword \b' in this \b sets a word boundary
regex_pattern = '|'.join(search_strings)

codh = rast[rast['function'].str.contains(regex_pattern, regex=True, na=False)].reset_index(drop=True)
codh.head()

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
0,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_405606,fig|2030806.1252.peg.2805,peg,Genome269_k127_405606_190211_189204,190211,189204,-,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Purine_Utilization idu(1);Carbon_monoxi...,atggacaacatggatctgcaagtgctgcgccaggtgcagcaatgga...,MDNMDLQVLRQVQQWMDADQRVVLGTITRTWGSAPRPVGSVVAVRG...
1,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_422113,fig|2030806.1252.peg.3400,peg,Genome269_k127_422113_152936_152133,152936,152133,-,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Purine_Utilization idu(1);Carbon_monoxi...,atgcgtccctggcaacgtctcgccgcgcacctgcatcacgcgccgg...,MRPWQRLAAHLHHAPAVLVRVDALQGSGPREVGAWMAVTATELVGT...
2,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_1008953,fig|2030806.1251.peg.31,peg,Genome302_k127_1008953_868_44,868,44,-,Xanthine and CO dehydrogenases maturation fact...,,,isu;Xanthine_dehydrogenase_subunits isu;Purine...,atgaccggcgccgtcgacgaacttctggaccgcctctcgcgtggcg...,MTGAVDELLDRLSRGDEGVLIRVTATQGSVPREAGTWMAVWTDALT...
3,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1673,peg,Genome302_k127_440979_542_3,542,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcttccgatttcgccaagctgccgcacatcggcgaggcca...,MGASDFAKLPHIGEAIRRKEDARFLTGAGNYTDDVVLPNQAHAVFL...
4,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1674,peg,Genome302_k127_440979_1043_555,1043,555,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgcaggtatccctcaccgtcaacgggcgcgaggtcacgatcgacg...,MQVSLTVNGREVTIDAPPNTLLVQAIREHLRLTGTHVGCDTAQCGA...


In [17]:
# Finding the the types of functions discovered from annotation for our keyword search
codh['function'].unique()

array(['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family',
       'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)',
       'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)',
       'carbon monoxide dehydrogenase G protein',
       'carbon monoxide dehydrogenase E protein',
       'Carbon monoxide oxidation accessory protein CoxD',
       'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)',
       'Carbon monoxide oxidation accessory protein CoxG',
       'Carbon monoxide oxidation accessory protein CoxE',
       'Carbon monoxide dehydrogenase F protein',
       'Carbon monoxide dehydrogenase large chain (EC 1.2.99.2)',
       'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs',
       'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs',
       'Aerobic carbon monoxide dehydrogenase molybdenum cofactor insertion protein CoxF',
       'Xant

In [18]:
# List of MAGs with CODH gene
codh['MAG'].unique()

array(['Burkholderiaceae_bacterium_ERS6626828',
       'Burkholderiaceae_bacterium_ERS6626861',
       'Candidatus_Sericytochromatia_bacterium_ERS6626968',
       'Rhizobiaceae_bacterium_ERS6626796',
       'uncultured_Achromobacter_sp._ERS6626913',
       'uncultured_Acidovorax_sp._ERS6626561',
       'uncultured_Acidovorax_sp._ERS6626565',
       'uncultured_Acidovorax_sp._ERS6626576',
       'uncultured_Acidovorax_sp._ERS6626591',
       'uncultured_Acidovorax_sp._ERS6626592',
       'uncultured_Acidovorax_sp._ERS6626596',
       'uncultured_Acidovorax_sp._ERS6626599',
       'uncultured_Acidovorax_sp._ERS6626607',
       'uncultured_Acidovorax_sp._ERS6626644',
       'uncultured_Acidovorax_sp._ERS6626705',
       'uncultured_Acidovorax_sp._ERS6626715',
       'uncultured_Acidovorax_sp._ERS6626735',
       'uncultured_Acidovorax_sp._ERS6626737',
       'uncultured_Acidovorax_sp._ERS6626738',
       'uncultured_Acidovorax_sp._ERS6626787',
       'uncultured_Acidovorax_sp._ERS6626798'

In [19]:
codh['MAG'].nunique()

293

In [20]:
codh['Organism'] = codh['MAG'].str[:-11]
codh.head()

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence,Organism
0,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_405606,fig|2030806.1252.peg.2805,peg,Genome269_k127_405606_190211_189204,190211,189204,-,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Purine_Utilization idu(1);Carbon_monoxi...,atggacaacatggatctgcaagtgctgcgccaggtgcagcaatgga...,MDNMDLQVLRQVQQWMDADQRVVLGTITRTWGSAPRPVGSVVAVRG...,Burkholderiaceae_bacterium
1,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_422113,fig|2030806.1252.peg.3400,peg,Genome269_k127_422113_152936_152133,152936,152133,-,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Purine_Utilization idu(1);Carbon_monoxi...,atgcgtccctggcaacgtctcgccgcgcacctgcatcacgcgccgg...,MRPWQRLAAHLHHAPAVLVRVDALQGSGPREVGAWMAVTATELVGT...,Burkholderiaceae_bacterium
2,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_1008953,fig|2030806.1251.peg.31,peg,Genome302_k127_1008953_868_44,868,44,-,Xanthine and CO dehydrogenases maturation fact...,,,isu;Xanthine_dehydrogenase_subunits isu;Purine...,atgaccggcgccgtcgacgaacttctggaccgcctctcgcgtggcg...,MTGAVDELLDRLSRGDEGVLIRVTATQGSVPREAGTWMAVWTDALT...,Burkholderiaceae_bacterium
3,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1673,peg,Genome302_k127_440979_542_3,542,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcttccgatttcgccaagctgccgcacatcggcgaggcca...,MGASDFAKLPHIGEAIRRKEDARFLTGAGNYTDDVVLPNQAHAVFL...,Burkholderiaceae_bacterium
4,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1674,peg,Genome302_k127_440979_1043_555,1043,555,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgcaggtatccctcaccgtcaacgggcgcgaggtcacgatcgacg...,MQVSLTVNGREVTIDAPPNTLLVQAIREHLRLTGTHVGCDTAQCGA...,Burkholderiaceae_bacterium


In [21]:
codh['Organism'].nunique()

42

In [22]:
codh['Organism'].unique()

array(['Burkholderiaceae_bacterium',
       'Candidatus_Sericytochromatia_bacterium', 'Rhizobiaceae_bacterium',
       'uncultured_Achromobacter_sp.', 'uncultured_Acidovorax_sp.',
       'uncultured_Acinetobacter_sp.', 'uncultured_Agrobacterium_sp.',
       'uncultured_Aquabacterium_sp.', 'uncultured_Aureimonas_sp.',
       'uncultured_Blastomonas_sp.', 'uncultured_Chryseobacterium_sp.',
       'uncultured_Comamonas_sp.', 'uncultured_Cupriavidus_sp.',
       'uncultured_Curtobacterium_sp.', 'uncultured_Deinococcus_sp.',
       'uncultured_Epilithonimonas_sp.', 'uncultured_Erwinia_sp.',
       'uncultured_Herbaspirillum_sp.', 'uncultured_Hymenobacter_sp.',
       'uncultured_Klebsiella_sp.', 'uncultured_Massilia_sp.',
       'uncultured_Methylobacterium_sp.', 'uncultured_Microbacterium_sp.',
       'uncultured_Mucilaginibacter_sp.', 'uncultured_Naasia_sp.',
       'uncultured_Novosphingobium_sp.', 'uncultured_Ochrobactrum_sp.',
       'uncultured_Oxalicibacterium_sp.', 'uncultured_Panto

In [23]:
# save the data frame into .csv file
codh.to_csv(codh_out, index=False)

In [24]:
protein1 = 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)'
protein2 = 'Carbon monoxide dehydrogenase large chain (EC 1.2.99.2)'
protein3 = 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs'

In [25]:
coxl = codh[(codh['function'] == protein1) | (codh['function'] == protein2) | (codh['function'] == protein3)].reset_index(drop=True)
coxl

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence,Organism
0,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1673,peg,Genome302_k127_440979_542_3,542,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcttccgatttcgccaagctgccgcacatcggcgaggcca...,MGASDFAKLPHIGEAIRRKEDARFLTGAGNYTDDVVLPNQAHAVFL...,Burkholderiaceae_bacterium
1,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_836765,fig|2030806.1251.peg.2769,peg,Genome302_k127_836765_5636_4905,5636,4905,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,accaaggccaagaagatcgccgcgcacctgatggaagccagagacg...,TKAKKIAAHLMEARDADIEFANGEFTVKGTDKKKTFGEVALTAYVP...,Burkholderiaceae_bacterium
2,uncultured_Acidovorax_sp._ERS6627271,Genome495_k127_317059,fig|158751.153.peg.1364,peg,Genome495_k127_317059_28958_31345,28958,31345,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcatccgacttttccaagctgccatacattggcgaggcct...,MGASDFSKLPYIGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFV...,uncultured_Acidovorax_sp.
3,uncultured_Aureimonas_sp._ERS6626915,Genome356_k127_498042,fig|1604662.9.peg.2537,peg,Genome356_k127_498042_1_1509,1,1509,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,ctggcgatgatcgcggccgagaagctgaagcgcccggtgcgctggg...,LAMIAAEKLKRPVRWVADRNEHFLADTHGRANLATATMALDAKGRF...,uncultured_Aureimonas_sp.
4,uncultured_Aureimonas_sp._ERS6627306,Genome530_k127_16854,fig|1604662.10.peg.491,peg,Genome530_k127_16854_7296_9650,7296,9650,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtatcgagggcatcggcgcgcgcgtcctgcgcaaggaagacc...,MGIEGIGARVLRKEDRRFITGRGRYTDDMVVPGMKYAAFVRSPHAH...,uncultured_Aureimonas_sp.
5,uncultured_Aureimonas_sp._ERS6627306,Genome530_k127_443989,fig|1604662.10.peg.1867,peg,Genome530_k127_443989_33169_36138,33169,36138,+,Carbon monoxide dehydrogenase large chain (EC ...,,,,atgaccgccattccgccaacgggcggctggacggggcgttctgtgc...,MTAIPPTGGWTGRSVPRVEDAALLTGRGRYMDDLPITPGTLHAAIL...,uncultured_Aureimonas_sp.
6,uncultured_Aureimonas_sp._ERS6627306,Genome530_k127_444238,fig|1604662.10.peg.1877,peg,Genome530_k127_444238_3814_3455,3814,3455,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,cgcatcgtcaacccaatgatcgtcgaggggcaggtgcatggcggct...,RIVNPMIVEGQVHGGLAQGIGQALLEGVHYDESGQLVTASYNDYAM...,uncultured_Aureimonas_sp.
7,uncultured_Comamonas_sp._ERS6626602,Genome43_S3Ck127_324968,fig|114710.41.peg.1792,peg,Genome43_S3Ck127_324968_46683_46309,46683,46309,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,ttcggcaccatcatcaatcccatgatcgtcgagggccaggtgcacg...,FGTIINPMIVEGQVHGGVAQGIGQALLENCVYDRETGQLLTGSFMD...,uncultured_Comamonas_sp.
8,uncultured_Comamonas_sp._ERS6626602,Genome43_S3Ck127_506375,fig|114710.41.peg.2966,peg,Genome43_S3Ck127_506375_11614_14535,11614,14535,+,"Aerobic-type carbon monoxide dehydrogenase, la...",,,,atgcgcccagaacaacactccggcgccggcagccaggccggcgcgg...,MRPEQHSGAGSQAGAAPLSRRGFLQAASIAGVSVWVAPLASQAYAA...,uncultured_Comamonas_sp.
9,uncultured_Comamonas_sp._ERS6626602,Genome43_S3Ck127_58254,fig|114710.41.peg.3673,peg,Genome43_S3Ck127_58254_2031_1,2031,1,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcatccgacttttccaagctcccccacatcggcgagcccg...,MGASDFSKLPHIGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFV...,uncultured_Comamonas_sp.


In [26]:
coxl['MAG'].nunique()

35

In [27]:
coxl['MAG'].unique()

array(['Burkholderiaceae_bacterium_ERS6626861',
       'uncultured_Acidovorax_sp._ERS6627271',
       'uncultured_Aureimonas_sp._ERS6626915',
       'uncultured_Aureimonas_sp._ERS6627306',
       'uncultured_Comamonas_sp._ERS6626602',
       'uncultured_Comamonas_sp._ERS6626630',
       'uncultured_Comamonas_sp._ERS6626650',
       'uncultured_Comamonas_sp._ERS6627283',
       'uncultured_Cupriavidus_sp._ERS6626993',
       'uncultured_Deinococcus_sp._ERS6626797',
       'uncultured_Deinococcus_sp._ERS6626811',
       'uncultured_Deinococcus_sp._ERS6626820',
       'uncultured_Deinococcus_sp._ERS6626837',
       'uncultured_Deinococcus_sp._ERS6627282',
       'uncultured_Deinococcus_sp._ERS6627293',
       'uncultured_Deinococcus_sp._ERS6627335',
       'uncultured_Methylobacterium_sp._ERS6627288',
       'uncultured_Pantoea_sp._ERS6626969',
       'uncultured_Pseudacidovorax_sp._ERS6626577',
       'uncultured_Pseudacidovorax_sp._ERS6626767',
       'uncultured_Pseudacidovorax_sp._ERS

In [28]:
# Filter the fasta files with CODH function
filter_fasta_files(rast_faa, filtered_fasta, coxl)

Filtered sequences have been saved to new FASTA files in 'summary/faa' and results have been saved to 'filtered_sequences_info.csv'


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    MOTIF Search
</h2>

In [30]:
# MOTIF search pattern
## Define the motifs as regular expressions
form1 = re.compile(r"(.{3})CSFR") # last four amino acids of form 1: AYXCSFR
form2 = re.compile(r"(.{3})GAGR") # Last four amino acids of form 2: AYRGAGR

## Define Label for MOTIFs
label1 = "CoxL_form_1"
label2 = "CoxL_form_2"

## Search for MOTIFs
rast_motif_results = search_motifs(filtered_fasta, form1, label1, form2, label2, output_file = motif_out)

Results saved to summary/codh_motif.csv


In [31]:
rast_motif_results

Unnamed: 0,MAG,feature_id,aa_sequence,MOTIF,Previous_3_AA,Motif_Label
0,uncultured_Acidovorax_sp._ERS6627271,fig|158751.153.peg.1364,MGASDFSKLPYIGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFV...,AYRGAGR,AYR,CoxL_form_2
1,uncultured_Aureimonas_sp._ERS6626915,fig|1604662.9.peg.2537,LAMIAAEKLKRPVRWVADRNEHFLADTHGRANLATATMALDAKGRF...,AYRGAGR,AYR,CoxL_form_2
2,uncultured_Aureimonas_sp._ERS6627306,fig|1604662.10.peg.491,MGIEGIGARVLRKEDRRFITGRGRYTDDMVVPGMKYAAFVRSPHAH...,AYRGAGR,AYR,CoxL_form_2
3,uncultured_Comamonas_sp._ERS6626602,fig|114710.41.peg.3673,MGASDFSKLPHIGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFV...,AYRGAGR,AYR,CoxL_form_2
4,uncultured_Comamonas_sp._ERS6626630,fig|114710.42.peg.1808,MGASDFSKLPHIGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFV...,AYRGAGR,AYR,CoxL_form_2
5,uncultured_Comamonas_sp._ERS6626650,fig|114710.46.peg.4086,MGASDFSKLPHIGEAVLRREDDRFLTGAGQYTDDISLGAQAYAVFV...,AYRGAGR,AYR,CoxL_form_2
6,uncultured_Comamonas_sp._ERS6627283,fig|114710.43.peg.5499,MGASDFSKLPHIGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFV...,AYRGAGR,AYR,CoxL_form_2
7,uncultured_Cupriavidus_sp._ERS6626993,fig|314008.12.peg.105,MNAPDKQALIGASVRRKEDYRFLTGNGQYTDDVTLPHQSYAYFVRS...,AYRGAGR,AYR,CoxL_form_2
8,uncultured_Deinococcus_sp._ERS6626797,fig|158789.27.peg.203,MTESRTDKYVGQALKRKEDPRFITGTGNYTDDIVLHGMLHAAMVRS...,AYRGAGR,AYR,CoxL_form_2
9,uncultured_Deinococcus_sp._ERS6626811,fig|158789.28.peg.1757,MTESRTDKYVGQALKRKEDPRFITGTGNYTDDIVLHGMLHAAMVRS...,AYRGAGR,AYR,CoxL_form_2


In [32]:
rast_motif_results['Motif_Label'].unique()

array(['CoxL_form_2'], dtype=object)

In [33]:
motif_results = rast_motif_results.merge(coxl, on = ['MAG', 'feature_id', 'aa_sequence'], how='inner')

In [34]:
motif_results

Unnamed: 0,MAG,feature_id,aa_sequence,MOTIF,Previous_3_AA,Motif_Label,contig_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,Organism
0,uncultured_Acidovorax_sp._ERS6627271,fig|158751.153.peg.1364,MGASDFSKLPYIGEALKRKEDYRFLTGAGQYTDDVVLAAQCHAVFV...,AYRGAGR,AYR,CoxL_form_2,Genome495_k127_317059,peg,Genome495_k127_317059_28958_31345,28958,31345,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcatccgacttttccaagctgccatacattggcgaggcct...,uncultured_Acidovorax_sp.
1,uncultured_Aureimonas_sp._ERS6626915,fig|1604662.9.peg.2537,LAMIAAEKLKRPVRWVADRNEHFLADTHGRANLATATMALDAKGRF...,AYRGAGR,AYR,CoxL_form_2,Genome356_k127_498042,peg,Genome356_k127_498042_1_1509,1,1509,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,ctggcgatgatcgcggccgagaagctgaagcgcccggtgcgctggg...,uncultured_Aureimonas_sp.
2,uncultured_Aureimonas_sp._ERS6627306,fig|1604662.10.peg.491,MGIEGIGARVLRKEDRRFITGRGRYTDDMVVPGMKYAAFVRSPHAH...,AYRGAGR,AYR,CoxL_form_2,Genome530_k127_16854,peg,Genome530_k127_16854_7296_9650,7296,9650,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtatcgagggcatcggcgcgcgcgtcctgcgcaaggaagacc...,uncultured_Aureimonas_sp.
3,uncultured_Comamonas_sp._ERS6626602,fig|114710.41.peg.3673,MGASDFSKLPHIGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFV...,AYRGAGR,AYR,CoxL_form_2,Genome43_S3Ck127_58254,peg,Genome43_S3Ck127_58254_2031_1,2031,1,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcatccgacttttccaagctcccccacatcggcgagcccg...,uncultured_Comamonas_sp.
4,uncultured_Comamonas_sp._ERS6626630,fig|114710.42.peg.1808,MGASDFSKLPHIGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFV...,AYRGAGR,AYR,CoxL_form_2,Genome71_S1Ck127_18694,peg,Genome71_S1Ck127_18694_29630_32020,29630,32020,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcatccgacttttccaagctcccccacatcggcgagcccg...,uncultured_Comamonas_sp.
5,uncultured_Comamonas_sp._ERS6626650,fig|114710.46.peg.4086,MGASDFSKLPHIGEAVLRREDDRFLTGAGQYTDDISLGAQAYAVFV...,AYRGAGR,AYR,CoxL_form_2,Genome91_S1Ck127_6696,peg,Genome91_S1Ck127_6696_93338_90951,93338,90951,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcatcagacttttccaagctgccgcatatcggcgaagccg...,uncultured_Comamonas_sp.
6,uncultured_Comamonas_sp._ERS6627283,fig|114710.43.peg.5499,MGASDFSKLPHIGEPVKRKEDYRFLTGAGQYTDDIALAAQAHAVFV...,AYRGAGR,AYR,CoxL_form_2,Genome507_k127_91973,peg,Genome507_k127_91973_28823_31213,28823,31213,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcatccgacttttccaagctcccccacatcggcgagcccg...,uncultured_Comamonas_sp.
7,uncultured_Cupriavidus_sp._ERS6626993,fig|314008.12.peg.105,MNAPDKQALIGASVRRKEDYRFLTGNGQYTDDVTLPHQSYAYFVRS...,AYRGAGR,AYR,CoxL_form_2,Genome434_k127_125020,peg,Genome434_k127_125020_1352_3,1352,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgaacgcacccgacaaacaggcgctgatcggcgcatccgtgcgcc...,uncultured_Cupriavidus_sp.
8,uncultured_Deinococcus_sp._ERS6626797,fig|158789.27.peg.203,MTESRTDKYVGQALKRKEDPRFITGTGNYTDDIVLHGMLHAAMVRS...,AYRGAGR,AYR,CoxL_form_2,Genome238_k127_17753,peg,Genome238_k127_17753_3580_5970,3580,5970,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgaccgaatccagaacggacaagtacgtcgggcaggccctcaagc...,uncultured_Deinococcus_sp.
9,uncultured_Deinococcus_sp._ERS6626811,fig|158789.28.peg.1757,MTESRTDKYVGQALKRKEDPRFITGTGNYTDDIVLHGMLHAAMVRS...,AYRGAGR,AYR,CoxL_form_2,Genome252_k127_249421,peg,Genome252_k127_249421_3031_641,3031,641,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgaccgaatccagaacggacaagtacgtcgggcaggccctcaagc...,uncultured_Deinococcus_sp.


In [36]:
motif_results['MAG'].nunique()

25

In [44]:
motif_results['MAG'].unique()

array(['uncultured_Acidovorax_sp._ERS6627271',
       'uncultured_Aureimonas_sp._ERS6626915',
       'uncultured_Aureimonas_sp._ERS6627306',
       'uncultured_Comamonas_sp._ERS6626602',
       'uncultured_Comamonas_sp._ERS6626630',
       'uncultured_Comamonas_sp._ERS6626650',
       'uncultured_Comamonas_sp._ERS6627283',
       'uncultured_Cupriavidus_sp._ERS6626993',
       'uncultured_Deinococcus_sp._ERS6626797',
       'uncultured_Deinococcus_sp._ERS6626811',
       'uncultured_Deinococcus_sp._ERS6626820',
       'uncultured_Deinococcus_sp._ERS6626837',
       'uncultured_Deinococcus_sp._ERS6627282',
       'uncultured_Deinococcus_sp._ERS6627293',
       'uncultured_Deinococcus_sp._ERS6627335',
       'uncultured_Methylobacterium_sp._ERS6627288',
       'uncultured_Pseudacidovorax_sp._ERS6626577',
       'uncultured_Pseudacidovorax_sp._ERS6626850',
       'uncultured_Pseudacidovorax_sp._ERS6626909',
       'uncultured_Pseudacidovorax_sp._ERS6627003',
       'uncultured_Pseudacidov

In [46]:
motif_results['function'].unique()

array(['Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)'],
      dtype=object)

In [47]:
set(coxl['MAG'].unique()) - set(motif_results['MAG'].unique())

{'Burkholderiaceae_bacterium_ERS6626861',
 'uncultured_Pantoea_sp._ERS6626969',
 'uncultured_Pseudacidovorax_sp._ERS6626767',
 'uncultured_Pseudomonas_sp._ERS6626566',
 'uncultured_Pseudomonas_sp._ERS6626713',
 'uncultured_Pseudomonas_sp._ERS6626952',
 'uncultured_Pseudomonas_sp._ERS6627245',
 'uncultured_Serratia_sp._ERS6626624',
 'uncultured_Serratia_sp._ERS6626636',
 'uncultured_Serratia_sp._ERS6627010'}

In [48]:
create_fasta_from_df(coxl, mag_col = 'MAG', feature_id_col = 'feature_id', sequence_col = 'nucleotide_sequence', output_folder = fna)

FASTA files created successfully.


In [49]:
# Get the count of sequences in each .faa file and the total sequences
sequence_counts, total_sequences = count_sequences_in_fasta_files(fna)

# Print the results
for file_name, count in sequence_counts.items():
    print(f"{file_name}: {count} sequences")

print(f"Total sequences from all files: {total_sequences}")

Burkholderiaceae_bacterium_ERS6626861.fasta: 2 sequences
uncultured_Acidovorax_sp._ERS6627271.fasta: 1 sequences
uncultured_Aureimonas_sp._ERS6626915.fasta: 1 sequences
uncultured_Aureimonas_sp._ERS6627306.fasta: 3 sequences
uncultured_Comamonas_sp._ERS6626602.fasta: 3 sequences
uncultured_Comamonas_sp._ERS6626630.fasta: 2 sequences
uncultured_Comamonas_sp._ERS6626650.fasta: 3 sequences
uncultured_Comamonas_sp._ERS6627283.fasta: 3 sequences
uncultured_Cupriavidus_sp._ERS6626993.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6626797.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6626811.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6626820.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6626837.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6627282.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6627293.fasta: 1 sequences
uncultured_Deinococcus_sp._ERS6627335.fasta: 1 sequences
uncultured_Methylobacterium_sp._ERS6627288.fasta: 3 sequences
uncultured_Pantoea_sp._ERS6626969.fas

In [50]:
# List to store dataframes
dfs = []

# Iterate over each file in the directory
for filename in os.listdir(cross_ver):
    if filename.endswith('.csv'):
        # Read the CSV file
        df = pd.read_csv(os.path.join(cross_ver, filename))
        
        # Add 'MAG' and 'query_index' columns
        if filename.endswith('_2.csv'):
            mag_value = filename[:-6]  
            query_index_value = 'Query 3'
        elif filename.endswith('_1.csv'):
            mag_value = filename[:-6]
            query_index_value = 'Query 2'
        else:
            mag_value = filename[:-4]  
            query_index_value = 'Query 1'
        
        df['MAG'] = mag_value
        df['query_index'] = query_index_value
        
        # Append the dataframe to the list
        dfs.append(df)

# Concatenate all dataframes
result_df = pd.concat(dfs, ignore_index=True)

# Save the concatenated dataframe to a new CSV file
result_df.to_csv('open_blast.csv', index=False)

print("All files have been concatenated into 'concatenated_file.csv'.")

All files have been concatenated into 'concatenated_file.csv'.


In [51]:
open_blast = open_file('open_blast.csv')
open_blast

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
0,xanthine dehydrogenase family protein molybdop...,Pseudomonadota bacterium,368.0,368.0,100%,1.000000e-119,98.89,814,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",Burkholderiaceae_bacterium_ERS6626861,Query 1
1,xanthine dehydrogenase family protein molybdop...,Xenophilus aerolatus,365.0,365.0,100%,1.000000e-118,97.78,814,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",Burkholderiaceae_bacterium_ERS6626861,Query 1
2,xanthine dehydrogenase family protein molybdop...,Burkholderiaceae bacterium,364.0,364.0,100%,1.000000e-118,98.33,788,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",Burkholderiaceae_bacterium_ERS6626861,Query 1
3,carbon monoxide dehydrogenase [Xenophilus aero...,Xenophilus aerolatus,365.0,365.0,100%,2.000000e-118,98.33,814,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",Burkholderiaceae_bacterium_ERS6626861,Query 1
4,xanthine dehydrogenase family protein molybdop...,Pseudomonadota bacterium,364.0,364.0,100%,3.000000e-118,97.78,814,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",Burkholderiaceae_bacterium_ERS6626861,Query 1
...,...,...,...,...,...,...,...,...,...,...,...
4795,MULTISPECIES: xanthine dehydrogenase family pr...,unclassified Variovorax,1551.0,1551.0,100%,0.000000e+00,95.32,816,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2
4796,xanthine dehydrogenase family protein molybdop...,Variovorax paradoxus,1551.0,1551.0,100%,0.000000e+00,94.44,812,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2
4797,xanthine dehydrogenase family protein molybdop...,Variovorax sp. 54,1550.0,1550.0,100%,0.000000e+00,94.80,794,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2
4798,xanthine dehydrogenase family protein molybdop...,Variovorax sp. PAMC26660,1549.0,1549.0,100%,0.000000e+00,94.30,794,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Variovorax_sp._ERS6627246,Query 2


In [52]:
# Fetch the best hits for every sequence
open_blast = open_blast.sort_values(by=['MAG', 'query_index', 'E value', 'Per. ident'], ascending=[True, True, True, False]).drop_duplicates(subset=['MAG', 'query_index']).reset_index(drop=True)
open_blast

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
0,xanthine dehydrogenase family protein molybdop...,Pseudomonadota bacterium,368.0,368.0,100%,1e-119,98.89,814,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",Burkholderiaceae_bacterium_ERS6626861,Query 1
1,xanthine dehydrogenase family protein molybdop...,Pseudomonadota bacterium,456.0,456.0,99%,6e-153,97.93,814,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",Burkholderiaceae_bacterium_ERS6626861,Query 2
2,xanthine dehydrogenase family protein molybdop...,uncultured Acidovorax sp.,1574.0,1574.0,100%,0.0,100.0,795,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Acidovorax_sp._ERS6627271,Query 1
3,xanthine dehydrogenase family protein molybdop...,Bosea sp. TND4EK4,1006.0,1006.0,100%,0.0,96.81,769,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6626915,Query 1
4,xanthine dehydrogenase family protein molybdop...,uncultured Aureimonas sp.,1540.0,1540.0,100%,0.0,100.0,784,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6627306,Query 1
5,xanthine dehydrogenase family protein molybdop...,uncultured Aureimonas sp.,1861.0,1861.0,97%,0.0,100.0,989,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6627306,Query 2
6,molybdopterin cofactor-binding domain-containi...,uncultured Aureimonas sp.,211.0,211.0,99%,4e-68,100.0,119,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6627306,Query 3
7,carbon monoxide dehydrogenase [Delftia acidovo...,Delftia acidovorans,256.0,256.0,99%,5.999999999999999e-85,100.0,180,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6626602,Query 1
8,xanthine dehydrogenase family protein molybdop...,Delftia tsuruhatensis,1873.0,1873.0,100%,0.0,99.9,973,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6626602,Query 2
9,MULTISPECIES: xanthine dehydrogenase family pr...,Delftia,1286.0,1286.0,100%,0.0,99.85,796,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6626602,Query 3


In [53]:
open_blast['Description'].unique()

array(['xanthine dehydrogenase family protein molybdopterin-binding subunit [Pseudomonadota bacterium]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [uncultured Acidovorax sp.]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [Bosea sp. TND4EK4]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [uncultured Aureimonas sp.]',
       'molybdopterin cofactor-binding domain-containing protein, partial [uncultured Aureimonas sp.]',
       'carbon monoxide dehydrogenase [Delftia acidovorans]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [Delftia tsuruhatensis]',
       'MULTISPECIES: xanthine dehydrogenase family protein molybdopterin-binding subunit [Delftia]',
       'xanthine dehydrogenase family protein molybdopterin-binding subunit [Delftia acidovorans]',
       'MULTISPECIES: xanthine dehydrogenase family protein molybdopterin-binding subunit [Comamonas]',
       'xan

In [54]:
open_blast[open_blast['Description'] == 'carbon monoxide dehydrogenase [Delftia acidovorans]']

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
7,carbon monoxide dehydrogenase [Delftia acidovo...,Delftia acidovorans,256.0,256.0,99%,5.999999999999999e-85,100.0,180,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Comamonas_sp._ERS6626602,Query 1


In [55]:
open_blast[open_blast['Description'] == 'molybdopterin cofactor-binding domain-containing protein, partial [uncultured Aureimonas sp.]']

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
6,molybdopterin cofactor-binding domain-containi...,uncultured Aureimonas sp.,211.0,211.0,99%,4e-68,100.0,119,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Aureimonas_sp._ERS6627306,Query 3


In [56]:
open_blast[open_blast['Description'] == 'molybdopterin-dependent oxidoreductase [Enterobacteriaceae bacterium RIT691]']

Unnamed: 0,Description,Scientific Name,Max Score,Total Score,Query Cover,E value,Per. ident,Acc. Len,Accession,MAG,query_index
29,molybdopterin-dependent oxidoreductase [Entero...,Enterobacteriaceae bacterium RIT691,1553.0,1553.0,100%,0.0,78.66,928,"=HYPERLINK(""https://www.ncbi.nlm.nih.gov/prote...",uncultured_Pantoea_sp._ERS6626969,Query 1


In [57]:
codh

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence,Organism
0,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_405606,fig|2030806.1252.peg.2805,peg,Genome269_k127_405606_190211_189204,190211,189204,-,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Purine_Utilization idu(1);Carbon_monoxi...,atggacaacatggatctgcaagtgctgcgccaggtgcagcaatgga...,MDNMDLQVLRQVQQWMDADQRVVLGTITRTWGSAPRPVGSVVAVRG...,Burkholderiaceae_bacterium
1,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_422113,fig|2030806.1252.peg.3400,peg,Genome269_k127_422113_152936_152133,152936,152133,-,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Purine_Utilization idu(1);Carbon_monoxi...,atgcgtccctggcaacgtctcgccgcgcacctgcatcacgcgccgg...,MRPWQRLAAHLHHAPAVLVRVDALQGSGPREVGAWMAVTATELVGT...,Burkholderiaceae_bacterium
2,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_1008953,fig|2030806.1251.peg.31,peg,Genome302_k127_1008953_868_44,868,44,-,Xanthine and CO dehydrogenases maturation fact...,,,isu;Xanthine_dehydrogenase_subunits isu;Purine...,atgaccggcgccgtcgacgaacttctggaccgcctctcgcgtggcg...,MTGAVDELLDRLSRGDEGVLIRVTATQGSVPREAGTWMAVWTDALT...,Burkholderiaceae_bacterium
3,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1673,peg,Genome302_k127_440979_542_3,542,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcttccgatttcgccaagctgccgcacatcggcgaggcca...,MGASDFAKLPHIGEAIRRKEDARFLTGAGNYTDDVVLPNQAHAVFL...,Burkholderiaceae_bacterium
4,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1674,peg,Genome302_k127_440979_1043_555,1043,555,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgcaggtatccctcaccgtcaacgggcgcgaggtcacgatcgacg...,MQVSLTVNGREVTIDAPPNTLLVQAIREHLRLTGTHVGCDTAQCGA...,Burkholderiaceae_bacterium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_495571,fig|114708.15.peg.4765,peg,Genome470_k127_495571_5644_6453,5644,6453,+,Xanthine and CO dehydrogenases maturation fact...,,,isu;Xanthine_dehydrogenase_subunits isu;Molybd...,atgagcggtgcggtcgatcaattgctggcgcgcctgcgccaagaag...,MSGAVDQLLARLRQEDAVLVRVESTQGSAPREAGTWMAVWADGLTA...,uncultured_Variovorax_sp.
604,uncultured_Xylophilus_sp._ERS6626940,Genome381_k127_215459,fig|296832.12.peg.1402,peg,Genome381_k127_215459_1357_2370,1357,2370,+,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Carbon_monoxide_oxidation idu(1);Purine...,atggaaaacctcgacctcaccgttctgcgcaccctgcgcgactggc...,MENLDLTVLRTLRDWRLAGRCALLATVVRTWGSSPRPVGSTMALRE...,uncultured_Xylophilus_sp.
605,uncultured_Xylophilus_sp._ERS6626940,Genome381_k127_48640,fig|296832.12.peg.2590,peg,Genome381_k127_48640_258_1103,258,1103,+,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Carbon_monoxide_oxidation idu(1);Purine...,atgccgatggacaggctgtggcaacggtgcgaacggcggctgcacg...,MPMDRLWQRCERRLHEGAPAVLVRVVQAQGSVPRGPGAWMLVFADA...,uncultured_Xylophilus_sp.
606,uncultured_Xylophilus_sp._ERS6627302,Genome526_k127_645719,fig|296832.11.peg.3427,peg,Genome526_k127_645719_16781_17794,16781,17794,+,Xanthine and CO dehydrogenases maturation fact...,,,idu(1);Purine_Utilization idu(1);Xanthine_dehy...,atggaaaacctcgacctcaccgttctgcgcaccctgcgcgactggc...,MENLDLTVLRTLRDWRLAGHRALLATVVRTWGSSPRPVGSTMALRE...,uncultured_Xylophilus_sp.


In [58]:
print(list(codh[codh['MAG'] == 'uncultured_Acidovorax_sp._ERS6627271']['function']))

['Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'Carbon monoxide oxidation accessory protein CoxG', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxG']


In [59]:
print(list(codh[codh['MAG'] == 'uncultured_Aureimonas_sp._ERS6627306']['function']))

['Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxG', 'Carbon monoxide dehydrogenase large chain (EC 1.2.99.2)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide dehydrogenase F protein', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [60]:
print(list(codh[codh['MAG'] == 'uncultured_Comamonas_sp._ERS6626602']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxG', 'Carbon monoxide oxidation accessory protein CoxE', 'Carbon monoxide oxidation accessory protein CoxD', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs', 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs', 'Carbon monoxide oxidation accessory protein CoxG', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)']


In [61]:
print(list(codh[codh['MAG'] == 'uncultured_Comamonas_sp._ERS6626630']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'Carbon monoxide oxidation accessory protein CoxG', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs', 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs']


In [62]:
print(list(codh[codh['MAG'] == 'uncultured_Comamonas_sp._ERS6626650']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxG', 'Carbon monoxide oxidation accessory protein CoxG', 'Carbon monoxide oxidation accessory protein CoxE', 'Carbon monoxide oxidation accessory protein CoxD', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs', 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs', 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs', 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [63]:
print(list(codh[codh['MAG'] == 'uncultured_Comamonas_sp._ERS6627283']['function']))

['Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs', 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'Carbon monoxide oxidation accessory protein CoxG', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [64]:
print(list(codh[codh['MAG'] == 'uncultured_Deinococcus_sp._ERS6626797']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxD', 'carbon monoxide dehydrogenase G protein']


In [65]:
print(list(codh[codh['MAG'] == 'uncultured_Deinococcus_sp._ERS6626811']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'carbon monoxide dehydrogenase E protein', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)']


In [66]:
print(list(codh[codh['MAG'] == 'uncultured_Deinococcus_sp._ERS6626820']['function']))

['Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [67]:
print(list(codh[codh['MAG'] == 'uncultured_Deinococcus_sp._ERS6626837']['function']))

['carbon monoxide dehydrogenase G protein', 'Carbon monoxide oxidation accessory protein CoxD', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [68]:
print(list(codh[codh['MAG'] == 'uncultured_Deinococcus_sp._ERS6627282']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxD', 'carbon monoxide dehydrogenase G protein', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [69]:
print(list(codh[codh['MAG'] == 'uncultured_Deinococcus_sp._ERS6627293']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'carbon monoxide dehydrogenase G protein', 'Carbon monoxide oxidation accessory protein CoxD']


In [70]:
print(list(codh[codh['MAG'] == 'uncultured_Deinococcus_sp._ERS6627335']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxD', 'carbon monoxide dehydrogenase G protein', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [71]:
print(list(codh[codh['MAG'] == 'uncultured_Methylobacterium_sp._ERS6627288']['function']))

['Carbon monoxide oxidation accessory protein CoxD', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Carbon monoxide dehydrogenase large chain (EC 1.2.99.2)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxG', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)']


In [72]:
print(list(codh[codh['MAG'] == 'uncultured_Pseudacidovorax_sp._ERS6626577']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'carbon monoxide dehydrogenase G protein']


In [73]:
print(list(codh[codh['MAG'] == 'uncultured_Pseudacidovorax_sp._ERS6626850']['function']))

['Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'carbon monoxide dehydrogenase G protein', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [74]:
print(list(codh[codh['MAG'] == 'uncultured_Pseudacidovorax_sp._ERS6626909']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxD', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)']


In [75]:
print(list(codh[codh['MAG'] == 'uncultured_Pseudacidovorax_sp._ERS6627003']['function']))

['carbon monoxide dehydrogenase G protein', 'Carbon monoxide oxidation accessory protein CoxE', 'Carbon monoxide oxidation accessory protein CoxD', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)']


In [76]:
print(list(codh[codh['MAG'] == 'uncultured_Pseudacidovorax_sp._ERS6627320']['function']))

['Carbon monoxide oxidation accessory protein CoxG', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'carbon monoxide dehydrogenase G protein', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [77]:
print(list(codh[codh['MAG'] == 'uncultured_Pseudacidovorax_sp._ERS6627321']['function']))

['Carbon monoxide oxidation accessory protein CoxG', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'carbon monoxide dehydrogenase G protein', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [78]:
print(list(codh[codh['MAG'] == 'uncultured_Pseudorhodoferax_sp._ERS6626941']['function']))

['Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE', 'carbon monoxide dehydrogenase G protein', 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs', 'Carbon monoxide oxidation accessory protein CoxD', 'Carbon monoxide oxidation accessory protein CoxE']


In [79]:
print(list(codh[codh['MAG'] == 'uncultured_Variovorax_sp._ERS6626779']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'carbon monoxide dehydrogenase G protein', 'carbon monoxide dehydrogenase E protein', 'Carbon monoxide oxidation accessory protein CoxD', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)']


In [80]:
print(list(codh[codh['MAG'] == 'uncultured_Variovorax_sp._ERS6627246']['function']))

['Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Carbon monoxide oxidation accessory protein CoxD', 'carbon monoxide dehydrogenase E protein', 'carbon monoxide dehydrogenase G protein', 'carbon monoxide dehydrogenase E protein', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']


In [81]:
print(list(codh[codh['MAG'] == 'uncultured_Microbacterium_sp._ERS6626906']['function']))

[]


In [82]:
rast[(rast['MAG'] == 'uncultured_Microbacterium_sp._ERS6626906') & (rast['function'] == 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)')]

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence


In [83]:
rast[(rast['MAG'] == 'uncultured_Microbacterium_sp._ERS6626906')]

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
1021324,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_100671,fig|191216.135.peg.1,peg,Genome347_k127_100671_320_3,320,3,-,FIG00820327: hypothetical protein,,,isu;Lipid-linked_oligosaccharide_synthesis_rel...,atggcagaccgcagtctccgcggcatccgactcggcgcccagagcc...,MADRSLRGIRLGAQSLQSEDGVVFHDRAQHTYTCSTCGRDTVLTFA...
1021325,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_100671,fig|191216.135.peg.2,peg,Genome347_k127_100671_1190_414,1190,414,-,Glycerophosphoryl diester phosphodiesterase (E...,,,isu;Glycerol_and_Glycerol-3-phosphate_Uptake_a...,tccgggcccgctccgcgcgtgctcgcgcaccgcggactggtgacgc...,SGPAPRVLAHRGLVTPDAAAQGVAENSFAAVAAAHAAGAVYVESDC...
1021326,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_100932,fig|191216.135.peg.3,peg,Genome347_k127_100932_1339_2,1339,2,-,Putative formate dehydrogenase oxidoreductase ...,,,idu(3);Formate_hydrogenase,tgccacgaaggttcggggcgtgggctcacggcatccctcgccaccg...,CHEGSGRGLTASLATGKGTADLEDWQNADALFILGVNAASNAPRML...
1021327,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_101353,fig|191216.135.peg.4,peg,Genome347_k127_101353_2_703,2,703,+,hypothetical protein,,,,ccgtggatccagcccgtgctcgatgcggcgccgatctgggtgggtg...,PWIQPVLDAAPIWVGAGGEELAAIVGWHYPFTVWIAFLLAGMGVGR...
1021328,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_101353,fig|191216.135.peg.5,peg,Genome347_k127_101353_713_2137,713,2137,+,Argininosuccinate lyase (EC 4.3.2.1),,,isu;Arginine_Biosynthesis_--_gjo isu;Arginine_...,atgagcagcgagaccggtcagtcgacgaacgagggcgccctgtggg...,MSSETGQSTNEGALWGARFSGGPSPELAALSRSTHFDWDLALYDIQ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1024458,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_96279,fig|191216.135.peg.3111,peg,Genome347_k127_96279_2027_900,2027,900,-,Glutamine--fructose-6-phosphate aminotransfera...,,,,gacgggagttcccggccgcgcggacgatggccggtgccgcagacgg...,DGSSRPRGRWPVPQTALMASTGRHRDHMGNEIDEQPAVARRVIDAH...
1024459,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_968,fig|191216.135.peg.3112,peg,Genome347_k127_968_2_277,2,277,+,hypothetical protein,,,,cggtcgtaccgtgacggcatgagcaccgcactgtggaacgacctcg...,RSYRDGMSTALWNDLAPAPRSAGWGAATHVAAGLWRIADPRGIVVG...
1024460,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_968,fig|191216.135.peg.3113,peg,Genome347_k127_968_1348_302,1348,302,-,"Transcriptional regulator, GntR family domain ...",,,idu(2);Threonine_and_Homoserine_Biosynthesis i...,gcccaagaggctcctgctctgctgtcgcgcgtcggctacgacgtgg...,AQEAPALLSRVGYDVVGDPALRVAIADHYGRRGMPTSTDQILVTSG...
1024461,uncultured_Microbacterium_sp._ERS6626906,Genome347_k127_97343,fig|191216.135.peg.3114,peg,Genome347_k127_97343_2_658,2,658,+,Dihydrolipoamide dehydrogenase (EC 1.8.1.4),,,idu(2);5-FCL-like_protein,gcctacaccgcgaaggacggctcgcagggctcgatcgacgccgacc...,AYTAKDGSQGSIDADRVLMSIGFAPKVYGFGLENTGVKLTERGAID...


In [84]:
print(list(codh[codh['MAG'] == 'Burkholderiaceae_bacterium_ERS6626861']['function']))

['Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)', 'carbon monoxide dehydrogenase G protein', 'carbon monoxide dehydrogenase E protein', 'Carbon monoxide oxidation accessory protein CoxD', 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)', 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)']


In [85]:
carriers_list = ['uncultured_Acidovorax_sp._ERS6627271', 'uncultured_Aureimonas_sp._ERS6627306', 'uncultured_Comamonas_sp._ERS6626602', 'uncultured_Comamonas_sp._ERS6626630',
                'uncultured_Comamonas_sp._ERS6626650', 'uncultured_Comamonas_sp._ERS6627283', 'uncultured_Deinococcus_sp._ERS6626797', 'uncultured_Deinococcus_sp._ERS6626811',
                'uncultured_Deinococcus_sp._ERS6626820', 'uncultured_Deinococcus_sp._ERS6626837', 'uncultured_Deinococcus_sp._ERS6627282', 'uncultured_Deinococcus_sp._ERS6627293',
                'uncultured_Deinococcus_sp._ERS6627335', 'uncultured_Methylobacterium_sp._ERS6627288', 'uncultured_Pseudacidovorax_sp._ERS6626577', 'uncultured_Pseudacidovorax_sp._ERS6626850',
                'uncultured_Pseudacidovorax_sp._ERS6626909', 'uncultured_Pseudacidovorax_sp._ERS6627003', 'uncultured_Pseudacidovorax_sp._ERS6627320', 'uncultured_Pseudacidovorax_sp._ERS6627321',
                'uncultured_Pseudorhodoferax_sp._ERS6626941', 'uncultured_Variovorax_sp._ERS6626779', 'uncultured_Variovorax_sp._ERS6627246', 'uncultured_Microbacterium_sp._ERS6626906', 'Burkholderiaceae_bacterium_ERS6626861']

carriers = codh[codh['MAG'].isin(carriers_list)]
carriers

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence,Organism
2,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_1008953,fig|2030806.1251.peg.31,peg,Genome302_k127_1008953_868_44,868,44,-,Xanthine and CO dehydrogenases maturation fact...,,,isu;Xanthine_dehydrogenase_subunits isu;Purine...,atgaccggcgccgtcgacgaacttctggaccgcctctcgcgtggcg...,MTGAVDELLDRLSRGDEGVLIRVTATQGSVPREAGTWMAVWTDALT...,Burkholderiaceae_bacterium
3,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1673,peg,Genome302_k127_440979_542_3,542,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcttccgatttcgccaagctgccgcacatcggcgaggcca...,MGASDFAKLPHIGEAIRRKEDARFLTGAGNYTDDVVLPNQAHAVFL...,Burkholderiaceae_bacterium
4,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1674,peg,Genome302_k127_440979_1043_555,1043,555,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgcaggtatccctcaccgtcaacgggcgcgaggtcacgatcgacg...,MQVSLTVNGREVTIDAPPNTLLVQAIREHLRLTGTHVGCDTAQCGA...,Burkholderiaceae_bacterium
5,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_836765,fig|2030806.1251.peg.2765,peg,Genome302_k127_836765_1907_1332,1907,1332,-,carbon monoxide dehydrogenase G protein,,,,atggacatgcaaggcacccgccagctcggcgtcacccaggaacagg...,MDMQGTRQLGVTQEQAWEALNDPETLKGCLPGCDKFESTGENQYAV...,Burkholderiaceae_bacterium
6,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_836765,fig|2030806.1251.peg.2766,peg,Genome302_k127_836765_3145_1934,3145,1934,-,carbon monoxide dehydrogenase E protein,,,,atggagcgcgtgcagcaactcggcgacgcccggcgcggcaagctgg...,MERVQQLGDARRGKLAGNLVGFGRALRRAGVRLDASRIALAAEAAQ...,Burkholderiaceae_bacterium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
599,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_318739,fig|114708.15.peg.2866,peg,Genome470_k127_318739_4423_3758,4423,3758,-,carbon monoxide dehydrogenase G protein,,,,atggaaatgctcggaaaccgccgcctaggcgtcacccaacaacagg...,MEMLGNRRLGVTQQQAWEALNDPETLKKCIPGCDKFELTGENQYSV...,uncultured_Variovorax_sp.
600,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_318739,fig|114708.15.peg.2867,peg,Genome470_k127_318739_4889_4458,4889,4458,-,carbon monoxide dehydrogenase E protein,,,,ttcgctttcggcacccggctgagcgacctgacgcccgcattccggc...,FAFGTRLSDLTPAFRLADTDEMLGAASLAIDDFAGGTRFGASIAEL...,uncultured_Variovorax_sp.
601,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_338375,fig|114708.15.peg.3042,peg,Genome470_k127_338375_2375_3,2375,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggcgcttccgatttctcgaacctgccgcacatcggcgaggcag...,MGASDFSNLPHIGEAVRRKEDYRFLTGSGNYTDDITLANQSHAVFV...,uncultured_Variovorax_sp.
602,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_338375,fig|114708.15.peg.3043,peg,Genome470_k127_338375_2883_2386,2883,2386,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgcaggtcgctttcaaggtcaacggccgacaggtcacggtcgacg...,MQVAFKVNGRQVTVDAPPNTFLVHALREHLHLTGTHVGCDTAQCGA...,uncultured_Variovorax_sp.


In [86]:
carriers.reset_index(drop=True, inplace=True)
carriers

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence,Organism
0,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_1008953,fig|2030806.1251.peg.31,peg,Genome302_k127_1008953_868_44,868,44,-,Xanthine and CO dehydrogenases maturation fact...,,,isu;Xanthine_dehydrogenase_subunits isu;Purine...,atgaccggcgccgtcgacgaacttctggaccgcctctcgcgtggcg...,MTGAVDELLDRLSRGDEGVLIRVTATQGSVPREAGTWMAVWTDALT...,Burkholderiaceae_bacterium
1,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1673,peg,Genome302_k127_440979_542_3,542,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggtgcttccgatttcgccaagctgccgcacatcggcgaggcca...,MGASDFAKLPHIGEAIRRKEDARFLTGAGNYTDDVVLPNQAHAVFL...,Burkholderiaceae_bacterium
2,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_440979,fig|2030806.1251.peg.1674,peg,Genome302_k127_440979_1043_555,1043,555,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgcaggtatccctcaccgtcaacgggcgcgaggtcacgatcgacg...,MQVSLTVNGREVTIDAPPNTLLVQAIREHLRLTGTHVGCDTAQCGA...,Burkholderiaceae_bacterium
3,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_836765,fig|2030806.1251.peg.2765,peg,Genome302_k127_836765_1907_1332,1907,1332,-,carbon monoxide dehydrogenase G protein,,,,atggacatgcaaggcacccgccagctcggcgtcacccaggaacagg...,MDMQGTRQLGVTQEQAWEALNDPETLKGCLPGCDKFESTGENQYAV...,Burkholderiaceae_bacterium
4,Burkholderiaceae_bacterium_ERS6626861,Genome302_k127_836765,fig|2030806.1251.peg.2766,peg,Genome302_k127_836765_3145_1934,3145,1934,-,carbon monoxide dehydrogenase E protein,,,,atggagcgcgtgcagcaactcggcgacgcccggcgcggcaagctgg...,MERVQQLGDARRGKLAGNLVGFGRALRRAGVRLDASRIALAAEAAQ...,Burkholderiaceae_bacterium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_318739,fig|114708.15.peg.2866,peg,Genome470_k127_318739_4423_3758,4423,3758,-,carbon monoxide dehydrogenase G protein,,,,atggaaatgctcggaaaccgccgcctaggcgtcacccaacaacagg...,MEMLGNRRLGVTQQQAWEALNDPETLKKCIPGCDKFELTGENQYSV...,uncultured_Variovorax_sp.
196,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_318739,fig|114708.15.peg.2867,peg,Genome470_k127_318739_4889_4458,4889,4458,-,carbon monoxide dehydrogenase E protein,,,,ttcgctttcggcacccggctgagcgacctgacgcccgcattccggc...,FAFGTRLSDLTPAFRLADTDEMLGAASLAIDDFAGGTRFGASIAEL...,uncultured_Variovorax_sp.
197,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_338375,fig|114708.15.peg.3042,peg,Genome470_k127_338375_2375_3,2375,3,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgggcgcttccgatttctcgaacctgccgcacatcggcgaggcag...,MGASDFSNLPHIGEAVRRKEDYRFLTGSGNYTDDITLANQSHAVFV...,uncultured_Variovorax_sp.
198,uncultured_Variovorax_sp._ERS6627246,Genome470_k127_338375,fig|114708.15.peg.3043,peg,Genome470_k127_338375_2883_2386,2883,2386,-,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgcaggtcgctttcaaggtcaacggccgacaggtcacggtcgacg...,MQVAFKVNGRQVTVDAPPNTFLVHALREHLHLTGTHVGCDTAQCGA...,uncultured_Variovorax_sp.


In [87]:
function_group = carriers.groupby('function').agg({'MAG' : lambda x:', '.join(x)}).reset_index()
carriers_function = function_group[function_group['MAG'].str.contains(', ')]
carriers_function

Unnamed: 0,function,MAG
0,Aerobic carbon monoxide dehydrogenase (quinone...,"Burkholderiaceae_bacterium_ERS6626861, Burkhol..."
1,Aerobic carbon monoxide dehydrogenase (quinone...,"Burkholderiaceae_bacterium_ERS6626861, uncultu..."
2,Aerobic carbon monoxide dehydrogenase (quinone...,"Burkholderiaceae_bacterium_ERS6626861, uncultu..."
3,"Aerobic-type carbon monoxide dehydrogenase, la...","uncultured_Comamonas_sp._ERS6626602, unculture..."
4,"Aerobic-type carbon monoxide dehydrogenase, sm...","uncultured_Comamonas_sp._ERS6626602, unculture..."
6,Carbon monoxide dehydrogenase large chain (EC ...,"uncultured_Aureimonas_sp._ERS6627306, uncultur..."
7,Carbon monoxide oxidation accessory protein CoxD,"Burkholderiaceae_bacterium_ERS6626861, uncultu..."
8,Carbon monoxide oxidation accessory protein CoxE,"uncultured_Acidovorax_sp._ERS6627271, uncultur..."
9,Carbon monoxide oxidation accessory protein CoxG,"uncultured_Acidovorax_sp._ERS6627271, uncultur..."
10,Xanthine and CO dehydrogenases maturation fact...,"Burkholderiaceae_bacterium_ERS6626861, uncultu..."


In [88]:
list(carriers_function['function'].unique())

['Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)',
 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)',
 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)',
 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs',
 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs',
 'Carbon monoxide dehydrogenase large chain (EC 1.2.99.2)',
 'Carbon monoxide oxidation accessory protein CoxD',
 'Carbon monoxide oxidation accessory protein CoxE',
 'Carbon monoxide oxidation accessory protein CoxG',
 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family',
 'carbon monoxide dehydrogenase E protein',
 'carbon monoxide dehydrogenase G protein']

In [89]:
list(carriers_function[carriers_function['function'] == 'Aerobic carbon monoxide dehydrogenase (quinone), large chain (EC 1.2.5.3)']['MAG'].unique())

['Burkholderiaceae_bacterium_ERS6626861, Burkholderiaceae_bacterium_ERS6626861, uncultured_Acidovorax_sp._ERS6627271, uncultured_Aureimonas_sp._ERS6627306, uncultured_Aureimonas_sp._ERS6627306, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Deinococcus_sp._ERS6626797, uncultured_Deinococcus_sp._ERS6626811, uncultured_Deinococcus_sp._ERS6626820, uncultured_Deinococcus_sp._ERS6626837, uncultured_Deinococcus_sp._ERS6627282, uncultured_Deinococcus_sp._ERS6627293, uncultured_Deinococcus_sp._ERS6627335, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Pseudacidovorax_sp._ERS6626577, uncultured_Pseudacidovorax_sp._ERS6626850, uncultured_Pseudacidovorax_sp._ERS6626909, uncultured_Pseudacidovorax_sp._ERS6627003, uncultured_Pseudacidovorax_sp._ERS6627320, uncultured_Pseudacidovorax_sp._ERS6627321, un

In [90]:
list(carriers_function[carriers_function['function'] == 'Aerobic carbon monoxide dehydrogenase (quinone), medium chain (EC 1.2.5.3)']['MAG'].unique())

['Burkholderiaceae_bacterium_ERS6626861, uncultured_Acidovorax_sp._ERS6627271, uncultured_Aureimonas_sp._ERS6627306, uncultured_Aureimonas_sp._ERS6627306, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Deinococcus_sp._ERS6626797, uncultured_Deinococcus_sp._ERS6626811, uncultured_Deinococcus_sp._ERS6626820, uncultured_Deinococcus_sp._ERS6626837, uncultured_Deinococcus_sp._ERS6627282, uncultured_Deinococcus_sp._ERS6627293, uncultured_Deinococcus_sp._ERS6627335, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Pseudacidovorax_sp._ERS6626577, uncultured_Pseudacidovorax_sp._ERS6626850, uncultured_Pseudacidovorax_sp._ERS6626909, uncultured_Pseudacidovorax_sp._ERS6627320, uncultured_Pseudacidovorax_sp._ERS6627321, uncultured_Pseudorhodoferax_sp._ERS6626941, uncultured_Variovorax_sp._ERS6626779, uncultured_Variovorax_sp._ERS6627246']

In [91]:
list(carriers_function[carriers_function['function'] == 'Aerobic carbon monoxide dehydrogenase (quinone), small chain (EC 1.2.5.3)']['MAG'].unique())

['Burkholderiaceae_bacterium_ERS6626861, uncultured_Acidovorax_sp._ERS6627271, uncultured_Aureimonas_sp._ERS6627306, uncultured_Aureimonas_sp._ERS6627306, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Deinococcus_sp._ERS6626797, uncultured_Deinococcus_sp._ERS6626811, uncultured_Deinococcus_sp._ERS6626820, uncultured_Deinococcus_sp._ERS6626837, uncultured_Deinococcus_sp._ERS6627282, uncultured_Deinococcus_sp._ERS6627293, uncultured_Deinococcus_sp._ERS6627335, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Pseudacidovorax_sp._ERS6626577, uncultured_Pseudacidovorax_sp._ERS6626850, uncultured_Pseudacidovorax_sp._ERS6626909, uncultured_Pseudacidovorax_sp._ERS6627003, uncultured_Pseudacidovorax_sp._ERS6627320, uncultured_Pseudacidovorax_sp._ERS6627321, uncultured_Pseudorhodoferax_sp._ER

In [92]:
list(carriers_function[carriers_function['function'] == 'Aerobic-type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs']['MAG'].unique())

['uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Comamonas_sp._ERS6627283']

In [93]:
list(carriers_function[carriers_function['function'] == 'Aerobic-type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs']['MAG'].unique())

['uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Pseudorhodoferax_sp._ERS6626941']

In [94]:
list(carriers_function[carriers_function['function'] == 'Carbon monoxide dehydrogenase large chain (EC 1.2.99.2)']['MAG'].unique())

['uncultured_Aureimonas_sp._ERS6627306, uncultured_Methylobacterium_sp._ERS6627288']

In [95]:
list(carriers_function[carriers_function['function'] == 'Carbon monoxide oxidation accessory protein CoxD']['MAG'].unique())

['Burkholderiaceae_bacterium_ERS6626861, uncultured_Acidovorax_sp._ERS6627271, uncultured_Aureimonas_sp._ERS6627306, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Deinococcus_sp._ERS6626797, uncultured_Deinococcus_sp._ERS6626820, uncultured_Deinococcus_sp._ERS6626837, uncultured_Deinococcus_sp._ERS6627282, uncultured_Deinococcus_sp._ERS6627293, uncultured_Deinococcus_sp._ERS6627335, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Pseudacidovorax_sp._ERS6626577, uncultured_Pseudacidovorax_sp._ERS6626850, uncultured_Pseudacidovorax_sp._ERS6626909, uncultured_Pseudacidovorax_sp._ERS6627003, uncultured_Pseudacidovorax_sp._ERS6627320, uncultured_Pseudacidovorax_sp._ERS6627321, uncultured_Pseudorhodoferax_sp._ERS6626941, uncultured_Pseudorhodoferax_sp._ERS6626941, uncultured_Variovorax_sp._ERS6626779, uncultured_Variovorax_sp._ERS6627246']

In [96]:
list(carriers_function[carriers_function['function'] == 'Carbon monoxide oxidation accessory protein CoxE']['MAG'].unique())

['uncultured_Acidovorax_sp._ERS6627271, uncultured_Aureimonas_sp._ERS6627306, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Pseudacidovorax_sp._ERS6626577, uncultured_Pseudacidovorax_sp._ERS6626850, uncultured_Pseudacidovorax_sp._ERS6627003, uncultured_Pseudacidovorax_sp._ERS6627320, uncultured_Pseudacidovorax_sp._ERS6627321, uncultured_Pseudorhodoferax_sp._ERS6626941, uncultured_Pseudorhodoferax_sp._ERS6626941']

In [97]:
list(carriers_function[carriers_function['function'] == 'Carbon monoxide oxidation accessory protein CoxG']['MAG'].unique())

['uncultured_Acidovorax_sp._ERS6627271, uncultured_Acidovorax_sp._ERS6627271, uncultured_Aureimonas_sp._ERS6627306, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Methylobacterium_sp._ERS6627288, uncultured_Pseudacidovorax_sp._ERS6627320, uncultured_Pseudacidovorax_sp._ERS6627321']

In [98]:
list(carriers_function[carriers_function['function'] == 'Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family']['MAG'].unique())

['Burkholderiaceae_bacterium_ERS6626861, uncultured_Acidovorax_sp._ERS6627271, uncultured_Aureimonas_sp._ERS6627306, uncultured_Aureimonas_sp._ERS6627306, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626602, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626630, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6626650, uncultured_Comamonas_sp._ERS6627283, uncultured_Comamonas_sp._ERS6627283, uncultured_Comamonas_sp._ERS6627283, uncultured_Deinococcus_sp._ERS6626797, uncultured_Deinococcus_sp._ERS6626797, uncultured_Deinococcus_sp._ERS6626811, uncultured_Deinococcus_sp._ERS6626811, uncultured_Deinococcus_sp._ERS6626820, uncultured_Deinococcus_sp._ERS6626820, uncultured_Deinococcus_sp._ERS6626820, uncultured_Deinococcus_sp._ERS6626837, uncultured_Deinococcus_sp._ERS6626837, uncultured_Deinococcus_sp._ERS6627282, uncultured_Deinococcus_sp._ERS6627282, uncultured_Deinococcus_sp._ERS6627293, uncultured_Deinococcus_sp._ERS6627293, uncul

In [99]:
list(carriers_function[carriers_function['function'] == 'carbon monoxide dehydrogenase E protein']['MAG'].unique())

['Burkholderiaceae_bacterium_ERS6626861, uncultured_Deinococcus_sp._ERS6626811, uncultured_Variovorax_sp._ERS6626779, uncultured_Variovorax_sp._ERS6627246, uncultured_Variovorax_sp._ERS6627246']

In [100]:
list(carriers_function[carriers_function['function'] == 'carbon monoxide dehydrogenase G protein']['MAG'].unique())

['Burkholderiaceae_bacterium_ERS6626861, uncultured_Deinococcus_sp._ERS6626797, uncultured_Deinococcus_sp._ERS6626837, uncultured_Deinococcus_sp._ERS6627282, uncultured_Deinococcus_sp._ERS6627293, uncultured_Deinococcus_sp._ERS6627335, uncultured_Pseudacidovorax_sp._ERS6626577, uncultured_Pseudacidovorax_sp._ERS6626850, uncultured_Pseudacidovorax_sp._ERS6627003, uncultured_Pseudacidovorax_sp._ERS6627320, uncultured_Pseudacidovorax_sp._ERS6627321, uncultured_Pseudorhodoferax_sp._ERS6626941, uncultured_Variovorax_sp._ERS6626779, uncultured_Variovorax_sp._ERS6627246']

In [101]:
list(codh[codh['function'] == 'Aerobic carbon monoxide dehydrogenase molybdenum cofactor insertion protein CoxF']['MAG'].unique())

['uncultured_Cupriavidus_sp._ERS6626993']