<h1 style="font-family: 'Times New Roman'; text-align: center; font-weight: bold;">
    CBB Cycle
</h1></p>

<div style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
  <div>
    <span style="display: inline-block; width: 100px;"><b>Date</b></span>: 20<sup>th</sup> April 2025
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Author</b></span>: Deepan Kanagarajan Babu
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Description</b></span>: In this document, the output of RAST is analysed and looked for the RuBisCO genes in uploaded MAGs.
  </div>
</div>


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Libraries
</h2>

In [7]:
import pandas as pd
import re
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import shutil

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Functions
</h2>

In [10]:
# Defining single function to open multiple file formats (.csv, .txt, .fasta or .fastq (to display only 1st 100 lines), and .xlsx)
def open_file(file_path):
    if file_path.endswith('.csv'):
        # Read CSV file
        data = pd.read_csv(file_path)
        return data
    elif file_path.endswith('.txt'):
        # Read TXT file
        try:
            with open(file_path, 'r') as file:
                data = file.read()
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith(('.fasta', '.fastq')):
        # Read FASTA or FASTQ file
        try:
            with open(file_path, 'r') as file:
                data = []
                for _ in range(100):
                    line = file.readline()
                    if not line:
                        break
                    data.append(line.strip())
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith('.xlsx'):
        # Read XLSX file
        try:
            data = pd.read_excel(file_path)
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    else:
        raise ValueError("Unsupported file format")

In [12]:
# Function to extract nucleotide sequence from RAST output CSV
def create_fasta_from_df(df, mag_col, feature_id_col, sequence_col, output_folder):
    # Ensure the output folder exists
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Group by 'MAG' and create separate FASTA files for each group
    for mag, group in df.groupby(mag_col):
        fasta_records = []
        for _, row in group.iterrows():
            record = SeqRecord(Seq(row[sequence_col]), id=row[feature_id_col], description="")
            fasta_records.append(record)
        
        # Write the records to a FASTA file named after the 'MAG' value in the specified output folder
        fasta_file = os.path.join(output_folder, f"{mag}.fasta")
        SeqIO.write(fasta_records, fasta_file, "fasta")

    print("FASTA files created successfully.")

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Directories Path
</h2>

In [15]:
# Specify the folder path containing compiled RAST .csv output file
## Concated CSV file output path
rast_out = 'summary/rast.csv'

# Create output directories
## output file for cbb sequences
cbb_out = 'cbb'
## Ensure the output directory exists
os.makedirs(cbb_out, exist_ok=True)

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Looking for RuBisCO related genes
</h2>

In [18]:
# Import the file
rast = open_file(rast_out)
rast.head()

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
0,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.1,peg,Genome20_S3Ck127_100924_3_194,3,194,+,hypothetical protein,,,,gcgatcttcacgcactctgacgaacaaaaaagcgtcgctcttcagg...,AIFTHSDEQKSVALQAKAKAQQKWDKPIVTQIVDAKPFYSAERVHQ...
1,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.2,peg,Genome20_S3Ck127_100924_197_814,197,814,+,hypothetical protein,,,,atgccgattcaatcctggtttccgacgtcgctttatgttgagcccc...,MPIQSWFPTSLYVEPLLANGLARFNRELLEQCYLVRDTDPHGVEWS...
2,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.3,peg,Genome20_S3Ck127_100924_1362_811,1362,811,-,hypothetical protein,,,,atgaagcgcttgatctcagctattcttttaatttcttttttaacca...,MKRLISAILLISFLTTQAEAQVAERESRISRAATLLSDVAESVSPV...
3,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_100924,fig|2026715.123.peg.4,peg,Genome20_S3Ck127_100924_2234_1452,2234,1452,-,hypothetical protein,,,,tttgtcataacggcttccgcctcagcgcaactcacgaatggccgca...,FVITASASAQLTNGRIRQRTANAPGFTACLSKDGDKCRLIAPAVAI...
4,Bdellovibrionaceae_bacterium_ERS6626579,Genome20_S3Ck127_104562,fig|2026715.123.peg.5,peg,Genome20_S3Ck127_104562_2_271,2,271,+,hypothetical protein,,,,aaaaagaaaagttggaattacgagttcggccagcgcgcgctagaag...,KKKSWNYEFGQRALEELAKKFEGGTDVVKQLSTPEDHHAALEALWQ...


In [19]:
# Search for CODH function
search_strings = [r'RUBISCO', r'rubisco', r'RuBisCO', r'Ribulose-1,5-bisphosphate carboxylase/oxygenase']
regex_pattern = '|'.join(search_strings)

cbb = rast[rast['function'].str.contains(regex_pattern, regex=True, na=False)].reset_index(drop=True)
cbb.head()

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
0,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_390201,fig|2030806.1252.peg.2398,peg,Genome269_k127_390201_245816_244911,245816,244911,-,RuBisCO operon transcriptional regulator CbbR,,,"isu;CO2_uptake,_carboxysome",atgcatgcgacttttcgtcagctgaagatgctgttggcattggcgg...,MHATFRQLKMLLALAETGSITGAARVCHVTQPTVSMQLKDLAESVG...
1,uncultured_Acidovorax_sp._ERS6627271,Genome495_k127_101610,fig|158751.153.peg.14,peg,Genome495_k127_101610_9457_8558,9457,8558,-,RuBisCO operon transcriptional regulator CbbR,,,"isu;CO2_uptake,_carboxysome",atgaacatcacctttcgccaattgcgactgttcctggcgctggctg...,MNITFRQLRLFLALAETGSVSAAAKVMHVTQPTASMQLREVTQAVG...
2,uncultured_Flavobacterium_sp._ERS6627313,Genome537_k127_132050,fig|165435.165.peg.232,peg,Genome537_k127_132050_48562_49482,48562,49482,+,RuBisCO operon transcriptional regulator CbbR,,,"icw(1);CO2_uptake,_carboxysome",atgaaatatacccttaaccagcttcagatatttttaaaagtagcag...,MKYTLNQLQIFLKVAETESITKAADELHLTQPAVSIQLKNFQAQFD...
3,uncultured_Massilia_sp._ERS6626988,Genome429_k127_453002,fig|169973.34.peg.4148,peg,Genome429_k127_453002_3267_2356,3267,2356,-,RuBisCO operon transcriptional regulator CbbR,,,"icw(2);CO2_uptake,_carboxysome",atgcgccgctatacgctacgccagctcgacaccttcatcgaagtcg...,MRRYTLRQLDTFIEVARQLSISRAASALHVSQPAVSMQLRQLEEAL...
4,uncultured_Methylobacterium_sp._ERS6627288,Genome512_k127_94909,fig|157278.22.peg.6242,peg,Genome512_k127_94909_46899_45958,46899,45958,-,RuBisCO operon transcriptional regulator CbbR,,,"isu;CO2_uptake,_carboxysome",atggcgcaggaccgcatacgcaacctctcgctcaagcagctccatg...,MAQDRIRNLSLKQLHAVAAIARLGTMTRAAQELNVTPAALTARIKG...


In [20]:
# Search for CODH function
search_strings = [r'CbbL', r'CbbS', r'CbbR']
regex_pattern = '|'.join(search_strings)

cbb = rast[rast['function'].str.contains(regex_pattern, regex=True, na=False)].reset_index(drop=True)
cbb.head()

Unnamed: 0,MAG,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
0,Burkholderiaceae_bacterium_ERS6626828,Genome269_k127_390201,fig|2030806.1252.peg.2398,peg,Genome269_k127_390201_245816_244911,245816,244911,-,RuBisCO operon transcriptional regulator CbbR,,,"isu;CO2_uptake,_carboxysome",atgcatgcgacttttcgtcagctgaagatgctgttggcattggcgg...,MHATFRQLKMLLALAETGSITGAARVCHVTQPTVSMQLKDLAESVG...
1,uncultured_Acidovorax_sp._ERS6627271,Genome495_k127_101610,fig|158751.153.peg.14,peg,Genome495_k127_101610_9457_8558,9457,8558,-,RuBisCO operon transcriptional regulator CbbR,,,"isu;CO2_uptake,_carboxysome",atgaacatcacctttcgccaattgcgactgttcctggcgctggctg...,MNITFRQLRLFLALAETGSVSAAAKVMHVTQPTASMQLREVTQAVG...
2,uncultured_Flavobacterium_sp._ERS6627313,Genome537_k127_132050,fig|165435.165.peg.232,peg,Genome537_k127_132050_48562_49482,48562,49482,+,RuBisCO operon transcriptional regulator CbbR,,,"icw(1);CO2_uptake,_carboxysome",atgaaatatacccttaaccagcttcagatatttttaaaagtagcag...,MKYTLNQLQIFLKVAETESITKAADELHLTQPAVSIQLKNFQAQFD...
3,uncultured_Massilia_sp._ERS6626988,Genome429_k127_453002,fig|169973.34.peg.4148,peg,Genome429_k127_453002_3267_2356,3267,2356,-,RuBisCO operon transcriptional regulator CbbR,,,"icw(2);CO2_uptake,_carboxysome",atgcgccgctatacgctacgccagctcgacaccttcatcgaagtcg...,MRRYTLRQLDTFIEVARQLSISRAASALHVSQPAVSMQLRQLEEAL...
4,uncultured_Methylobacterium_sp._ERS6627288,Genome512_k127_94909,fig|157278.22.peg.6242,peg,Genome512_k127_94909_46899_45958,46899,45958,-,RuBisCO operon transcriptional regulator CbbR,,,"isu;CO2_uptake,_carboxysome",atggcgcaggaccgcatacgcaacctctcgctcaagcagctccatg...,MAQDRIRNLSLKQLHAVAAIARLGTMTRAAQELNVTPAALTARIKG...


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Save the sequences as FASTA file
</h2>

In [26]:
create_fasta_from_df(cbb, "MAG", "feature_id", "nucleotide_sequence", cbb_out)

FASTA files created successfully.


<p style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
    There is no cbbL or cbbS gene in the data. The CbbR protein is the regulator protein of microbial carbon dioxide operon. Still it is not verified to fix carbon because no presence of cbbL and cbbS, which are the genes encode for actual RuBisCO protein. These sequences are take for conserved domain search in the NCBI server. To learn more about CbbR gene refer the following article: <a href="https://doi.org/10.1128/jb.00442-15">"CbbR, the Master Regulator for Microbial Carbon Dioxide Fixation"</a> (click to open the article).
</p>