<h1 style="font-family: 'Times New Roman'; text-align: center; font-weight: bold;">
    Data Preprocessing
</h1></p>

<div style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
  <div>
    <span style="display: inline-block; width: 100px;"><b>Date</b></span>: 29<sup>th</sup> January 2024
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Author</b></span>: Deepan Kanagarajan Babu
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Description</b></span>: In this document, Checking for the presence of Form 2 CoxL protein in <i>Deinococcus gobiensis</i> I-0 genome. Genome is uploaded to RAST server and the results are downloaded in '.faa' and '.xls' file formats. Trying to find the Aerobic carbon monoxide dehydrogenase activity in the 'function' column of 'Deinococcus_gobiensis_I-0_rast.xls' file. Then the fasta amino acid file will be filtered for the sequences with the CODH function. The filtered fasta amino acid file is used for MOTIF search. Here we use both MOTIF form 1 (AYXCSFR) and form 2 (AYRGAGR). We expect for the presence of only form 2 MOTIF. This genome is from an isolated <i>Deinococcus gobiensis</i> I-0 strain.
  </div>
</div>


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Libraries
</h2>

In [5]:
#Import required modules
import pandas as pd
import os
from Bio import SeqIO
import re

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Import and Process Files
</h2>

In [7]:
# Import RAST .xls file using pandas
deinococcus = pd.read_excel('Deinococcus_gobiensis_I-0_rast.xls')
deinococcus

Unnamed: 0,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
0,NC_017771.1,fig|745776.20.peg.1,peg,NC_017771.1_2_523,2,523,+,SOS-response repressor and protease LexA (EC 3...,,,"isu;DNA_repair,_bacterial_UmuCD_system isu;DNA...",gacctgatcgagtacgacaccgcggagcgccggacagccatcatcc...,DLIEYDTAERRTAIIRLTAKGRREAGVPENGEVPELPFPILGEVAA...
1,NC_017771.1,fig|745776.20.peg.2,peg,NC_017771.1_694_1410,694,1410,+,hypothetical protein,,,,atggccctggaactccgccagcgccgcctgctggccgccctcgcgg...,MALELRQRRLLAALAAHDGEVPNDVLASAVQSPPGVAFGLMVRLCE...
2,NC_017771.1,fig|745776.20.peg.3,peg,NC_017771.1_2257_1394,2257,1394,-,Transposase,,,,gtgacgacccgcgatactgcccgactgcatgctgacacgctggctg...,MTTRDTARLHADTLAAHLKTHLPHRRLDALRRLAEVLLALLQAEST...
3,NC_017771.1,fig|745776.20.peg.4,peg,NC_017771.1_2317_2469,2317,2469,+,hypothetical protein,,,,gtgtcaggtgctgaggcctcgacgtccatcccgctgagtgccaaca...,MSGAEASTSIPLSANTLASQISFLDMPHDTEDQISFPSNFDMSPIS...
4,NC_017771.1,fig|745776.20.peg.5,peg,NC_017771.1_2522_4462,2522,4462,+,"Type I restriction-modification system, DNA-me...",,,idu(1);Type_I_Restriction-Modification idu(1);...,atgactcagacgcgcatgggcaacatcatctgggccaccgcagaac...,MTQTRMGNIIWATAELLRGDYKQADYGKVILPMTIARRLDGMAGRH...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4895,NC_017806.1,fig|745776.20.peg.4459,peg,NC_017806.1_52489_52821,52489,52821,+,hypothetical protein,,,,atgacggccacccagaatgtggactacacctatgagctgaccctgc...,MTATQNVDYTYELTLPQILREGMDRHGDRAYVALGAAISAVHGLSP...
4896,NC_017806.1,fig|745776.20.peg.4460,peg,NC_017806.1_52903_53088,52903,53088,+,hypothetical protein,,,,gtgggcgtggctcggaaaatgctgaccagtgacgacacacctgagc...,MGVARKMLTSDDTPERLEVDARLAVQVLRDLGRVHAAGEHLASLDP...
4897,NC_017806.1,fig|745776.20.repeat.78,repeat,NC_017806.1_53368_54602,53368,54602,+,repeat region,,,,cggtaatgctccagagatgacttccagaacctgacccttgtgatgc...,
4898,NC_017806.1,fig|745776.20.peg.4461,peg,NC_017806.1_53440_53727,53440,53727,+,hypothetical protein,,,,atgggaaagcagagaaaagtctggagcacggacgtcaaagaagcca...,MGKQRKVWSTDVKEAIVLSVLRGDLGVAEAARQHRVNESLIHTWKT...


In [8]:
# Searching for string 'carbon monoxide' in 'function' column
## Define the string to be searched and the regrex pattern
search_strings = [r'carbon monoxide', r'Carbon Monoxide'] # python identify capital and small letters
regex_pattern = '|'.join(search_strings)

## Search and save the filtered rows to 'coxl' variable
coxl = deinococcus[deinococcus['function'].str.contains(regex_pattern, regex=True, na=False)]
coxl

Unnamed: 0,contig_id,feature_id,type,location,start,stop,strand,function,aliases,figfam,evidence_codes,nucleotide_sequence,aa_sequence
2983,NC_017790.1,fig|745776.20.peg.2909,peg,NC_017790.1_2693628_2694125,2693628,2694125,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgaacgtgaccatgaccgtgaacggcaagacgtacacccgtgagg...,MNVTMTVNGKTYTREVEPRRLLVQFLREDLGLTGTHVGCDTSQCGA...
2984,NC_017790.1,fig|745776.20.peg.2910,peg,NC_017790.1_2694250_2696640,2694250,2696640,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgaccgaatccagaacggacaagtacgtcgggcaggccctcaagc...,MTESRTDKYVGQALKRKEDPRFITGTGNYTDDIVLHGMLHAAMVRS...
2985,NC_017790.1,fig|745776.20.peg.2911,peg,NC_017790.1_2696699_2697493,2696699,2697493,+,Aerobic carbon monoxide dehydrogenase (quinone...,,,,atgtacccagcccagttcgactaccagaaagcgacgagcgtggatg...,MYPAQFDYQKATSVDDALRALAENPDLKVIAGGHSLLPAMKLRLAQ...
3198,NC_017790.1,fig|745776.20.peg.3110,peg,NC_017790.1_2882128_2881664,2882128,2881664,-,carbon monoxide dehydrogenase G protein,,,,atgaaactcaactattccggccaggaacacgtcaaggccccgcccg...,MKLNYSGQEHVKAPPAAVWAFVRDPERVARCLPDVQDVQVRDATHM...


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Filtering FASTA Amino Acid file
</h2>


In [11]:
# Get the IDs form coxl DataFrame to filter '.faa' file
ids_to_keep = set(coxl['feature_id'])

# Name the FASTA file
filename = "Deinococcus_gobiensis_I-0_rast.faa"

# Initialize a list to store the results
results = []

# Process the FASTA file
filtered_sequences = []

for record in SeqIO.parse(filename, "fasta"):
    sequence_id = record.id.split("Sequence ID: ")[-1]
    if sequence_id in ids_to_keep:
        filtered_sequences.append(record)
        results.append({
            "FASTA_File": filename,
            "Sequence_ID": sequence_id
        })

# Save the filtered sequences to a new FASTA file
filtered_fasta_file = f"filtered_{filename}"
SeqIO.write(filtered_sequences, filtered_fasta_file, "fasta")


print(f"Filtered sequences have been saved to '{filtered_fasta_file}'")


Filtered sequences have been saved to 'filtered_Deinococcus_gobiensis_I-0_rast.faa'


In [13]:
# Checking the filtered fasta file
fasta_file = "filtered_Deinococcus_gobiensis_I-0_rast.faa"

# Read the filtered FASTA file
for record in SeqIO.parse(fasta_file, "fasta"):
    print(f"Sequence ID: {record.id}")
    print(f"Sequence: {record.seq}\n")


Sequence ID: fig|745776.20.peg.2909
Sequence: MNVTMTVNGKTYTREVEPRRLLVQFLREDLGLTGTHVGCDTSQCGACTVHVGGDAVKSCTLLAVQVGDLPVTTIEGMGTPGDLHPLQTGFWEQHGLQCGFCTPGMIMASAELLRHNPQPSEEEIRFHLEGNYCRCTGYHNIVRAVQQAAGAIREAAPGQSQAADD

Sequence ID: fig|745776.20.peg.2910
Sequence: MTESRTDKYVGQALKRKEDPRFITGTGNYTDDIVLHGMLHAAMVRSPYPHARITAIDKGSVADFPGVQLVLTGQDVKDAGLGSIPVGWLLPDLKTPAHPAIALEEANHVGDIVAVVVAETRAQAEDAAAALDVSYEALPSVSSAVAALEDGAPVVHDDVPGNVAFRWEIGDEAALNESFRRAHKTVKVKLRNHRLIANAIEPRASLAQFSPASGEYTLHTTSQNPHIHRLILAAFVMNIPEHKLRVISPDVGGGFGSKIFQYQEEVIVLLAAQKLGRPVKWAARRSESFVSDAQGRDHDTETEMAVDENGMMLGLRVSTVANLGAYQTLFSPAVPTYLYGTLCNGVYKLPAVHVKVTGVMTNTVPVDAYRGAGRPEATYAVERTVDVMAHELGEDPAEFRRRNFIQPDEFPYQTPVALVYDSGDYEPALDKALGMMNYPALREEQARMKGGRKILGVGLISYLEACGLAPSALVGQLGAQAGQWESSLVRVHPTGKVELFTGSHSHGQGHETAFAQIAADELQIPIEDIELIHGDTGRMPYGWGTYGSRSAAVGGSALKMALLKVTAKAKKIAAHLLEASEEDVEHEGGVFRIKGAPGQSKSFFDVSLMAHLAHNLPADMEPGLEATAFYDPKNFVYPFGTHIAVVEIDTDTGHVKLRQYGCVDDCGPLINPLIAEGQVHGGIAQGAAQALLEDAAYDEDGSLLAGTFMEYAIPRADDVPSFLIDHTVTPSPHNPLGVKGI

## MOTIF Search

In [19]:
## Checking for last 4 aminoacids match from motif and trying to retrive the 3 amino acids before the match
# Defining the motifs as regular expressions
form1 = re.compile(r"(.{3})CSFR") # last four amino acids of form 1 MOTIF: AYXCSFR
form2 = re.compile(r"(.{3})GAGR") # Last four amino acids of form 2 MOTIF: AYRGAGR

# Function to search for a motif and return results as a list of dictionaries
def search_motif(motif, record, motif_name):
    matches = motif.finditer(str(record.seq))
    results = []
    for match in matches:
        start = match.start()
        end = match.end()
        first_three = match.group(1)
        results.append({
            "Sequence_ID": record.id,
            "First_Three_Aminoacids": first_three,
            "Motif": motif_name,
            "Subject_Start": start,
            "Subject_End": end - 1
        })
    return results

# Initialize lists to store results for each motif
all_results1 = []
all_results2 = []

# Name the FASTA file
filename = "filtered_Deinococcus_gobiensis_I-0_rast.faa"

# Process the FASTA file
for record in SeqIO.parse(filename, "fasta"):
    # Search for motif 1
    results1 = search_motif(form1, record, "CSFR")
    all_results1.extend(results1)
    
    # Search for motif 2
    results2 = search_motif(form2, record, "GAGR")
    all_results2.extend(results2)

# Create DataFrames from the results
Form1 = pd.DataFrame(all_results1)
Form2 = pd.DataFrame(all_results2)

# Comfirmation of Form 1 or Form 2 CoxL protein presence
if not Form1.empty:
    print("Deinococcus gobiensis I-0 carry Form1 CoxL protein")
else:
    print("No results found for form 1.")

if not Form2.empty:
    print("Deinococcus gobiensis I-0 carry Form2 CoxL protein")
else:
    print("No results found for form 2.")


No results found for form 1.
Deinococcus gobiensis I-0 carry Form2 CoxL protein


In [21]:
# Checking Form 2 matches
Form2

Unnamed: 0,Sequence_ID,First_Three_Aminoacids,Motif,Subject_Start,Subject_End
0,fig|745776.20.peg.2910,AYR,GAGR,367,373


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Result
</h2>
<p style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
    As per the analysis <i>Deinococcus gobiensis</i> I-0 carry Form 2 CoxL gene.
</p>