## **Homology Analysis Using BLAST**

This script performs the following operations:

1. Reads gene sequences from FASTA files.  
2. Runs a BLAST search for each gene using the NCBI "nr" database to perform local alignment.  
3. Analyzes BLAST results and filters them based on e-value, identity percentage, and coverage.  
4. Saves the filtered results in a new FASTA file, including details such as e-value, identities, coverage, and species in the sequence description.  

The script uses Biopython for sequence handling and BLAST execution.

In [3]:
from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
import re
import os

def blast_and_filter(gene_names, e_value_threshold=1e-5, percent_identity_threshold=50, coverage_threshold=50):
    if not os.path.exists("blast_results"):
        os.makedirs("blast_results")
        
    for name_gene in gene_names:
        # Read sequence and run BLAST
        try:
            if not os.path.exists("genes"):
                os.makedirs("genes")
        
            query_seq = SeqIO.read(f"genes/{name_gene}.fasta", "fasta")
        except FileNotFoundError:
            print(f"File not found for {name_gene}. Skipping...")
            continue

        print(f"Starting BLAST for  {name_gene}...")
        result_handle = NCBIWWW.qblast("blastp", "nr", query_seq.seq)
        print(f"BLAST completed for {name_gene}.")

        # Parse and filter results
        blast_records = NCBIXML.parse(result_handle)
        output_path = f"blast_results/{name_gene}_blast.fasta"
        
        with open(output_path, "w") as output_handle:
            for blast_record in blast_records:
                print(f"Number of alignments found for  {name_gene}:", len(blast_record.alignments))
                for alignment in blast_record.alignments:
                    print("Alignment title:", alignment.title)
                    for hsp in alignment.hsps:
                        query_cover = (hsp.align_length / blast_record.query_letters) * 100
                        print(f"HSP: E-value: {hsp.expect}, Identities: {hsp.identities}, "
                              f"Align length: {hsp.align_length}, Query Cover: {query_cover:.2f}%")
                        
                        percent_identity = (hsp.identities / hsp.align_length) * 100
                        if (hsp.expect <= e_value_threshold and
                            percent_identity >= percent_identity_threshold and
                            query_cover >= coverage_threshold):
                            
                            species_match = re.search(r"\[(.*?)\]", alignment.title)
                            species = species_match.group(1) if species_match else "Unknown species"
                            

                            SeqIO.write(
                                SeqIO.SeqRecord(
                                    seq=hsp.sbjct,
                                    id=alignment.accession,
                                    description=f"E-value: {hsp.expect:.2e}, Identities: {hsp.identities}/{hsp.align_length}, "
                                                f"Query Cover: {query_cover:.2f}%, Percent Identity: {percent_identity:.2f}%, "
                                                f"Species: {species}"
                                ),
                                output_handle,
                                "fasta"
                            )
                            break  # Use only the best HSP for each alignment
        
        print(f"Filtered BLAST results for {name_gene} were saved in '{output_path}'")








#### **1: Gene ptsP**

In [4]:
gene_names = ["ptsP"]
blast_and_filter(gene_names)

Starting BLAST for  ptsP...
BLAST completed for ptsP.
Number of alignments found for  ptsP: 50
Alignment title: ref|WP_005925321.1| phosphoenolpyruvate--protein phosphotransferase [Faecalibacterium prausnitzii] >gb|EDP19718.1| phosphoenolpyruvate-protein phosphotransferase [Faecalibacterium prausnitzii M21/2] >gb|MCI3184523.1| phosphoenolpyruvate--protein phosphotransferase [Faecalibacterium prausnitzii] >gb|MCI3202328.1| phosphoenolpyruvate--protein phosphotransferase [Faecalibacterium prausnitzii] >gb|MDU8657129.1| phosphoenolpyruvate--protein phosphotransferase [Faecalibacterium prausnitzii] >gb|MDW2997156.1| phosphoenolpyruvate--protein phosphotransferase [Faecalibacterium prausnitzii]
HSP: E-value: 0.0, Identities: 547, Align length: 547, Query Cover: 100.00%
Alignment title: ref|WP_097783314.1| phosphoenolpyruvate--protein phosphotransferase [Faecalibacterium prausnitzii] >gb|MDU8670066.1| phosphoenolpyruvate--protein phosphotransferase [Faecalibacterium prausnitzii] >gb|MDU87245



#### **2. Gene ButyrylCoA**

In [5]:
gene_names = ["butyrylCoA"]
blast_and_filter(gene_names)

Starting BLAST for  butyrylCoA...
BLAST completed for butyrylCoA.
Number of alignments found for  butyrylCoA: 50
Alignment title: ref|WP_044960620.1| MULTISPECIES: butyryl-CoA:acetate CoA-transferase [Faecalibacterium] >gb|MBP9564639.1| butyryl-CoA:acetate CoA-transferase [Faecalibacterium sp.] >gb|AXB28579.1| butyryl-CoA:acetate CoA-transferase [Faecalibacterium prausnitzii] >gb|MBV0896480.1| butyryl-CoA:acetate CoA-transferase [Faecalibacterium prausnitzii] >gb|MBV0926594.1| butyryl-CoA:acetate CoA-transferase [Faecalibacterium prausnitzii] >gb|MCG4793536.1| butyryl-CoA:acetate CoA-transferase [Faecalibacterium prausnitzii]
HSP: E-value: 0.0, Identities: 448, Align length: 448, Query Cover: 100.00%
Alignment title: ref|WP_097783900.1| MULTISPECIES: butyryl-CoA:acetate CoA-transferase [Faecalibacterium] >gb|MDR3769479.1| butyryl-CoA:acetate CoA-transferase [Faecalibacterium sp.] >gb|UYI72064.1| MAG: butyryl-CoA:acetate CoA-transferase [Oscillospiraceae bacterium] >gb|MBD8928006.1| but

#### **3. Gene MutS**

In [6]:
gene_names = ["MutS"]
blast_and_filter(gene_names)

Starting BLAST for  MutS...
BLAST completed for MutS.
Number of alignments found for  MutS: 50
Alignment title: ref|WP_341271153.1| SNF2-related protein [Faecalibacterium prausnitzii]
HSP: E-value: 0.0, Identities: 2117, Align length: 2117, Query Cover: 100.00%
Alignment title: gb|EDP20631.1| MutS domain I protein [Faecalibacterium prausnitzii M21/2]
HSP: E-value: 0.0, Identities: 2117, Align length: 2117, Query Cover: 100.00%
Alignment title: ref|WP_207708712.1| SNF2-related protein [Clostridium porci]
HSP: E-value: 0.0, Identities: 1772, Align length: 2125, Query Cover: 100.38%
Alignment title: ref|WP_333523603.1| SNF2-related protein [Clostridium fessum]
HSP: E-value: 0.0, Identities: 1763, Align length: 2125, Query Cover: 100.38%
Alignment title: gb|MCG4781639.1| SNF2-related protein [Acetatifactor sp. DFI.5.50]
HSP: E-value: 0.0, Identities: 1733, Align length: 2130, Query Cover: 100.61%
Alignment title: gb|MCB6198170.1| DEAD/DEAH box helicase family protein [Lacrimispora saccharo