## **Homology Analysis Using BLAST**

This script performs the following operations:

1. Reads gene sequences from FASTA files.  
2. Runs a BLAST search for each gene using the NCBI "nr" database to perform local alignment.  
3. Analyzes BLAST results and filters them based on e-value, identity percentage, and coverage.  
4. Saves the filtered results in a new FASTA file, including details such as e-value, identities, coverage, and species in the sequence description.  

The script uses Biopython for sequence handling and BLAST execution.

In [1]:
from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
import re
import os

def blast_and_filter(gene_names, e_value_threshold=0.01, percent_identity_threshold=30, coverage_threshold=30):
    """
    Performs BLAST search for given genes and filters results based on thresholds.

    Inputs:
        gene_names (list of str): List of gene names to process.
        e_value_threshold (float): Maximum E-value to consider an alignment (default: 0.01).
        percent_identity_threshold (float): Minimum percentage identity for inclusion (default: 30).
        coverage_threshold (float): Minimum query coverage percentage for inclusion (default: 30).

    Outputs:
        - Filtered BLAST results in FASTA format (saved in blast_results/{gene_name}_blast.fasta)
        - Summary of alignments printed to console
    """
    if not os.path.exists("blast_results"):
        os.makedirs("blast_results")
        
    for name_gene in gene_names:
        # Read sequence and run BLAST
        try:
            if not os.path.exists("genes"):
                os.makedirs("genes")
        
            query_seq = SeqIO.read(f"genes/{name_gene}.fasta", "fasta")
        except FileNotFoundError:
            print(f"File not found for {name_gene}. Skipping...")
            continue

        print(f"Starting BLAST for {name_gene}...")
        result_handle = NCBIWWW.qblast("blastp", "swissprot", query_seq.seq)
        print(f"BLAST completed for {name_gene}.")

        # Parse and filter results
        blast_records = NCBIXML.parse(result_handle)
        output_path = f"blast_results/{name_gene}_blast.fasta"
        
        with open(output_path, "w") as output_handle:
            total_alignments = 0
            saved_alignments = 0
            
            for blast_record in blast_records:
                total_alignments = len(blast_record.alignments)
                print(f"Number of alignments found for {name_gene}: {total_alignments}")
                
                for i, alignment in enumerate(blast_record.alignments, start=1):
                    for hsp in alignment.hsps:
                        query_cover = (hsp.align_length / blast_record.query_letters) * 100
                        percent_identity = (hsp.identities / hsp.align_length) * 100
                        
                        # Displays the alignment in the terminal
                        species_match = re.search(r"\[(.*?)\]", alignment.title)
                        species = species_match.group(1) if species_match else "Unknown species"
                        print(f"Alignment {i}: {species} | {alignment.accession} | "
                              f"Identity: {percent_identity:.2f}% | E-value: {hsp.expect:.2e} | Coverage: {query_cover:.2f}%")
                        
                        # Applies the filters
                        if (hsp.expect <= e_value_threshold and
                            percent_identity >= percent_identity_threshold and
                            query_cover >= coverage_threshold):
                            
                            # Writes to the file if the alignment is filtered
                            SeqIO.write(
                                SeqIO.SeqRecord(
                                    seq=hsp.sbjct,
                                    id=f"{species} | {alignment.accession}",  # Coloca a espécie primeiro
                                    description=f"E-value: {hsp.expect:.2e}, Identities: {hsp.identities}/{hsp.align_length}, "
                                                f"Query Cover: {query_cover:.2f}%, Percent Identity: {percent_identity:.2f}%"
                                ), output_handle, "fasta")

                            saved_alignments += 1
                            break  # Uses only the best HSP per alignment
            
            # Displays the final summary
            
            print(f"Alignments saved for {name_gene}: {saved_alignments}")
        
        print(f"Filtered BLAST results for {name_gene} were saved in '{output_path}'")


#### **1: Gene ptsP**

In [None]:
blast_and_filter(gene_names = ["ptsP"])

Starting BLAST for ptsP...
BLAST completed for ptsP.
Number of alignments found for ptsP: 50
Alignment 1: Halalkalibacterium halodurans C-125 | Q9K8D3 | Identity: 43.02% | E-value: 5.96e-160 | Coverage: 96.89%
Alignment 2: Bacillus sp. S | O83018 | Identity: 44.34% | E-value: 4.76e-154 | Coverage: 96.89%
Alignment 3: Geobacillus stearothermophilus | P42014 | Identity: 44.34% | E-value: 6.92e-153 | Coverage: 96.89%
Alignment 4: Priestia megaterium | O69251 | Identity: 42.35% | E-value: 2.70e-150 | Coverage: 97.99%
Alignment 5: Bacillus subtilis subsp. subtilis str. 168 | P08838 | Identity: 42.14% | E-value: 3.80e-149 | Coverage: 98.90%
Alignment 6: Enterococcus faecalis V583 | P23530 | Identity: 42.34% | E-value: 2.50e-145 | Coverage: 100.18%
Alignment 7: Staphylococcus carnosus subsp. carnosus TM300 | P23533 | Identity: 39.96% | E-value: 9.58e-145 | Coverage: 99.27%
Alignment 8: Streptococcus equinus | Q9WXK9 | Identity: 41.71% | E-value: 6.55e-144 | Coverage: 100.37%
Alignment 9: List



#### **2. Gene ButyrylCoA**

In [5]:
blast_and_filter(gene_names = ["butyrylCoA"])

Starting BLAST for butyrylCoA...
BLAST completed for butyrylCoA.
Number of alignments found for butyrylCoA: 8
Alignment 1: Roseburia hominis A2-183 | G2SYC0 | Identity: 74.27% | E-value: 0.00e+00 | Coverage: 99.78%
Alignment 2: Anaerostipes caccae L1-92 | B0MC58 | Identity: 71.14% | E-value: 0.00e+00 | Coverage: 99.78%
Alignment 3: Syntrophomonas wolfei subsp. wolfei str. Goettingen G311 | Q0AVM5 | Identity: 51.58% | E-value: 2.48e-164 | Coverage: 99.11%
Alignment 4: Clostridium kluyveri DSM 555 | P38942 | Identity: 39.86% | E-value: 7.36e-97 | Coverage: 97.99%
Alignment 5: Fasciola hepatica | C6EUD4 | Identity: 36.19% | E-value: 1.19e-66 | Coverage: 83.26%
Alignment 6: Clostridium kluyveri DSM 555 | P38946 | Identity: 24.37% | E-value: 3.80e-17 | Coverage: 87.95%
Alignment 7: Dictyostelium discoideum | Q54K91 | Identity: 23.42% | E-value: 6.86e-13 | Coverage: 81.03%
Alignment 8: Arabidopsis thaliana | Q9AR19 | Identity: 21.29% | E-value: 9.18e+00 | Coverage: 34.60%
Alignments saved fo

#### **3. Gene MutS**

In [5]:
blast_and_filter(gene_names = ["MutS"], e_value_threshold=0.1, percent_identity_threshold=15, coverage_threshold=5)

Starting BLAST for MutS...
BLAST completed for MutS.
Number of alignments found for MutS: 7
Alignment 1: Escherichia phage P1 | Q71TF8 | Identity: 25.00% | E-value: 2.43e-18 | Coverage: 21.73%
Alignment 2: Paramecium bursaria Chlorella virus CV-XZ6E | P52284 | Identity: 35.87% | E-value: 9.46e-08 | Coverage: 4.35%
Alignment 3: Streptomyces albus G | Q53609 | Identity: 30.36% | E-value: 6.72e-03 | Coverage: 5.29%
Alignment 4: Paramecium bursaria Chlorella virus NC1A | P10835 | Identity: 27.50% | E-value: 4.38e-02 | Coverage: 5.67%
Alignment 5: Schizosaccharomyces pombe 972h- | Q10332 | Identity: 23.20% | E-value: 3.28e-01 | Coverage: 11.81%
Alignment 6: Cryptococcus neoformans var. grubii H99 | J9VI03 | Identity: 30.36% | E-value: 4.88e-01 | Coverage: 5.29%
Alignment 7: Escherichia coli | P25240 | Identity: 24.68% | E-value: 4.92e+00 | Coverage: 7.27%
Alignments saved for MutS: 3
Filtered BLAST results for MutS were saved in 'blast_results/MutS_blast.fasta'


