# Lab10 - Bioinformatics Algorithms

# Gene prediction - ORF finding

## Background
An open reading frame (ORF) is a portion of a DNA sequence that does not include a stop codon (which functions as a stop signal). A codon is a DNA or RNA sequence of three nucleotides (a trinucleotide) that forms a unit of genomic information encoding a particular amino acid or signaling the termination of protein synthesis (stop codon).
A prokaryotic gene typically begins with a start codon (eg. ATG). Gene ends with one of the three stops codon (eg. TAG, TAA, or TGA).

## Discussion10

We will explore the bacterial genome of bacilus subtilis (NC_000964.3.) with two approaches:

1- Regular ORF finder function (jupyter book below)

2- Gene prediction program using HMMs (GeneMarkS-2)
download: https://genemark.bme.gatech.edu/license_download.cgi
online: https://genemark.bme.gatech.edu/genemarks2.cgi

The genome fasta file will be in MyCourses under Content > Labs to make sure everyone has the same version. If you want to know more about the Bsubtilis genome:
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009045.1/

And we will use the UCSC genome browser of microbial genomes for visualtization of our results:
https://microbes.ucsc.edu/

For  discussion10:
1. Understand the basic ORF rules.
2. Implement ORF finder using both strands, with stop, start codons and a minimun gene length.
3. Get the GFF3 file and save it locally.
4. Run GeneMarkS-2 with the bacterial genome. Make sure you select GFF3 as output format.
5. Upload both annotations in the UCSC browser (custom tracks), and take a sceenshot or two with your results and reflections. Send it as PDF to myCourses.


## GeneMarkS-2 line to run in your computer (not here in the jupyter notebook)

In [None]:
perl gms2.pl --seq Bsubtilis-NC_000964.3.fa --genome-type bacteria --output Bsubtilis.gff3 --format gff3

## In Oedipus:
Copy folder "Lab10" with the bacteria genome in YOUR direcory


In [None]:
cp -R /mnt/classes/biol_530_630/data/Lab10/ ~/

cd ~/Lab10

### GeneMark is installed in 
 /mnt/classes/biol_530_630/bioware/gms2_linux_64/
 You are going to need the full path. But first, you need to copy the Genemark key in your home directory:

In [None]:
cp /mnt/classes/biol_530_630/bioware/gms2_linux_64/.gmhmmp2_key ~/

In [None]:
/mnt/classes/biol_530_630/bioware/gms2_linux_64/gms2.pl --seq Bsubtilis-NC_000964.3.fa --genome-type bacteria --output Bsubtilis.gff3 --format gff3

# Now the jupyter notebook with the ORF finder ("ORF-Finder")

 ## Import necessary libraries

In [None]:
from Bio import SeqIO
from Bio.Seq import Seq
import pandas as pd

## Define start and stop codons

In [None]:
start_codon = 'ATG'
stop_codons = ['TAA', 'TAG', 'TGA']

 ## Function to find ORFs in all six reading frames

In [None]:
def find_orfs_in_six_frames(sequence, min_gene_length=250):
    orfs = []
    
    # Analyze both the sense and antisense strands
    for strand, nuc in [(+1, sequence), (-1, sequence.reverse_complement())]:
        seq_len = len(nuc)
        for frame in range(3):
            protein = []
            start_pos = None
            
            # Scan in each frame
            for i in range(frame, seq_len - 2, 3):
                codon = str(nuc[i:i+3])
                
                if codon == start_codon and not protein:
                    protein = [codon]
                    start_pos = i if strand == 1 else seq_len - i - 3  # Adjust for strand
                
                elif protein:
                    protein.append(codon)
                    if codon in stop_codons:
                        if len(protein) * 3 >= min_gene_length:
                            end_pos = i + 3 if strand == 1 else seq_len - i
                            orfs.append({
                                "strand": strand,
                                "start": start_pos + 1,  # Convert to 1-based indexing
                                "end": end_pos,
                                "length": len(protein) * 3
                            })
                        protein = []
    return orfs

 ## Load sequence from a FASTA file

In [None]:
sequence_file = "Bsubtilis-NC_000964.3.fa"  # Replace with your file path
record = SeqIO.read(sequence_file, "fasta")
sequence = record.seq

## Find ORFs

In [None]:
orfs = find_orfs_in_six_frames(sequence)

## Convert results to DataFrame with strand as "+" or "-"

In [None]:
orfs_df = pd.DataFrame(orfs)
orfs_df['strand'] = orfs_df['strand'].apply(lambda x: '+' if x == 1 else '-')
orfs_df['type'] = 'gene'
orfs_df['score'] = '.'
orfs_df['phase'] = '.'
orfs_df['source'] = 'GeneFinder'
orfs_df['seqid'] = record.id
orfs_df['unique_id'] = [f"{i+1}-{length}" for i, length in enumerate(orfs_df['length'])]

# Ensure start < end for all entries, regardless of strand
orfs_df['start'], orfs_df['end'] = zip(*orfs_df.apply(
    lambda row: (min(row['start'], row['end']), max(row['start'], row['end'])), axis=1))

# Rearrange columns for GFF3 format
orfs_df = orfs_df[['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'unique_id']]


## Save ORFs to GFF3 format

In [None]:
orfs_df.to_csv("output_genes.gff3", sep='\t', header=False, index=False)
print("Gene locations saved to output_genes.gff3")

Gene locations saved to output_genes.gff3


##Check gff3 file here

In [None]:
print(orfs_df.to_string(index=False, header=False))

chr GeneFinder gene    1816    3075 . + .     1-1260
chr GeneFinder gene    4867    6783 . + .     2-1917
chr GeneFinder gene    6994    9459 . + .     3-2466
chr GeneFinder gene   14038   14376 . + .      4-339
chr GeneFinder gene   16888   17220 . + .      5-333
chr GeneFinder gene   25852   26337 . + .      6-486
chr GeneFinder gene   28867   29463 . + .      7-597
chr GeneFinder gene   31474   31728 . + .      8-255
chr GeneFinder gene   35845   36459 . + .      9-615
chr GeneFinder gene   36478   37638 . + .    10-1161
chr GeneFinder gene   37720   39162 . + .    11-1443
chr GeneFinder gene   39871   40200 . + .     12-330
chr GeneFinder gene   40306   40653 . + .     13-348
chr GeneFinder gene   42598   42858 . + .     14-261
chr GeneFinder gene   43921   44799 . + .     15-879
chr GeneFinder gene   45796   46134 . + .     16-339
chr GeneFinder gene   57745   58698 . + .     17-954
chr GeneFinder gene   58771   59397 . + .     18-627
chr GeneFinder gene   60430   63963 . + .    1

## Sort ORFs dataframe

In [None]:
def sort_orfs_by_start(orfs_df):
    """
    Sorts ORFs DataFrame by the 'start' column in ascending order.

    Parameters:
        orfs_df (pd.DataFrame): DataFrame containing ORF information.

    Returns:
        pd.DataFrame: Sorted DataFrame by 'start' column.
    """
    sorted_orfs_df = orfs_df.sort_values(by='start').reset_index(drop=True)
    return sorted_orfs_df

# Example usage:
sorted_orfs_df = sort_orfs_by_start(orfs_df)
print(sorted_orfs_df.head())


  seqid      source  type  start   end score strand phase unique_id
0   chr  GeneFinder  gene    326  1750     .      +     .  977-1425
1   chr  GeneFinder  gene   1816  3075     .      +     .    1-1260
2   chr  GeneFinder  gene   3554  4549     .      +     .   978-996
3   chr  GeneFinder  gene   3799  4067     .      -     .  5858-273
4   chr  GeneFinder  gene   4867  6783     .      +     .    2-1917


## Save Sorted ORFs to GFF3 format

In [None]:
sorted_orfs_df.to_csv("output_genes.sorted.gff3", sep='\t', header=False, index=False)
print("Gene locations saved to sorted_output_genes.gff3")

Gene locations saved to sorted output_genes.gff3


# Activity10
1. Implement Alternative Start Codons: Bacteria may use non-standard start codons like... <insert here>
2. Optimize ORF Search Parameters: Experiment with minimum ORF lengths, and the alternative start codons to assess prediction accuracy. Hint: check average values of gene length in GeneMarkS-2 output.
3. Export in Additional Formats: Extend export options (like BED format https://genome.ucsc.edu/FAQ/FAQformat.html) for compatibility with other bioinformatics tools.
4. Can you think of a metric that can be used to compare two gene prediction annotations, like GeneMarkS-2 and GeneFinder GFF3 output files?
5. Save your notebook with your reflections as markdownds,export it as PDF file (give a specific name, eg. Activity10 + your NAME) and send both to myCourses. For exporting as PDF from VisualStudio you may have to export it first as HTML, then to PDF.