# Implementation of Brute Force Searching Algorithm on Genome Sequence
Note: Run all cells in this notebook in order

### 1. File Handling
>#### Functions to read in and clean genome (DNA/RNA) sequence
>Note: Data cleaning function works with .fna files from https://www.ncbi.nlm.nih.gov/genome/

In [1]:
# To read .fna file

def readSourceFile(filename):
    source_file = open(filename, 'r')
    file = source_file.read()
    source_file.close()
    return file

In [2]:
# to get rid of title and newline characters in the .fna file
# return a clean genome sequence

def cleanSourceFile(filename):
    source_file = open(filename,'r')
    lines = source_file.readlines()[1:] # to remove title
    source_file.close()
    raw_DNA = "".join(lines)
    clean_DNA = raw_DNA.replace('\n','') # to remove newline characters
    clean_DNA = clean_DNA.upper()
    return clean_DNA


# This function is only applicable to .fna files downloaded from https://www.ncbi.nlm.nih.gov/genome/
# .fna files from other resources may have different formats, thus this function needs to be varied accordingly

### 2. Brute Force Searching Algorithm

In [3]:
def bruteForce(s1, s2): # s1: query sequence; s2: source genome sequence
    occurrences = []    # create an empty list
    for i in range (len(s2)-len(s1)+1):
        match = True
        for j in range (len(s1)):
            if s2[i+j] != s1[j]: # compare characters
                match = False    # mismatch happened, exit from inner loop
                break
        # match found
        if match:       # append the position of the matched sequence in the list
            occurrences.append(i+1) # Assume the first character is of position 1, use (i+1) instead of i to return  1 instead of 0
    print("The query sequence can be found at the position(s) of", occurrences,", there are/is", len(occurrences),"occurrence(s) in total.")
    # print positions of query sequences found and number of occurrences

### 3. Execution
#### Searching UI starts here

In [4]:
# prompt the user to input a .fna file
file = input("Please enter the source file.fna: ")

# print the content of the .fna file
print("The input file: ")
file_read = readSourceFile(file)
file_read

Please enter the source file.fna: test.fna
The input file: 


'>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome\nATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA\nCGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC\nTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG\nTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC\nCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC\nGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG\nCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT\nGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC\nGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT\nTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTA\nGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTG\nTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGG\nCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTT

In [5]:
# print the clean genome sequence
print("Complete genome sequence after cleaning: ")
source = cleanSourceFile(file)
source

Complete genome sequence after cleaning: 


'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTC

In [6]:
# prompt the user to enter a query sequence
target = input("Please enter the query sequence to be found: ") 

import datetime
run_start_time = datetime.datetime.now()

# print results
bruteForce(target, source)

# to return processing time of the searching algorithm
run_time = datetime.datetime.now() - run_start_time
print("Search took", run_time.microseconds, "microseconds")

Please enter the query sequence to be found: CTCTTGAAACTGCTCAAAATTCTG
The query sequence can be found at the position(s) of [1917] , there are/is 1 occurrence(s) in total.
Search took 11007 microseconds


### 4. References
- https://www.geeksforgeeks.org/naive-algorithm-for-pattern-searching/  
- https://www.youtube.com/watch?v=KUbsdGm3G7s&t=219s  