<a href="https://colab.research.google.com/github/boydvcu/VCU_BNFO301_Spring2022/blob/main/KEY_BNFO_301_regex_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Finding Degenerate Primers using Regex</h1>

This python program finds degenerate primers using regex.

### Part 0- Setup (Provided code)
Read the Genbank Sample Data File

In [None]:
import re # python regex module
import os.path

#Read the sequence file
DATA_FILE_GITHUB = "https://raw.githubusercontent.com/MusBansal/BNFO301Data/main/primersequencewhole.txt"
DEFAULT_FILE_NAME = 'primersequencewhole.txt'

fileName = DEFAULT_FILE_NAME
#Does the file exists locally, if not get it from the github
if not os.path.exists(fileName):
  #Load the file from Github to the local folder
  !wget --no-check-certificate --content-disposition $DATA_FILE_GITHUB

print("Reading file:", fileName)

--2022-02-17 11:13:34--  https://raw.githubusercontent.com/MusBansal/BNFO301Data/main/primersequencewhole.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 582665 (569K) [text/plain]
Saving to: ‘primersequencewhole.txt’


2022-02-17 11:13:34 (17.9 MB/s) - ‘primersequencewhole.txt’ saved [582665/582665]

Reading file: primersequencewhole.txt


### Testing degenerative primers: <br>
<br>
PCR is a commonly used tool to isolate short regions of a genome.  PCR requires the use of primers, short sequences of DNA that bind to specific sites in a genome.  It is important to test primers, to determine if they will bind to more than one site in a genome.  Here you will use regex to identify matches to each of the four primers.  Please note that the primers are degenerate, which means that the primer will be mixed (e.g. some primers may have an A at a given site, while the others of a C)  You will be provided with a fasta file.  Please examine this file carefully to determine if you need to remove new line characters before identifying primer matches.

* M-1B, 5′-agagtttgatmcacc-3′ (M -> A or C)
* M-2B, 5′-ctgctgcsycc-3′ (Y -> C or T), (S -> G or C)
* M-3B, 5′-gcaacgaa3′
* M-4B, 5′-ggcggtgtgtrc-3′ (R->A or G)

<br>

In [None]:
#Read the sequence file
#DATA_FILE_GITHUB = "https://raw.githubusercontent.com/MusBansal/BNFO301Data/main/primersequencewhole.txt"
#DEFAULT_FILE_NAME = 'primersequencewhole.txt'

#fileName = DEFAULT_FILE_NAME
#Does the file exists locally, if not get it from the github
#if not os.path.exists(fileName):
  #Load the file from Github to the local folder
#  !wget --no-check-certificate --content-disposition $DATA_FILE_GITHUB

#print("Reading file:", fileName)

#function to remove header lines using regex ("that starts with >")
######################
def removeHeaders(data):
  pattern = "^>(.)*(\n)*"
  # Compile our search string
  positive_residues = re.compile(pattern)

  # Get a list of all the segments
  list_of_seq_segments = positive_residues.findall(gb_input)
  print("Total Sequences Found for Replacement:",len(list_of_seq_segments))

  # Find and replace
  output = positive_residues.sub('', gb_input)
  return output
######################


# Read in the string from the file
with open(fileName, "r") as myfile:
    gb_input = myfile.read()

gb_input = removeHeaders(gb_input)


# #Saving it in output file
# OUTPUT_FILE_NAME = 'HW2_Results1.txt'
# outfile = open(OUTPUT_FILE_NAME, "w")
# print("Writing to the file:", OUTPUT_FILE_NAME)
# print(output, file=outfile)
# # close the output file
# outfile.close()



#Since the file contains newlines, remove all the newlines
#this pattern will look for all the newline/tabs etc.
pattern = "(\r\n)+|\r+|\n+|\t+"

# Compile our search string
positive_residues = re.compile(pattern)


# Get a list of all the segments
list_of_seq_segments = positive_residues.findall(gb_input)
print("Total Sequences Found for Replacement:",len(list_of_seq_segments))

# Find and replace
output = positive_residues.sub('', gb_input)
#print(output)



#function to find sequences
def findSequences(pattern,data):
  count = 0
  for match in re.finditer(pattern, data):
    count = count + 1
    # Start index of match (integer)
    sStart = match.start()

    # Final index of match (integer)
    sEnd = match.end()

    # Complete match (string)
    sGroup = match.group()

    # Print match
    print('Match: {} "{}" found at: [{},{}]'.format(count, sGroup, sStart,sEnd))
  print("Total Sequences Found:",count, "\n")
  return count

# Finding the patterns

#1. Finding the pattern "agagtttgatmcacc"
#(?i) - ignores the case, m can be a or c -> [ac]
pattern = "(?i)agagtttgat[ac]cacc"
findSequences(pattern,output)

#2. Finding the pattern "ctgctgcsycc"
#y can be c or t -> [ct]
pattern = "(?i)ctgctg[gc][ct]cc"
findSequences(pattern,output)

#3. Finding the pattern "gcaacgaa"
pattern = "(?i)gcaacgaa"
findSequences(pattern,output)

#4. Finding the pattern "(?i)ggcggtgtgt[ag]c"
#r can be a or g -> [ag]
pattern = "(?i)ggcggtgtgt[ag]c"
findSequences(pattern,output)



Total Sequences Found for Replacement: 1
Total Sequences Found for Replacement: 8206
Match: 1 "AGAGTTTGATACACC" found at: [5418,5433]
Total Sequences Found: 1 

Match: 1 "CTGCTGGTCC" found at: [26006,26016]
Match: 2 "CTGCTGGTCC" found at: [559709,559719]
Total Sequences Found: 2 

Match: 1 "GCAACGAA" found at: [73955,73963]
Match: 2 "GCAACGAA" found at: [92807,92815]
Match: 3 "GCAACGAA" found at: [162554,162562]
Match: 4 "GCAACGAA" found at: [192590,192598]
Match: 5 "GCAACGAA" found at: [482871,482879]
Total Sequences Found: 5 

Match: 1 "GGCGGTGTGTAC" found at: [38814,38826]
Match: 2 "GGCGGTGTGTAC" found at: [337431,337443]
Total Sequences Found: 2 



2