# Finding k-mers 

We use the term k-mer to indicate a string of DNA of length *k*. You can think of them loosely as words within a DNA sequence. For example:
> 'ATCG' is a 4-mer (k-mer of length 4)
> 'ATTGGGCT' is a 8-mer (kmer of length 8)

**Think about it: How many 2-mers of DNA are there? How many 8-mers? What is the general formula for calculating the number of possible kmers?**

k-mers of various lengths are used across biology to facilitate the analysis of sequence information. For example, k-mers are used for:
1. The assembly of genomes and transcriptomes through the construction of De Bruijn graphs (more on this later)
2. Identify protein coding regions
3. Identify species within metagenomes 
4. Detection of repeats and transposable elements
5. Evidence of recombination
6. Identifying contamination or misassembly

And many more! We will talk about k-mers more in future but for today-- let's just think about how to find them and count them. 

Often, sequences are broken down into their component k-mers. For example: 

```
AGATTATATAGATA
```
Is broken into the following 5-mers:
```
AGATT
 GATTA
  ATTAT
   TTATA
    TATAT
     ATATA
      TATAG
       ATAGA
        TAGAT
         AGATA     
```

### Finding and counting a given k-mer
Let's write a function that for a given pattern counts the number of times that pattern occurs within a string. This type of thing already exists within the `re` package-- feel free to look that up later. But for now, let's try doing this our selves:

First, let's write a function called `PatternCount` that counts the number of times a pattern occurs in a string. The idea being that you can start with some sequence (string) and search for a particular k-mer or pattern (string). You should then count the number of times a pattern occurs and return and print that number.  

Let's work with this pseudocode and convert it into python. (Note: `|` indicates the length of the item and `<-` indicates assignment) 
```
PatternCount(Text, Pattern)
	count <- 0
	for i<- 0 to |Text| - |Pattern|
		if Text(i, |Pattern|) = Pattern
			count <- count + 1
	return count
```
A trick to making sure this work *properly* is going to be setting the for loop range over your string correctly. 

> Think about it: if you choose to use the `range()` think about how to set your start and stop values to make sure you are searching the full string without calling an index outside of the lenght of the serach string. 

In [None]:
#Some strings to practice with
starting_string='ATCGCTCTCTCACGTGCTCCTATGCT'
search_1='ATC' 
search_2='CTC'
search_3='GCTC'
search_4='CT'

In [None]:
def PatternCount(text, pattern):
    
    return(count)
        

In [None]:
## Uncomment this block of text to test out your function!
# print(PatternCount(starting_string, search_1)) #should return 1
# print(PatternCount(starting_string, search_2)) #should return 4
# print(PatternCount(starting_string, search_3)) #should return 2
# print(PatternCount(starting_string, search_4)) #should return 6


### Finding kmers
Alright, now we lets make a function called `PatternFinder`. Provided with a sequence (string) and a k-mer length (integer) the program should move through and identify all the possible k-mers in that sequence and save them to an array called Kmers. 

``` 
PatternFinder(Text, k)
	L <- |Text|
	Kmers <- empty array
	for n <- 0 to L - k 
		Kmers <- Text[n, k]
	return Kmers
```

> Think about it: What kind of data type should we use to store our k-mers? 

In [None]:
string1='ATCGC'
string2='AGTCCTCTCGAGACT'

In [None]:
def PatternFinder(text, k):

    return(Kmers)

In [None]:
# Uncomment to check your code! 
# print(PatternFinder(string1,2)) # Should return: 'AT', 'CG', 'GC', 'TC'
# print(PatternFinder(string1,3)) # Should return: 'TCG', 'CGC', 'ATC'
# print(PatternFinder(string2,10)) # Should return: 'CTCTCGAGAC', 'CCTCTCGAGA', 'GTCCTCTCGA', 'TCTCGAGACT', 'TCCTCTCGAG', 'AGTCCTCTCG'

# Fun extra question: 

## How could you make `PatternFinder` impervious to mixed capitalization?

For example, if I passed the string: `ACTacTcT` to your function it would return:

> `'AC', 'CT', 'TA', 'TC'`

instead of :

> `'AC', 'CT', 'Ta', 'Tc', 'ac', 'cT'`

In [None]:
extra_string='ACTacTcT'

# How long did this homework take you? 

## Answer here: 