# 1.1 Text manipulation
This tutorial notebook introduces the following key concepts and their implementations in python:
- the data type for text, `str`
- `dict` data structure
- iteration
- creating a function

By the end of this tutorial you will have all the ingredients to solve the reverse complement problem introduced in the README.

### Requirements
**Dependencies:**  
Python 3

**Prerequisites:**  
Tutorial 1.0 'Python is a calculator'


## Text with `str`

[docs](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)  
Apart from numbers, another common data format we encounter is text. Text in programming languages is called a 'string' (`str`) and is indicated using single, double or triple quotes.

In [None]:
c = 'a single-quoted string'
d = "a double quoted string containing 'single quotes'"

print(c)
print(d)

Strings are useful for representing any kind of sequence-like data, including DNA, RNA, and proteins.

In [None]:
FLAG = 'DYKDDDDK'
mRNA = 'GACUACAAAGACGAUGACGAUAAA'
cDNA = 'GACTACAAAGACGATGACGATAAA'

Python strings are implemented as `list`s of single characters, so you can manipulate strings using the same syntax for lists which we learned in the previous chapter.  
Here are some useful operations on strings:

In [None]:
print("Here are some facts about",FLAG)

# find the length of a string using the len() function
print("length:",len(FLAG))

# get the first, second, etc. character using []
# note that a string index starts at 0.
print("The first character:", FLAG[0]) # the 0th character

# you can index from the end of the string using a negative index.
# negative indices start at -1.
print("The last character:",FLAG[1])

# get a substring using [start:end].
# the start index is included, the end is not.
print("The first codon:",mRNA[0:3])

# if you omit start or end, 0 or -1 is assumed, respectively.
print("The other codons:",mRNA[3:])

# you can even "add" strings.
print("Double FLAG:",FLAG+FLAG)

In addition, there's a library of [built-in python functions for strings](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [None]:
str.lower(cDNA)

## `dict`

`dict`, short for 'dictionary', is a lookup table, allowing us to map *keys* to unique *values*. Create a `dict` using curly braces `d={key:value}`:

In [None]:
# Create a dict
pokemon_release_years = {
    'Red':1998,
    'Blue':1998,
    'Yellow':1999,
    'Gold':2000,
    'Silver':2000,
    'Crystal':2001,
    'Ruby':2003,
    'Sapphire':2003,
    'FireRed':2004,
    'LeafGreen':2004
}

# Access an entry using dict[key]
best = 'Red'
print('Pokemon',best,'was released in North America in',pokemon_release_years[best])

# Add a new entry
pokemon_release_years['Emerald'] = 2005

Note that each key must be unique, but you can have duplicate values.

A `dict` is useful in any situation where we want to map a limited number of inputs to exactly one output value each. In mathematical terms, this is a *discrete function* on a *finite input space*.  
For example, in molecular biology, we know that DNA always forms complementary pairs: 'A' always pairs with 'T', and 'C' with 'G'.  Suppose we want to write a program to emulate this behavior. We have a finite input space on {A,C,G,T} and each maps to exactly one output, so a `dict` is a good way to implement this:

In [None]:
dna_complements = {
    'A':'T',
    'C':'G',
    'G':'C',
    'T':'A'
}

# To get the complement of 'A',
dna_complements['A']

**Exercise**: A DNA polymerase would replicate DNA using a mapping like the above. RNA polymerase uses DNA as a template to create RNA, which uses U instead of T. What `dict` would an RNA polymerase use?

In [None]:
rna_complements = {
    ## Your code here
    
}

One more note about dictionaries. It is sometimes useful to get all of the keys, values, or key-value pairs in a `dict`. 

In [None]:
print('keys:',pokemon_release_years.keys())
print('values:',pokemon_release_years.values())
print('key-value pairs:',pokemon_release_years.items())

## Iteration

In the previous section, we introduced operations which operate on collections of data, such as `len()` to find the length of a list. How do these functions work? and how can we do it ourselves? Such an operation will be essential if, for example, we were to find the reverse complement of a DNA sequence by *iterating* over the sequence and finding the complement of each base. 

To begin, let's imagine how we could write the `len` function:
```
# Pseudocode for function len, which accepts 1 input sequence
start count at 0
for each element in input, add 1 to count
output count
```
Now, let's see the Python code:

In [None]:
# get len(FLAG)
count = 0
for element in FLAG:
    count = count + 1
print(FLAG,"length:",count)

Explaining that code in more detail,
```
count = 0
```
To begin, we define a new variable `count` and set it to 0. 
```
for element in FLAG:
```
The `for x in y :` syntax tells python to *iterate* over all elements in the collection `y`, each time setting the new variable `x` to the value of that element.
```
    count = count + 1
```
Here indentation is used to define the *scope* of the `for` loop. Every line beginning with at tab '\t' or four spaces '    ' will be run in each iteration.
```
print(FLAG,"length:",count)
```
Because the `print` statement is not indented, it indicates the end of the `for` loop and is only run once.

So, the above `for` loop is equivalent to running:

In [None]:
count = 0
element = FLAG[0]
count = count + 1
element = FLAG[1]
count = count + 1
element = FLAG[2]
count = count + 1
element = FLAG[3]
count = count + 1
element = FLAG[4]
count = count + 1
element = FLAG[5]
count = count + 1
element = FLAG[6]
count = count + 1
element = FLAG[7]
count = count + 1
print(FLAG,"length:",count)

but far more elegant, especially if we wanted to find the length of a much longer string.

Finally, I'll introduce a frequently-used pattern across programming, which is to iterate over the index of the iterable instead of the elements. The following 2 code snippets are equivalent:

In [None]:
for element in FLAG:
    pass # the `pass` keyword does nothing except exit the for loop

for i in range(len(FLAG)):
    element = FLAG[i]
    pass

The former iterates on the elements of FLAG, ie. `element` = `'D'`, `'Y'`, `'K'`, ...  
The latter iterates on the *index*, ie. `i` = `1`, `2`, `3`, ...  and then gets the elements using `element = FLAG[i]`.  
Having access to the index within the `for` loop could be useful in some situations, including for example reversing a string...

## Writing a function

So far, we haven't quite replicated the `len` function as we're missing the nice `my_len()` syntax. In programming languages, a `function` is a small program-within-a-program to help organize, reuse, and share your code. Let's wrap our length code in a function:

In [None]:
def my_len(x):
    count = 0
    for element in x:
        count = count + 1
    return count

Let's take a look at the new lines of code.  
```
def my_len(x):
```
Create a new function using the `def` keyword. Inputs, also called *parameters* or *arguments*, are in parentheses `()` separated by commas `,`. As in the `for` loop, a colon `:` followed by indentation defines the scope of the function. All variables defined within the function only exist within that scope, and do not persist outside the function.
```
return count
```
The only value that persists outside the function is the output(s) defined by the `return` keyword.

In [None]:
print(FLAG,"length:",my_len(FLAG))

## Putting it all together

You now have all the ingredients we need to create a `reverse_complement` function. Use a `dict` to encode DNA base complement chemistry, a `for` loop to iterate over the input DNA sequence, and a clever bit of math on the index to reverse the sequence. 

In [None]:
## Exercise: complete the function.
complements = {
    'A':'T',
    'C':'G',
    'G':'C',
    'T':'A'
}

def reverse(x):
    result = ''
    for i in range(len(x)):
        # TODO reverse the list
    return result

def complement(x):
    result = ''
    for base in x:
        # TODO get the complement of x
    return result

def reverse_complement(x):
    return reverse(complement(x))

In [None]:
# Check your work
cDNA = 'GACTACAAAGACGATGACGATAAA'
reverse_complement(cDNA)
# answer should be 'TTTATCGTCATCGTCTTTGTAGTC'

## Solutions

In [None]:
rna_complements = {
    'A':'U',
    'C':'G',
    'G':'C',
    'T':'A'
}

def complement(x):
    result = ''
    for base in x:
        result = result + complements[base]
    return result

# There are a few ways to complete the reverse function:
def reverse(x):
    result = ''
    for i in range(len(x)):
        result = x[i] + result
    return result

def reverse2(x):
    result = ''
    for i in range(len(x)):
        j = len(x)-i-1 # j = 7,6,5,...
        result = result + x[j]
    return result

def reverse3(x):
    result = ''
    for i in range(len(x)):
        j = -i-1 # j = -1,-2,-3,...
        result = result + x[j]
    return result

def reverse_complement(x):
    return reverse(complement(x))