# Python for Scientific Research
# Functions & modules
# Answers to exercises

## February 2020

## Exercise 1
### Exercise 1.1
1. Reproduce the `find_motif()` function presented in the lecture. Make sure you understand every statement in that code. Use the `help()` function to display the documentation string

In [18]:
def find_motif(DNA, motif="gaatca"):
    """
    Finds a motif within a DNA sequence and returns a list of start indices
    
    Parameters
    ----------
    DNA : str
        A string containing the DNA sequence to be searched
    motif : str, optional
        The motif to be found in the DNA sequence

    Returns
    -------
    A list of indices highlighting the start of the motif in the DNA sequence

    """
    index = 0 # set initial index at the start of the string
    indices = [] # initialize empty list to store any successful finds of motifs
    while index != -1:  # go on as long as DNA.find does not return a -1 (i.e., no motifs in remainder of DNA)
        index = DNA.find(motif, index) # find the motif
        
        if index != -1: # successful find of motif 
            indices.append(index) # add position at which motif was found to list of starting indices 
            index += 1 # update index used by DNA.find() to the next position, otherwise same p
    
    return(indices)

Now use the help function to display the documentation string we wrote:

In [24]:
help(find_motif)

Help on function find_motif in module __main__:

find_motif(DNA, motif='gaatca')
    Finds a motif within a DNA sequence and returns a list of start indices
    
    Parameters
    ----------
    DNA : str
        A string containing the DNA sequence to be searched
    motif : str, optional
        The motif to be found in the DNA sequence
    
    Returns
    -------
    A list of indices highlighting the start of the motif in the DNA sequence



### Exercise 1.2

2. Play around with different values for `motif` and call the function by:
    * argument order/position
    * argument keyword
    * using default arguments

Example using a sequence that does not contain the default motif

In [25]:
focalDNA="aaagggaggggggaggagag"
indicesDefaultMotif = find_motif(DNA=focalDNA)
print(indicesDefaultMotif) # returns empty list as the default motif is not present in the focalDNA string

[]


Now use a custom motif and print the indices:

In [26]:
focalDNA="aaagggaggggggaggagag"
motif1 = "gg"
indicesGG = find_motif(DNA=focalDNA,motif=motif1)
print(indicesGG)

[3, 4, 7, 8, 9, 10, 11, 14]


Same, but then different order of arguments when we call `text_motif()`. As long as one identifies arguments by argument keyword, order of arguments does not matter:

In [28]:
focalDNA="aaagggaggggggaggagag"
motif1 = "gg"
indicesGG = find_motif(motif=motif1,DNA=focalDNA)
print(indicesGG)

[3, 4, 7, 8, 9, 10, 11, 14]


What happens if we call function with no argument at all?

In [31]:
find_motif()

TypeError: find_motif() missing 1 required positional argument: 'DNA'

Python complains about the lacking argument for the parameter `DNA`, but not about the argument `motif` as a default is provided.

### Exercise 1.3
For the biologists amongst you, write a function to return the complement of a DNA sequence. That is, if the input is `"acgt"` the function returns `"tgca"`. Similarly return the reverse complement of a DNA sequence.

**Hint 1**: Use a dictionary to specify which character is swapped with what:

`compDict = {'a': 't', 'c': 'g', 'g': 'c', 't': 'a'} # i.e 'c' should be swapped with 'g' etc.`

Then use a list comprehension to loop through each character in your string and convert to its complement using the dictionary.

**Hint 2**: To reverse a string/list use the slice operator `[::-1]`

In [39]:
def DNA_complement1(sequence):
    compDict = {"a":"t","c":"g","g":"c","t":"a"}
    
    newseq = [ compDict[x] for x in sequence]
    
    return("".join(newseq))

def DNA_complement2(sequence):
    return(sequence[::-1])

In [41]:
DNA_complement1("acgt")

'tgca'

In [42]:
DNA_complement2("acgt")

'tgca'

## Exercise 2

### Exercise 2.1
Consider the following two-dimensional list of sequences: `sequences_2d = [["ccggattc","ggggta","ggc"],["aat","agcgaccc"],["ggccccaaa"]]`. Write a function called `append_seq` that accepts a two-dimensional list of DNA sequences (for now, we can assume that the list is indeed 2-dimensional and contains sequences). It then adds a user-specified sequence, say `"ggtttaa"`, to the front of each individual sequence. The function then returns the updated two-dimensional list of sequences to the variable `sequences_2d_new`. 

I will show four attempts, where only the last two attempts are successful:

#### First attempt:

In [59]:
def append_seq1(list2d, z="ggtttaa"):
    """
    Adds a sequence z to each element within a 2d list of sequences,
    contained in list2d

    Parameters
    ----------
    list2d : list
        two-dimensional list with sequences.
    z : str, optional
        sequence to add to the front of each sequence. The default is "ggtttaa".

    Returns
    -------
    two dimensional list with updated sequences.

    """
    
    # outer loop going through rows of the 2d list
    for row in list2d:
        # inner loop going through the columns within each row
        for col in row:
            # 'update' value of the individual elements
            col = z + col
                        
    return(list2d)
            
sequences_2d = [["ccggattc","ggggta","ggc"],["aat","agcgaccc"],["ggccccaaa"]]        
sequences_2d_new = append_seq1(sequences_2d)
print(sequences_2d)
print(sequences_2d_new)

[['ccggattc', 'ggggta', 'ggc'], ['aat', 'agcgaccc'], ['ggccccaaa']]
[['ccggattc', 'ggggta', 'ggc'], ['aat', 'agcgaccc'], ['ggccccaaa']]


#### Nothing happened: what is going on?
Nothing eventually happens within the function `append_seq1`. Within the statement `for col in row`, `col` is being assigned *a reference* to the values of the individual elements within `list2d`, as in any copying operation of a single-dimensional variable (see lecture on data types). 

As `col` is a single-dimensional variable, once one assigns a new value to `col` (namely `z + col`), the variable `col` now gets a new reference to a register that contains that new value. By contrast, the original reference in `list2d` is left untouched.

In order to really make changes to the elements of `list2d`, making copies of these individual elements (like we did above) does not suffice. Rather, we might want to assign things directly to the elements of `list2d` themselves, by using two-dimensional index operators, e.g., `list2d[row_idx][col_idx] = ...` where `row_idx` and `col_idx` are index numbers of the rows and columns of our two-dimensional list respectively. This is what we will do in the second attempt. As we will see below, however, also this method this is still far from perfect, because directly updating list elements introduces other problems.

#### Second attempt (still not perfect):

In [60]:
def append_seq2(list2d, z="ggtttaa"):
    """
    Adds a sequence z to each element within a 2d list of sequences,
    contained in list2d

    Parameters
    ----------
    list2d : list
        two-dimensional list with sequences.
    z : str, optional
        sequence to add to the front of each sequence. The default is "ggtttaa".

    Returns
    -------
    two dimensional list with updated sequences.

    """
    
    # outer loop going through rows of the 2d list, but now we
    # use enumerate() to
    # keep track of the index in the original list
    # so that we can directly access individual list elements
    for row_idx, row in enumerate(list2d):
        # inner loop going through the columns within each row, again using enumerate()
        # so that we retain indices of the columns in col_idx
        for col_idx, col in enumerate(row):
            # update the value of the individual elements
            list2d[row_idx][col_idx] = z + col
                        
    return(list2d)
            
sequences_2d = [["ccggattc","ggggta","ggc"],["aat","agcgaccc"],["ggccccaaa"]]        
sequences_2d_new = append_seq2(sequences_2d)
print(sequences_2d)
print(sequences_2d_new)

[['ggtttaaccggattc', 'ggtttaaggggta', 'ggtttaaggc'], ['ggtttaaaat', 'ggtttaaagcgaccc'], ['ggtttaaggccccaaa']]
[['ggtttaaccggattc', 'ggtttaaggggta', 'ggtttaaggc'], ['ggtttaaaat', 'ggtttaaagcgaccc'], ['ggtttaaggccccaaa']]


#### Woah, the values of `sequences_2d` and `sequences_2d_new` are the same! What happened?!

Now we work directly with the elements of `list2d` rather than with a copy of these elements, so indeed changes to the individual elements are carried through. The problem is that `list2d[row_idx][col_idx]` also affects the original copy of the list. When the function `append_seq2()` starts, the parameter `list2d` gets assigned the same reference to the list that is contained in `sequence_2d`. Following the lecture notes from the lecture on data types, assigning new values to the *elements* of the list pointed to by `list2d` changes the elements of all the copies of that list, unless we make a deep copy. This is what we will do in our next example.



2. After you have called `append_seq`, inspect the value of the original two-dimensional list `sequences_2d`. Does it differ from `sequences_2d_new`? Why/why not?

3. Extend the function with code that checks whether the list is indeed two-dimensional (do not bother yet with checking whether the strings of text are indeed sequences, we will do this later). If the list is not two-dimensional, print an error message and return `None`.