## ACTIVITY 1: Loops and Conditional Statements

In last week's homework, you were asked to use `range()` as a way to run a `for` loop over two lists at once. In today's lecture, we went through yet another way to loop over multiple lists using `zip()`.

It may seem redundant to practice multiple ways of iterating over lists with a `for` loop, but as you'll see later in the course, it's not. `zip()` and `range()` are two critical, commonly used Python functions, and it's worth your while to know how to use both.

## Problem 1: Reporter gene data version 3

In this problem, we return yet again to the red/green reporter gene experiment we saw in the last lecture and in the homework. The problem is the same one you were asked to complete in the homework, but this time you should solve it with `zip()`.

Recall that your task is to write a `for` loop that iterates over both `red` and `green` data at the same time, and divide the red value by its corresponding green value. The result should be appended to a new list called `norm_data`.

In the cell below, do the following:

1. Create an empty list called `norm_data`, to hold the normalized values.

2. Write a `for` loop using `zip()` to iterate over both `red` and `green`. 

3. In the loop block (indented lines), divide the red value by the green value and `.append()` the result to the list `norm_data`.


In [1]:
red = [23, 145, 203, 235, 354, 456]
green = [5, 11, 6, 9, 8, 4]
norm_data = []

# Write your code below. Use print() to display the result.
for r_gene, g_gene in zip(red, green):
    norm_data.append(r_gene / g_gene)

norm_data

[4.6, 13.181818181818182, 33.833333333333336, 26.11111111111111, 44.25, 114.0]

In [2]:
# Run this cell to test your answer
assert len(norm_data) == len(red)
assert norm_data[2] < 33.834 and norm_data[0] == 4.6

## Problem 1a: Convert from string to numeric data
Here is a quick problem whose relevance will become more apparent in the next lecture:

To normalize the reporter data, you are dividing *numbers*. But (as you'll see next week), numerical data somtimes arrives as a Python string. Before doing any math with the data, you need to convert strings to a numeric data type, either **integers** or **floats** (floating point numbers... anything with a decimal.

To convert one data type to another, there are handy Python functions:

```python
datum = '23' # string data type
datum = int(datum) # convert the string to integer
datum = str(datum) # convert back to a string
datum = float(datum) # convert to float
```

Two points to keep in mind:

1) Don't worry about converting integers to floats. If division of two integers produces a non-integer answer, Python will automatically make the answer a float.

2) In the code above I changed the type of the original variable `datum` by assigning the converted value back to the original variable name. Often we *won't* change the original variable; it's usually better to keep the original data unchanged, in case your code returns to it later.

In the cell below, our reporter data is in string format. Rewrite your code above to convert strings to integers before dividing red by green.

In [3]:
red = ['23', '145', '203', '235', '354', '456']
green = ['5', '11', '6', '9', '8', '4']
norm_data = []

# Write your code below. Use print() to display the result.
for r_gene, g_gene in zip(red, green):
    r_gene_int = int(r_gene)
    g_gene_int = int(g_gene)
    norm_data.append(r_gene_int / g_gene_int)

norm_data

[4.6, 13.181818181818182, 33.833333333333336, 26.11111111111111, 44.25, 114.0]

## Problem 2: `zip()` and unequal lists

The code you've written to normalize reporter gene data assumes that both `red` and `green` lists are of equal length. What if they aren't? In this case, if the lists are unequal, you'd like the code to return an error. So which method of looping, `range()` or `zip()` will give you an error if the lists are unequal?

Try both methods in the cells below to see what happens when you iterate over two lists of unqeual length. You can use the code you wrote as a starting point.

In [4]:
red = [23, 145, 203, 235, 354, 456]
green = [5, 11, 6, 9, 8] # MISSING DATA - shorter list
norm_data = []

for r_gene, g_gene in zip(red, green):
    norm_data.append(r_gene / g_gene)

print(len(red))
print(len(green))
print(len(norm_data))

norm_data

6
5
5


[4.6, 13.181818181818182, 33.833333333333336, 26.11111111111111, 44.25]

In [5]:
# Use range() to loop over these lists. 
# Use range(len(red)) in your for statement.

red = [23, 145, 203, 235, 354, 456]
green = [5, 11, 6, 9, 8] # MISSING DATA - shorter list
norm_data = []

# Use range() to loop over these lists (do you get an error?).
for i in range(len(red)):
    r_gene = red[i]
    g_gene = green[i]
    norm_data.append(r_gene / g_gene)

norm_data

# This error is expected

IndexError: list index out of range

How does `zip()` handle two lists of unequal lengths? What happens when you use `range()` to generate list indices beyond the length of the list?

The bigger lesson here is that our reporter data normalization code should first test whether the lengths of the two lists are equal before running the `for` loop. You could write something like this:

```python
if len(red) == len(green):
    # run the for loop here
else:
    print('Unequal data in red and green channels!')
```

We'll focus more on such tests when we begin writing Python functions.

## Problem 3: Use `if` to avoid dividing by 0

Zeros in our control data are another potential source of error in our reporter gene normalization code. If we got a blank measurement for one sample in the green channel, then our code above won't work – you can't divide red by green if green is zero.

So let's make our normalization code a little more sophisticated, by adding a conditional statement to check for zero values in the green channel.

We want a `for` loop that recognizes when green is zero and then executes an alternate instruction.

In the cell below, I've edited `green` to include a 0. Write a `for` loop to normalize red by green, as before. But change the inside the block of the `for` loop, using `if` and `else`, to do the following:

 *If* the value of green is greater than zero, normalize green by red and append the result to `normalized`.
 *Else*, print: "Green value is zero." (And do nothing else - we skip that sample.)

In [6]:
red = [23, 145, 203, 235, 354, 456]
green = [5, 11, 6, 0, 8, 4]
norm_data = [] # clear the list - we'll reuse this variable name

# Write a loop to normalize the data, checking for divide by zero errors.
for r_gene, g_gene in zip(red, green):
    if g_gene == 0:
        norm_data.append(0)
    else:
        norm_data.append(r_gene / g_gene)

norm_data

[4.6, 13.181818181818182, 33.833333333333336, 0, 44.25, 114.0]

## Problem 4: Calculate GC content of a sequence

This problem is a slightly more challenging application of the basic concepts of `for` loops and `if` statements. As we'll discuss more in the second half of today's lecture, a string in Python is somewhat similar to a list: A string is an ordered collection of text characters, just like a list is an ordered collection of items (of any data type).

You can iterate over a string with a `for` loop just as you iterate over a list. In the cell below, do the following: 

1) Initialize a variable `gc_count` by setting it equal to zero.

2) Iterate over each base of the DNA sequence below using `for`.

3) In the block of the loop, determine whether the base is a G or C. If it is, increment `gc_count` by one.

4) After the `for` loop finishes, calculate fraction GC by dividing `gc_count` by the total length of the DNA sequence. (What function gives you the length of a string?) Assign the answer to the variable `gc_frac`.


HINT 1: The syntax to loop over a string will be the same as a `for` loop over just one list, as we practiced last week. Just replace the list with a string in the `for` expression.

HINT 2: Here is how to put two conditions into a single `if` expression by using `or`:

```python
if my_variable == 'value1' or my_variable == 'value2':
```

In [7]:
dna = 'GTGCATGTTTGTCGGATGAGCCTAAACTAGGGAGGTCAGCCAGGGTT'
gc_count = 0
gc_frac = 0

# Write a loop to count the number of G and C characters in the DNA sequence.
# Find GC content by dividing the number of G and C characters by the total number of characters.
for base in dna:
    if base == 'G' or base == 'C':
        gc_count += 1

gc_frac = gc_count / len(dna)

gc_frac

0.5319148936170213

In [8]:
# Run this cell to check your answer
assert gc_count == 25
assert gc_frac > 0.53 and gc_frac <0.532

# END Activity 1.
If you're done early, take a break.

# ACTIVITY 2: Building and using Dictionaries

The problems in this activity will (mostly) focus on creating and manipulating python dictionaries.

## Problem 1: Gene Ontology Dictionary
This problem is a simplified version of a type of problem that will come up frequently in this course, especially when we do enrichment calculations with genomic data.

Imagine you performed some experiment in which you measured gene expression in a sample before and after some perturbation. You have obtained gene ontology annotations for the genes whose expression increased after the perturbation. Your task is to count the number of times each gene ontology term shows up in your results. To keep track of the counts, you will create a dictionary in which the *keys* are gene ontology terms and the *values* are number of times the term occurs in the data.

In the cell below is a list of gene ontology terms for each gene whose expression increased. You're going to loop over the list and 1) add new terms to the dictionary as you encounter them, *or* 2) increment the count for entries that already exist. 

For this, we can use the keywords `in` and `not in` to check if an entry is already in the dictionary. Below is a simple example of code that checks to see if a dictionary entry exists. If the entry *doesn't* exist, one is created and given an initial value of 1. If the entry already exists, the code increments the count by 1:

```python
### Counting coin tosses

coin_tosses = {} # empty dictionary
toss = 'heads' # the data

if toss not in coin_tosses:
    coin_tosses[toss] = 1 # create a new entry and set value to 1
else:
    coin_tosses[toss] += 1 # increment existing entry
```

Below, apply the logic of the coin toss example to write code to count occurrences of gene ontology terms. Write the code to do the following:

1. Create a empty dictionary called `go_counts`.
2. Use a `for` loop to iterate over the list `go_annotations`, which we created above.
3. In the block of the `for` loop, add new GO terms to the dictionary as they come up, and increment the count of terms that are already in the dictionary.

Use `print()` to see the results.

(If you're not familiar with gene ontology analysis, you can find an introduction here: http://geneontology.org.)

In [9]:
# Some simplified GO annotations
go_annotations = ['neurogenesis', 'eye development', 'neurogenesis',  'cell differentiation','detoxification',\
                  'cell differentiation', 'neurogenesis', 'detoxification', 'exocytosis', 'lipid transport',\
                 'signaling', 'neurogenesis']

go_counts = {}

# Write your code below. Print() your dictionary to see how it worked.
for go in go_annotations:
    if go not in go_counts:
        go_counts[go] = 1
    else:
        go_counts[go] += 1

go_counts

{'neurogenesis': 4,
 'eye development': 1,
 'cell differentiation': 2,
 'detoxification': 2,
 'exocytosis': 1,
 'lipid transport': 1,
 'signaling': 1}

In [10]:
# Check your answer
assert go_counts['neurogenesis'] == 4
assert go_counts['eye development'] == 1
assert len(go_counts) == 7

## Problem 2: Restriction enzymes

Your goal in this problem is to identify which restriction enzymes cut a DNA sequence. You'll write code to do the following:

1) Create a dictionary of restriction enzyme specificities in which the sequence is the *key* and the enzyme name is the *value*.

2) Run a `for` loop over a DNA sequence, using `range()` to take 6 bp windows. Check if those 6 bp sequences match a dictionary entry of enzyme specificities. If yes, append the name of the restriction enzyme to a list.

It sounds like a simple problem, but the code will bring together most of the concepts we've learned so far. We'll solve this in two steps.

### 2A: Build the dictionary by looping over two lists.
In the cell below are two lists, `enzyme_names` and `specificities`. Matching names and specificities are in the corresponding position in each list.

Build a dictionary called `enzymes` by using `for` with `zip()` to loop over the two lists. Add dictionary entries as you go, with the *specificity as the key and the enzyme name as the value.*

HINT: Define your empty dictionary *before* the `for` loop. (Why is this important?)

In [11]:
enzyme_names = ['EcoRI', 'BamHI', 'SphI','XhoI', 'XbaI', 'SacI', 'HindIII']
specificities = ['GAATTC', 'GGATCC', 'GCATGC', 'CTCGAG', 'TCTAGA', 'CCGCGG','AAGCTT']

enzyme_dictionary = {}

# Write your code below
for enzyme, specificity in zip(enzyme_names, specificities):
    enzyme_dictionary[specificity] = enzyme

enzyme_dictionary

{'GAATTC': 'EcoRI',
 'GGATCC': 'BamHI',
 'GCATGC': 'SphI',
 'CTCGAG': 'XhoI',
 'TCTAGA': 'XbaI',
 'CCGCGG': 'SacI',
 'AAGCTT': 'HindIII'}

### 2B Loop over a DNA sequence & take 6 bp slices

In the cell below, you're given a DNA sequence. Write code to do the following:

1) Create a blank list called `cutters` to hold the list of enzymes that cut the sequence.

2) Iterate over the DNA sequence, 1 bp at a time using `range()`.

3) In the block of the loop: 
        1. Take a 6 bp slice of the DNA sequence
        2. Check whether that 6 bp sequence is a key in the dictionary
        3. If it is, append the enzyme name to `cutters`.
        
#### Hints
To loop over a string (such as a DNA sequence), we use two tools: slices and `range()`. You've learned how to use `range()` in a for loop. Let's look at slices:

**Grabbing three bases with a slice**

Strings, like lists, can be accessed with indexing and slices, like this:

```python
dna = 'ATGAGCAGGTCAGTGACTGAT' # a DNA sequence
dna[0] # the first base
dna[0:3] # a slice that grabs the first three bases
dna[i:i+3] # a slice that grabs three bases beginning at index i
```

In [12]:
# Check this sequence for cut sites
dna_seq = 'TAGGCTGGATCCTCGATTCGATGGGGCCCATTAATCTAGAGATCGGATCGGACTGAAAGCTCTTTTGATTCGAAGCTTGCGATGCGAAGCTTGCTAGCTA'
cutters = []

# Write your code below
for index in range(len(dna_seq)):
    # get the next six bases using slicing
    next_six_bases = dna_seq[index:index+6]

    # check if the next six bases are in the dictionary and if so, add the enzyme name to the cutters list
    if next_six_bases in enzyme_dictionary:
        enzyme_name = enzyme_dictionary[next_six_bases]
        cutters.append(enzyme_name)

cutters

['BamHI', 'XbaI', 'HindIII', 'HindIII']

In [13]:
# Check your answers
assert len(cutters) == 4
assert cutters.count('HindIII') == 2
assert 'XbaI' in cutters

If you wanted to remove all duplicate elements from `cutters`, you could convert it to a *set*. Create a blank cell above and try `set(cutters)`. What happens?



# END Activity 2.
If you're done early, take a break.