<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Conditions

<div class="alert alert-warning">
Parts of this section have been adapted from copyrighted material in *Jones, M: Python for Biologists: A complete programming course for beginners (2013)*.

**Please do not distribute it!**

## Programs need to make decisions

At first our programs were just executed from top to bottom, like a recipe.  Then we learned how to write a loop, to iterate over the elements of a container.  But we still don't know how to make our program branch off in different directions depending on some deciding factor.

Real-life problems, often require our programs to act as decision-makers; to examine a property of some bit of data and decide what to do with it. In this section, we’ll see how to do that using conditional statements. Conditional statements are features of Python that allow us to build decision points in our code. They allow our programs to decide which out of a number of possible courses of action to take – instructions like *“print the name of the sequence if it’s longer than 300 bases”* or *“group two samples together if they were collected less than 10 megabases apart“*.

Before we can start using conditional statements, however, we need to understand conditions.

## Conditions, `True` and `False`

A *condition* (*conditional statement*, or sometimes just a *conditional*) is simply a bit of code that can produce a true or false answer. The easiest way to understand how conditions work in Python is try out a few examples. The following example prints out the result of testing (or evaluating) a bunch of different conditions – some mathematical examples, some using string methods, and one for testing if a value is included in a list:

In [1]:
3 == 5

False

In [4]:
print(3 == 5)

False


In [5]:
3 > 5

False

In [7]:
3 <= 5

True

In [8]:
len("ATGC") > 5

False

In [9]:
"GAATTC".count("T") > 1

True

In [10]:
"ATGCTT".startswith("ATG")

True

In [11]:
"ATGCTT".endswith("TTT")

False

In [12]:
"ATGCTT".isupper()

True

In [13]:
"ATGCTT".islower()

False

In [14]:
"V" in ["V", "W", "L"]

True

But what’s actually being printed here? At first glance, it looks like we’re printing the strings `"True"` and `"False"`, but those strings don’t appear anywhere in our code. What is actually being printed is the special built-in values that Python uses to represent true and false.

In fact, the values `True` and `False` are examples of another *type* of data Python recocnises.  It's called the *boolean* type, and a variable of type boolean can **only** have the value `True` or `False`.

We can show that these values are special by trying to print them. The following code runs without errors (note the absence of quotation marks):

In [15]:
print(True)
print(False)

True
False


…whereas trying to print arbitrary unquoted words:

In [16]:
print(Hello)

NameError: name 'Hello' is not defined

…causes a `NameError`.

We can also use the built-in function `type` to test the type of an object:

In [17]:
type("ACTG")

str

In [18]:
type(592)

int

In [19]:
type(["A", "C", "T", "G"])

list

In [20]:
type("True")

str

In [21]:
type(True)

bool

`bool` is shorthand for *boolean*.

There’s a wide range of things that we can include in conditions, and it would be impossible to give an exhaustive list here. Some basic operators and keywords are:

| meaning                       | syntax       |
|-------------------------------|:------------:|
| is equal to                   | `==`         |
| is not equal to               | `!=`         |
| greater than                  | `>`          |
| less than                     | `<`          |
| greater than or equal to      | `>=`         |
| less than or equal to         | `<=`         |
| is an element in a collection |  `in`        |


> Note that Python uses a two equals signs (`==`) to test for equality, because the single equals sign `=` is already used for *variable assignment*!

Many data types also provide methods that return `True` or `False` values. We’ve already seen a few in the code sample above: For example, strings have `startswith()` and `endswith()` methods that returns true if the string starts (or ends with) with the substring given as an argument.

Now that we know how to express tests as conditions, let’s see what we can do with them!

## `if` statements

The simplest kind of conditional statement is an `if` statement. The syntax is fairly simple to understand:

In [24]:
expression_level = 125

if expression_level > 100:
    print("gene is highly expressed")

gene is highly expressed


We write the word `if`, followed by a condition, and end the condition line with a colon.

There follows a block of indented lines of code (the *body* of the `if` statement), which will **only be executed if the condition evaluates to `True`**.

(This colon-plus-block syntax should be familiar to you from the sections on loops and functions.)

Most of the time, we want to use an `if` statement to test a property of some variable whose value we don’t know at the time when we are writing the program. The example above is obviously useless as the value of the `expression_level` variable is not going to change!

Here’s a slightly more interesting example: we’ll define a list of gene accession names and print out just the ones that start with `"a"`:

In [25]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a'):
        print(accession)

ab56
ay93
ap97


If you take a close look at the code above, you’ll see something interesting: The lines of code inside the `for` loop are indented (just as before), but the line of code inside the `if` statement is indented **twice** – once for the `for` loop, and once for the `if` statement.

This is the first time we’ve seen multiple levels of indentation, but it’s very common once we start working with larger programs – whenever we have one loop or if statement nested inside another, we’ll have multiple levels of indentation.

Python is quite happy to have as many levels of indentation as needed, but you’ll need to keep  track of which lines of code belong at which level. If you find yourself writing a piece of code that requires very deep indentation, it could be an indication that you should think of structuring it differently — maybe by using a function.

Let's repeat the example above, but this time let's build up a list of accessions that start with “`a`” to use in further calculations:

In [31]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']
starts_with_a = []

for accession in accs:
    if accession.startswith('a'):
        starts_with_a.append(accession)
        
print(starts_with_a)

['ab56', 'ay93', 'ap97']


So we've got a list to start with, as well as a condition (`startswith('a')`).  We look at the elements of the list one by one, and then retain only those that *pass* the condition, discarding those that don't.

In effect, we're *filtering* the list based on a condition.

Filtering a sequence is a rather common operation, and Python provides built-in ways to make this easier, as we'll see later on in the course…

## `else` statements

Closely related to the `if` statement is the `else` statement.

The examples above use a yes/no type of decision-making:  Based on some condition, a code block is either executed or not.

Often we need an either/or type of decision, where we have two possible actions to take (two code blocks to choose from).

To do this, we can add on an `else` clause after the end of the body of an `if` statement:

In [27]:
expression_level = 90

if expression_level > 100:
    print("gene is highly expressed")
else:
    print("gene is not highly expressed")

gene is not highly expressed


The `else` statement doesn’t have any condition of its own – rather, the `else` statement body is execute when the `if` statement to which it’s attached is *not* executed.  That is, if the *condition* of the `if` statement evaluates to `False`.

Note how indentation is used:  The `else` statement is indented to the same level as its corresponding `if` statement — the first column of text, in the example above.

Here’s an example which uses `if` and `else` to split up a list of accession names into two different files – accessions that start with `"a"` go into the first file, and all other accessions go into the second file:

In [44]:
%cd /Users/sabineurban/EVOP2017/files

file1 = open("one.txt", "w")
file2 = open("two.txt", "w")

accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a'):
        file1.write(accession + "\n")
    else:
        file2.write(accession + "\n")

file1.close()
file2.close()

/Users/sabineurban/EVOP2017/files


Notice how there are multiple indentation levels as before, but that the `if` and `else` statements are at the same level.

You can inspect the two resulting files here to see if they contain what you think they ought to contain:

* [one.txt](../edit/files/one.txt)
* [two.txt](../edit/files/two.txt)

## `elif` statements

What if we have *more than two* possible branches? For example, say we want three files of accession names:

* ones that start with `"a"`
* ones that start with `"b"`
* all others.

We could have a second if statement nested inside the else clause of the first if statement:

In [46]:
file1 = open("one.txt", "w")
file2 = open("two.txt", "w")
file3 = open("three.txt", "w")

accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a'):
        file1.write(accession + "\n")
    else:
        if accession.startswith('b'):
            file2.write(accession + "\n")
        else:
            file3.write(accession + "\n")

file1.close()
file2.close()
file3.close()

This works, but is difficult to read – we can quickly see that we need an extra level of indentation for every additional choice we want to include. To get round this, Python has an `elif` statement, which merges together `else` and `if` and allows us to rewrite the above example in a much more elegant way:

In [47]:
file1 = open("one.txt", "w")
file2 = open("two.txt", "w")
file3 = open("three.txt", "w")

accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a'):
        file1.write(accession + "\n")
    elif accession.startswith('b'):
        file2.write(accession + "\n")
    else:
        file3.write(accession + "\n")

file1.close()
file2.close()
file3.close()

Notice how this version of the code only needs two levels of indention. In fact, using elif we can have any number of branches and still only require a single extra level of indentation:

```python
for accession in accs:
    if accession.startswith('a'):
        file1.write(accession + "\n")
    elif accession.startswith('b'):
        file2.write(accession + "\n")
    elif accession.startswith('c'):
        file3.write(accession + "\n")
    elif accession.startswith('d'):
        file4.write(accession + "\n")
    elif accession.startswith('e'):
        file5.write(accession + "\n")
    else:
        file6.write(accession + "\n")
```

## **Aside:**  Using a context manager to control file access

You may have noticed in the previous section's code snippets how ungainly it has become to work with multiple file objects — opening them one by one, and then having to remember to call the `close()` method again on each open file.

An advanced Python strucure known as a *context manager* can help to make the syntax for working with file objects less cumbersome and more readable.

Context managers are well beyond the scope of this introductory course, so we'll just look at their use in terms of file objects — an idiom you'll see a lot in modern Python code.

>An aside **only** for those who are interested:

>Objects that act as context managers are usually the sort of thigns that need to be "built up" before you can use them, and then "torn down" afterwards.  Python has a special keyword — `with` — that work with context managers.  In the `with` statement, the context manager gets built up.  It then stays intact during the code block that follows the `with`, and automatically gets torn down when execution leaves that block.

>The file object can act as a context manager.  When you use a file obect as a context manager, it automatically closes the underlying file as soon as the `with` block is left

Here's how to use a file object as a context manager, which requires the `with` keyword — let's repeat the example from the start of the previous subsection:

In [None]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

with open("one.txt", 'w') as file1:
    for accession in accs:
        if accession.startswith('a'):
            file1.write(accession + "\n")

The entire block under the `with` line lies within the managed context. When code execution leaves this block, the file will automatically be closed.

Now, let's use multiple file objects:

In [None]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

with open("one.txt", 'w') as file1, open("two.txt", 'w') as file2:
    for accession in accs:
        if accession.startswith('a'):
            file1.write(accession + "\n")
        else:
            file2.write(accession + "\n")

Again, both `file1` and `file2` will be closed as soon as code execution leaves the managed context.

And that's all we'll say about context managers for now.  Feel free to use this idiom when reading from or writing to files;  it is the standard way to do so in modern Python, so you'll see it a lot in other people's code.

## `while` loops

Another thing we can do with conditionals is use them to determine *when to exit a loop*.

Previously we've learned about `for` loops that iterate over a collection of items. Python also has a `while` loop. Rather than iterating over a set number of items, a `while` loop runs *until some condition is met*. For example, here’s a bit of code that increments a `count` variable by one each time round the loop, stopping when the `count` variable reaches ten:

In [49]:
count = 0

while count < 10:
    print(count)
    count = count + 1

0
1
2
3
4
5
6
7
8
9


Each time we go around the loop, the condition in the `while` statement is tested.  If it evaluates to `True`, we go around the loop one more time.  If it's `False`, we exit the loop and conrtinue execution below it.

We know by now that iterating over a file object with a `for` loop is the simplest way to iterate over the lines of a file.  But if we didn't know this, we could've used a `while` loop as follows, keeping in mind that the `readline()` method of a file object returns an empty string when you "fall off the end" of a file:

In [50]:
file1 = open("one.txt", 'r')

line = file1.readline()

while line != "":
    line = line.rstrip()
    print("Line:", line)
    line = file1.readline()

Line: ab56
Line: ay93
Line: ap97


Can you figure out how the program in the code block above works?

>Don't actually iterate over a file like that.  Use a `for` loop!

## Truthiness and falsiness

American comedian and political commentator Stephen Colbert gave the world the wordy "truthy", used to describe complex statements made by politicians that *sound* true, but in fact have a more complicated relationship with actual truth.

Many programming languages regard more than just the boolean variables `True` or `False` as true or false, and programmers were quick to appropriate the words *truthiness* and *falsiness* to describe this.

A thing is *truthy* if it tests as true in a conditional statement, and vice versa.

Let's write a little function to test the truthiness of things:

In [59]:
def truth_test(thing):
    if thing:
        print("Truthy!")
    else:
        print("Falsy!")

Let's test some things:

In [60]:
truth_test("ACTG")        # a string

Truthy!


In [None]:
truth_test("")            # an empty string

In [None]:
truth_test([34, 56, 11])  # a list

In [None]:
truth_test([])            # an empty list

In [None]:
truth_test(42)            # a number

In [None]:
truth_test(-42)           # a negative number

In [None]:
truth_test(0)             # zero

As it happens, Python also has a built-in function that does much the same as `truth_test`.  It's called `bool` — it evaluates an argument, and returns the boolean values `True` or `False` depending on the truthiness of that argument:

In [None]:
print(bool(42))
print(bool(0))

Often we can write conditional statements more simply by keeping truthiness in mind.

In the `files` subdirectory there's a Fasta file called `sample1.fa`:

* [sample1.fa](../edit/files/sample1.fa)

As you can see, it contains blank lines between records.  What if we want to filter out those blank lines?  We could do it like this:

In [None]:
for line in open("sample1.fa", 'r'):
    line = line.rstrip()
    if line != "":
        print(line)

But by keeping truthiness in mind (a non-empty string is "true", and an empty string is "false"), we can write the condition more succinctly:

In [None]:
for line in open("sample1.fa", 'r'):
    line = line.rstrip()
    if line:
        print(line)

## Building up compound conditions

What if we wanted to express a condition that was made up of several parts? Imagine we want to go through our list of accessions and print out only the ones that start with `"a"` **and** end with `"3"`. We could use two nested if statements:

In [52]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a'):
        if accession.endswith('3'):
            print(accession)

ay93


…but this brings in an extra level of indention. A better way is to join up the two condition with `and` to make a complex expression:

In [51]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a') and accession.endswith('3'):
        print(accession)

ay93


This version is nicer in two ways: it doesn’t require the extra level of indentation, and the condition reads in a very natural way. We can also use `or` to join up two conditions, to produce a complex condition that will be true if either of the two simple conditions are true:

In [53]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a') or accession.startswith('b'):
        print(accession)

ab56
bh84
ay93
ap97
bd72


Say we have two variables `x` and `y`, which can both evaluate to either `True` or `False`.  This tables shows the value of `x and y` and `x or y` for all possible combinations of truth value of `x` and `y`:

| `x`     | `y`     | `x and y`   | `x or y`   |
|---------|---------|-------------|------------|
| `True`  | `True`  | `True`      |`True`      |
| `True`  | `False` | `False`     | `True`     | 
| `False` | `True`  | `False`     | `True`     |
| `False` | `False` | `False`     | `False`    |

We can join up compound conditions to make more complex compound conditions – here’s an example which prints accessions if they start with either `"a"` or `"b"`, and end with `"4"`:

In [54]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for acc in accs:
    if (acc.startswith('a') or acc.startswith('b')) and acc.endswith('4'):
        print(acc)

bh84


Notice how we can include parentheses in the above example to avoid ambiguity.

Finally, we can negate any type of condition by prefixing it with the word `not`. This example will print out accessions that start with `"a"` and don’t end with 6:

In [55]:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for acc in accs:
    if acc.startswith('a') and not acc.endswith('6'):
        print(acc)

ay93
ap97


By using a combination of the `and`, `or` and `not` (along with parentheses where necessary) we can build up arbitrarily complex conditions.

> This kind of use for conditions – identifying elements in a list – can often be done in a more readable fashion using some advanced Python features we'll learn later on.

These three keywords — `and`, `or`, `not` — are collectively known as *boolean operators*.

## Writing true/false functions

Sometimes we want to write a function that can be used in a condition. This is very easy to do – we just make sure that our function always returns a boolean value — either `True` or `False`.  (A function that returns a boolean value is called a *predicate function*.)

Remember that `True` and False are built-in values in Python, so they can be passed around, stored in variables, and returned, just like numbers or strings.

Here’s a function that determines whether or not a DNA sequence is AT-rich (we’ll say that a sequence is AT-rich if it has an AT content of more than 0.65):

In [56]:
def is_at_rich(dna):
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    if at_content > 0.65:
        return True
    else:
        return False

We’ll test this function on a few sequences to see if it works:

In [57]:
print(is_at_rich("ATTATCTACTA"))
print(is_at_rich("CGGCAGCGCT"))

True
False


The output shows that the function returns `True` or `False` just like the other conditions we’ve been looking at.  Therefore we can use our function in an `if` statement:

```python
if is_at_rich(my_dna):
    # do something with the sequence
```

Because the last four lines of our function are devoted to evaluating a condition and returning `True` or `False`, we can write a slightly more compact version. In this example we evaluate the condition, and then return the result right away:

In [None]:
def is_at_rich(dna):
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    return at_content > 0.65

This is a little more concise, and even easier to read.

---

## Exercises

In the `files` directory where this Notebook is located, you’ll find a text file called `data.csv` containing some arbitrary data for a number of genes:

* [`data.csv`](../edit/files/data.csv)

>We have already used `%cd` to change into the `files` directory earlier in this Notebook.

Each line contains the following fields for a single gene in this order:

    species name, sequence, gene name, expression level

The fields are separated by commas.  (Hence the name of the file – `csv` stands for "Comma Separated Values").

Think of it as a representation of a table in a spreadsheet – each line is a row, and each field in a line is a column. All the exercises for this section use the data read from this file.

### 1. Several species

Print out the gene names for all genes belonging to *Drosophila melanogaster* or *Drosophila simulans*.

In [None]:
for line in open("data.csv", 'r'):
    line = line.rstrip()
    
    fields = line.split(',')
    
    species_name = fields[0]
    gene_name = fields[2]
    
    #if species_name == "Drosophila melanogaster" or species_name == "Drosophila simulans":
    #   print(gene_name)
    
    if species_name in ["Drosophila melanogaster", "Drosophila simulans"]:
        print(gene_name)

### 2. Length range

Print out the gene names for all genes between 90 and 110 bases long.

In [62]:
for line in open("data.csv", 'r'):
    line = line.rstrip()
    
    fields = line.split(',')
    
    sequence = fields[1]
    gene_name = fields[2]
    
    #if sequence == "90:110":
    #   print(gene_name)
    
    if sequence in ["90:110"]:
        print(gene_name)

### 3. AT content

Print out the gene names for all genes whose AT content is less than 0.5 and whose expression level is greater than 200.

In [76]:
for line in open("data.csv", 'r'):
    line = line.rstrip()
    
    fields = line.split(',')
    
    sequence = fields[1]
    gene_name = fields[2]
    expression = fields[3]
    
    # if AT content > 0.5 and expression > 200:
    #   print(gene_name)
    
    def at_content(data):
        length = len(sequence)
        a_count = data.upper().count('A')
        t_count = data.upper().count('T')
        at_content = (a_count + t_count) / length
        
        if at_content > 0.5 and expression > 200:
            print(gene_name)

### 4. Complex condition

Print out the gene names for all genes whose name begins with “k” or “h” except those belonging to Drosophila melanogaster.

In [None]:
for line in open("data.csv", 'r'):
    line = line.rstrip()
    
    fields = line.split(',')
    
    sequence = fields[1]
    gene_name = fields[2]
    expression = fields[3]
    
    # if k or h but not Drosophila melanogaster
    #   print(gene_name)
    
    for names in line:
    if name.startswith('k') or name.startswith('h') and not name.startswith('Drosophila melanogaster'):
        print(gene_name)
        

### 5. High low medium

For each gene, print out a message giving the gene name and saying whether its AT content is high (greater than 0.65), low (less than 0.45) or medium (between 0.45 and 0.65).