<h1 id="toctitle">Conditions</h1>
<ul id="toc"/>

## True and False

__Conditions__ are things that can be evaluated as either `True` or `False`:

In [1]:
print(3 == 5)# prints the result from the test
print(3 > 5)
print(3 <= 5)
print(5 != 4)

print(len("ATGC") > 5)
print("GAATTC".count("T") > 1)#counts the number of Ts in the seq is that greater than 1 = yes

print("ATGCTT".startswith("ATG"))
print("ATGCTT".endswith("TTT"))
print("ATGCTT".isupper())
print("ATGCTT".islower())

print("V" in ["V", "W", "L"])

False
False
True
True
False
True
True
False
True
False
True


Above are a bunch of different ways to test true/false conditions in Python:

- equals (`==`) -- and not equals (`!=`)
- numerical comparisons (`>`, `<`, `>=`, `<=`)
- string methods (`startswith`, `isupper`)
- is a value in a list (`in` keyword)

We can experiment with these directly. Note that `True` and `False` are special values in Python and always have the initial capital:

In [2]:
4 != 5

True

## Using conditions

### `if` statements

The simplest thing we can do with a condition, apart from printing it, is execute some code if it's true:

In [3]:
expression_level = 125
if expression_level > 100:
    print("gene is highly expressed")#execute code if its true

gene is highly expressed


Notice that

- the condition line starts with `if`
- the thing we want to test goes after the `if`
- the line ends with a colon
- just like with loops, the body is indented
- can have multiple lines in the body

A more interesting example:

In [5]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg']
for sequence in sequences:
    if sequence.count("g") > 1:
        print(sequence + " contains more than one G")
        
    print("still in the loop")#if you want something in the loop but not in the if statement need to watch indents

atcgatgctact contains more than one G
still in the loop
still in the loop
tctgatgctagct contains more than one G
still in the loop
still in the loop


Code with conditions often involves mulitple levels of indentation. Watch out for `IndentationErrors` and indentation mistakes.

### `else` statements

The `if` examples above are yes/no - either we execute the bit of code, or we do nothing. 

Sometimes we want either/or - two different branches:

In [6]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg']
for sequence in sequences:
    if sequence.startswith('a'):
        print(sequence  + " starts with a")
    else:
        print(sequence  + " doesn't start with a")

atcgatgctact starts with a
ccttacgt doesn't start with a
tctgatgctagct doesn't start with a
atctactg starts with a


Notice that:
- The `else` is followed by a colon
- This line has nothing more on it
- The `else` block is indented and is run when the condition is false

---

### `elif` statements

Sometimes we need more than two branches. Use `elif` for each extra branch:

In [8]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg', 'gctagctga', 'tctagtacg']
for sequence in sequences:
    if sequence.startswith('a'):
        print(sequence  + " starts with a")
    elif sequence.startswith('t'):
        print(sequence  + " starts with t")
    else:#else need colon
        print(sequence  + " starts with something else")

atcgatgctact starts with a
ccttacgt starts with something else
tctgatgctagct starts with t
atctactg starts with a
gctagctga starts with something else
tctagtacg starts with t


This is equivalent to putting a second `if`/`else` into the `else` block, but much easier to read, and we can add as many conditions as we like without introducing a huge multi-level indent:

In [9]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg', 'gctagctga', 'tctagtacg', 'ntggattg']

for sequence in sequences:
    if sequence.startswith('a'):
        print('a!')
    elif sequence.startswith('t'):
        print('t!')
    elif sequence.startswith('g'):
        print('g!')
    elif sequence.startswith('c'):
        print('c!')
    else:
        print('something else!')

a!
c!
t!
a!
g!
t!
something else!


`else` and `elif` are good when we have _mutually exclusive_ posibilities. If more than one can be `True`, then use multiple `if` lines:

In [10]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg', 'gctagctga', 'tctagtacg', 'ntggattg']
for sequence in sequences:
    if sequence.startswith('a'):
        print(sequence  + " starts with a")
    if sequence.endswith('t'):
        print(sequence  + " ends with t")

atcgatgctact starts with a
atcgatgctact ends with t
ccttacgt ends with t
tctgatgctagct ends with t
atctactg starts with a


## Combining conditions
### `and`

We can join together two conditions with `and` to make a combined one that's only true if both of the simple ones are true:

In [11]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg', 'gctagctga', 'tctagtacg', 'ntggattg']
for sequence in sequences:
    if sequence.startswith('t') and sequence.count('g') > 2:
        print("both conditions are true for " + sequence)

both conditions are true for tctgatgctagct


### `or`

We can do the same with `or` and it will be true if either of the simple conditions are true:

In [12]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg', 'gctagctga', 'tctagtacg', 'ntggattg']
for sequence in sequences:
    if sequence.startswith('t') or sequence.count('g') > 2:
        print("at least one condition is true for " + sequence)

at least one condition is true for tctgatgctagct
at least one condition is true for gctagctga
at least one condition is true for tctagtacg
at least one condition is true for ntggattg


We can even join together complex conditions in this way. Use parentheses `()` to disambiguate the order in which combinations are evaluated:

In [13]:
sequences = ['atcgatgctact', 'ccttacgt', 'tctgatgctagct', 'atctactg', 'gctagctga', 'tctagtacg', 'ntggattg']
for sequence in sequences:
    if (sequence.startswith('t') or sequence.startswith('n')) and sequence.count('g') > 2:
        print("the condition is true for " + sequence)

the condition is true for tctgatgctagct
the condition is true for ntggattg


## Exercises

### Processing tabular data

Open the text file called _data.csv_, which contains some made-up data for a number of genes. Each line contains the following fields for a single gene in this order: species name, sequence, gene name, expression level. The fields are separated by commas (hence the name of the file – __csv__ stands for __Comma Separated Values__):

```
Drosophila yakuba,cgcgcgc...gatgc,hdt739,85
```

Think of it as a representation of a table in a spreadsheet – each line is a row, and each field in a line is a column. 

Print out the gene names for all genes belonging to Drosophila melanogaster or Drosophila simulans.

Print out the gene names for all genes between 90 and 110 bases long.

Print out the gene names for all genes whose AT content is less than 0.5 and whose expression level is greater than 200.

Print out the gene names for all genes whose name begins with "k" or "h" except those belonging to Drosophila melanogaster.

For each gene, print out a message giving the gene name and saying whether its AT content is high (greater than 0.65), low (less than 0.45) or medium (between 0.45 and 0.65).

__Important note: for this exercise, it's quite easy to write a program that works, but which gives incorrect output, so check each answer carefully.__

In [3]:
dros = list(open("data.csv")) #takes file and turns file into a list 
#...but if have a large dataset then other way will take each line as ypu need it (less computing power)

for line in dros:
    line=line.rstrip('\n').split(',')
    if line[0]=='Drosophila melanogaster' or line[0]=='Drosophila simulans':
        print("Gene names belonging to D.melanogaster and D. simulans: " + str(line[2]))
        
#dros.seek(0) = resets the list back to the beginning if we have a stop in the loop
#dros.close()

for line in dros:
    line=line.rstrip('\n').split(',')
    if len(line[1])>90 and len(line[1])<110:
           print("Gene names for genes between 90 and 110 bp long: " + str(line[2]))
            
            


Gene names belonging to D.melanogaster and D. simulans: kdy647
Gene names belonging to D.melanogaster and D. simulans: jdg766
Gene names belonging to D.melanogaster and D. simulans: kdy533
Gene names for genes between 90 and 110 bp long: kdy647
Gene names for genes between 90 and 110 bp long: teg436


In [53]:
#### how tutor did it ####

data = open("data.csv")
organism=[] #create list to populate

for line in data:
    columns = line.rstrip('\n').split(',')
    spp_name = columns[0]
    seq = columns[1]
    gene_name = columns[2]
    expression = columns[3]
    
    if spp_name == 'Drosophila melanogaster' or spp_name == 'Drosophila simulans':
        organism.append("organism" + gene_name) ##adding the answer to the list made

kdy647
jdg766
kdy533


In [54]:
for line in dros:
    line=line.rstrip('\n').split(',')
    ATcont=(line[1].count("a")+line[1].count("t"))/len(line[1])
    #AT_con=int(ATcont) dont need this .. why? ... rounds down to 0 so all would be 0
    if ATcont < 0.5 and int(line[3]) > 200:
        print("Gene names for genes with AT cont < 0.5 and expression lvl > 200: " + str(line[2]))
    

Gene names for genes with AT cont < 0.5 and expression lvl > 200: teg436


In [40]:
#print(ATcont)
#(line[1].count("a")+line[1].count("t"))/len(line[1])

    #Acont=(line[1]).count("a")
    #Tcont=(line[1]).count("t")
    #fulllen=len(line[1])
    #didnt work but try again if time

0.45918367346938777


0.45918367346938777

In [26]:
#Print out the gene names for all genes whose name begins with "k" or "h" except those belonging to Drosophila melanogaster.

for line in dros:
    line=line.rstrip('\n').split(',')
    if line[0] != 'Drosophila melanogaster'and line[2].startswith("k") or line[2].startswith("h"):
        print("Gene names beginning w/ k or h not incl D. melanogaster: " + str(line[2]) + " in " + str(line[0]))
        


Gene names beginning w/ k or h not incl D. melanogaster: kdy533 in Drosophila simulans
Gene names beginning w/ k or h not incl D. melanogaster: hdt739 in Drosophila yakuba
Gene names beginning w/ k or h not incl D. melanogaster: hdu045 in Drosophila ananassae


In [50]:
#For each gene, print out a message giving the gene name and saying whether its AT content is 
#high (greater than 0.65), low (less than 0.45) or medium (between 0.45 and 0.65).

for line in dros:
    line=line.rstrip('\n').split(',')
    ATcont=(line[1].count("a")+line[1].count("t"))/len(line[1])
    #AT_con=int(ATcont)
    if ATcont > 0.65:
        print("For gene "+ str(line[2]) + " the AT cont is high(>0.65)")
    elif ATcont < 0.45:
        print("For gene "+ str(line[2]) + " the AT cont is low(<0.45)")
    elif ATcont >= 0.45 and AT_con <= 0.65:
        print("For gene "+ str(line[2]) + " the AT cont is medium(0.45 - 0.65)")
    else:
        print("error")

##### or more efficient code : 

dros = list(open("data.csv")) #takes file and turns file into a list 
        
for line in dros:
    line=line.rstrip('\n').split(',')
    ATcont=(line[1].count("a")+line[1].count("t"))/len(line[1])
    if ATcont > 0.65:
        print("For gene "+ str(line[2]) + " the AT cont is high(>0.65)")
    elif ATcont < 0.45:
        print("For gene "+ str(line[2]) + " the AT cont is low(<0.45)")
    else:
        print("For gene "+ str(line[2]) + " the AT cont is medium(0.45 - 0.65)")
    
        
    
    

For gene kdy647 the AT cont is high(>0.65)
For gene jdg766 the AT cont is medium(0.45 - 0.65)
For gene kdy533 the AT cont is medium(0.45 - 0.65)
For gene hdt739 the AT cont is low(<0.45)
For gene hdu045 the AT cont is medium(0.45 - 0.65)
For gene teg436 the AT cont is medium(0.45 - 0.65)
For gene kdy647 the AT cont is high(>0.65)
For gene jdg766 the AT cont is medium(0.45 - 0.65)
For gene kdy533 the AT cont is medium(0.45 - 0.65)
For gene hdt739 the AT cont is low(<0.45)
For gene hdu045 the AT cont is medium(0.45 - 0.65)
For gene teg436 the AT cont is medium(0.45 - 0.65)


In [43]:
print(ATcont)

0.45918367346938777


### Bonus exercise: pairwise distance

*Warning: This takes a bit of thought. Plan your whole strategy before writing any code!*

Here is a list of DNA sequences:

`['ATTGTACGGA', 'AATGAACCGA', 'AATGAACCCA', 'AATGGGAATA']`

Write a program that calculates and prints, for each pair of sequences, the percentage of identical positions. 

Hint: 

```
if base1 == base2:
    # do something
```

In [58]:
DNA_seqs = ['ATTGTACGGA', 'AATGAACCGA', 'AATGAACCCA', 'AATGGGAATA']


if (DNA_seqs[1][:1]) == (DNA_seqs[2][:1]):
    print("yes")

yes


### Bonus exercise: kmer counting

*Warning: difficult!*

Write a program that, given a DNA sequence, will print all the k-mers (e.g. 4-mers) that occur more than n times. 

E.g. with dna="ATGCATCATG", k=2 and n=2 print:

AT 


In [None]:
#set can have a collection of things but cannot hold more than one of the same thing
#if kmer not in - not in opposite of in

dna = "ATGCATCATG"
k = 2
n = 2

# An empty list
all_kmers = []

# First get the k-mers...
# Need to stop at k-1 because if k==2 we need to stop reading 1 base 
# (ie. k - 1) from the end
for idx in range(len(dna) - (k-1)):

    # Put them into a list
    kmer = dna[idx:idx+k]
    if kmer not in all_kmers:
        all_kmers.append(kmer)

print(all_kmers)

# Now for all the kmers, print them if they occur >n times
for kmer in all_kmers:
    count_of_kmer = dna.count(kmer)
    if dna.count(kmer) > n:
        print("I saw " + str(count_of_kmer) + " of the kmer " + kmer)
_
