#**WEEK 4**

This week we'll be working through an adapted version of Chapter 2 from Applied Bioinformatics.

First, we have to install BioPython again, since this is a new notebook. You can do that by running the code block below.

In [1]:
##Install BioPython in Jupyter Notebook
%pip install biopython

Collecting biopython
  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.81


#Part 1: Introduction to Sequence Motifs.
First, make sure you've read Sections 2.1-3 of the textbook. If you haven't read it, do that now.

The first examples from 2.1 and 2.2. involve finding a specific sequence within a longer DNA string. This is a function that is very common in bioinformatics and, therefore, built in to BioPython.

In [None]:
##Run the code below. What do the numbers mean and how are they defined?
from Bio.Seq import Seq
from Bio import SeqUtils
pattern = Seq("ACG")
sequence = Seq("ATGCGCGACGGCGTGATCAGCTTATAGCCGTACGACTGCTGCAACGTGACTGAT")
results = SeqUtils.nt_search(str(sequence),pattern)
print(results)

In [None]:
##The code below searches for a sequence in the reverse complement of a string. When might you want to search a reverse complement?
results_rc = SeqUtils.nt_search(str(sequence),pattern.reverse_complement())
print(results_rc)

Now it's time for Exercise 6:
>Find the number of occurrences and locations of `ACTT` (including the forward and reverse strand) within the following sequence:
>
>```
>sequence
AGCGATCTAGCATACTTATACGCGCGCAGCTATCGATCACTTGTGCTAGTAAAGTGCGCGCCGCATTAAAGTGCTAGCTAGCTACTTAGCTAGCTAGTCG
```



In [None]:
##You can write your code in this block.

#Part 2: Consensus Sequences
First, make sure you've read Sections 2.1-3 of the textbook. If you haven't read it, do that now.

First: Why do we as biologists care about consensus sequences? Give an example of a consensus sequence. (If you've already taken BIOL3101 this should be very easy!)

\---------------------------
You can add your answer in this text box.
\---------------------------

As biologists care about consensus sequences a lot, as you can imagine there is a function built in to BioPython to search for them.

In [7]:
##Run the code below. What does the resulting output mean?
from Bio import SeqUtils
consensus = "RGWYV"
sequence = "CGTAGCTAGCTCAGAGCAGGGACACGTGCTAGCAACAGCGCT"
results = SeqUtils.nt_search(sequence,consensus)
print(results)

['[AG]G[AT][CT][ACG]', 19]


The concept of a consensus sequence is really important for when we're doing BLAST later, so also explain below in your own words what the following variables represent. Give at least three different sequences that satisfy the criteria:
1. consensus
2. results

\---------------------------
You can add your answer in this text box.
\---------------------------

#Part 3: Finding new motifs
Up until this point, we've been pre-defining our motifs and searching for them in a larger sequence. We already know a lot of motifs from our previous studies: for example, you might know the motif "AUG" as the start codon. However, this only works when we know exactly what we're looking for within the genome. Unfortunately, we *don't* always know exactly when a motif will occur in a genome, how many motifs there are, or what a motif looks like.

We can calculate these things using a weight matrix; these basically summarise the probability that a given sequence in a dataset is a recurring motif rather than a random sequence. This is difficult, because you can only determine this in the context of a whole genome.

Think about playing Snap. I could shuffle a deck of cards, get you to draw a card and ask you if it's Snap or not. You'd tell me, well that depends on the context of the other cards, what's ahead and what's before. It's kind of the same with motif searching.

There is a LOT of math in this section of the textbook that is beyond the scope of this course, which is why I'm not suggesting you understand the entire section before you start this bit of the lab. However, I do encourage you to read it and ask Brian lots of hard questions.

As before, finding motifs *de novo* (this means without additional info like what motif we're looking for) is very important in bioinformatics and there is a function for it in BioPython. Before we use that function, however, we have to learn a tiny bit more about how the Python programming language works.

In [15]:
##This code creates a variable called instances and saves a list of DNA sequences to it.
instances = [Seq("CAGTT"),Seq("CATTT"),Seq("ATTTA"),Seq("CAGTA"),Seq("CAGTT"),Seq("CAGTA")]

When you run the code above, you create a variable to which a value is assigned, just like before. However, up until this point we've been working with just numbers and strings: a string is a sequence of characters and usually used for any text, like `Hello world` or `ATGTCTGTGT` and is defined in python using quotation marks:

In [16]:
x = "hello world"
print(x)

hello world


In this code, we're using a list. A list is an ordered collection of elements which can be manipulated with various Python functions, and unlike strings are not limited to characters. A list is defined in Python by square brackets, with the elements separated by commas.

In [18]:
x = [1, 'hello', 3.14, True]
print(x)
print(x[1])

[1, 'hello', 3.14, True]
hello


Create a list that includes five elements and at least three data types. Print the first, third, and fifth elements of the list.

In [None]:
##You can write your code in this block.

Now let's go back to our motif function and our `instances` list.

In [19]:
#This code takes six short DNA strings (contained in the variable instances) and saves it as a motif.
from Bio import motifs
from Bio.Seq import Seq
motif = motifs.create(instances)
print(motif)

CAGTT
CATTT
ATTTA
CAGTA
CAGTT
CAGTA



At first glance, this doesn't look very interesting - we already had a variable `instances` that could print our list of sequences. However, by creating a motif of this list, we can identify all sorts of other attributes. For example, we can use the code below to quantify the amount of variation at each site in our motif.

In [12]:
##This code prints the motif counts of our sequences
print(motif.counts)

        0      1      2      3      4
A:   1.00   5.00   0.00   0.00   3.00
C:   5.00   0.00   0.00   0.00   0.00
G:   0.00   0.00   4.00   0.00   0.00
T:   0.00   1.00   2.00   6.00   3.00



We can also identify the consensus sequence.

In [20]:
##This code prints the motif consensus of our sequences
print(motif.degenerate_consensus)

CAKTW


This code identifies the degenerate consensus sequence. What does the term degenerate mean in biology, and what does it mean in this context?

\---------------------------
You can add your answer in this text box.
\---------------------------

The motifs tool also allows us to visualize our motifs.

In [21]:
#This code creates an image of your motif.
motif.weblogo("LOGO.pdf",format="pdf")

You might recognise these extremely fancy diagrams from other genetics classes.

© Elisabeth Richardson, 2023. Adapted from Applied Bioinformatics by David A. Hendricks under a [CC-by-4.0 license](https://https://creativecommons.org/licenses/by/4.0/) and shared under same.