<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Working with text and numbers

### *&mdash; Values, Types and Operators &mdash;*

<div class="alert alert-warning">
Parts of this section have been adapted from copyrighted material in *Jones, M:
Python for Biologists: A complete programming course for beginners (2013)*.

**Please do not distribute it!**

## Working with text

Many introductory programming texts, whether aimed at Python or another language, start by discussing numbers.  Not only are numbers well understood by most people, but it's also a safe bet that anyone learning to program will wish to crunch numerical data at some point.  To be different, we'll start this course by looking at how Python works with *text* data.

Why are we, as biologists, interested in working with text?  The answer should be obvious:  Much of the sort of data we work with is most easily represented as text, for example, *sequence* data (like DNA or protein sequences).

>By implication we've learned our first lesson about Python:  It can distinquish between different *types* of data, and it treats textual and numerical data differently.

## Strings

The basic *data type* used by Python to represent textual data is the *string*.  Strings are sequences of *characters*, and in Python we write a string by enclosing its contents in quotes:

In [6]:
"Hello world"

'Hello world'

***delete***

<div class="alert alert-info">
**Remember:**  You can *evaluate* a code cell in a Jupyter Notebook by placing the cursor (grey frame with blue border) on it and pressing `Shift-Enter` or `Control-Enter`.

Throughout this course, you're expected to execute the code examples to see what
they do!  But even more importantly, you can *and should* experiment by changing the code!

<div class="alert alert-info">
When you pressed `Shift-Enter` (or `Ctrl-Enter`, or pressed the "run" button on the toolbar)  on the code cell above, the Python kernel *evaluated* the *expression* in the cell.  In this case, it was a string:

```python
"Hello world"
```

Jupyter then printed the *result of the evaluation* directly below the code cell.

Throughout this course, you're expected to execute the code examples to see what
they do!  But even more importantly, you can *and should* experiment by changing the code!

Python doesn't care whether you use double or single quotes to surround (or *delimit*) a string:

In [7]:
'Hello world'

'Hello world'

However, the quotes at the start and end of the string must match!

Let's see what happens if they don't:

In [8]:
"Hello world'

SyntaxError: EOL while scanning string literal (<ipython-input-8-271324297b4f>, line 1)

Well, that looks a little scary!  What you see there is Python's way of telling you that an error occurred.  We say that Python *raised an exception*, and the message wherein it tells you about the error is called a *traceback*.

Tracebacks may seem scary at first, but really, they're your friends.  Python's tracebacks are extremely verbose and help you get to the root of the problem quickly.

*** insert bit about reading traceback***

Why does Python support both single and double quotes as *string delimiters*?  One reason is so that we can create a string that contains one kind of quote by delimiting it with the other:

In [9]:
"She said, 'Hello world'"

"She said, 'Hello world'"

In [10]:
'He said, "Hello world"'

'He said, "Hello world"'

In other words, a string delimited by double quotes can contain single quote characters, and vice versa.

What happens if we forget the quotes altogether?

In [11]:
Hello world

SyntaxError: invalid syntax (<ipython-input-11-f0ef7081e153>, line 1)

Without quotes, Python doesn't know that the expression represents a string — a piece of data.  Python is now trying to evaluate the expression as Python program code, which it simply isn't.  Hence, we get a `SyntaxError` once again.

***should maybe postpone special characters etc. until after print() ?***

How to we make a string that spans more than one line.  What if we wanted a string that contains the words "Hello" and "world" on two consecutive lines, like so:

```
Hello
world
```

We could try to break the string up over two lines, like this:

In [12]:
"Hello
world"

SyntaxError: EOL while scanning string literal (<ipython-input-12-b0d9821feecf>, line 1)

…but clearly Python doesn't allow that.

We could try to put a string on either line, but that way we end up with two strings instead of one:

In [1]:
"Hello"
"world"

'world'

<div class="alert alert-info">
This code cell contained *two* expressions, each one consisting of a string.  Note that, while Jupyter evaluates all the expressions in a code cell, it only prints the result of the last evaluation!

The way to split a string across multiple lines in Python is to use a *triple-quoted string*.  That is, delimit the string using *three* quote characters at either end:

In [14]:
"""Hello
world"""

'Hello\nworld'

The result of evaluating that triple-quoted string may look a little different than you expected.  Instead of "Hello" and "world" on two consecutive lines, it consists of "Hello" and "world" separated by a backslash and a lowercase ‘n’, like so:  `\n`.

The `\n` is Python's way of indicating a *newline*, i.e. the character you get when you hit the `<Enter>` key.

We can also explicitly add a newline character to a string:

In [16]:
print("Hello\nworld")

Hello
world


The `\n` is usually called a *special character*, and it's one of several.  Another common special character is `\t`, which represents the tab character you get by pressing `<Tab>`.

Because Python views the combination of a backslash and the character following it in a string as a special character, adding a simple backslash to a Python string is … less than simple:  A single backslash in a Python string is represented by a double backslash:

***delete***

Let’s take a look at the various bits of this line of code, and give some of
them names:

* The whole line is called a *statement*.


* `print` is the name of a *function*. A function is, roughly, a pre-packaged action that Python can perform.  (We'll learn more about functions later.)  The function name is always followed by parentheses.


* The bits of text inside the parentheses are called the *arguments* to the function. In this case, we just have one argument.  Later on we’ll see examples of functions that take more than one argument.  The arguments provide the data for the function to work on – in this case, the argument tells Python exactly what it is we want the `print` function to print.


* When we *execute* a statement that contains a function with arguments in parentheses (e.g by pressing `Shift-Enter` on the code block above), we say we *call* that function.

>Note: Python syntax uses different kinds of brackets for different purposes. Throughout this course, we'll consistently refer to them as:

> * `()` &mdash; "parentheses"
> * `[]` &mdash; "square brackets", or just "brackets"
> * `{}` &mdash; "curly braces", or just "braces"

## Quotes are important

In Python strings are **always** surrounded by quotes — we say the string is *delimited* by quotes. That is how Python is able to tell the difference between
the instructions (like the function name) and the data (the thing we want to print). We can use either single quotes “`''`” or double quotes “`""`” for delimiting strings — Python will happily accept either. The following two statements behave exactly the same — run them and you'll see:

In [2]:
print("Hello world")
print('Hello world')

Hello world
Hello world


You’ll notice that the output above doesn’t contain quotes – they are part of
the code, not part of the string itself. If we *do* want to include quotes in
the output, the easiest thing to do is use the other type of quotes for
surrounding the string:

In [3]:
print("She said, 'Hello world'")
print('He said, "Hello world"')

She said, 'Hello world'
He said, "Hello world"


In other words, single quotes can be used to delimit strings that contain double quotes, and vice versa.

Be careful when writing and reading code that involves quotes – you have to make
sure that the quotes at the beginning and end of the string match up.  Using a code editor that *highlights* code using colours helps with this, and you'll note that Jupyter highlights code by default in code cells:  Data like strings is shown in red.

## Use comments to annotate your code

Occasionally, we want to write some text in a program that is for humans to
read, rather than for the computer to execute. We call this type of line a
*comment*. To include a comment in your source code, start the line with a `#`
symbol (called “hash” or “pound”).

In [17]:
# this is a comment, it will be ignored by the computer
print("Comments are very useful!")

Comments are very useful!


The Python interpreter completely ignores everything from the `#` symbol to the end of the line.

Comments are a very useful way to document your code, for a number of reasons:

* You can put the explanation of what a particular bit of code does right next to the code itself. This makes it easier to find documentation than searching a separate document.


* Because the comments are part of the source code, they can never get mixed up or separated.


* Having the comments right next to the code acts as a reminder to update the documentation whenever you change the code. The only thing worse than undocumented code is code with old documentation that is no longer accurate!

Don't think that comments are only useful to someone else.  Mainly, you're writing comments *for yourself, six months in the future!*  You'll be amazed how quickly you forget the details of an intricate piece of code you're working on.

Comments are the breadcrumbs you leave yourself to remind you of the route your thinking took while solving a problem.  They save you time by helping you avoid having to spend hours understanding your own code.  Additionally, they help you understand what you're doing *while you're writing code*, by forcing you to put your thinking into words.

In short:  Write comments.  The more, and the more explicit, the better!

In [None]:
# Note how the following comment describes what the function call does:

# print a friendly greeting
print("Hello world")

# A comment extends from the `#` sign to the end of the line:

print("It's a lovely day!")    # This comments comes after some code and is still ignored

# ****************************************** #
#                                            #
# You can do anything after the '#' sign     #
# including fancy comment blocks like this!  #
#                                            #
# ****************************************** #

>Python provides special ways of writing comments that work with the integrated documentation system.  We'll see those later.

## Error messages and debugging

Wait, we haven't even written a real program yet and we're already talking about errors?

Yep, it's an unfortunate fact of life that **computer programs almost never work correctly the first time**.  Computers have the unfortunate habit of doing what you tell them to, instead of what you want them to.

Unlike natural languages, programming languages are very explicit.  They have strict rules (of syntax, etc.) and if you break any of them, the computer will not and cannot attempt to guess what you intended, but instead will stop running and present you with an error message.  You'll be seeing a lot of these error messages in your programming career, so let's get
used to them as soon as possible.

### Forgetting quotes

Here’s one possible error we can make when printing a line of output – we can
forget to include the quotes:

In [19]:
print("Hello world")

Hello world


This is easily done, so let’s take a look at the output we’ll get if we try to
run the above code.

We can see that the error occurs on the first line of code. Python’s best guess
at the location of the error is just before the close parentheses. Depending on
the type of error, this can be wrong by quite a bit, so don’t rely on it too
much!

The type of error is a `SyntaxError`, which means that Python can’t understand
the code – it breaks the rules in some way. We’ll see different types of errors
later in this course.

>Note that syntax highlighting could have saved you from the error above.  If you get used to seeing strings in red, it may have occurred to you that the statement above "looked wrong" before you tried to execute it.

### Spelling mistakes

What happens if we misspell the name of the function?

In [20]:
prin("Hello world")

NameError: name 'prin' is not defined

We get a different type of error – a `NameError` – and the error message is a
bit more helpful:

This time, Python doesn’t try to show us where on the line the error occurred,
it just shows us the whole line. The error message tells us which word Python
doesn’t understand, so it’s quite easy to fix.

### Splitting a statement over two lines

What if we want to print some output that spans multiple lines? For example, we
want to print the word “Hello” on one line and then the word “World” on the next
line – like this:

In [None]:
Hello
World

We might try putting a new line in the middle of our string like this:

In [24]:
print("Hello
"World")

Hello World


…but that won’t work and we’ll get an error message.

Python finds the error when it gets to the end of the first line of code. The
error message is a bit more cryptic than the others. `EOL` stands for End Of
Line, and string literal means a string in quotes. So to put this error message
in plain English: *“I started reading a string in quotes, and I got to the end
of the line before I came to the closing quotation mark”*

>Note again how the syntax highlighting should've given you a clue that something was wrong!

If splitting the line up doesn’t work, then how do we get the output we want…?  Write some code to print the words "Hello" and "World" on consecutive lines:

In [None]:
# Write your code in this code cell

## Printing special characters

The reason that the code above didn’t work is that Python got confused about
whether the new line was part of the string (which is what we wanted) or part of
the source code (which is how it was actually interpreted). What we need is a
way to include a new line as part of a string, and luckily for us, Python has
just such a tool built in. To include a new line, we write a backslash followed
by the letter `n` – Python knows that this is a special character and will
interpret it accordingly. Here’s the code which prints “Hello world” across two
lines:

In [None]:
# how to include a new line in the middle of a string
print("Hello\nworld")

Notice that there’s no need for a space before or after the new line.

There are a few other useful special characters as well, all of which consist of
a backslash followed by a letter. The only ones which you are likely to need for
the exercises in this course are the tab character (`\t`) and the carriage
return character (`\r`). The tab character can sometimes be useful when writing
a program that will produce a lot of output. The carriage return character works
a bit like a new line in that it puts the cursor back to the start of the line,
but doesn’t actually start a new line, so you can use it to overwrite output –
this is sometimes useful for long-running programs.

See if you can figure out what the following statement will print before you
execute it:

In [2]:
print("First value:\t45\nSecond value:\t97")

First value:	45
Second value:	97


## Triple-quoted strings

Sometimes we want to create strings that represent blocks of text with many line
breaks, and typing such strings with `\n` special characters is inelegant and
messy.  For this purpose, Python has triple-quoted strings.  If you delimit a
stirng with *three* quotes (either `"""` or `'''` will work), Python ignores any
line breaks in that string.  Some examples:

In [26]:
print('''Hello
World!''')

Hello
World!


In [None]:
limerick = """A programming genius called Gertie
Had a penchant for graphics so dirty
No computer she knew
Would accept what she drew
Until she had tickled its QWERTY."""

print(limerick)

## *Aside:* Evaluating vs. `print`ing in a Jupyter Notebook

What's the difference between doing running this line:

In [27]:
"ACTG"

'ACTG'

…and this line?

In [28]:
print("ACTG")

ACTG


When we write standalone scripts later on in this coruse, we'll use the `print` function whenever we need to print output to the terminal.

A Jupyter Notebook is a special interactive case, though.  When you type a
series of Python *statements* into a code box in a Notebook, the Python
interpreter *evaluates* them all in turn — this includes printing the output of any calls to the `print` function — and then **automatically prints the result of the last statement evaluated**.

In [29]:
"ACGTTG"
'Hello there'
"AATTT"

'AATTT'

Note that only the value of the last statement (of the three) was auto-printed.

Don't confuse an explicit call of the `print()` function with Jupyter automatically printing the result of the last statement evaluated in the code box. The latter is purely an artefact of the Jupyter environment.

>Note that when Jupyter auto-prints an evaluation, it prepends it with
`Out [n]:` in the left margin, where `n` indicates that its the `n`'th output from an evaluation in the current Notebook.

>By contrast, output from the `print()` function has no marker in the left margin.

## Assigning variables to strings

We've been printing strings for a while now, but every time we used a string we did so explicitly:  we used *literal* string values. Such literal values are of limited use 
for storing actual data we want to use in our computations, since there's no way to refer back to them; once they've been evaluated, they're gone!

If we want to use a string value for "real" computational work, we need to store it in the computer's memory in such a way that we can later retrieve it.  To do this, we need to assign it some sort of name or label. In Python, such a label is called a *variable*. We can *assign* a value to a variable by using the *assignement operator*  Python repurposes the equals sign (“`=`”) as its assignment operator:

In [34]:
# attach the variable (or label) my_dna to the sequence string
my_dna = "ATGCGTA"

ATGCGTA


The variable `my_dna` now points to the string `"ATGCGTA"`. Once we have executed this *assignment* statement, we can use the variable name instead of the string itself – for example, we can use it in a print statement:

In [35]:
# attach the variable my_dna to a short DNA sequence fragment
my_dna = "ATGCGTA"
# now print the DNA sequence
print(my_dna)

ATGCGTA


Notice that when we use the variable as argument to a `print()` function, we don’t need any quotation marks – the quotes are part of the representation of a literal string in Python.  If we *were* to put quotation marks around `my_dna`, we wouldn't be talking about the variable at all anymore, but rather a literal string with the literal value `"my_dna"`!

In [36]:
# attach the variable my_dna to a short DNA sequence fragment
my_dna = "ATGCGTA"
print(my_dna)
print("my_dna")

ATGCGTA
my_dna


>Note: Other types (not just strings) can of course also be stored in variables, as we'll see in later sections.

We can *reassign* a variable to a new value as often as we like:

In [37]:
my_dna = "ATGCGTA"
print(my_dna)
# assign my_dna to a new value
my_dna = "TGGTCCA"
print(my_dna)

ATGCGTA
TGGTCCA


> Note that the Python assignment operator `=` has a very different meaning as compared to the mathematical equals sign!
>  * In mathematics, `=` states equality; we read it as *"is equal to"*.
>  * In Python, `=` indicates that the variable on the left gets
assigned the value of the expression on the right. We can read it as *"is
assigned the value"*.

Remeber: variable names are arbitrary – that means that we can pick whatever we like to be the name of a variable. So our code above would work in exactly the same way if we picked a different variable name:

In [38]:
# attach the variable banana to a DNA sequence
banana = "ATGCGTA"
# now print the DNA sequence
print(banana)

ATGCGTA


However it's generally a good idea to use a variable name that gives us a clue as to what the variable refers to.

In this example:

* `my_dna` is a good variable name, because it tells us that the content of the variable is a DNA sequence
* `banana` is a bad variable name, because it doesn’t really tell us anything about the value that’s stored

A *descriptive* variable name implicitly helps you to document your code.

Python enforces a few rules on variable names:

* Variable names in Python can consist of the characters letters, numbers, and the underscore character “`_`”.  Variable names cannot *start* with a number, though.


* You cannot pick a variable name in Python that's identical to one of the language's built-in keywords, like `print`.


* Finally, variable names are *case sensitive*. This means that `my_seq`, `My_seq` and `my_SEQ` are all different variables!

Technically this last point means that you could use all three of those names in a Python program to reference different values.  **Don't do this** – it is very easy to become confused when you use very similar variable names.

## Visualising variables and assignment

You can think of a variable as a label that's assigned to a piece of data that's
stored at a specific address in the computer's memory.  For instance, if we look
at the first assignment above&hellip;

In [39]:
my_seq = "ACTTCGT"

&hellip;we can think of it as looking something like this:

<img src="img/variables01.png" align="left">
<div style="clear:both;height:1px;"> </div>

The blue cloud represents the location in the computer's memory that is now
storing your sequence data. The variable is a (yellow) label that has been
attached to the data &mdash; it points to the data where it sits in memory.

When Python evaluates the variable `my_seq`, it returns the string data that the
variable was referencing:

In [40]:
my_seq

'ACTTCGT'

Let's make a second variable, and give it the name `T_seq`:

In [41]:
T_seq = "GAAT"

We now have two variables (labels) pointing at two separate bits of data sitting
in two different locations in the computer's memory.

<img src="img/variables02.png" align="left">
<div style="clear:both;height:1px;"> </div>

We can evaluate our new variable:

In [42]:
T_seq

'GAAT'

What happens if we reassign `T_seq` to point to `my_seq`?

In [43]:
T_seq = my_seq

This amounts to telling Python, "Attach the `T_seq` label to the same piece of
data that bears the `my_seq` label."

<img src="img/variables03.png" align="left">
<div style="clear:both;height:1px;"> </div>

You can confirm this by evaluating both variables:

In [44]:
my_seq

'ACTTCGT'

In [45]:
T_seq

'ACTTCGT'

What happens to the bit of sequence data that T_seq was pointing to before?
You'll note that this data now has no label attached at all. If data is not
referenced by a variable, we say it has become *unreferenced*. There's no way to
get this data back. In fact, after a while Python will reclaim the memory used
by this data in a process known as *garbage collection*:

<img src="img/variables04.png" align="left">
<div style="clear:both;height:1px;"> </div>

If we assign `my_seq` a new value, we're telling Python to re-use the `my_seq`
label to point to a new, different piece of data that sits in a different
location in the computer's memory.

In [46]:
my_seq = "GA"

But, very importantly:  `T_seq` **still points to the original string!**

<img src="img/variables05.png" align="left">
<div style="clear:both;height:1px;"> </div>

Let's confirm this by evaluating both variables:

In [47]:
my_seq

'GA'

In [48]:
T_seq

'ACTTCGT'

## Tools for manipulating strings

Now we know how to print strings and assign them to variables, we can take a look at a few of the facilities that Python has for manipulating them. For now we’ll just take a look at some of the most useful ones. In
the exercises at the end of this section, we’ll look at how we can use multiple
different tools together in order to carry out more complex operations.

### Concatenation

In Python we can concatenate (stick together) two strings using the “`+`” (plus) symbol, borrowed from mathematics.  (We say Python has *overloaded* this mathematical operator to work on the string type.)  This *operator* will join together the string on the left with the string on the right and produce a new, concatenated string:

In [49]:
"AATT" + "GGCC"

'AATTGGCC'

We can assign a variable to the result of the concatenation:

In [50]:
my_dna = "AATT" + "GGCC"
print(my_dna)

AATTGGCC


In the above examples we concatenated literal strings, but we can also concatenate variables that point to strings:

In [51]:
upstream = "AAA"
my_dna = upstream + "ATGC"
# my_dna should now be "AAAATGC"
print(my_dna)

AAAATGC


We can also join multiple strings together in one go:

In [52]:
upstream = "AAA"
downstream = "GGG"
my_dna = upstream + "ATGC" + downstream
# my_dna should now be "AAAATGCGGG"
print(my_dna)

AAAATGCGGG


The result of concatenating two strings is itself a string. We can use a concatenation anywhere we could use a literal string.  Such as inside a print statement:

In [53]:
print("Hello" + " " + "world")

Hello world


Note what's going on in this example:  The `print` function is called with a *single argument*, but this argument isn't a string *literal* as before, nor is it a variable that references a string;  rather, it's an *expression that evaluates to a string*.  (As it happens, this expression is built up out of both string literals and a string variable.)

>Python always evaluates the arguments of a function before executing the function. A computer scientist will say that Python is a *strict* language, or that it performs *strict evaluation* or *applicative evaluation*.

### Repetition

The plus sign isn't the only mathematical operator which Python has *overloaded* to work on strings.  A somewhat strange — but nevertheless occasionally useful — operator for strings is “`*`” (the multiplication operator, represented — as in most programming languages — by an asterisk):

In [4]:
5 * "ACTG"

'ACTGACTGACTGACTGACTG'

In [3]:
'N' * 100

'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

You can only "multiply" a string by an integer number (say `n`), and this results in a new string which is the original one repeated `n` times. You cannot "multiply" two strings with each other.  What would that even mean?

In [56]:
"ACTG" * "GC"

TypeError: can't multiply sequence by non-int of type 'str'

### Finding the length of a string

Another useful built-in tool in Python is the `len()` function.  (`len` is short for “length”.)

Just like the `print()` function, the `len()` function takes a single argument which is a string. However, the behaviour of the `len()` function is quite different. Instead of outputting text to the screen, `len()` behaves more like a function in mathematics — it returns a value that can be stored, which we call this the *return value*.

>`print()` is actually the outlier in this case.  Most functions behave like `len()`.

If we write a program that uses `len()` to calculate the length of a string, the program will run but we won't see any output.  However, recall that Jupyter automatically evaluates the last line in a code box and *implicitly* prints the result of that evaluation, so we should see a value if we evaluate a call to the `len()` function:

In [57]:
# this line doesn't print any output
len("ATGC")

4

If we want to actually use the return value, we need to assign it to a variable, and then do something useful with it (like printing it or using it in a further computation):

In [59]:
dna_length = len("AGTC")
print(dna_length)

4


Note that you can also *nest* function calls, i.e. use a function call as an argument in another function call:

In [60]:
print(len("ACTG"))

4


As we've seen before, Python evaluates arguments before evaluating the function call.  In the case of nested functions, Python therefore evaluates "from the inside out":

* it first evaluates `"ACTG"`, which is a literal string and therefore evaluates to itself
* it then evaluates the call to `len()`, to produce the length of its argument
* it then evaluates the call to `print()`, which prints the result returned by `len()`

There’s another interesting thing about the `len` function: the result (or
return value) is not a string, it’s a number.  This is the first time we've worked with numbers!

Python treats strings and numbers differently. We can see that this is the case if we try to concatenate together a number and a string. Consider this short program which calculates the length of a DNA sequence and then prints a message telling us the length:

In [1]:
# assign the DNA sequence to a variable
my_dna = "ATGCGAGT"
# calculate the length of the sequence and assign it to a variable
dna_length = len(my_dna)
# print a message telling us the DNA sequence lenth
print("The length of the DNA sequence is " + dna_length)

TypeError: must be str, not int

When we try to run this program, we get an error.

Python is complaining that it doesn’t know how to concatenate a string (which it
calls `str` for short) and a number (which it calls `int` – short for integer).
Strings and numbers are examples of *types* – different kinds of information
that can exist inside a program. We'll talk a lot more about types later on.

We can fix this bug by using another built-in function supplied by Python:  It's called `str()` and it turns a number into a string so that we can print it.

Here's our modified program (without comments to make it more compact):

In [65]:
my_dna = "ATGCGAGT"
dna_length = len(my_dna)
print("The length of the DNA sequence is " + str(dna_length))

The length of the DNA sequence is 8


The only thing we have changed is that we’ve replace `dna_length` with
`str(dna_length)` inside the `print` statement.

### Aside:  The clever `print()` function

We've seen many times that `print()` can print string literals.  We've also just seen that `print()` can print variables that point to integer (numeric) values.  It can also print literal numbers:

In [2]:
# literal string:
print("Hello World")

# string variable:
print(my_dna)

# literal number:
print(101)

# numeric variable
print(dna_length)

Hello World
ATGCGAGT
101
8


`print()` is clever enough to take the argument you give it and *implicitly* convert it to a textual representation before printing it.

>Note that we can insert blank lines in our code to make it more readable.

But wait!  That's not all!

We can also call print with *multiple arguments*.  To call a function with more than one argument in Python, we simply put all the arguments between the parentheses and separate them with commas:

In [66]:
print("Hello", "World")

Hello World


What happens in this example?  `print()` is now given *two* strings as arguments:  `"Hello"` and `"World"`.  What dos it do?  It prints both strings, separated by a space.

Let's experiment:  What if we add yet more arguments:

In [6]:
print("Hello", 1, "World")

Hello 1 World


As you can see, `print()` prints *all* its arguments, separated by spaces, implicitly converting arguments to a textual representation as necessary.

We can now rewrite our previous example a little more succinctly:

In [7]:
my_dna = "ATGCGAGT"
dna_length = len(my_dna)
print("The length of the DNA sequence is", len(my_dna))

The length of the DNA sequence is 8


Since we didn't concatenate strings to form a single argument, we didn't need to call the `str()` function to convert a number to a string.  Instead, we simply called `print()` with multiple arguments and depended on its implicit conversion to print the number.

### Changing case

We can convert a string to lower case by using a new type of syntax – a *method*
that belongs to strings. A *method* is like a *function*, but instead of being
built in to the core Python language, it belongs to a particular *type* of data.

The method need is called `lower()` and we say that it belongs to the *string* type. Here’s how we use it:

In [69]:
my_dna = "ATGC"
# print my_dna in lower case
my_dna.lower()

'atgc'

Notice how using a method looks different to using a function. When we use a
function like `print()` or `len()`, we write the function name first and the
arguments go in parentheses:

In [70]:
print("ATGC")
len(my_dna)

ATGC


4

When we use a method…
* we write the name of the variable first
* followed by a period (`'.'`)
* followed by the name of the method
* followed by the method's arguments in parentheses

In this example `lower()` takes no argument, so the parentheses are empty.  They're still required, though, because they're part of Python's *function call* syntax, which applies equally to methods.

It’s important to notice that the `lower()` method does not actually change the value pointed to by the variable `my_dna`; instead it **returns a copy value pointed to by the variable**, but in lower case.

We can easily show this by printing the variable before and after running `lower()`:

In [71]:
my_dna = "AtgC"

# print the variable
print("before:\t", my_dna)

# run the lower method and store the result
lowercase_dna = my_dna.lower()

# print the variable again
print("after:\t", my_dna)

# print the copy of the variable
print ("lower:\t", lowercase_dna)

before:	 AtgC
after:	 AtgC
lower:	 atgc


Analogous to `lower()`, the string type also has a method `upper()`.  You can probably guess what it does!

In [None]:
my_dna = "AtgC"
print("before:\t", my_dna)
uppercase_dna = my_dna.upper()
print("after:\t", my_dna)
print ("upper:\t", uppercase_dna)

Because `upper()` and `lower()` are method of the string type, we can only use them on variables that are strings. If we try to use one of them on a number:

In [5]:
my_number = len("AGTC")
# my_number is 4
print(my_number.lower())

AttributeError: 'int' object has no attribute 'lower'

…we will get an error.

The error message is a bit cryptic, but hopefully you can grasp the meaning:
something that is a number (an `int`, or integer) does not have a `lower()`
method.

### Replacement

Here’s another example of a useful method that belongs to the string type: `replace()`.

`replace()` takes *two* arguments (both strings) and returns a copy of the value pointed to by the string variable where all occurrences of the first string are replaced by the second string.

Just as when we used the `print()` function with multiple arguments, we use commas to separate multiple arguments to a method.

Note that the `replace()` method **of a string variable** takes **two further strings** as arguments.  There are three strings involved here:

* the original string, pointed to by the string variable
* the first argument, indicating the substring that will be replaced
* the second argument, indicating the string that will replace it

Confused?  An example should make things clearer:

In [72]:
protein = "vlspadktnv"

# replace valine with tyrosine
print(protein.replace("v", "y"))

ylspadktny


In [73]:
# we can replace more than one character
print(protein.replace("vls", "ymt"))

ymtpadktnv


In [74]:
# the original variable is not affected
print(protein)

vlspadktnv


We’ll take a look at more powerful tools for carrying out string replacement
later on.

### Counting and finding substrings

A very common job in biology is to count the number of times some pattern occurs
in a (DNA, RNA, protein, …) sequence. In programming terms, we want to count the number of times a substring occurs in a string.

The string type has a method that does this job, called `count()`. It takes a single argument whose type is string, and returns the number of times that this string is found as a substring of the string pointed to by the variable. The return type is (obviously) a number.

An example:

In [6]:
protein = "vlspadktnv"

# count amino acid residues
valine_count = protein.count('v')
lsp_count = protein.count('lsp')
tryptophan_count = protein.count('w')
 
# now print the counts
print("valines:", valine_count)
print("lsp:", lsp_count)
print("tryptophans:", tryptophan_count)

valines: 2
lsp: 1
tryptophans: 0


Often we don't just want to count patterns in a sequence, but also know their location.  The string type's `find()` method will give us this answer.

`find()` takes a single string argument (just like `count()`) and returns a number which is the position at which that substring first appears in the string (we call that the *index* of the substring).

**Important:**  In Python we start counting from zero rather than one (We say that Python strings are *zero-indexed*), so position 0 is the first character, position 1 is the second character, position 2 the third and so on.

A couple of examples:

In [76]:
protein = "vlspadktnv"
print(protein.find('p'))
print(protein.find('kt'))
print(protein.find('w'))

3
6
-1


Notice the behaviour of `find()` when we ask it to locate a substring that doesn’t exist – we get back the answer `-1`.

Python strings have another method called `index()` that works almost identically to `find()` … except for what it does if it can't find the substring:

In [77]:
protein = "vlspadktnv"
print(protein.index('p'))
print(protein.index('kt'))
print(protein.index('w'))

3
6


ValueError: substring not found

>`count()`, `find()` and `index()` all have the limitation that you can only search for *exact* substrings. If you need to count the number of occurrences of a variable protein motif, or find the position of a variable transcription factor binding site, they will not help you. Later on in the course we'll encounter more powerful tools that can do these sort of jobs.

Of the tools we’ve discussed in this section, three – `replace`, `count` and `find` – require at least two strings to work.  For instance, in the case of `find`:  a substring you're searching *for* and the string you're searching *in*.  So this:

```python
my_dna.count(my_motif)
```

…is not the same as:

```python
my_motif.count(my_dna)
```

The first example looks for `my_motif` in `my_dna`, which is probably what you wanted to do.  The second example is most likely not what you intended.

### Extracting a character from a string

It's possible to extract a single character from a Python string, using Python's string *subscript* syntax. We place the *index* of the character we want in **square brackets**, *directly* after the variable name (or literal string):

In [82]:
protein = "vlspadktnv"
print("The value of 'protein' is currently:\t", protein)
print("The first residue of 'protein' is:\t", protein[0])
print("The second residue of 'protein' is:\t", protein[1])
print("The third residue of 'protein' is:\t", protein[2])

The value of 'protein' is currently:	 vlspadktnv
The first residue of 'protein' is:	 v
The second residue of 'protein' is:	 l
The third residue of 'protein' is:	 s


We can also use substring syntax with string literals:

In [83]:
print("CGATTAG"[0])
print("CGATTAG"[1])
print("CGATTAG"[2])
print("CGATTAG"[3])
print("CGATTAG"[4])
print("CGATTAG"[5])
print("CGATTAG"[6])

C
G
A
T
T
A
G


**Once again:**  It's very important to note that the *first* character has the index `0` (zero), the second character has the index `1`, and so on.  Python starts counting from zero — we say Python's indexing is *zero-based*.

| Character | Index |
|:---------:|:-----:|
| `C`       | `0` |
| `G`       | `1` |
| `A`       | `2` |
| `T`       | `3` |
| `T`       | `4` |
| `A`       | `5` |
| `G`       | `6` |

What happens when we try to use an index larger than 6 with a string that is only 7 characters long (i.e. runs from index 0 to index 6)?  Let's try:

In [84]:
print("CGATTAG"[7])

IndexError: string index out of range

As we may have expected, we get an error.  An `IndexError` to be exact.

Note what happens when the string index is *negative*:

In [85]:
print("CGATTAG"[-1])
print("CGATTAG"[-2])
print("CGATTAG"[-3])
print("CGATTAG"[-4])
print("CGATTAG"[-5])
print("CGATTAG"[-6])
print("CGATTAG"[-7])

G
A
T
T
A
G
C


It counts *backwards* from the end of the string!

Hence, the index `-1` indicates the last character of a string, `-2` is the penultimate character, and so forth:

| Character | Index | Negative index |
|:---------:|:-----:|:--------------:|
| `C`       | `0` | `-7` |
| `G`       | `1` | `-6` |
| `A`       | `2` | `-5` |
| `T`       | `3` | `-4` |
| `T`       | `4` | `-3` |
| `A`       | `5` | `-2` |
| `G`       | `6` | `-1` |

Mostly we use negative indices to extract characters from somewhere near the end of a string.  They give us a quick way of referencing the last (or penultimate, etc.) character.

### Extracting a substring from a string

What if we want to extract more than one *contiguous* character?

If we want to extract a *substring* (that is, as series of contiguous characters) from a string, we use Python's *slicing* syntax. This looks a lot like subscripts, except we now give *two* values (the beginning and end positions of the slice), separated by a colon (“`:`”).

In [87]:
my_seq = "CGATTAG"
print(my_seq)

CGATTAG


In [88]:
print(my_seq[1:3]) # from the first but stop before the third position so basically count from seconden til third

GA


In [89]:
print(my_seq[2:5])

ATT


Note that the way in which slicing works might seem a little unintuitive at first blush:

`my_seq[1:3]` means:

* **from** the first character&hellip;
* &hellip;until **just before** the 3rd character.

In other words, the slice notation is *inclusive* at the start, and *exclusive* at the end.  Alternatively, we could say it's open-ended on the right.

You may also leave out either of the two endpoints of a slice. A missing
endpoint means "to the end of the string". For example:

In [90]:
my_seq = "CGATTAG"

# This is equivalent to: my_seq[0:3]
# i.e. "from the start of the string till just before character 3"
print(my_seq[:3])

CGA


In [91]:
# This is equivalent to: my_seq[2:7], for our string of length 7
# i.e. "from character 2 till the end of the string"
print(my_seq[2:])

ATTAG


In [92]:
# This is equivalent to: my_seq[-2:7], for our string with length of 7
# i.e. "from the second-to-last character to the end of the string
print(my_seq[-2:])

AG


In [93]:
# This is equivalent to the entire my_seq
# i.e. "from the first character to the last character
print(my_seq[:])

CGATTAG


> Note that the "slightly unintuitive" way in which slicing works means
that&hellip;
>
>     my_seq[:n] + my_seq[n:]
>
> …is always equal to just `my_seq` for any value of “`n`”! This may help you to remember how it works.

Applying the string slicing syntax always yields a new string, leaving the original unchanged.  Thus, the empty slice `[:]` in the last example above can be used to make an exact copy of a string:

In [94]:
my_seq = "CGATTAG"
seq_2 = my_seq[:]
my_seq = "AG"
print("'myseq':\t", my_seq)
print("'seq_2':\t", seq_2)

'myseq':	 AG
'seq_2':	 CGATTAG


If you include a third integer between the square brackets, this becomes the *step* value. The default step value (if you leave out the third integer, as we've done so far) is `1`.

In [8]:
my_seq = "CGATTAG"

# Equivalent to: my_seq[1:3]
print(my_seq[1:3:1])

GA


In [96]:
# Equivalent to: my_seq[3:]
print(my_seq[3::1])

TTAG


In [97]:
# Step value of 2; print every 2nd character from 3 till before 6 
print(my_seq[3:6:2])

TA


In [98]:
# Print every 2nd character of the entire string
print(my_seq[::2])

CATG


In [11]:
# Print *backwards* from the penultimate till *just before* the 3rd character  
print(my_seq[-2:3:-1])

AT


In [100]:
# Print the entire string backwards
print(my_seq[::-1])

GATTAGC


### Recap of string manipulation

We've seen four ways to manipulate strings in this section:

* Built-in Python operators like `+`
* Built-in Python functions like `len()`
* String methods like `replace()` and `lower()`
* String indexing and slicing

There's still more to come!  For instance, we couldn't touch on all string methods in this section, because some of them yield data structures we have not yet learned about (like *lists*!)

## Recap of the `print()` function

The humble `print()` function has some useful tricks up its sleeve.

* It always prints a textual representation of the argument you give it, even if that argument is not a string.


* When called with multiple arguments, `print()` will print textual representations of all arguments a single line, separated by spaces.


* A feature of `print()` which we have been using silently all along, is that it implicitly appends a newline character (“`\n`”) to the string being printed. Thus, if multiple `print()` statements follow each other, their output appears on consecutive lines:

In [None]:
print("Hello")
print("World")

The `print()` function has even more tricks up its sleeve, as we'll see later…

---

## Exercises

### 1. Calculating GC content

Here’s a short DNA sequence:

    ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT

Write a program that will print out the GC content of this DNA sequence. Hint:
you can use normal mathematical symbols like add (`+`), subtract (`-`), multiply
(`*`), divide (`/`) and parentheses to carry out calculations on numbers in
Python.

You can do the exercise in the following code block.  I've started by putting
the sequence into a variable `my_dna` for you:

In [208]:
# Exercise on calculating GC content
dna_seq = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"

# count G and C seperately
G_content = dna_seq.count('G')
C_content = dna_seq.count('C')
 
# now print the counts
print("G:", G_content)
print("C:", C_content)

dna_length = len("ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT")
print(dna_length)

gc_content = ((int(G_content) + int(C_content)) / dna_length) * 100

print(gc_content)


G: 8
C: 9
54
31.48148148148148


### 2. Complementing DNA

Here’s a short DNA sequence:

    ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT

Write a program that will print the complement of this sequence.

In [128]:
# Exercise on complementing a DNA sequence
dna_seq = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"

#print *backwards* from the penultimate till *just before* the 3rd character  
print(dna_seq[::-1])

TACTTGCGTAGCTATATATACATACTATCGTTTATGATATGCATTAGCTAGTCA


### 3. Restriction fragment lengths

Here’s a short DNA sequence:

    ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT

The sequence contains a recognition site for the EcoRI restriction enzyme, which
cuts at the motif G\*AATTC (the position of the cut is indicated by an
asterisk).

Write a program which will calculate the size of the two fragments that will be
produced when the DNA sequence is digested with EcoRI.

In [187]:
# Exercise 3
dna_seq = "ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT"

print("complete sequence length:", len("ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT"))
EcoRI = dna_seq.find("GAATTC")
EcoRI_length = len("GAATTC")

print("Position of cutting site: ", EcoRI)
print("Length of the cutting site: ", EcoRI_length)

print((dna_seq [0:21]),(dna_seq [27:56]))
dna_seq_1 = len("ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT"[0:21])
dna_seq_2 = len("ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT"[27:56])
print("length of the first fragment: ", dna_seq_1)
print("length of the second fragment: ", dna_seq_2)

complete sequence length: 55
Position of cutting site:  21
Length of the cutting site:  6
ACTGATCGATTACGTATAGTA TATCATACATATATATCGATGCGTTCAT
length of the first fragment:  21
length of the second fragment:  28


### 4. Splicing out introns, Part 1

Here’s a short section of genomic DNA:

    ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT

It comprises two exons and an intron. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. Write a program that will print just
the coding regions of the DNA sequence.

In [199]:
# Exercise 4

dna_seq = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"

sequence_length = len("ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT")
print("Sequence length: ", sequence_length)

exon1 = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[0:37]
exon2 = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[92:123]

print("Exon 1: ", exon1)
print("Exon 2: ", exon2)

Sequence length:  123
Exon 1:  ATCGATCGATCGATCGACTGACTAGTCATAGCTATGC
Exon 2:  CATCGATCGATATCGATGCATCGACTACTAT


### 5. Splicing out introns, Part 2

Using the data from part one, write a program that will calculate what
percentage of the DNA sequence is coding.

>**Reminder:**  This entire Jupyter Notebook is in reality just one Python
session, so any variables you defined in Part 1 will still be defined;  you
don't need to define them again.

In [204]:
# Exercise 5

sequence_length = len("ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT")
print("Sequence length: ", sequence_length)
 
exon1_length = len("ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[0:37])
exon2_length = len("ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[92:123])
print(exon1_length)
print(exon2_length)

# now print the counts
print(int(exon1_length) + int(exon2_length) / sequence_length,"%")

Sequence length:  123
37
31
37.2520325203252 %


### 6. Splicing out introns, Part 3

Using the data from Part 1, write a program that will print out the original
genomic DNA sequence with coding bases in uppercase and non-coding bases in
lowercase.

In [5]:
# Exercise 6
 ## coding bases
exon1 = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[0:37]
exon2 = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[92:123]


coding_dna1 = exon1.upper()
coding_dna2 = exon2.upper()
print("exon1:\t", coding_dna1)
print ("exon2:\t", coding_dna2)

## noncoding bases

intron = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[37:91]

# run the lower method and store the result
intron = intron.lower()

# print the copy of the variable
print ("noncoding: ", intron)

exon1:	 ATCGATCGATCGATCGACTGACTAGTCATAGCTATGC
exon2:	 CATCGATCGATATCGATGCATCGACTACTAT
noncoding:  atgtagctactcgatcgatcgatcgatcgatcgatcgatcgatcgatcatgcta
