# Welcome to Day 1! 

## Introduction to strings, Biopython, and Biopython sequences

### Section 1: String manipulation in Python


### Section 2: Installing and importing Biopython 


### Section 3: Working with sequences using Biopython 

----
## Session summary

Our first session will have a slow pace to make sure everyone's jupyter notebook is working correctly and that everyone has a basic understanding of strings in Python, which is important for understanding Biopython. Afterwards, we will load up Biopython and get to work modifying sequences!

----

# Section 1: String manipulation in Python

A string in Python is any set of characters between a single quotes or double quotes.  

In [3]:
# this is a string
print('this is a string')
# this is also a string
print("this is also a string")


this is a string
this is a string


Strings can be manipulated using basic Python syntax.

One way is to use brackets to indicate the range of characters in a string that you wish to keep.

In [12]:
# [start:stop]
# remove the first four characters of the string
print('this is a string'[4:])
# keep the first 12 characters 
print('this is a string'[:12])
# remove the last four characters
print('this is a string'[:-4])
# keep the middle 5-10 characters
print('this is a string'[4:10])

 is a string
this is a st
this is a st
 is a 


### Exercise 1:

Complete the code to show only characters `'string'`. 

**Hint: replace the ___ with character positions.**

In [None]:
print('this is a string'[___:___])

Rather than typing out the string each time, we can save it to a variable.

In [14]:
string_var = 'this is a string'

We can still manipulate the string variable using brackets as before

In [15]:
# [start:stop]
# remove the first four characters of the string
print(string_var[4:])
# keep the first 12 characters 
print(string_var[:12])
# remove the last four characters
print(string_var[:-4])
# keep the middle 5-10 characters
print(string_var[4:10])

 is a string
this is a st
this is a st
 is a 


However, the string_var itself is not affected. It still prints 'this is a string'

In [16]:
print(string_var)

this is a string


We can change the string within string_var with a modified string easily

In [17]:
string_var = string_var[4:]
print(string_var)

 is a string


---
### Exercise 2a: 
Assign a string, `'my new string'` to `string_var_2`. Print `string_var_2`

In [None]:
___ = 'my new string'
print(string_var_2)

### Exercise 2b:
Make a new variable called `string_var_3` that only contains only the first three characters of `string_var_2`. Print `string_var_3`

In [None]:
___ = string_var_2[___:___]
print(string_var_3)

---
There are additional Python functions that can modify sequences. One group are the `upper()`, `lower()` functions.

In [19]:
string_var_4 = 'I_Am_A_beAuTifUl_sTrIng'
# make all characters upper case
print(string_var_4.upper())
# make all lower case
print(string_var_4.lower())

I_AM_A_BEAUTIFUL_STRING
i_am_a_beautiful_string


The replace() function is also very useful. It will replace all characters with another.

In [21]:
# replace beAuTifUl with and empty string (effectively deleting it)
print(string_var_4.replace('beAuTifUl',''))
# replace beAuTifUl with a new string
print(string_var_4.replace('beAuTifUl','GREAT'))
# replace all underscores with a space
print(string_var_4.replace('_',' '))

I_Am_A__sTrIng
I_Am_A_GREAT_sTrIng
I Am A beAuTifUl sTrIng


You can also add a string to another string using the + operator. You can create new variables from this too.

In [55]:
# add exclamation marks
print(string_var_4+'!!!!!')

# add question marks and save to new variable 
string_var_4_modified = string_var_4+'???'
print(string_var_4_modified)

I_Am_A_beAuTifUl_sTrIng!!!!!
I_Am_A_beAuTifUl_sTrIng???


---
### Exercise 3a:

Assign `string_var_4` to a new variable called `string_var_5`. Print `string_var_5`

In [22]:
___ = string_var_4
print(___)

I_Am_A_beAuTifUl_sTrIng


### Exercise 3b:
Add the string `'i think'` to `string_var_5`. Then `replace()` all underscores with hyphens (-). Finally, print `string_var_5`!

In [None]:
string_var_5 = string_var_5+___

string_var_5 = ___.replace(___, ___)

print(____)

That's it for string manipulation in base Python for now! 

# Section 2: Installing and importing Biopython

Installing the Biopython package is simple:

In [24]:
pip install Biopython

Note: you may need to restart the kernel to use updated packages.


**Import** the Biopython package to make sure that it downloaded succesfully. Note that the package's name is Bio

In [27]:
# When you run this cell, there should be no errors
import Bio

Check the version of Biopython. It should be 1.77 

In [31]:
Bio.__version__

'1.77'

That's it for installing Bioython! 

# Section 3: Working with sequences using Biopython

Now that we have reviewed the basics of strings in Python we are ready to adapt them to biological sequences and learn how to work with them using Biopython.

Let's make a short DNA sequence and check what class of data Python sees it as

In [126]:
# make string
short_dna = 'atgtacgacaacgccagcaccaggatcaacggc'
# check what class of viarble
type(short_dna)


str


We will start by importing Biopython's **Seq** module (Seq is shorthand for Sequence) and comparing it to a basic Python string.

Biopython's Seq module is designed to work with sequences. We will convert short_dna to a Seq **object**

In [125]:
from Bio.Seq import Seq
# convert short_dna to a Seq object
short_dna = Seq(short_dna)
#The Biopython Seq class
type(short_dna)

Bio.Seq.Seq

### What does it mean for short_dna to be a Seq object? 

When a variable like short_dna is converted to a Seq object, it tells Biopython that short_dna has certain properties, like whether it is a DNA, RNA, or a protein sequence. It also tells Biopython that certain functions that can be performed on sequences, like translation or calculating the reverse complement, can be performed on it.

Here are some useful functions that can be applied to DNA seq object

In [117]:
# make a Seq object from a string of DNA characters
seq_object_dna = Seq('atgtacgacaacgccagcaccaggatcaacggc')

In [118]:
# translate
seq_object_dna.translate()

Seq('MYDNASTRING', ExtendedIUPACProtein())

In [119]:
# transcribe
seq_object_dna.transcribe()

Seq('auguacgacaacgccagcaccaggaucaacggc', RNAAlphabet())

In [120]:
# reverse complement
seq_object_dna.reverse_complement()

Seq('gccgttgatcctggtgctggcgttgtcgtacat')

In [121]:
# complement
seq_object_dna.complement()

Seq('tacatgctgttgcggtcgtggtcctagttgccg')

In [122]:
# convert sequence to upper case
seq_object_dna.upper()

Seq('ATGTACGACAACGCCAGCACCAGGATCAACGGC')

In [123]:
# convert sequence to lower case
seq_object_dna.lower()

Seq('atgtacgacaacgccagcaccaggatcaacggc')

We can check that seq_object_dna is a Biopython Seq object by entering the value and **returning** the code. A Seq object will have `Seq()` returned as part of the value. 


In [58]:
seq_object_dna

Seq('atgtacgacaacgccagcaccaggatcaacggc')

What if our sequence was not a Seq object, but instead a normal Python string? For example if we try to translate a string, we will get an error. This is because string_dna does not have any information about what can and can't be done to it with Biopython functions.

In [57]:
# same dna as above but not a seq object
string_dna = 'atgtacgacaacgccagcaccaggatcaacggc'
# trying to use translate
print(string_dna.translate())

TypeError: translate() takes exactly one argument (0 given)

---
### Exercise 4a
Assign the string `'aacgagtggagcgagcaggagaactgcgag'` to a new variable called `dna_seq`. Afterwards, convert it to a Seq object using `Seq()`. Are the classes different?

In [68]:
# assign the string 'aacgagtggagcgagcaggagaactgcgag' to dna_seq
dna_seq = ___
print(type(dna_seq)) 

# convert it to a Seq object
dna_seq = Seq(___)
print(type(dna_seq))


<class 'str'>
<class 'Bio.Seq.Seq'>


### Exercise 4b
`translate()` the Seq object `dna_seq` and store the result in a variable called `translated_seq`. Then print it.

In [None]:
___ = ___.translate()
print(___)

### Exercise 4c
Take the `reverse_complement()` of dna_seq and store it in `dna_seq_rc`



In [None]:
dna_seq_rc = ____
print(dna_seq_rc)

---
The very last thing we'll touch on in this session will adding and removing sequences to an existing sequence. We will also make sure that we are not accidentally mixing sequence types when we do this (adding DNA sequence to an RNA sequence for example)

In [88]:
short_seq = Seq('agccacaggaccagcgagcag')

In [89]:
longer_seq = short_seq + Seq('aacgagtgggacaacgcc')

In [92]:
longer_seq

Seq('agccacaggaccagcgagcagaacgagtgggacaacgcc')

When adding sequences together, it is important that you control for potential errors, such as adding a DNA sequence to and RNA sequence.

How can we control for this? We will use Biopython's Alphabet functions to make sure that each Biopython Seq object has the correct sequence 'alphabet'.

In [80]:
from Bio.Alphabet import IUPAC

In [93]:
longer_seq.alphabet = IUPAC.unambiguous_dna
longer_seq

Seq('agccacaggaccagcgagcagaacgagtgggacaacgcc', IUPACUnambiguousDNA())

When we return longer_seq variable, we see that it now also has an additional descriptor, IUPACUnambiguousDNA.

If we explicitly state the types of sequences we are working with, Biopython will return errors on certain operations that don't make biologicl sense. For example:



In [114]:
# Make a Seq object of RNA and assign the alphabet as IUPAC.unambiguous_rna
RNA_seq = Seq('aacgagugggacaacgccuuu', IUPAC.unambiguous_rna)
# Make a Seq object of amino acids and assign the alphabet as IUPAC.protein
protein_seq = Seq('PRTIENSEQ', IUPAC.protein)
# Make a Seq object of DNA and assign the alphabet as IUPAC.unambiguous_dna
DNA_seq = Seq('agccacaggaccagcgagcag', IUPAC.unambiguous_dna)
# Make a Seq object but do not define alphabet.
random_seq = Seq('asdfl;kjsadfoij')


In [111]:
# try adding RNA_seq to DNA_seq. This will return an error.
DNA_seq+RNA_seq

TypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACUnambiguousRNA()

In [112]:
# try adding protein_seq to RNA_seq. This will return an error.
protein_seq+RNA_seq

TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousRNA()

In [113]:
# add protein to protein. This will work.
protein_seq+protein_seq

Seq('PRTIENSEQPRTIENSEQ', IUPACProtein())

In [115]:
# add random_seq to DNA_seq. This will work but is not biologically correct.
random_seq+DNA_seq

Seq('asdfl;kjsadfoijagccacaggaccagcgagcag')

We can see from the last example that when we add sequences together, we want to make sure they have the correct sequence type labels or else you may end up adding gibberish, or mixed sequence types.

---
### Exercise 5a

Convert the string `'PRTIENSEQ'` into a `IUPAC.protein()` sequence.

In [None]:
my_prtn_seq = ___
my_prtn_seq

### Exercise 5b
Create a new protein `Seq()` object, `my_prtn_seq2` with the amino acids `'FGTYRL'`. Afterwards, join `my_prtn_seq2` with `my_protein_seq`

In [None]:
my_prtn_seq2 = Seq(___)
my_prtn_seq2 + ___ 

---

That's it for Day 1