# Welcome to Day 1! 

## Introduction to strings, Biopython, and Biopython sequences

### Section 1: String manipulation in Python


### Section 2: Installing and importing Biopython 


### Section 3: Working with sequences using Biopython 

----
## Session summary

Our first session will have a slow pace to make sure everyone's jupyter notebook is working correctly and that everyone has a basic understanding of strings in Python, which is important for understanding Biopython. Afterwards, we will load up Biopython and get to work modifying sequences!

----

# Section 1: String manipulation in Python

A string in Python is any set of characters between a single quotes or double quotes.  

In [None]:
# this is a string
'this is a string'

In [None]:
# this is also a string
"this is also a string"

Strings can be manipulated using basic Python syntax.

One way is to use brackets with a colon `[ : ]` to indicate the range of characters in a string that you wish to keep. `[start:stop]`

In [None]:
# [start:stop]
# remove the first four characters of the string
'this is a string'[4:]

In [None]:
# keep the first 12 characters 
'this is a string'[:12]

In [None]:
# remove the last four characters
'this is a string'[:-4]

In [None]:
# keep the middle 5-10 characters
'this is a string'[4:10]

---

### Exercise 1

Complete the code to show only characters `'string'`. 

**Hint: replace the ___ with the correct character position.**

In [None]:
'this is a string'[___:]

---

Rather than typing out the string each time, we can save it to a variable.

In [None]:
 string_var = 'this is a string'

 string_var

We can still manipulate the string variable using brackets as before

In [None]:
# [start:stop]
# remove the first four characters of the string
string_var[4:]

In [None]:
# keep the first 12 characters 
string_var[:12]

In [None]:
# remove the last four characters
string_var[:-4]

In [None]:
# keep the middle 5-10 characters
string_var[4:10]

However, the string_var itself is not affected. It still `returns` 'this is a string'

In [None]:
string_var

We can change the string within string_var with a modified string easily

In [None]:
string_var = string_var[4:]
string_var

Difference between `return` and `print()`

So far we have been using `return`. When you assign a variable some value and then type it out, the value of the most recent variable is returned:

In [None]:
variable1 = 'hello'
variable2 = 'hi'

variable1
variable2

Only the value of variable 2 is present. If we use `print()`, things work a little differently

In [None]:
variable1 = 'hello'
variable2 = 'hi'

print(variable1)
print(variable2)

Now the value of both variables is visible to us. 

As we go forward, there will be times where it is simpler to `return` a value and other times when we need multiple values to be displayed to us, in which case we will use `print()`

---
### Exercise 2a
Assign a string, `'my new string'` to `string_var_2`. Next, `print()` `string_var_2`

In [None]:
___ = 'my new string'
print(string_var_2)

### Exercise 2b
Make a new variable called `string_var_3` that contrains only the 4th-6th characters of `string_var_2`. `Print()` `string_var_3`

In [None]:
___ = string_var_2[___:___]
print(string_var_3)

---
There are additional Python functions that can modify sequences. One group are the `upper()`, `lower()` functions.

In [None]:
string_var_4 = 'I_Am_A_beAuTifUl_sTrIng'
# make all characters upper case
print(string_var_4.upper())
# make all lower case
print(string_var_4.lower())

The replace() function is also very useful. It will replace all characters with another character or string of interst, or even an empty space `''`

In [None]:
# replace beAuTifUl with and empty string (effectively deleting it)
string_var_4.replace('beAuTifUl','')

In [None]:
# replace beAuTifUl with a new string
string_var_4.replace('beAuTifUl','GREAT')

In [None]:
# replace all underscores with a space
string_var_4.replace('_',' ')

You can also add a string to another string using the + operator. You can create new variables from this too.

In [None]:
# add exclamation marks
string_var_4+'!!!!!'

In [None]:
# add question marks and save to new variable 
string_var_4_modified = string_var_4+'???'
string_var_4_modified

---
### Exercise 3a

Assign `string_var_4` to a new variable called `string_var_5`. `print()` `string_var_5`

In [None]:
___ = string_var_4
print(___)

### Exercise 3b
Add the string `'i think'` to `string_var_5`. Then `replace()` all underscores with hyphens (-). Finally, `print()` `string_var_5`!

In [None]:
string_var_5 = string_var_5+___

string_var_5 = ___.replace(___, ___)

print(____)

Now that we have covered strings, it is time to move on to Biopython

# Section 2: Installing and importing Biopython

Installing the Biopython package is simple:

In [None]:
pip install Biopython

If you have Biopython already installed, we need to makes sure that it is version 1.78 or greater.

In [None]:
pip install Biopython --upgrade

`import` the Biopython package to make sure that it downloaded succesfully. Note that the package's name is `Bio`

In [None]:
# When you run this cell, there should be no errors
import Bio

Check the version of Biopython. It should be `1.79`

In [None]:
Bio.__version__

That's it for installing Bioython!

# Section 3: Working with sequences using Biopython

Now that we have reviewed the basics of strings in Python we are ready to adapt them to biological sequences and learn how to work with them using Biopython.

Let's make a short DNA sequence and check what class of data Python sees it as

In [None]:
# make string
short_dna = 'atgtacgacaacgccagcaccaggatcaacggc'
short_dna

In [None]:
# check what class of variable
type(short_dna)


We will start by importing Biopython's **Seq** module (Seq is shorthand for Sequence) and comparing it to a basic Python string.

Biopython's Seq module is designed to work with sequences. We will convert short_dna to a Seq **object**

In [None]:
# import the Seq module
from Bio.Seq import Seq

In [None]:
# convert short_dna to a Seq object
short_dna = Seq(short_dna)
short_dna

In [None]:
#The Biopython Seq class
type(short_dna)

### What does it mean for short_dna to be a Seq object? 

When a variable like short_dna is converted to a Seq object, it tells Biopython that short_dna has certain properties, like whether it is a DNA, RNA, or a protein sequence. It also tells Biopython that certain functions that can be performed on sequences, like `translate()` or calculating its `reverse_complement()`, can be performed on it.

Here are some useful functions that can be applied to DNA seq object

In [None]:
# make a Seq object from a string of DNA characters
seq_object_dna = Seq('atgtacgacaacgccagcaccaggatcaacggc')

In [None]:
seq_object_dna

In [None]:
# translate
seq_object_dna.translate()

In [None]:
# transcribe
seq_object_dna.transcribe()

In [None]:
# reverse complement
seq_object_dna.reverse_complement()

In [None]:
# complement
seq_object_dna.complement()

In [None]:
# convert sequence to upper case
seq_object_dna.upper()

In [None]:
# convert sequence to lower case
seq_object_dna.lower()

We can check that seq_object_dna is a Biopython Seq object by entering the value and **returning** the code. A Seq object will have `Seq()` returned as part of the value. 


In [None]:
seq_object_dna

What if our sequence was not a Seq object, but instead a normal Python string? For example if we try to translate a string, we will get an error. This is because string_dna does not have any information about what can and can't be done to it with Biopython functions.

In [None]:
# same dna as above but not a seq object
string_dna = 'atgtacgacaacgccagcaccaggatcaacggc'
# trying to use translate
string_dna.translate()

---
### Exercise 4a
Assign the string `'aacgagtggagcgagcaggagaactgcgag'` to a new variable called `dna_seq` and `print()` `dna_seq`. 

Afterwards, convert `dna_seq` to a Seq object using `Seq()` and `print `dna_seq`. 

Are the classes different?

In [None]:
# assign the string 'aacgagtggagcgagcaggagaactgcgag' to dna_seq
dna_seq = ___
print(type(dna_seq)) 

# convert the string 'aacgagtggagcgagcaggagaactgcgag' to a Seq object and assign to dna_seq
dna_seq = Seq(___)
print(type(dna_seq))


### Exercise 4b
`translate()` the Seq object `dna_seq` and store the result in a variable called `translated_seq`. Then `print()` it.

In [None]:
___ = ___.translate()
print(___)

### Exercise 4c
Take the `reverse_complement()` of `dna_seq` and store it in `dna_seq_rc`. `print()` `dna_seq_rc`



In [None]:
dna_seq_rc = ___.___
print(dna_seq_rc)

---
We can add and remove sequences that belong to an existing sequence.

In [None]:
short_seq = Seq('agccacaggaccagcgagcag')

In [None]:
longer_seq = short_seq + Seq('aacgagtgggacaacgcc')

In [None]:
longer_seq

What if we want to select only some parts of the sequence?

We can work the `Seq` object as if though it were a string.

In [None]:
# select first 10 characters in longer_seq
longer_seq[:10]

In [None]:
# select 6th-10th characters in longer_seq
longer_seq[5:10]

In [None]:
# store the selected results in a variable called subset_seq
subset_seq = longer_seq[5:10]
subset_seq

---
### Exercise 5a

Make a `Seq()` object with the sequence `'PRTIENSEQ'` and assign it to the variable `my_prtn_seq1`

In [None]:
___ = ___('___')

my_prtn_seq1

### Exercise 5b
Create a new protein `Seq()` object, `my_prtn_seq2` with the sequence `'HISTAG'`.

Afterwards, add (`+`) the sequence from `my_prtn_seq2` to `my_prtn_seq1`

In [None]:
my_prtn_seq2 = Seq('___')
___ + my_prtn_seq1 

### Exercise 5c

Make a new variable called `my_prtn_seq3` that contains only the second-fifth amino acids `from my_prtn_seq1`

In [None]:
my_prtn_seq3 = ___[1:___]
my_prtn_seq3

### Exercise 5d

Add (`+`) `my_prtn_seq2` to `my_prtn_seq3` and store the result in `my_prtn_seq4`

In [None]:
___ = ___ + my_prtn_seq3

my_prtn_seq4

---

That's it for Day 1!