<a href="https://colab.research.google.com/github/agfoote/Python4Biologists/blob/main/P34B_Day1_BasicDataTypes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.co.uk

---
# General Python Concepts

We want to get on to some cool biology but first we need to learn a few basic things about how python works. Let's go!

## Basic Data types in Python

All data (aka variables) in python fall under a small number of formats (e.g. strings, numbers, lists, and dictionaries). How they are defined changes how they will be used in your code so it's important to know which to use.

We use the equals ```=``` symbol to assign data to a variable

### Strings
A string is the most straightforward data type and is a list of characters which will be interpreted as text.

We assign it with an equals character, with the information within either single or double quotes ```" "```/```' '``` .

We can use the ```type()``` function to tell us what type of data it is:

In [None]:
my_gene = "TGCATGCATGCTAGCGTAC"
type(my_gene)

str

In [None]:
my_abundance_value = "742.37 od"
type(my_abundance_value)

str

Note that even plain numbers are a string if they're put inside of quote characters!

In [None]:
my_other_value = "9231414"
type(my_other_value)

str

---
### Numbers

To create a number variable you need to simply **NOT** put it in quotes and python will accept it as a number

There are two types of number formats, either **integers** (whole
numbers) or **floats** (Decimal or Floating Point numbers).

In [None]:
read_count = 400500
type(read_count)

int

In [None]:
frequencyObserved = 0.8231
type(frequencyObserved)

float


 You have to be especially careful when assigning number values as calling a float (decimal number) as an integer can cause errors.

 ```int()``` doesn't round the value, it just discards everything after the decimal point!

In [None]:
print("my frequency is:", frequencyObserved)
print("my frequency as an integer is:", int(frequencyObserved))

my frequency is: 0.8231
my frequency as an integer is: 0


See this example, and how if the number being stored as an integer affects the calculation

In [None]:
average_leaf_count = 425.35
average_leaf_count_int = int(425.35)

print("Leaves in 1000 forests is:  ", average_leaf_count * 1000)
print("Leaves in 1000 int forests: ", average_leaf_count_int * 1000)

Leaves in 1000 forests is:   425350.0
Leaves in 1000 int forests:  425000


It is really important to be aware of what format your data is in especially when combining data. If they are different types it could go very wrong!

In the old days (python 2) doing a division would return an ```int``` by default! Just think of all the unexpected errors!

Now in python 3 now it always outputs a float, but there can still be some unexpected rounding to be aware of at really small decimals because of complicated computer science "*reasons*":

In [None]:
100/3

33.333333333333336

More information on how floating-point numbers become strange [can be found here](https://docs.python.org/3/tutorial/floatingpoint.html)

---
### Booleans (True/False)
Another important data type is True/False values called Booleans. This can be useful for storing information, or for deciding whether to perform a process.

Note the capital letter for the variables.

In [None]:
# Was the site sampled?
WT01 = True
WT02 = True
WT03 = False

type(WT01)

bool

In a useful way, printing a boolean variable gives you the output as a string:

In [None]:
print(WT01, WT02, WT03)

True True False


If we run a line of code that is a statement, it will return a boolean: If it is true or false. This could be mathematical, string based, or many other methods we will see.

Note:
*   one equals symbol ( ```=``` ) assigns a value
*   two equals symbol ( ```==``` ) TESTS if that equation is correct.

So this line is TESTING if two plus two equals four:

In [None]:
2 + 2 == 4

True

In [None]:
my_species = "Xenopus"

"Drosophila" == my_species

False

---
### Combining variables

We can do mathematical operations as you would expect with ```+ - * /``` (more on maths later in the course), but we have to be careful with combing variables!

In [None]:
samples = 15
sites = "four"
print(samples + 5)
print(sites + 3)

20
four3


Here we have three variables. A string, an integer, and a float. Or do we?

**Exercise:** Before running the code, think about what you would expect the outputs to be. And then run the codeblock and work out what is happening.

In [None]:
my_gene = "TGCATGCATGCTAGCGTAC"
read_count = 400500
frequency_observed = "0.4231"

print(my_gene * 4)
print(read_count * 4)
print(frequency_observed * 4)

TGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTAC
1602000
1.6924


In [None]:
samples = 15
sites = "four"
print(samples + 5)
print(sites + str(3)) # Convert 3 to a string using str()

20
four3


Can you see what went wrong there? Putting the float value was inside of " " quotes, resulted in it being interpretted as a string, so was repeated multiple times rather than used as a number like the **read_count** variable above it.

> In each of these examples we're using the variable format to assign data or values to a label or "variable name". It means we can refer back to them later or multiple times through our code. Variables don't automatically get displayed on the screen or saved to a file which is why we use the ```print()``` command whenever we want to see a variable's value. It also means we can reuse variables multiple times in our code.

**Exercise:** Lets play with strings and convert an example gene sequence into something a bit more gene-like by adding a start codon, polyA tail, and repeat it 4 times for good luck!

Take a few moments to try and interpret what the code is going to do before running it

In [None]:
# Defining some string variables
my_sequence = "TGCATGCATGCTAGCGTAC"
start_codon = "ATG"
stop_codon = "GGC"
polyA = "AAAAAAAAAA"

# My code
newGene = start_codon + my_sequence + stop_codon + polyA * 4
print("My new gene is:", newGene)

My new gene is: ATGTGCATGCATGCTAGCGTACGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


We introduced a few things there. Firstly you'll see the lines that begin with a # character (called a hash (or hashtag in the twitter era!) or a pound). We use them for putting comments in our code to explain it or to make more sense when returning. Comments are super important!

Firstly we define our variables. In this short sequence we could have just typed them straight in, but imagine you had code that was hundreds of lines long and you typed the stop codon every time. Then you realise you used a stop codon for bacteria, not eukaryotes! This way you only need to change one line and it will correct your whole script.

The actual code line does what is called string concatenation - basically just joining strings together

---
Lets make our code a bit more advanced by repeating our sequence 4 times numerically, and also lets change the stop codon like we mentioned. All the other variables stay the same, but we have written over the ```stop_codon``` variable now.

In [None]:
# New variables
stop_codon = "TAG"

# My code
newerGene = start_codon + my_sequence * 4 + stop_codon + polyA
print("My newer gene is:", newerGene)

My newer gene is: ATGTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTAGAAAAAAAAAA


 Here we're doing a multiplication of string. This is a very python way to work, because other langugages may require you to define the types but python interprets it for you.

**Exercise:** In the above code make a variable named "repeats" and assign a numerical value, and replace the 4 in the code with the new variable.

**Extension Exercise:** There is a function named ```len()``` which calculates the length of a sequence. Try using that to print the length of the "newerGene" string. If you get an error be sure to read it as it should help solve the problem!

## A quick look at print()

We have used ```print()``` many times already and will be one of the most used functions. The main rule is put everything within the brackets ```(  )```, and most of the time it is very intuitive how it works and it is very useful that you can call variables without needing extra characters (some languages such as R requires ```$``` to call variables). Python just knows!

But there are a few interesting cases to make sure you're aware of, for example using commas (```,```) and plusses (```+```)

In [None]:
# variables
fav_text = "My favourite animal is"
animal_choice = "Platypus"
rating = 8

A simple print command using commas:

In [None]:
print(fav_text, animal_choice, "! I give it", rating, "/10")

My favourite animal is Platypus ! I give it 8 /10


Hmmmm. That works, but it's a little annoying with the whitespace. Using commas in a print command makes it output sentence structure with spaces between each variable but sometimes that's not what we want.

Using plusses we can skip the whitespace where we want to:

In [None]:
print(fav_text, animal_choice + "! I give it", rating + "/10")

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Oh no! It gave an error! That's because here pythong thinks you are now trying to do maths on an integer number and a string together. We can fix that by telling it to turn the rating number into a string:

In [None]:
print(fav_text, animal_choice + "! I give it", str(rating) + "/10")

My favourite animal is Platypus! I give it 8/10


Much better! We'll look at some more advanced ways to use print later in the course (```printf()``` can be really useful) but this should be good for now.

---
## Errors & Help

Coding is always going to go wrong at some point. Python is quite useful in the error messages it gives (usually!). Lets see what happens and how to read the error messages.

In this case, where we have a similar program as above, but I've made some mistakes for you to find and fix

### Exercise - BugFix

Run the next broken code and read the error messages. Try to correct the errors without looking back at the example above.

In [None]:
# Defining some variables
gene_part1 = "TGCATGCATGCTAGCGTAC"
gene_part2 = "GGGTCGATAGCGCGTATAATGC"
repeat_element = "CAG"
insertion = "-"

# My code - Join the parts, with an x8 CAG repeat and insertion.
newGene = gene_part1 + (insertion + repeat_element) * 8 + "-" + gene_part2
print("My new gene is:", newGene)
print("It is", len(newGene), "bases long")

My new gene is: TGCATGCATGCTAGCGTAC-CAG-CAG-CAG-CAG-CAG-CAG-CAG-CAG-GGGTCGATAGCGCGTATAATGC
It is 74 bases long


Sometimes you'll get an error like this, but other times your code will complete, but the output will be incorrect so it's important to check and test your outputs to be certain they are as expected!

There are a huge number of complex methods and syntax when writing python code that it's good to consult the manual or find help. Google is always useful, but you can have information from different versions so it can be easiest to open the help within your python environment so it is definitely the version you're working with:

In [None]:
help(print)

Help on built-in function print in module builtins:

print(*args, sep=' ', end='\n', file=None, flush=False)
    Prints the values to a stream, or to sys.stdout by default.
    
    sep
      string inserted between values, default a space.
    end
      string appended after the last value, default a newline.
    file
      a file-like object (stream); defaults to the current sys.stdout.
    flush
      whether to forcibly flush the stream.



---

## Exercise - Data Types
Lets imagine we have written a program that outputs a range of variables as the set of results. Here, try to create a useful output in normal text that could be easily understood by a non-coding biologist using as many of the variables as possible.

**Exercise:** Use ```print()``` to make a summary that could be the first line of an RNAseq results paper i.e "In this study we had....."

**Extension Exercise:** Include a calculation of the percentage of Up/Down regulated genes

In [None]:
# Number of upregulated genes
up_genes = 8423

# Number of upregulated genes
down_genes = 7239

# Total genes in reference database
tot_genes = 23000

# Significance value for t.test
sig_thresh = 0.05

# False Discovery Rate correction was performed?
FDR = True

# Author of analysis
auth = "~Foote et. al."


In [None]:
# Write your output code here
print("In this study we had", up_genes, "upregulated genes and", down_genes, "downregulated genes", "out of", tot_genes, "total genes", "with a threshold cutoff of", sig_thresh, auth)
print("In our study by", auth, "we reported", up_genes, "upregulated genes, and", down_genes, "downregulated genes.")
percentage_value = (up_genes + down_genes) / tot_genes * 100
print("We reported", str(round(percentage_value, 4)) + "% significantly up/down regulated genes.")

print("In this study we had", up_genes, "upregulated and" str(down_genes) + "downregulated genes. The percentage of upregulated genes is", str(round(up_genes/tot_genes *100,2)) + "% significantly up/down regulated genes.")



SyntaxError: invalid syntax. Perhaps you forgot a comma? (<ipython-input-80-ff72c3f31890>, line 7)

# Methods & Data Manipulation
There are many useful functions and methods built in to python that are defined within an object.

We've already seen ```len()``` to output the length of an object. Two other useful numerical ones are ```.count``` and ```round()```.

```.count``` is a string method and ```round()``` is a function which affects how they are used in code but lets not worry about the background for now. We'll look in more detail at methods later but to quickly demonstrate now how they are used as we will rely on them a lot:

In [None]:
myAA = "MVKLRYFMVKLRYFHPCQDEGANISTWHPCQDEGANISTISTW"
print("My amino acid sequence is:", myAA)

My amino acid sequence is: MVKLRYFMVKLRYFHPCQDEGANISTWHPCQDEGANISTISTW


The ```.count()``` method will output the number of times a character is in your string. Here it is using the options inside the ```( )``` ON the string that it is a method for.

In [None]:
# Count the number of As in your sequence
countK = myAA.count("K")
print("There are", countK ,"Lysine residues in my protein")

K_percent = countK / len(myAA) * 100
print("That is", K_percent, "% of the protein")

There are 2 Lysine residues in my protein
That is 4.651162790697675 % of the protein


**Exercise:** Calculate the percentage that are Lysine (K) residues using the ```len()``` function.

And we can use the round() function to simply modify the number:

In [None]:
# Rounded
K_percent_rounded = round(K_percent, 2)
print("That is", K_percent_rounded, "% of the protein")

That is 4.65 % of the protein


Note how in these two processesses the key-word is either before your variables surrounding it in brackets ```( )``` (this is performing a *function* on your variables), or it comes after your variable following a dot/stop/period ``` . ``` (here it is modifying the variable that it is a *method* for).

Often they work the same and you just need to remember which way to use it, but we will see what is going on in the background later in the course



---


# Exercises - Data Manipulation

We've seen various ways of changing the raw data, so now you will try to use them. Don't forget to use the ```help()``` function for a quick reminder, or also while in Colab then putting the mouse over a function will show the same help text.

**1 - Write some code to caclulate the GC% of the sequence and print the result** (the count method above will help!)

**2 - Output your result as a float number. Then to 2 decimal places.**

**3 - Include your key information and outputs as a sentence combining the numbers and text**

I have put the first line of reading a block of text in, and removing the newlines. We'll look at better ways of reading data into our scripts in a later session.

In [None]:
#>U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I
gene_id = "C. familiaris COI"
input = """ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG
CATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTT
ACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTC
ATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCAT
TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCCGCCTTCTACTATTAGCATCTTCTAT
GGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGA
GCATCCGTTGACCTTAGCGCGCGCGCGCCACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT
TCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATC
AGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTA
ACAGACCGGAATCTTAATACAACACGCGCGGATCCCGCTGGAGGAGGAGACCCTATCCTATATCAACACC
TATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACAT
CGCGCGCGCGCGCGAGCGAGCGCGCAGCGGGCGCATGCCAACCACGGCATCGCGCGCGACGCGCCCCCCG"""

gene_sequence = input.replace("\n","") #This removes new lines which are coded as characters, thus provies this sequence as a long single string

In [None]:
# Write your code here
print(gene_sequence)

GC_percent = (gene_sequence.count("G") + gene_sequence.count("C")) / len(gene_sequence) * 100
GC_percent_rounded = round(GC_percent, 2)

print("The CG content of C. familiaris COI is", GC_percent, "% of the sequence.")
print("The CG content of C. familiaris COI is", GC_percent_rounded, "% of the sequence.")

# Write your code here again in multiple steps
G_count = gene_sequence.count("G")
C_count = gene_sequence.count("C")

GC_count = G_count + C_count
GC_perc = GC_count / len(gene_sequence) * 100

print(GC_perc)
print("GC%:", round(GC_perc, 2))

ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAGCATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTTACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTCATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCATTCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCCGCCTTCTACTATTAGCATCTTCTATGGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGAGCATCCGTTGACCTTAGCGCGCGCGCGCCACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATTTCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATCAGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTAACAGACCGGAATCTTAATACAACACGCGCGGATCCCGCTGGAGGAGGAGACCCTATCCTATATCAACACCTATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACATCGCGCGCGCGCGCGAGCGAGCGCGCAGCGGGCGCATGCCAACCACGGCATCGCGCGCGACGCGCCCCCCG
The CG content of C. familiaris COI is 48.69047619047619 % of the sequence.
The CG content of C. familiaris COI is 48.69 % of the sequence.
48.69047619047619
G