# Intermediate Python Skills

MiCM Workshop - October 31, 2023

Benjamin Rudski, PhD Student, Quantitative Life Sciences, McGill University

Dear `Reader | Workshop Attendee`,
Welcome! In this interactive Jupyter notebook, we will explore intermediate-level skills in the Python programming language. We'll cover material from functions and basic classes to writing scripts to run on the command line. This workshop assumes that you have a basic knowledge of Python. If you don't feel free to check out some beginner resources. In a shameless self-promotion plug, you may find my [Intro to Python](https://github.com/bzrudski/micm_intro_to_python_summer_2023) workshop helpful.

This workshop will involve working through a single project, which we will progressively build. The final product, as well as a filled-in notebook, can be found in the [solutions](../solutions/) folder (to be uploaded at a later date). I recommend trying the exercises yourself before looking at the solutions. There's often more than one way to accomplish a task, so it's better that you figure out the intuition for yourself. Who knows... your answer may actually be better than the one I've provided!

Here's the outline of this workshop:

1.	Module 0: Introduction and Problem Scenario (10 min)
    1.	Introducing DNA sequence processing
    2.	Why do we need scripts?
2.	Module 1: Functions (45 minutes)
    1.	Writing custom functions
        1.	Function parameters
        2.	Function return values
        3.	Function documentation and type hints
    2.	Hands-on activity: Write protein expression functions.
3.	Module 2: Classes and Object-Oriented Programming (1 hour)
    1.	Introduction to classes and objects
        1.	What are objects?
        2.	What are classes? 
        3.	What's the difference????
    2.	Writing classes – Gaining a sense of self
        1.	Attributes and initialization
        2.	Methods
        3.	Documentation
        4.	Hands-on activity: Writing classes to represent biological sequences.
    3.	Data classes
        1.	What are data classes?
        2.	Using data classes
        3.	Hands-on activity: Updating out biological sequence classes.
4.	Module 3: Packages (45 minutes)
    1.	Installing packages
        1.	Finding packages online (GitHub, PyPI)
        2.	Installing packages using pip
        3.	Installing packages using conda
    2.	Using packages
        1.	Importing packages
        2.	Using NumPy (includes reading documentation)
    3.	Hands-on activity: Use NumPy to analyse the properties of multiple sequences.
5.	Module 4: Working with the Operating System and the User (1 hour)
    1.	Interacting with the operating system
        1.	Working with text files
        2.	Hands-on activity: Write code to read and write FASTA files. 
        3.	Files and paths: os and shutil
        4.	Hands-on activity: Write code to copy all the FASTA files from a folder to another folder, read and translate them, then write the amino acid sequence to a new FASTA file with the same header in a new folder called “proteins”.
    2.	Basic scripts
        1.	Running Python files from the command line
        2.	Hands-on exercise: Move the code we’ve been working on in the Jupyter Notebook to a Python script.
        3.	Adding command line arguments using argparse
        4.	Hands-on exercise: Update the script to accept the input sequence directory and output directory as command line arguments. Add optional arguments, like maximum number of sequences to process or maximum sequence length.
6.	Module 5: Where to go from here (10 min)
    1.	Next steps (topics to mention):
        1.	Enumerated types
        2.	Class inheritance
        3.	Working with data frames using Pandas
        4.	Creating Python packages and distributing using PyPI, developing GUIs using PyQt5/6.
    2.	Important resources
        1.	Python documentation
        2.	Online tutorials
        3.	Stack Overflow

By the end of this workshop, you should feel comfortable writing Python scripts that involve classes and functions, as well as command line arguments. We'll also seem some useful skills for writing code, such as reading and writing documentation.


# Module 0 - Introduction and Problem Scenario

In this section, we will introduce the major topics discussed in this workshop, as well as the project we will use to explore the skills we develop.

**Outline:**

1. Introducing DNA Sequence Processing
2. Why do we need scripts?

## Introducing DNA Sequence Processing

One of the most important tasks in genetics and genomics is processing DNA sequences. One of the first things we learn in university-level biology is the central dogma: how to do gene expression from DNA sequences to peptides. While doing this task manually works well for one or two short sequences, but what if we have hundreds of sequences, of hundreds of nucleotides?

In this workshop, we will develop a tool to take a set of DNA sequences, translate them into peptide sequences and compute statistics about nucleotides and amino acids. Each section will bring new elements to this tool, until we have the finished product. We'll be able to run our final script from the command line, without needing to write any extra Python code.

## Why do we need scripts?

Some of you may be wondering... why should we both with writing a script?

Well, let's consider the alternatives...

* We could just re-write the same code over and over again... but won't this take a lot of time?
* Ok, fine... well what about just copy-pasting code? Well, then we need to make sure all our variables have the right name and we need to make sure all the processing steps are done right.
* Well, what about having a sophisticated Python package that takes care of all the processing code? Well, you still need to plug in your inputs!

So, what are the benefits of writing a script:

* Easily perform multiple complicated tasks in series.
* No need to write tons of code each time! Just write the code once and it's done!
* Scripts can be run **automatically** on your own computer, or on a cluster, like DRAC (formerly Compute Canada).

# Module 1 - Functions

It's all good and fun to write all the steps you want to do line-by-line. But, let's say you want to run the same set of steps multiple times, potentially on different inputs. Instead of copying the code, we can write **functions**. Chances are you've seen functions already, but I'm including them here to make sure everyone is at the same level.

## Writing Custom Functions

We can think of functions as machines that take in **inputs**, run code (do calculations, magic or a bit of both), and then produce an **output** that can be used.

![Diagram of a function as taking input and producing output](assets/Function.png)

The inputs are known as *parameters* or *arguments* and the outputs are known as *return values*.

You've probably already *used* functions when working with the Python standard library and built-in objects, like `str`s and `list`s. Now, let's **define** functions. In Python, functions are defined using the `def` keyword. The syntax is:

```python

def function_name(argument1, argument2, argument3, ..., argumentN):
    """
    documentation here
    """

    your_code_here...

    return some_value

```
Here are the important elements to notice when **defining** a function:

* The function definition begins with the `def` keyword. This is similar to the `function` keyword in Javascript, or the `func` keyword in Swift.
* The **function name** follows the same rules as variable names. There are different naming conventions for names that consist of multiple words (`snake_case` vs `camelCase`).
* After the function name, you can include a list of parameters. **If your function takes no arguments, you must still put the brackets.** Each argument in the list must have a valid variable name. We'll discuss these in more detail.
* After closing the argument list bracket, we put a **colon** (`:`).
* After the first line, we must **indent**. This tells Python where the function body begins and ends.
* We can start the body with a **docstring**, which describes the function. We'll discuss these more later.
* Then, you write your code as normal. In this function body, treat the arguments **like normal variables**.
* To **output** a result that can be used later, use the keyword `return`, followed by the result. We'll discuss this in more detail.
* After finishing to define the function, simply stop indenting. There's no need to close any brackets or type `end`.

To demonstrate, let's write a simple function with no arguments that simply prints a string onto the screen:

In [181]:
# Your code here

def do_nothing():
    print("Hello, World!")

Wait! What happened? Or well, what didn't happen? We didn't see any string... What's going on?

Well, we only **defined** the function. To actually run the function we must *call* it. To call a function, simply write the name of the function, followed by the desired arguments in brackets. **If the function takes no arguments, you must still type the empty brackets.**

Let's call our function we just defined:

In [182]:
# Your code here
do_nothing()

Hello, World!


### Function Parameters

This function was great, but we said that a function takes input and produces output... This does neither!!! So, let's add some parameters to this function. Let's look at the specific syntax:

```python

def my_function(arg1, arg2, arg3, ..., argN):
    my_code...

```

We separate each parameter using commas. We can then refer to these as variables in the function body. Let's write a new function that takes a DNA sequence as input and prints the transcribed RNA. To make it more interesting, let's add an extra parameter that indicates whether we are considering the sequence to be on the template strand.

In [183]:
# Your code here

def transcribe_dna(dna_sequence, is_template_strand):
    if is_template_strand:
        m_rna_sequence = ""

        for nt in dna_sequence:
            if nt == "A":
                m_rna_sequence += "U"
            elif nt == "T":
                m_rna_sequence += "A"
            elif nt == "C":
                m_rna_sequence += "G"
            else:
                m_rna_sequence += "C"
    else:
        m_rna_sequence = dna_sequence.replace("T", "U")

    print(m_rna_sequence)

And now, let's call this function using a specific sequence.

In [184]:
my_sequence = "AATTAGCGAGCCGAATATATAGCCGCGATTCAGACAGTTCCAGCGCA"

# Your code here
transcribe_dna(my_sequence, True)
transcribe_dna(my_sequence, False)

UUAAUCGCUCGGCUUAUAUAUCGGCGCUAAGUCUGUCAAGGUCGCGU
AAUUAGCGAGCCGAAUAUAUAGCCGCGAUUCAGACAGUUCCAGCGCA


This works well! Except, what if most of the time, we're going to call the function on the template strand? It would be nice if we didn't have to specify this argument every time we call the function.

Good news! We can set default values for function arguments. These are known as *keyword* arguments. Values without a default value are known as *positional* arguments. To specify the default value, simply assign the value with `=`:

```python

def my_function(my_positional_arg, my_kw_arg=default_value):
    ...

```

Let's extend our transcription example to set a default value for the `is_template_strand` parameter:

In [185]:
# Your code here

def transcribe_dna(dna_sequence, is_template_strand=True):
    if is_template_strand:
        m_rna_sequence = ""

        for nt in dna_sequence:
            if nt == "A":
                m_rna_sequence += "U"
            elif nt == "T":
                m_rna_sequence += "A"
            elif nt == "C":
                m_rna_sequence += "G"
            else:
                m_rna_sequence += "C"
    else:
        m_rna_sequence = dna_sequence.replace("T", "U")

    print(m_rna_sequence)

transcribe_dna(my_sequence)
transcribe_dna(my_sequence, False)
transcribe_dna(is_template_strand=False, dna_sequence=my_sequence)

UUAAUCGCUCGGCUUAUAUAUCGGCGCUAAGUCUGUCAAGGUCGCGU
AAUUAGCGAGCCGAAUAUAUAGCCGCGAUUCAGACAGUUCCAGCGCA
AAUUAGCGAGCCGAAUAUAUAGCCGCGAUUCAGACAGUUCCAGCGCA


There are a few important rules to remember about positional and keyword arguments:

1. Positional arguments **always** come first, both when defining and when calling functions.
2. When calling a function, you **must** include **all** positional arguments, but you can omit keyword arguments (since they have default values).
3. **Any** argument can be written in keyword argument form when calling a function, but if you write a positional argument in keyword form, **all** subsequent arguments must be written in keyword form.
4. **Any** argument can be written in positional form when calling a function, but **all** preceding arguments must be positional as well.
5. Keyword arguments can be passed in **any order**, but positional arguments must be kept in the same order.

## Function Return Values

So, we've seen how to pass information into functions, but now, how do we get information out? The answer is **return values**. These return values let us capture the result of a function, which we can then use like a normal variable in code. To return a value, we simply type `return` followed by the value we want to return.

Here's the syntax:
```python

def my_function(...):
    ...

    my_result = ...

    ...

    return my_result

```

Let's now switch our previous transcription function to *return* the mRNA instead of simply printing it:

In [186]:
# Your code here

def transcribe_dna(dna_sequence, is_template_strand=True):
    if is_template_strand:
        m_rna_sequence = ""

        for nt in dna_sequence:
            if nt == "A":
                m_rna_sequence += "U"
            elif nt == "T":
                m_rna_sequence += "A"
            elif nt == "C":
                m_rna_sequence += "G"
            else:
                m_rna_sequence += "C"
    else:
        m_rna_sequence = dna_sequence.replace("T", "U")

    return m_rna_sequence

So, this is how to return the value. Now, let's see how to capture and use it. To capture the value, we simply assign it to a variable, like normal.

In [187]:
# Your code here
my_mrna = transcribe_dna(my_sequence)

print(my_mrna)
print(f"The length of my mrna is {len(my_mrna)}")

UUAAUCGCUCGGCUUAUAUAUCGGCGCUAAGUCUGUCAAGGUCGCGU
The length of my mrna is 47


**Note:** If your code has multiple branches, you can put multiple return statements in your code. **But**, once your code reaches the `return` line, the function **stops** and returns to the code that called it. Any code that you've written after the `return` statement **will not run**.

We can also return *multiple* values using tuples, lists or dictionaries. For example, let's say we want to count the number of each type of nucleotide in a sequence of DNA:

In [188]:
# Your code here

def count_nucleotides(dna_sequence):
    number_of_a = 0
    number_of_t = 0
    number_of_c = 0
    number_of_g = 0

    dna_sequence = dna_sequence.upper()

    for nt in dna_sequence:
        if nt == "A":
            number_of_a += 1
        elif nt == "T":
            number_of_t += 1
        elif nt == "C":
            number_of_c += 1
        else:
            number_of_g += 1
    
    return number_of_a, number_of_t, number_of_c, number_of_g

Now, let's run this code on an example:

In [189]:
# Your code here
count_nucleotides(my_sequence)

(15, 9, 12, 11)

This is great! But let's say you get this function from someone else to import and use in your own code. You don't want to have to find this function and read all the code just to use it... But, how do we know what parameters this function takes and what values it returns...

## Function Documentation and Type Hints

The answer to this question is **documentation**. When defining a function, we can provide a *docstring*, which describes the important information about a function in a **human-readable** form. The docstring is just a string that a person can read to learn more about a function. If you're using a code editor or IDE, like VS code or PyCharm, this string appears when you hover your mouse over a function. The information contained in this docstring can include:

* A brief description of the function.
* A longer description of the function. If you're implementing an existing approach, it could be good to include a citation here. You can also include equations here.
* A description of the function parameters, including their types.
* A description of the function return values, as well as their types. This is especially useful if you are returning multiple values and need to include their order.

Let's clarify our previous example by adding a docstring:


In [190]:
# Your code here

def count_nucleotides(dna_sequence):
    """
    Nucleotide Counter.

    This function counts the number of each type of nucleotide in a DNA sequence.

    Arguments:
        - dna_sequence: string containing a DNA sequence.
    
    Returns:
        - Tuple containing number of (A, T, C, G)
    """

    number_of_a = 0
    number_of_t = 0
    number_of_c = 0
    number_of_g = 0

    dna_sequence = dna_sequence.upper()

    for nt in dna_sequence:
        if nt == "A":
            number_of_a += 1
        elif nt == "T":
            number_of_t += 1
        elif nt == "C":
            number_of_c += 1
        else:
            number_of_g += 1
    
    return number_of_a, number_of_t, number_of_c, number_of_g

In [191]:
help(count_nucleotides)

Help on function count_nucleotides in module __main__:

count_nucleotides(dna_sequence)
    Nucleotide Counter.
    
    This function counts the number of each type of nucleotide in a DNA sequence.
    
    Arguments:
        - dna_sequence: string containing a DNA sequence.
    
    Returns:
        - Tuple containing number of (A, T, C, G)



So, this is great for making it easy for other people to read... But, the docstring is just a string. The code editor doesn't understand it and can't give us suggestions based on it. But, we can do get this extra help using **type hints**.

**Type hints** are a relatively recent addition to Python. They allow us to *explicitly* indicate the types of function parameters, return values and any other variable. That way, the code editor can tell us if we've passed the wrong type of value somewhere, or even give us suggestions as we type.

### A quick refresh on types

Everything in Python has a type. When you have a string of text, such as `"Hello, world!"`, it is a Python *object* of type `str` (string). If you have an integer number, like `4`, it is an object of type `int`. If you have a list, it is an object of type `list`. Hopefully, you get the idea by now.

In many cases the type can be inferred. For example, if you write
```python

x = 5

```

Then you know that `x` has type `int`. You can check the type of a variable `x` by running:
```python
type(x)
```

We'll see later how to create new types.

Let's see a few more examples of types:

In [192]:
# Your code here
x: float = 4

type(x)

int

### Type Hints

Although in the earlier cases we saw the type is implied, we can also explicitly set the type of different variables. We do this by adding a colon and the name of the type after the variable name. For example:
```python

x: int = 6
y: str = "world"

```

This extra bit that we add is known as a *type hint*. It gives a hint to the reader and the code editor about the specific type of the variable.

**Note:** Type variables are simply an annotation. They *don't* actually change the type of the object. If you want to convert from one type to another, you need a different function.

### Type Hints and Functions

There are certain operations that we can only perform on certain types. For example, you can subtract two `int`s or two `float`s, but you can't subtract two `str`s. In your function, you perform operations on the arguments that are passed in. These operations often make assumptions about the **type** of the arguments. With type hints, you express clearly for the computer, as well as your code editor, what exactly those assumptions are. To add type hints to the arguments, we just repeat the above syntax with the colon and the type name after the parameter names. Let's add type hints to our earlier transcription code:

In [193]:
# Your code here

def transcribe_dna(dna_sequence: str, is_template_strand: bool = True) -> str:
    """
    Transcribe DNA to mRNA.

    ...
    """
    
    if is_template_strand:
        m_rna_sequence = ""

        for nt in dna_sequence:
            if nt == "A":
                m_rna_sequence += "U"
            elif nt == "T":
                m_rna_sequence += "A"
            elif nt == "C":
                m_rna_sequence += "G"
            else:
                m_rna_sequence += "C"
    else:
        m_rna_sequence = dna_sequence.replace("T", "U")

    return m_rna_sequence

Type hints aren't just for parameters! They can also be used to specify function return types. After the parameter definition, but before the last colon, we add and arrow (`->`) followed by the return type. Let's add this to our function!

In [194]:
#Your code here

my_mrna = transcribe_dna(my_sequence, is_template_strand=False)

You may not be entirely convinced yet... Well, try calling the function with the wrong type of argument. Or try assigning it to the result to a variable of the wrong type. See what your editor does. If it does nothing, **make sure that you have turned on type checking**.

In [195]:
# Your code here
# my_mrna: int = transcribe_dna(56, is_template_strand=False)

For collection types, like `list` and `tuple`, we can also modify the type hint to give information about the *contents*. We put the type contained in the collection in **square brackets** after the collection type name. Using this knowledge, let's update our transcription function.

In [196]:
from typing import List

def transcribe_dna_list(dna_sequence: List[str], is_template_strand: bool = True) -> str:
    """
    Transcribe DNA to mRNA.

    ...
    """
    
    if is_template_strand:
        m_rna_sequence = ""

        for nt in dna_sequence:
            if nt == "A":
                m_rna_sequence += "U"
            elif nt == "T":
                m_rna_sequence += "A"
            elif nt == "C":
                m_rna_sequence += "G"
            else:
                m_rna_sequence += "C"
    else:
        m_rna_sequence = "" 
        
        for nt in dna_sequence:
            if nt == "T":
                new_nt = "U"
            else:
                new_nt = nt

            m_rna_sequence += new_nt

    return m_rna_sequence

Another cool thing is that in some editors, if you hover your mouse over the function name, when you see the function header, you see the types! So this will help you get a better idea of what types you need.

## Module 1 - Hands-on Activity

Let's put the skills we've learned in this module into practice.

**Recall our overall project.** We want to write transcription and translation. We've already written a function to perform transcription from DNA to mRNA. Let's now work on another function: Translation from mRNA to protein, ignoring the mRNA preprocessing. 

To test this code, I've provided a couple of random DNA sequences.

In [197]:
amino_acid_to_codon_table = {
    "F": ["UUU", "UUC"],
    "L": ["UUA", "UUG", "CUU", "CUC", "CUA", "CUG"],
    "I": ["AUU", "AUC", "AUA"],
    "M": ["AUG"],
    "V": ["GUU", "GUC", "GUA", "GUG"],
    "S": ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],
    "P": ["CCU", "CCC", "CCA", "CCG"],
    "T": ["ACU", "ACC", "ACA", "ACG"],
    "A": ["GCU", "GCC", "GCA", "GCG"],
    "Y": ["UAU", "UAC"],
    "STOP": ["UAA", "UAG", "UGA"],
    "H": ["CAU", "CAC"],
    "Q": ["CAA", "CAG"],
    "N": ["AAU", "AAC"],
    "K": ["AAA", "AAG"],
    "D": ["GAU", "GAC"],
    "E": ["GAA", "GAG"],
    "C": ["UGU", "UGC"],
    "W": ["UGG"],
    "R": ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],
    "G": ["GGU", "GGC", "GGA", "GGG"]
}

# Convert the table to have the codons as keys and the amino acids as values.
my_codon_table = {}

for amino_acid, codon_list in amino_acid_to_codon_table.items():
    for codon in codon_list:
        my_codon_table[codon] = amino_acid

# Your code here

def translate_mrna(mrna_sequence: str) -> str:
    """
    Translate mRNA to Amino Acid Sequence.

    Translate mRNA sequence containing a start codon and open reading frame
    to an amino acid sequence using single-letter codes.

    Arguments:
        - mrna_sequence: string containing mRNA.
    
    Returns:
        - string of single-letter amino acids, beginning with M.
    """

    amino_acid_sequence = ""
    start_position = -1

    for i in range(len(mrna_sequence)):

        candidate_codon = mrna_sequence[i: i+3]

        if candidate_codon == "AUG":
            start_position = i
            amino_acid_sequence += my_codon_table["AUG"]
            break

    for i in range(start_position + 3, len(mrna_sequence), 3):
        
        new_codon = mrna_sequence[i: i+3]
        new_amino_acid = my_codon_table[new_codon]

        if new_amino_acid == "STOP":
            break
        else:
            amino_acid_sequence += new_amino_acid

    return amino_acid_sequence


# To make things easier later, copy your transcription function here.
# Your code here

def transcribe_dna(dna_sequence: str, is_template_strand: bool = True) -> str:
    """
    Transcribe DNA to mRNA.

    ...
    """
    
    if is_template_strand:
        m_rna_sequence = ""

        for nt in dna_sequence:
            if nt == "A":
                m_rna_sequence += "U"
            elif nt == "T":
                m_rna_sequence += "A"
            elif nt == "C":
                m_rna_sequence += "G"
            else:
                m_rna_sequence += "C"
    else:
        m_rna_sequence = dna_sequence.replace("T", "U")

    return m_rna_sequence

In [198]:
# Testing: Here are the random DNA sequences. One is a template strand, while the other is not.

template_strand = "TGTATGGGGGCACGCGCGCTGAGTCGCAGCGGGCCCTTGATGTCCCGCGATGCCCTCCCCGAAGTCCTGAGGTGGCCGCGTGCAATACCCCTTGCGACGCCACTATGCTGTGCAGGCACGGCGCGGGGTCGCGCAGCGCGCGGTTGCTGCGGGCGCGGAATTCTCCCGCGGCCTAAGCCAGGGTCGCTCTACTCATGTACGGGGTGTCGAAGCCCGTACGTCCGCTTGCGCACGGACAACACGGCGTGCGGGCCTGGACCGTTGCCCAAACCCACGGCGCCTGCGTATGCCGCGGGCTACCGGCATGGTGCGCCCCACCCTGCGGACCCCCATGAGGGTTCTCCGGTCCGCCAAAGACACATGGTGCCCGACGGACCCCTCCCCATAAGCTCTACCTGACTGCTTAGTGGAGCCCCTGTGCGTCGCAGGGTAGGCCGACTTCGATGATGAGCGGTGCGCCTGTCGCCCCGTAGTGCCCGCCCCAGGGGCGCCAAACCAGACTCTGGACCGATAGAGAGTAGCCGACTGTAACGCCGTGGGAGAGGCTTGCCGAATCAACGGATCCACGGCTCCAGACTGCCGTGGGCGTTGACCCCGGGCGGACCACGGCCCGGCAATGGACTTTTCTTTCACCGTGAACAAATGAGACTCAGCCGAGCCCAGGTAGAGAGACCATGGCGTGTTCGAGACGGTCTGCGTGGCTAGTTACAACTGTGGCGGGGATGTCCGGCGGCTCAACACACCCGGCCAGATCTGGAAGTCCATCCGCGTCGGTGCCCGGTTTGTACTGGCCTGCGCCATCAGTAGTGTCATCGGGCCTAGGTATTGGATCTTGGTGAAGGAGGCCATAACCCCGCGGGCGGCAGCAATACAGCGGCCCATAAACCTCCCGTCTCCGCGGCCTTAGACCGCGGGCCCCGAATGACGCCAGGGCCTAGCGCCGCGGGCCCGCCACTCGGGGCCCCTACGCACTGACACGCACGTGGTCCCTCCGATCCTC"
non_template_strand = "TACGCCGAGAAAGTACTTAAACGCAACCTCTACCCCCTGGGAGACTACCGCAAAAGATTGAACACGTATTAATGGCACAACGCTGCCAAGAGGCTACTTAAATGTAAGTAGTGGGCGACCGAGTGCCGCACACTCAAGAGTAGCGGTGATTACTCTAGTTGCTTATTGGAGTTTTTTCTCCGTGTGGGCTTTACCCGGACATTGTACGGTAATTTTATTAAGGCAATGTTAGACAATGTACGTCTAGCCATCCTGGCGCCTAATGGTGCCTCTCCGCTGTGGTCTTGTAAACCGTACTGCATCCCTCGGGCCGTGAGAGTCTTACATTTAAGTGGCTCAAAATCAACGGTCCTAAAGAAGATCACAGTAAAGTCCCTGATACGAAATACATTGTCCCGGTCCGCGAAGTACTACGCCACGCCGGCATGGAGATGAATAGGTTTGGAAAATCACTTCACAAAAGATGCGATAAAAACGGGTATAACACCTCTAGATGGTGACACCACTACTAATAGCAGTGAGTAGTCATTCTTCCCCTGCTGCTAGAGAGTTAGTTATGGCGCTTGGACTAATGCCATATTTATGCGTCTCCTACGGATCAGTTATTTTAGTCTTCCCAGCCCCTGCGGTCCCCGTATCTCAACTCGAGTTATGGGATTGGAATCAACCCGCATATGTCCTACCAGTTAGAGTACTAGCGTCGATCGCATCCGCCTTTCAATTAGTCCCAGAGCGATCAAGGGGTCTACATGGCACGATACAGGTTAACCACATTCAGTGAATAAAAGGACAACCGTAACTGATCGGGTGCGAGAACAAGCGACTTTGGAGTACACTGTAAAGTCTTTGTGTGGATGTATTCATGTCATTCGGGAGTATTGCTTGGCGATGGTTGGGGCCGAGCGGAAAAGCCGACTATTTGGCGAATTGGGCCCTGTTACAATGCCTTACTTCGCTCACTCGTCAATAAGTCGATTAAGGCGGGGCGGTCCTCATCCGC"

my_mrna = transcribe_dna(template_strand, is_template_strand=True)
my_amino_acids = translate_mrna(my_mrna)

print(f"My transcribed mRNA is {my_mrna}")
print(f"My translated peptide is {my_amino_acids}")

My transcribed mRNA is ACAUACCCCCGUGCGCGCGACUCAGCGUCGCCCGGGAACUACAGGGCGCUACGGGAGGGGCUUCAGGACUCCACCGGCGCACGUUAUGGGGAACGCUGCGGUGAUACGACACGUCCGUGCCGCGCCCCAGCGCGUCGCGCGCCAACGACGCCCGCGCCUUAAGAGGGCGCCGGAUUCGGUCCCAGCGAGAUGAGUACAUGCCCCACAGCUUCGGGCAUGCAGGCGAACGCGUGCCUGUUGUGCCGCACGCCCGGACCUGGCAACGGGUUUGGGUGCCGCGGACGCAUACGGCGCCCGAUGGCCGUACCACGCGGGGUGGGACGCCUGGGGGUACUCCCAAGAGGCCAGGCGGUUUCUGUGUACCACGGGCUGCCUGGGGAGGGGUAUUCGAGAUGGACUGACGAAUCACCUCGGGGACACGCAGCGUCCCAUCCGGCUGAAGCUACUACUCGCCACGCGGACAGCGGGGCAUCACGGGCGGGGUCCCCGCGGUUUGGUCUGAGACCUGGCUAUCUCUCAUCGGCUGACAUUGCGGCACCCUCUCCGAACGGCUUAGUUGCCUAGGUGCCGAGGUCUGACGGCACCCGCAACUGGGGCCCGCCUGGUGCCGGGCCGUUACCUGAAAAGAAAGUGGCACUUGUUUACUCUGAGUCGGCUCGGGUCCAUCUCUCUGGUACCGCACAAGCUCUGCCAGACGCACCGAUCAAUGUUGACACCGCCCCUACAGGCCGCCGAGUUGUGUGGGCCGGUCUAGACCUUCAGGUAGGCGCAGCCACGGGCCAAACAUGACCGGACGCGGUAGUCAUCACAGUAGCCCGGAUCCAUAACCUAGAACCACUUCCUCCGGUAUUGGGGCGCCCGCCGUCGUUAUGUCGCCGGGUAUUUGGAGGGCAGAGGCGCCGGAAUCUGGCGCCCGGGGCUUACUGCGGUCCCGGAUCGCGGCGCCCGGGCGGUGAGCCCCGGGGAUGCGUGACUGU

## Module 1 - Summary

In this module, we've seen how to write **functions** that take in **parameters** (inputs), produce **return values** (output) and how to document these functions using docstrings and type hints.

Here are the key concepts to remember:
* Function **parameters**: positional vs. keyword parameters.
* Function **return values** and how to return multiple values.
* Function **documentation** through **docstrings** and **type hints**.

Keep these key elements in mind as we move on to the next section.

# Module 2 - Classes and Object-Oriented Programming

In your programming experience, you've worked with variables and basic operations. Now we've also seen how to write functions. We can now store data and perform complicated operations on it. So, is this it?

Well, not exactly. We can do much of what we want with variables and operations and functions, but what if we want to combine functions and variables together to more easily model a real-world thing... or an *object*? Good news! This is the realm of **object-oriented programming** (OOP).

## Introduction to Classes and Objects

Two **extremely** important concepts in OOP are **objects** and **classes**. Understanding both is **vital**. So, let's dive into each.

### What are Objects?

In OOP, an **object** is, as the name suggests, a representation of a real-world (or abstract) *thing*. You can access the properties, or *attributes* of this thing, and you can perform operations, known as *methods*, with it.

For example, let's say there is a ball outside. This ball is an object. Looking at it, you determine that it's green. This is an **attribute** of the ball. You can also do operations on it, like deflate it, inflate it or throw it. These are functions, or *methods*, that you can perform on the object.

We've actually been using objects all along. In Python, **everything is an object**. For example, let's create a list and add a new entry to it:

In [199]:
# Your code here

my_list = [3, 4, 5]

my_list.append(6)

print(my_list)

[3, 4, 5, 6]


The `append` function that we used is a *method* belonging to the list object. Similarly, if we want to take a string and find a certain character we write this code:

In [200]:
# Your code here

my_string = "Hello!!!"

my_string.find("!")

5

This code calls the string's `find` method. Notice how the variable comes before a dot. This **dot notation** is important in Object-Oriented Programming (OOP).

So, we've seen that **objects group together useful variables and functions**, but how do we create a new type of object? The answer is coming right up!

### What are Classes?

Let's say we want to represent a DNA sequence in Python. We could simply represent the sequence as a string and then write a bunch of functions that take in a string as argument to perform DNA processing tasks. But, we've just seen that we can use objects to group together these *attributes* (the DNA sequence) and *methods* (the DNA processing functions). So, we want to make an object. But how?

We need to create a template that defines what a DNA sequence is. This template states which attributes exist for a DNA sequence, as well as which methods, and how to create a new DNA sequence object. This template is known as a **class** in Python. We'll see in the next section exactly how to write classes, but here's the syntax quickly:

```python

class ClassName:
    """
    docstring goes here
    """

    def __init__(self, arg1, arg2, arg3):
        self.attr1 = arg1
        self.attr2 = arg2
        self.attr3 = arg3

    def some_method(self, some_arg1, some_arg2):
        ...

```

We'll go into **much** more depth about this in the next section.

A final important note is that a **class defines a new type**. This is especially important if you are adding type hints to your code. The code in the above example creates a new type of object, called `ClassName`. All new objects created using this class are of type `ClassName`.

### What's the difference????

Some of you may be wondering what the difference is between classes and objects. Here are the important points to remember:

* *Objects* are essentially a group of variables and functions representing a single thing (whether it is something concrete like a DNA sequence, or something abstract). Each object has its own internal attributes.
* *Classes* are **templates** used to create new many objects. The class defines what attributes and methods each new created object has.

Still confused? Think about this: the classes tell you how to make the objects. The class is the blueprint and it tells you what type of object you are creating, while the object is the actual finished product.

Want a biology analogy? The **class is the gene**, the **object is the protein**.

## Writing Classes - Gaining a Sense of `self`

Now that we've gotten the theoretical intro out of the way, let's talk about actually writing classes! I gave you a quick teaser of the syntax earlier. Now let's look at it in more depth:

```python

class ClassName:
    """
    docstring goes here
    """

    def __init__(self, arg1, arg2, arg3):
        self.attr1 = arg1
        self.attr2 = arg2
        self.attr3 = arg3

    def some_method(self, some_arg1, some_arg2):
        ...

```

Here are the important points to remember:
* We declare a new class with the `class` keyword.
* The class needs a name to define what type of object it creates. The naming convention for class names is `UpperCamelCase`, *regardless of your naming convention for everything else*. The first letter of the class name is traditionally capitalised, as is the first letter of any other word combined to make the name.
* As usual, in Python, no brace brackets. We put a colon (:) after the class name.
* Like functions, classes can include documentation in the form of a docstring. We'll cover this more in a bit.
* The first function we see looks a bit weird. This `__init__` is a special function called an initialiser. We'll see more about it in the next section.
* We'll also discuss what this `self` argument in the function means.
* The next function, or *method*, that we see is called `some_method`. We'll discuss methods soon, as well as the mysterious `self` parameter that has continued to appear.
* As usual, no closing brackets or `end` or anything like that. The class definition ends when you stop indenting. **Note:** The entire body of the class definition is indented. If you forget to indent, the class will **not** work as expected.

### Attributes and Initialisation

Attributes are the properties associated with an object. For example, if we want to create a class to represent a DNA sequence, the main attribute we need is the actual nucleotide sequence, either as a string or as a list. Ok, good! So now we know what attribute we want... but how do we actually set it?

The answer is: using the **initialiser**, also known in other languages as the *constructor*. The **initialiser** is a special method that runs whenever you create, or *instantiate* an object of that class. The name of this method is `__init__`, with **two** underscores before and after.

Here's the typical syntax of an initialiser:
```python

def __init__(self, arg1, arg2, arg3):
    self.attr1 = arg1
    ... # Other initialisation code

    # NO return statement!

```

There's a bit to unpack here. We've already discussed the name of the method. I want to emphasise that you **CANNOT** change the name of the init method!

#### The Inner `self`

It's time to get in touch with your inner `self`... Let's talk about what `self` means. In all methods (as we'll see), including the initialiser, the first argument is (almost) **always** `self`.

`self` very simply, is a variable that refers to the current object you're working with. In the `__init__` initialiser, `self` refers to the new object that you are constructing. We'll discuss more in the **methods** section.

#### Back to the init

So, now that we've figured out what `self` means, let's see the rest of the initialiser. Since the initialiser is a **function**, it can take in other parameters. 

In the function body, the magic happens. Here, you **define the attributes**. To assign an attribute, you must use the **dot syntax**. To define a new attribute, you simply perform variable assignment, **making sure that you have put `self.` before the name of the attribute.

Finally, **be careful**, the initialiser has **NO RETURN VALUE**. You **do not** return the created object. This is automatically performed.

As an example, let's write a simple class for representing a DNA sequence:

In [201]:
# Your code here

class DnaSequence:
    
    def __init__(self, dna_sequence: str):
        self.dna_sequence = dna_sequence

Recall that when were seeing functions, *defining* the function was quite different from actually *calling* it. Well, we have the same idea for classes. To be able to create a new object, we need to instantiate it using the initialiser. To do this, we simply write the class name, and then put the initialiser arguments in parentheses:

In [202]:
# Your code here

my_dna_sequence = DnaSequence("AGGAGAGATATAGATAGTCCGATCG")

Now that we have our object, let's try to access it's attribute. Remember, like when working with strings or lists, to access information in our object, we use **dot notation**. For example, let's look at the `sequence` attribute of our DNA sequence:

In [203]:
# Your code here

print(my_dna_sequence)
print(my_dna_sequence.dna_sequence)

<__main__.DnaSequence object at 0x7fcb28a1b4c0>
AGGAGAGATATAGATAGTCCGATCG


### Methods

**Methods** are simply functions that live in a class and are (usually) called on a specific object. As explained above, the first parameter is **always** `self`. To be able to access information (e.g., attributes, methods) about the object that the method is being called on, you **must** use `self`. **Python is not like certain other programming languages** in this way. You **cannot** omit `self`. Unlike the initialiser, methods **can** return values.

Let's now re-write our DNA sequence class, adding a transcription method to it:

In [204]:
# Your code here

class DnaSequence:
    
    def __init__(self, dna_sequence: str):
        self.dna_sequence = dna_sequence

    def transcribe(self, is_template_strand: bool = True) -> str:
        """
        Transcribe DNA to mRNA.

        ...
        """
        
        if is_template_strand:
            m_rna_sequence = ""

            for nt in self.dna_sequence:
                if nt == "A":
                    m_rna_sequence += "U"
                elif nt == "T":
                    m_rna_sequence += "A"
                elif nt == "C":
                    m_rna_sequence += "G"
                else:
                    m_rna_sequence += "C"
        else:
            m_rna_sequence = self.dna_sequence.replace("T", "U")

        return m_rna_sequence

Let's now call this method on a sample sequence:

In [205]:
# Your code here
my_dna_sequence = DnaSequence(my_sequence)
my_mrna_sequence = my_dna_sequence.transcribe(is_template_strand=False)

print(my_mrna_sequence)

AAUUAGCGAGCCGAAUAUAUAGCCGCGAUUCAGACAGUUCCAGCGCA


But wait!!! What happened to `self`? When we call the method on the object using the dot notation, the `self` is *implicitly* passed. The object before the dot is passed as `self`. There is a way of calling the method while explicitly specifying `self`... It involves placing the *class name* before the dot and then the instance as the first argument.

A common thing we like to do in programming is print text onto the screen. You've probably seen the `print` function called on strings, integers and lists. Well, now let's call `print` on our DNA sequence:

In [206]:
# Your code here
print(my_dna_sequence)

<__main__.DnaSequence object at 0x7fcaf8c0e470>


Well... That's not very helpful. We haven't gotten any information about our DNA sequence from this... what is this thing anyways? It's some sort of indication of where in the computer's memory you're data is stored. You don't need to worry about that. But, what if we want to call `print` on our DNA sequence object and get some nice output involving the sequence?

To do this, we used some special Python methods. In addition to regular methods, classes can define special *magic* Python methods. One such method is `__str__(self) -> str`. This function is called whenever a your variable is sent as argument to the `print` function. Let's change it in our DNA sequence class:

In [207]:
# Your code here

class DnaSequence:
    
    def __init__(self, dna_sequence: str):
        self.dna_sequence = dna_sequence

    def __str__(self) -> str:
        return f"DNA sequence of {len(self.dna_sequence)} nucleotides: {self.dna_sequence}"

    def transcribe(self, is_template_strand: bool = True) -> str:
        """
        Transcribe DNA to mRNA.

        ...
        """
        
        if is_template_strand:
            m_rna_sequence = ""

            for nt in self.dna_sequence:
                if nt == "A":
                    m_rna_sequence += "U"
                elif nt == "T":
                    m_rna_sequence += "A"
                elif nt == "C":
                    m_rna_sequence += "G"
                else:
                    m_rna_sequence += "C"
        else:
            m_rna_sequence = self.dna_sequence.replace("T", "U")

        return m_rna_sequence

In [208]:
my_dna_sequence = DnaSequence(my_sequence)

print(my_dna_sequence)
print(type(my_dna_sequence))

my_new_dna_sequence = DnaSequence("AGGATGTAGTCTCGCATGCTAGCTAGCTACGTAGCATGCATGCATGCTATCATGCTAGTAGCT")

print(my_new_dna_sequence)
print(type(my_new_dna_sequence))

DNA sequence of 47 nucleotides: AATTAGCGAGCCGAATATATAGCCGCGATTCAGACAGTTCCAGCGCA
<class '__main__.DnaSequence'>
DNA sequence of 63 nucleotides: AGGATGTAGTCTCGCATGCTAGCTAGCTACGTAGCATGCATGCATGCTATCATGCTAGTAGCT
<class '__main__.DnaSequence'>


There are a bunch of other functions like this. Take the time to explore what they have to offer (see [here](https://docs.python.org/3/reference/datamodel.html#specialnames) and [here](https://realpython.com/python-classes/#special-methods-and-protocols) and [here](https://medium.com/fintechexplained/advanced-python-what-are-magic-methods-d21891cf9a08)).

### Documentation

Just like we can document functions using docstrings, we can also document entire classes. This is especially useful if your class implements an idea from the literature, or if you would like to explain any quirks about the class. You can also add docstrings to all your methods.

The docstring usually contains a brief explanation of the class, followed by a more detailed description, and then a description of attributes, and a list of the methods (see [here](https://peps.python.org/pep-0257/)).

Let's add a docstring to our DNA sequence:

In [209]:
# Your code here

class DnaSequence:
    """
    DNA Sequence

    This class represents a DNA sequence.

    Attributes:
        - dna_sequence: string containing DNA nucleotides.

    Methods:
        - transcribe: produce mRNA based on DNA sequence.
    """
    
    def __init__(self, dna_sequence):
        self.dna_sequence = dna_sequence

    def __str__(self):
        return f"DNA sequence of {len(self.dna_sequence)} nucleotides: {self.dna_sequence}"

    def transcribe(self, is_template_strand=True):
        """
        Transcribe DNA to mRNA.

        ...
        """
        
        if is_template_strand:
            m_rna_sequence = ""

            for nt in self.dna_sequence:
                if nt == "A":
                    m_rna_sequence += "U"
                elif nt == "T":
                    m_rna_sequence += "A"
                elif nt == "C":
                    m_rna_sequence += "G"
                else:
                    m_rna_sequence += "C"
        else:
            m_rna_sequence = self.dna_sequence.replace("T", "U")

        return m_rna_sequence

One final thing we can do is add type hints to our methods. In fact, you can even "declare" your variables in the class main body with the type hints. Let's tweak our DNA sequence class a bit more.

In [210]:
# Your code here

# Your code here

class DnaSequence:
    """
    DNA Sequence

    This class represents a DNA sequence.

    Attributes:
        - dna_sequence: string containing DNA nucleotides.

    Methods:
        - transcribe: produce mRNA based on DNA sequence.
    """
    
    # Here, we essentially declare our attribute.
    dna_sequence: str
    
    def __init__(self, dna_sequence: str):
        self.dna_sequence = dna_sequence

    def __str__(self) -> str:
        return f"DNA sequence of {len(self.dna_sequence)} nucleotides: {self.dna_sequence}"

    def transcribe(self, is_template_strand: bool = True) -> str:
        """
        Transcribe DNA to mRNA.

        ...
        """
        
        if is_template_strand:
            m_rna_sequence = ""

            for nt in self.dna_sequence:
                if nt == "A":
                    m_rna_sequence += "U"
                elif nt == "T":
                    m_rna_sequence += "A"
                elif nt == "C":
                    m_rna_sequence += "G"
                else:
                    m_rna_sequence += "C"
        else:
            m_rna_sequence = self.dna_sequence.replace("T", "U")

        return m_rna_sequence

One last important thing! Classes define new types. So, you can use these in type annotations elsewhere in your code!

### Classes - Hands-on Activity I: Biological Sequences

We've been seeing how to write well-documented classes that include attributes and methods. Let's put this into practice. Over the course of this module, we have been working on writing a DNA sequence class. Let's now write a class to represent RNA sequences. Think about what attributes and methods we need.

In [211]:
# Your code here

from typing import Dict

class RnaSequence:
    """
    RNA Sequence

    This class represents an mRNA sequence.

    Attributes:
        - rna_sequence: string of ribonucleotides (A, U, C, G).
        - codon_table: dictionary containing codons as keys and amino acids as values.

    Methods:
        - translate: convert mRNA into amino acid sequence.
    """

    rna_sequence: str
    codon_table: Dict[str, str]

    def __init__(self, rna_sequence: str):
        self.rna_sequence = rna_sequence

        amino_acid_to_codon_table = {
            "F": ["UUU", "UUC"],
            "L": ["UUA", "UUG", "CUU", "CUC", "CUA", "CUG"],
            "I": ["AUU", "AUC", "AUA"],
            "M": ["AUG"],
            "V": ["GUU", "GUC", "GUA", "GUG"],
            "S": ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],
            "P": ["CCU", "CCC", "CCA", "CCG"],
            "T": ["ACU", "ACC", "ACA", "ACG"],
            "A": ["GCU", "GCC", "GCA", "GCG"],
            "Y": ["UAU", "UAC"],
            "STOP": ["UAA", "UAG", "UGA"],
            "H": ["CAU", "CAC"],
            "Q": ["CAA", "CAG"],
            "N": ["AAU", "AAC"],
            "K": ["AAA", "AAG"],
            "D": ["GAU", "GAC"],
            "E": ["GAA", "GAG"],
            "C": ["UGU", "UGC"],
            "W": ["UGG"],
            "R": ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],
            "G": ["GGU", "GGC", "GGA", "GGG"]
        }

        # Convert the table to have the codons as keys and the amino acids as values.
        self.codon_table = {}

        for amino_acid, codon_list in amino_acid_to_codon_table.items():
            for codon in codon_list:
                self.codon_table[codon] = amino_acid

    def __str__(self) -> str:
        return f"RNA sequence of {len(self.rna_sequence)} nucleotides"

    def translate(self) -> str:
        """
        Translate mRNA to Amino Acid Sequence.

        Translate mRNA sequence containing a start codon and open reading frame
        to an amino acid sequence using single-letter codes.
        
        Returns:
            - string of single-letter amino acids, beginning with M.
        """

        amino_acid_sequence = ""
        start_position = -1

        # We want to make sure that we don't go beyond the end of the string!
        for i in range(len(self.rna_sequence)-2):

            candidate_codon = self.rna_sequence[i: i+3]

            if candidate_codon == "AUG":
                start_position = i
                amino_acid_sequence += self.codon_table["AUG"]
                break

        # Again, we want to not go beyond the end.
        for i in range(start_position + 3, len(self.rna_sequence)-2, 3):
            
            new_codon = self.rna_sequence[i: i+3]
            new_amino_acid = self.codon_table[new_codon]

            if new_amino_acid == "STOP":
                break
            else:
                amino_acid_sequence += new_amino_acid

        return amino_acid_sequence



## Data Classes

Recall our initialiser for our DNA sequence class. We didn't really do much in this method, aside from setting some attributes that have the same name as our arguments. What if we want to add a new attribute? Later on, we'll look at working with FASTA files. In a FASTA file, each sequence has a name associated with it. So, let's add a `sequence_name` attribute to the class:

In [212]:
# Your code here

class DnaSequence:
    """
    DNA Sequence

    This class represents a DNA sequence.

    Attributes:
        - dna_sequence: string containing DNA nucleotides.
        - sequence_name: string representing the name of the sequence.

    Methods:
        - transcribe: produce mRNA based on DNA sequence.
    """
    
    dna_sequence: str
    sequence_name: str
    
    def __init__(self, dna_sequence: str, sequence_name: str):
        self.dna_sequence = dna_sequence
        self.sequence_name = sequence_name

    def __str__(self) -> str:
        return f"DNA sequence of {len(self.dna_sequence)} nucleotides: {self.dna_sequence}"


    def transcribe(self, is_template_strand: bool = True) -> str:
        """
        Transcribe DNA to mRNA.

        ...
        """
        
        if is_template_strand:
            m_rna_sequence = ""

            for nt in self.dna_sequence:
                if nt == "A":
                    m_rna_sequence += "U"
                elif nt == "T":
                    m_rna_sequence += "A"
                elif nt == "C":
                    m_rna_sequence += "G"
                else:
                    m_rna_sequence += "C"
        else:
            m_rna_sequence = self.dna_sequence.replace("T", "U")

        return m_rna_sequence

So, we had to add another argument, and another line to the constructor, just to assign a variable that has the same name as our argument. This seems a bit tedious.

In Python, there's actually a way of creating classes that **automatically generate** the initialiser based on the specified attributes. Classes created this way are known as **data classes**. For all the details, see the [Python documentation](https://docs.python.org/3/library/dataclasses.html).

### What are Data Classes?

Data classes have a few helpful features:

* No need to write the initialiser! `__init__` is generated automatically based on the specified attributes.
* You get a simple default `__str__` method which provides helpful information about the attributes.
* You can easily generate a Python dictionary based on a data class, and vice-versa.

These features make data classes especially helpful if you're passing around a bunch of... well, data. For example, if you are reading parameters from a file, it could be nice to simply load them into a data class.

### Using Data Classes

Here's the overall syntax:

```python

import dataclasses


@dataclasses.dataclass
class MyDataClass:
    """
    A class to contain data.

    This class contains a number of different attributes.
    """
    my_first_attribute: int
    my_second_attribute: str
    my_third_attribute: float = 2.3

```

Let's look at this code in a bit more detail. First, we have an import statement. We'll talk about imports more in the next module, but the important thing to know is that we are bringing in code from a different part of the Python Standard Library. Next, we have what looks like a normal class declaration, but there's a weird thing above the `class` line. The `@dataclasses.dataclass` is a special Python object called a *decorator*. We won't go into much detail above these, but it is **extremely** important to remember the decorator. This decorator takes the class that you've defined and runs some code to turn it into a data class.

**If you forget the decorator, your data class WILL NOT WORK.**

Then, in the class definition, we include all our attributes **with type annotations**.

That's it. No need to write a separate initialiser, and we get a clear view of our attributes and any default values. The `@dataclasses.dataclass` decorator takes care of the initialiser. Don't believe me? Let's try to write this class:

In [213]:
# Your code here

import dataclasses

@dataclasses.dataclass
class MyDataClass:
    """
    Simple data class.

    Simple data class containing a few attributes.
    """

    my_string: str
    my_int: int = 42
    my_float: float = 4.2

In [214]:
my_simple_data_object = MyDataClass(my_string="hi!")

Now, for another nice feature, let's try to print our data class:

In [215]:
# Your code here
print(my_simple_data_object)

MyDataClass(my_string='hi!', my_int=42, my_float=4.2)


So, we also get a nice `__str__` method!

In practice, I recommend using data classes when the main purpose of your class is simply to store data. But, you may be wondering... why bother creating a whole class? Why not use a dictionary? Well, in a dictionary, you have to remember what the keys are. If the keys are strings, you could make a typo, which you'll only realise once your code doesn't work. With a class, your editor can flag problems with incorrectly named attributes **before** you run the code. Also, it's actually very easy to convert a data class to a dictionary, and a dictionary with properly named keys to a data class.

To convert a data class to a dictionary, we just call the data class's `.asdict()` method:

In [216]:
# Your code here
my_data_dictionary = dataclasses.asdict(my_simple_data_object)

print(my_data_dictionary)

{'my_string': 'hi!', 'my_int': 42, 'my_float': 4.2}


To convert a dictionary to a data class, you **must** have all the required attributes, with the correct names. Then, you can use a trick in Python that leys you convert a dictionary to Python keyword arguments:
```python

my_dataclass_instance = MyDataClass(**my_dict)

```

The important thing here is to remember to put the two asterisks (`**`) before the dictionary name. This converts a dictionary into keyword arguments. (For completeness, if you use a single asterisk before a list or a tuple, that converts your object to positional arguments. Try it! It's fun!)

Let's see an example:

In [217]:
# Your code here

my_new_data_object = MyDataClass(**my_data_dictionary)

print(my_new_data_object)

MyDataClass(my_string='hi!', my_int=42, my_float=4.2)


Hopefully, by now I've convinced you that data classes are worth using. Again, **not every class should be a data class**, especially if you need to do additional processing when initialising the class.

### Classes - Hands-on Activity II: Data Classes

We've now reached the end of our material on data classes, so let's do a hands-on example. Let's rewrite our DNA sequence class to be a data class.

In [218]:
# Your code here

import dataclasses

@dataclasses.dataclass
class DnaSequence:
    """
    DNA Sequence

    Data class representing a DNA sequence.

    Attributes:
        - dna_sequence: string of nucleotides.
        - sequence_name: name from FASTA file.

    Methods:
        - transcribe: transcribe DNA to mRNA
    """

    dna_sequence: str
    sequence_name: str

    def transcribe(self, is_template_strand: bool = True) -> RnaSequence:
        """
        Transcribe DNA to mRNA.

        ...
        """
        
        if is_template_strand:
            m_rna_sequence = ""

            for nt in self.dna_sequence:
                if nt == "A":
                    m_rna_sequence += "U"
                elif nt == "T":
                    m_rna_sequence += "A"
                elif nt == "C":
                    m_rna_sequence += "G"
                else:
                    m_rna_sequence += "C"
        else:
            m_rna_sequence = self.dna_sequence.replace("T", "U")

        return RnaSequence(m_rna_sequence)

    

In [219]:
my_dna_sequence = DnaSequence(template_strand, "my_seq")

print(my_dna_sequence)

my_rna_sequence = my_dna_sequence.transcribe(is_template_strand=True)

print(my_rna_sequence)

my_protein = my_rna_sequence.translate()

print(my_protein)

DnaSequence(dna_sequence='TGTATGGGGGCACGCGCGCTGAGTCGCAGCGGGCCCTTGATGTCCCGCGATGCCCTCCCCGAAGTCCTGAGGTGGCCGCGTGCAATACCCCTTGCGACGCCACTATGCTGTGCAGGCACGGCGCGGGGTCGCGCAGCGCGCGGTTGCTGCGGGCGCGGAATTCTCCCGCGGCCTAAGCCAGGGTCGCTCTACTCATGTACGGGGTGTCGAAGCCCGTACGTCCGCTTGCGCACGGACAACACGGCGTGCGGGCCTGGACCGTTGCCCAAACCCACGGCGCCTGCGTATGCCGCGGGCTACCGGCATGGTGCGCCCCACCCTGCGGACCCCCATGAGGGTTCTCCGGTCCGCCAAAGACACATGGTGCCCGACGGACCCCTCCCCATAAGCTCTACCTGACTGCTTAGTGGAGCCCCTGTGCGTCGCAGGGTAGGCCGACTTCGATGATGAGCGGTGCGCCTGTCGCCCCGTAGTGCCCGCCCCAGGGGCGCCAAACCAGACTCTGGACCGATAGAGAGTAGCCGACTGTAACGCCGTGGGAGAGGCTTGCCGAATCAACGGATCCACGGCTCCAGACTGCCGTGGGCGTTGACCCCGGGCGGACCACGGCCCGGCAATGGACTTTTCTTTCACCGTGAACAAATGAGACTCAGCCGAGCCCAGGTAGAGAGACCATGGCGTGTTCGAGACGGTCTGCGTGGCTAGTTACAACTGTGGCGGGGATGTCCGGCGGCTCAACACACCCGGCCAGATCTGGAAGTCCATCCGCGTCGGTGCCCGGTTTGTACTGGCCTGCGCCATCAGTAGTGTCATCGGGCCTAGGTATTGGATCTTGGTGAAGGAGGCCATAACCCCGCGGGCGGCAGCAATACAGCGGCCCATAAACCTCCCGTCTCCGCGGCCTTAGACCGCGGGCCCCGAATGACGCCAGGGCCTAGCGCCGCGGGCCCGCCACTCGGGGCCCCTACGCACTG

## Module Summary

Congratulations! We've reached the end of another module. In this module, we learned:

* That **objects** are groups of information (*attributes*) and functionality (*methods*) contained together.
* **Classes** are *templates* for creating new objects of a certain type. We set the class *attributes* in the *initialiser* `__init__` method when creating the class, and we define methods just like normal functions, except that the first parameter is `self`.
* `self` is a special variable that refers to the object that the method is being called on.
* When creating classes that are designed mainly to hold data, and don't need a complicated initialiser, we can create **data classes** to use less code.
* We can use **docstrings** and **type hints** to make our class, attributes and methods easier to understand.

Here's a useful example we did in the live session to explain where classes and objects can be helpful. Let's say we want to represent a circle. Well, the important number we need is the radius. If we want to find the area and circumference, we could just write functions that take in a radius and return the answer, but writing a class makes our code much more organised. Instead of just looking at numbers, we can think of `Circle`s. This will make our code easier to understand.

In [220]:
# Here's another example that we did in the live session about why it's helpful to create classes and objects.

from math import pi

class Circle:
    """
    Simple Circle

    This class represents a circle with a specified radius.
    """

    def __init__(self, radius: float):
        self.radius = radius

    def compute_area(self) -> float:
        return pi * self.radius * self.radius
    
    def compute_circumference(self) -> float:
        return 2 * pi * self.radius
    
    def get_geometric_description(self) -> dict[str, float]:
        my_area = self.compute_area()
        my_circumference = self.compute_circumference()

        return {
            "area": my_area,
            "circumference": my_circumference
        }
    
my_circle = Circle(3)

print(f"The area of my circle is {my_circle.compute_area()}")
print(f"The circumference of my circle is {my_circle.compute_circumference()}")

print(my_circle.get_geometric_description())

The area of my circle is 28.274333882308138
The circumference of my circle is 18.84955592153876
{'area': 28.274333882308138, 'circumference': 18.84955592153876}


# Module 3 - Packages

So far, we've seen how to write **functions** and how to define **classes**. These two concepts help us write simple, smarter code more efficiently. But, sometimes we want to perform more complicated tasks. Instead of re-inventing the wheel, we can take advantage of **packages**. These **packages** contain code *modules* written by other people that we can easily use in our own code. Each module typically corresponds to a Python file, which we can load in its entirety, or from which we can load specific classes and/or functions.

## Intro to Packages

Python has a **massive** amount of packages available to accomplish all sorts of different tasks. There are Python packages for performing mathematical and scientific computations, making nice plots, doing machine learning, analysing images, designing user interfaces and more!

### The Python Standard Library

Before installing packages, let's actually talk about the modules that come *built-in* with Python. Python has a lot of functionality built-in, but much of it is found in different modules. These form the **Python Standard Library**. Examples of modules found here are:

* `random` - dealing with random numbers.
* `os` - interacting with the operating system. We'll see this more later.
* `shutil` - moving and copying files. Again more later.
* `math` - basic mathematical operations (including trigonometry and factorial).

To use these modules, you don't need to install anything new. All the information about them is found online in the Python [online documentation](https://docs.python.org/3/library/index.html).

### External Packages

In addition to everything we can do in the standard library, there are also tons of external packages that can be installed to add extra functionality. Here are a few of the common packages:

* NumPy - numeric and mathematical operations, matrix calculations.
* SciPy - scientific operations.
* Matplotlib - data visualisation and plotting.
* Scikit Learn - Data science and machine learning.
* Scikit Image - Image processing tools.

There are a bunch of others!

Now, let's talk about how to install these packages. These packages are pretty standard. So, if you installed Anaconda, these actually came with your installation.

If you didn't install Anaconda, don't worry! It's very easy to install packages.

## Installing Packages

Installing packages in Python is easy! Let's go over the main steps.

### Finding Packages Online

Well, before installing packages, first you need to find them. There are many different ways of finding Python packages. You can easily search on Google to find a package. Alternatively, you can search on GitHub. Or, you can search for packages on a dedicated Python package database. There are two major databases online: the official **Python Packaging Index (PyPI)** and the **Anaconda** repository. You can search each online to find a specific package. 

Let's start with an example using PyPI: https://pypi.org/

Now, let's look at Anaconda: https://anaconda.org/

*Note:* This is anaconda.**org** not anaconda.**com**.

### Installing Packages

These two repositories offer very easy ways to install packages. Each repository has its own tool:

* To install packages from PyPI, we use `pip`.
* To install packages from Anaconda, we use `conda`.

We'll see how to use each in more detail.



### Creating a Virtual Environment

Before we discuss how to install packages, we need to discuss Python environments. Often, you may find yourself working on multiple projects at the same time. Each may have similar packages installed, but due to complicated webs of dependencies, you may need different package versions for different projects. This is a problem! In your Python environment, you can only have **one version** installed. To get around this issue, we can create **virtual environments**.

A **virtual environment** is an environment that provides its own Python interpreter and its own packages, separate from your main Python installation.This lets you have your own set of packages, and in some cases even a different version of Python.

One easy way to create a new virtual environment is using `virtualenv`. To create a virtual environment, we use the **terminal** NOT the Python interpreter. At the prompt ($ or % on macOS and Linux, > on Windows), we run the following code:

```bash

    virtualenv environment_name

```

This will create a new virtual Python environment called `environment_name`. If you look in the folder using `ls`, you'll see a new folder with the same name as the environment. To **activate** this environment on the command line, you must run:

* On macOS or Linux: `source environment_name/bin/activate`.
* On Windows, using Powershell: `environment_name/bin/activate.ps1`

On macOS and Linux, the name of the environment will appear in parentheses before the command line prompt when the environment is active. Once this environment is active, any packages you install will **only live in this environment**. So, you can install specific versions and they **will not** affect your bigger Python installation. To **leave** the virtual environment, simply type `deactivate` on the command line.

You can also use the `venv` tool to create virtual environments (see [here](https://docs.python.org/3/library/venv.html)).

When working with virtual environments, **make sure to also configure your code editor** to respond to the virtual environment.

Using `conda`, we can also create new Python environments. To create a new `conda` environment called `new_env`, type this:
```bash
    conda env create --name new_env
```

To activate this new environment, you would type:
```bash
    conda activate new_env
```

To deactivate the environment, simply type `conda deactivate`.

Now that we've seen how to create environments, let's discuss how to install packages.

### Installing Packages using `pip`

Let's start with `pip`. `pip` is typically available regardless of how you've installed Python.

To install packages using `pip`, you must open the **command line**, NOT a Python interpreter. At the prompt, you write:
```bash
    pip install package_name
```

**Note:** `pip` should come with just about any installation of Python. If you didn't install Anaconda, things may get a bit messy. There are two major versions of Python in use: 2.7 and 3.*. On some operating systems, typing in `python` or `pip` on the command line use Python 2, while you must use `python3` or `pip3` to use the more updated and supported version of Python. When you install Anaconda, you no longer need to deal with this issue.

Now, let's say you're doing research and you have all your packages installed and you want someone to be able to reproduce your environment with the exact same versions. We can do this easily using `pip`. The package information is stored in a **requirements** file, commonly called `requirements.txt`. To create this file, at the command line, you use `pip freeze`:

```bash
    pip freeze > requirements.txt
```

Then, on another computer, if you want to install all these packages, you just write:
```bash
    pip install -r requirements.txt
```

This way, you have a nice, easily reproducible environment.

### Installing Packages using `conda`

Note: you only have `conda` available if you've installed Anaconda or Miniconda. To install packages using `conda`, you can either use the graphical **Anaconda Navigator** or use the command line.

In the **Anaconda Navigator**, click on the **Environments** tab. Then, you can select a specific environment and search for packages.

To install a package using `conda` at the **command prompt** in a terminal, **NOT IN A PYTHON SHELL**, you would write:

```bash
conda install package_name
```

Press enter, wait for it to prompt you, type `y` and hit enter again to install! If you don't want to be prompted, then you can just add `-y` to the command so that it automatically answers "yes" to the prompt for installation.

Sometimes, the package that you want isn't available in the main channel, so you may have to specify an additional option (see [here](https://docs.conda.io/projects/conda/en/latest/commands/install.html) for more details). For example, packages may come from `conda-forge`, so you would have to specify:
```bash
conda install -c conda-forge package_name
```

You can also add additional channels, such as [**bioconda**](https://bioconda.github.io/).

**Note:** On Windows and macOS, if you installed Anaconda, you can perform package management graphically using the "Anaconda Navigator". Also, on Windows, if you are installing packages from the command line, make sure you open the **Anaconda Prompt** from the Start Menu and you **don't** just go in through `cmd` or PowerShell. On Linux, you may need to install the `anaconda-navigator` package to be able to graphically install packages.

For reproducibility, you can also export your `conda` environment to a file. This will let other users recreate your setup. The process is discussed briefly in the `conda` documentation [here](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

At the command prompt (with you `conda` environment active), type the following (after the dollar sign):

```bash
    conda env export -f requirements.txt
```

To create a new `conda` environment called `new_env` based on the file, type this:
```bash
    conda env create --file requirements.txt --name new_env
```

To activate this new environment, you would type:
```bash
    conda activate new_env
```

Environments help you keep multiple versions of Python (and versions of packages) separate.

If you have both `conda` and `pip` installed, the `conda` [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) recommends trying to install packages with `conda` first.

### Other Installation Tips

Most packages give you information in the **documentation** about how to install them. For example, NumPy provides the following [page](https://numpy.org/install/). Matplotlib provides [this page](https://matplotlib.org/stable/users/getting_started/index.html#installation-quick-start). Another interesting example in CuPy, which allows performing NumPy and SciPy operations of the GPU. This package **does not** work on all systems. It requires an NVIDIA GPU and CUDA, which is not available on macOS. In these cases, it's very important to read the [installation instructions](https://docs.cupy.dev/en/stable/install.html).

### Reading Documentation

When in doubt, **consult the documentation!** The documentation usually provides instructions on installation, but also basic usage and it often has a collection of all the docstrings so that you can understand what all the function, classes and methods. For example, let's look at [NumPy](https://numpy.org).

## Using Packages

Now that we've installed packages, we need to use them. We'll discuss how to load a module using `import`. Then, we'll also look specifically at using the NumPy package for performing array operations and data processing.

### Importing Packages

To import a package into a Python script, we use the `import` keyword, followed by the name of the module we are importing. **This is the same regardless of whether the package is in the standard library or has been installed separately.** To access functions or variables (actually more like "constants" here) from the imported module, we use the same dot notation from object-oriented programming.

**Note:** For people coming from Java, or any other extremely object-oriented programming language, although we use dot notation, the imported module is **not** a class.

Let's do a simple example using the `math` module from the standard library.

In [221]:
# Your code here

import math

math.factorial(7)

5040

When importing a package, we can also give it a new name. This is helpful if the package name is long (or even if it's short but we use it a ton), or if we run the risk of having multiple modules with the same name. To do this renaming, we simply add `as` after the module name, then the new name:

```python
import module as m
```

A very popular example is when using the NumPy package, which is commonly abbreviated `np`:

In [222]:
# Your code here

import numpy as np

np.cos(np.pi / 2)

6.123233995736766e-17

Python packages can often have multiple folders and hierarchies. To import modules in a subfolder, we use dot syntax. A popular example is Matplotlib for creating graphs:

In [223]:
# Your code here

import matplotlib.pyplot as plt

plt.figure()

<Figure size 640x480 with 0 Axes>

<Figure size 640x480 with 0 Axes>

We can also import specific classes and functions from a module. To do this, we write:
```python

from module_name import ClassName, function_name

```

Then, you can use the class or call the function without using the dot notation. Instead, we can call the function as we would any other function we had defined.

Let's see an example from the `math` module:

In [224]:
# Your code here

from math import factorial

factorial(7)

5040

### Using NumPy

For a specific example of using a package, let's learn a bit about NumPy.

NumPy provides important mathematical tools for operations on arrays. It simplifies performing bulk calculations.

You may be wondering: If I have lists, why bother with arrays? Can't I do everything with lists and nested lists?

Well, to do operations on lists, you need to use `for` loops. If you want to do operations on nested lists, (essentially 2D lists), you have to nest the `for` loops. That starts to get messy to write... and it also becomes very inefficient to run.

So, given these difficulties, it is **definitely** worth the time investment to learn NumPy.

#### NumPy Fundamentals

The most important object is NumPy is the **array**. Very important: **Arrays are *not* lists**. Here are the important differences:

* NumPy arrays have a **fixed** size. You **cannot** add new elements to an array.
* NumPy arrays can easily have more than one dimension. For example, an array can be 2D (to represent a table or an image), or even higher dimensional. The dimensions are represented as a tuple. The length of this tuple is the number of dimensions, while each element in the tuple indicates the size of the array along that axis.
* All elements in a NumPy array have the same type.
* NumPy includes functions that perform mathematical operations on entire arrays without needing to perform iteration.

Let's see some of the basics of using NumPy arrays.

#### Creating NumPy Arrays

There are a number of different ways of creating a NumPy array. We can create an array based on nested lists using the `np.array` function. Here's an example:

In [226]:
# Your code here

my_arr = np.array([
    [2, 6], [4, 8]
])

print(my_arr)

[[2 6]
 [4 8]]


Note that the lists need to line up properly. In an array, all rows **must** have the same number of elements (the same with columns and any other dimension). If the nested list sizes don't line up properly, you won't get the array you were expecting.

There are a couple of helpful functions for creating arrays of a specific size. These are `np.zeros` and `np.ones`. They each take as argument the size of the desired array in the form of a tuple.

Let's see some examples:

In [235]:
# Your code here

my_zeros = np.zeros((4,5)) # Number of rows, number of columns
my_ones = np.ones((2, 3, 4))

print(f"My Zeros, with shape {my_zeros.shape}")
print(my_zeros)

print("\n")

print(f"My Ones, with shape {my_ones.shape}")
print(my_ones)

My Zeros, with shape (4, 5)
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


My Ones, with shape (2, 3, 4)
[[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]


There's another easy way to create a 1D array. Using `np.arange`, we can create simple arrays of numbers separated by a consistent step size. There are different ways of using the function:
```python

np.arange(a) # Array contains all integers from zero up to and excluding a.
np.arange(a, b) # Array contains all numbers from a (inclusive) to b (exclusive).
np.arange(a, b, c) # Array contains all numbers from a (inclusive) to b (exclusive) incrementing by c.

```

**Note:** for the second two, the arguments don't have to be integers.

Let's see a few examples.

In [236]:
# Your code here

my_first_range = np.arange(5)
print(my_first_range)

my_second_range = np.arange(3, 7)
print(my_second_range)

my_third_range = np.arange(-1, -6, -2)
print(my_third_range)

[0 1 2 3 4]
[3 4 5 6]
[-1 -3 -5]


We can check the dimensions, or shape, of an array using the `shape` attribute of the array.

In [237]:
# Your code here

print(f"My third range has shape {my_third_range.shape}")

My third range has shape (3,)


But, wait! This only returns a 1D array! What if we don't want only a one-dimensional array?

Introducing the `reshape` method. We can take an array and reshape it to match desired dimensions. As argument, we pass a tuple with the desired dimensions.

**Note:** the dimensions must be compatible with the number of elements in the array. If you know all the dimensions except one, you can set a dimension size as `-1` and Python will infer that dimension.

Let's see some examples:

In [240]:
# Your code here

# Let's make a sequence of 30 integers.
my_arr = np.arange(30)

# Now, let's rearrange the array to have 5 rows.
my_reshaped_arr = my_arr.reshape(5, -1)

print(my_reshaped_arr)

# By setting the second number as -1, we tell NumPy to figure out the missing dimension
print(my_reshaped_arr.shape)

[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]]
(5, 6)


### Array Operations

Now that we know how to *create* arrays, we can think about operations that we can perform with them.

#### Basic Mathematical Operations

We can perform *mathematical operations* with arrays in the same way that we would with numbers:

* `+` - addition of two arrays, element-wise.
* `-` - subtraction of two arrays, element-wise.
* `*` - **element-wise** multiplication of two arrays.
* `/` - division of two arrays, element-wise.
* `**` - **element-wise** exponentiation of arrays.
* `%` - modulo of two arrays, element-wise.
* `//` - integer division (floor division) of two arrays, element-wise.

Again, note that these operations are all **element-wise**. To perform these operations, the arrays must have *compatible shapes*. Here are some of the rules of working with different array shapes:

* The two arrays can have the *exact same shape*, in which case the operation works directly.
* One of the operands can simply be a `float` or an `int`, in which case, the operation is performed with each element in the operand which is an array.
* We can *broadcast* the shapes together. This process is described extensively in the [documentation](https://numpy.org/doc/stable/user/basics.broadcasting.html). Basically, one of the key ideas is making sure that, working from the **last** dimension to the first for each, the respective dimensions are either **equal** or **one of them is one**. We won't see too many examples of this, but I encourage you to look at the documentation.

Let's do a few examples of basic array operations:

In [249]:
# Your code here.

my_arr = np.arange(30).reshape(5, -1)

print(my_arr)
print("")

my_one_column = np.ones((5, 1))

print(my_one_column)
print("")

my_other_arr = (np.arange(30).reshape(5, -1) - 15) ** 2

print(my_other_arr)
print("")

print(my_arr + my_other_arr)
print("")

print(my_other_arr - my_arr)
print("")

print(my_arr - my_one_column)

[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]]

[[1.]
 [1.]
 [1.]
 [1.]
 [1.]]

[[225 196 169 144 121 100]
 [ 81  64  49  36  25  16]
 [  9   4   1   0   1   4]
 [  9  16  25  36  49  64]
 [ 81 100 121 144 169 196]]

[[225 197 171 147 125 105]
 [ 87  71  57  45  35  27]
 [ 21  17  15  15  17  21]
 [ 27  35  45  57  71  87]
 [105 125 147 171 197 225]]

[[225 195 167 141 117  95]
 [ 75  57  41  27  15   5]
 [ -3  -9 -13 -15 -15 -13]
 [ -9  -3   5  15  27  41]
 [ 57  75  95 117 141 167]]

[[-1.  0.  1.  2.  3.  4.]
 [ 5.  6.  7.  8.  9. 10.]
 [11. 12. 13. 14. 15. 16.]
 [17. 18. 19. 20. 21. 22.]
 [23. 24. 25. 26. 27. 28.]]


#### Advanced Mathematical Operations

In addition to the basic mathematical operations we've seen before, there are new operations we can perform.

One of the most important ones is **matrix multiplication**. We can perform matrix multiplication using the `@` operator between two arrays. Let's see an example:

In [250]:
# Your code here.

my_matrix = np.array([
    [3, 2, 1],
    [4, 6, 5],
    [7, 0, 2]
])

my_column_vector = np.array([1, 5, -2]).reshape(3, 1)

print("My matrix product is:")

print(my_matrix @ my_column_vector)

My matrix product is:
[[11]
 [24]
 [ 3]]


There are other matrix and vector operations that we can perform, such as dot products. We won't see those here, but I encourage you to read up on them in the [documentation](https://numpy.org/doc/stable/reference/routines.linalg.html).

We can also perform element-wise trigonometric functions, such as `cos` and `sin`. These exist as functions in the NumPy package. When we import NumPy, we can simply call these functions using the dot notation.

There are also functions that allow us to learn more about the data represented in a NumPy array. Here are some of the methods we can call on arrays:

* `sum` - compute the sum of elements.
* `mean` - compute the mean of elements.
* `var` - compute the variance of elements (by default, dividing by `n`, but you can specify to divide by `n-1`)
*  `std` - compute the standard deviation of elements (square root of the variance).

These functions can be called on either **all values** in an array, or all values **along a specific axis**. To specify the axis, we use the `axis` keyword argument. We can also use **negative indexing** for specifying the axis number. So, if we want to compute the variance of all elements, going along the last axis, we can write:

```python

my_variances = arr.var(axis=-1)

```

Let's see a few examples:

In [251]:
# Your code here

my_array = np.array([
    [3, 5, 6],
    [2, 3, 1],
    [6, 7, 2]
])

print(f"The sum of all entries is {my_array.sum()}")
print(f"The sums of rows are {my_array.sum(axis=1)}")
print(f"The sums of columns are {my_array.sum(axis=0)}")


print(f"The mean of all entries is {my_array.mean()}")
print(f"The means of rows are {my_array.mean(axis=1)}")
print(f"The means of columns are {my_array.mean(axis=0)}")

The sum of all entries is 35
The sums of rows are [14  6 15]
The sums of columns are [11 15  9]
The mean of all entries is 3.888888888888889
The means of rows are [4.66666667 2.         5.        ]
The means of columns are [3.66666667 5.         3.        ]


### Array Indexing

These operations are great for working with complete arrays. But, remember that for lists, tuples and dictionaries, it was very important to access subsets of elements. Well, we can also access sub-arrays using indexing.

To index NumPy arrays, we can use square brackets `[]`. To index along multiple axes, we can use multiple numbers separated by commas. We can also index sub-ranges using colons `:`, similar to in lists.

Let's see some examples:

In [258]:
# Your code here

# Let's make a 3D array
my_arr = np.arange(30).reshape(3, 5, 2)
print("My array:")
print(my_arr)
print("\n")

# In 3D, let's think of the array indices as (sheets, rows, columns).

my_first_sheet = my_arr[0]
print("My first sheet:")
print(my_first_sheet)
print("\n")

my_first_columns = my_arr[:, :, 0]
print("First respective columns:")
print(my_first_columns)
print("\n")

my_last_row_in_sheet_2 = my_arr[2, -1, :]
print("Last row in sheet 2:")
print(my_last_row_in_sheet_2)
print("\n")

My array:
[[[ 0  1]
  [ 2  3]
  [ 4  5]
  [ 6  7]
  [ 8  9]]

 [[10 11]
  [12 13]
  [14 15]
  [16 17]
  [18 19]]

 [[20 21]
  [22 23]
  [24 25]
  [26 27]
  [28 29]]]


My first sheet:
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]


First respective columns:
[[ 0  2  4  6  8]
 [10 12 14 16 18]
 [20 22 24 26 28]]


Last row in sheet 2:
[28 29]




We can also use indexing in assignments, similar to with lists. Here are a few examples:

In [259]:
# Your code here

# We can assign multiple values at once

# Here, we're changing the firs column in all rows in the last sheet.
print("First change:")
my_arr[-1, :, 0] = 5
print(my_arr)

# Now, in the first sheet, let's change the first row.
print("\nSecond change:")
my_arr[0, 0] = np.array([4, 4])
print(my_arr)

First change:
[[[ 0  1]
  [ 2  3]
  [ 4  5]
  [ 6  7]
  [ 8  9]]

 [[10 11]
  [12 13]
  [14 15]
  [16 17]
  [18 19]]

 [[ 5 21]
  [ 5 23]
  [ 5 25]
  [ 5 27]
  [ 5 29]]]

Second change:
[[[ 4  4]
  [ 2  3]
  [ 4  5]
  [ 6  7]
  [ 8  9]]

 [[10 11]
  [12 13]
  [14 15]
  [16 17]
  [18 19]]

 [[ 5 21]
  [ 5 23]
  [ 5 25]
  [ 5 27]
  [ 5 29]]]


NumPy is a **huge** library. So, it's impossible to know and remember all the functions, their arguments, and all the details about them.

Instead, it's better to become familiar with the **documentation**. It can be found on the NumPy website, at https://numpy.org/doc/stable/index.html. Let's look at the documentation for some of the functions we've discussed.

The docs give detailed information about every class, function, constant and method available. There are instructions on how to use (and not use) everything. When in doubt, **check the documentation**. This should be your **first stop**. Unlike checking on Stack Overflow, the docs give you **official information** and **important guidance**. It's also important to remember that sometimes, just because code *works*, that doesn't mean the code is *right*. The documentation will help clear up any confusion.

## Packages - Hands-on Activity: DNA Sequence Statistics

We've now reached the end of our module on packages. Let's now do a hands-on activity, using NumPy to compute basic statistics on DNA and protein sequences. 

Here's the problem: Let's say we have a collection of many DNA sequences of the **same length**. We want to determine the percentage of C and G at each position in each sequence, and then use this to determine the proportion of CG at each position and within each sequence.


In [262]:
# Your code here.

sequence1 = "TCACGTTCCAACGCCTACGAATTTAACCGATTGATTTCTCGAAATATCCGTGAGGGTTCGAGCCCTACCAGCGAGCGCCACAATGACGTGTCTCCACTAACACGCTGGCGGATGCTGAGGTGTGCTGTACAGGCTAGACCAGAGAGCATGATGGGCCCACCCACCACGAGATTCAGTTGTCACTTTTCCGGATCTCAGGATAACCGTACTATCGACAATTCATTTCGAGCCACCACAAATCAGCCCACTCATTAGATGTCGTTATCCCAACCAATTATGAACGGTGGCGATTTCACATTTTAGCCACCCGGTCATCGACTGTGGAATCGAGGAGTTAAATTCTCTTAGAGAACTTCCATCGGAAAAGAGCTAATTAGCACTCGGCTTGTGCATGTCCAACGTCCATGTCCAGACGCGTACAGGTGCGCGGGATACACACGCTTTCTGAAGTTTGACTAACTCGCCCCGCCCACGGTCAATCCCTATAGTTATATGTCACGAATAGTCGGAGACGCTATCCGTTGTAAAGCTTCAGTTGGCTAAACCAAAGTACGTCTGCCAAGTATCCTCCGTCACGTCCAGATAAACTTTCGTACTCAGGATCCGAGTTAATTGGCCTCTAGAATGATACTACCTACGCAAAGTCTGAGCCGTTGCGTGCACTGCGGCCTTACATTGGAAACTTAAACACCGGGATCTTCGCATGCTATTGAACTAGAAATTCACATAGTCGGGGAGCTCAAGATCACACACACGGGCTACATCGCAACTTTCCTCGTTATGCTCAAACTATAGGCTGGCAAGGGGCCTTTAAGGATAGATGGCCCTGATCGTGGTAGCTCAATCGATGGTGATACGAGAAGCTCCTGCAAGAAAAAAATGGTGCTCAAATTCTTGCGCGATGCAGATTTAGCTTACCCCTCAGTGCTATGCACGAGTGTTTGCCGGCAAGGTAAGTCCATTCCATGTTACACAGCGCCGTGTAGACCCTACCGAGTCA"
sequence2 = "GCGGTTAAGGGAGGCTCGCGGCCACGCAACTAACCGGCGAGGTGGCGCCCTTGCCCCCGGTCCTGCATGATGGAGGCGATACGCGCCTTCCGGTTGTCTAGCAGGGGTAGAGCCGGGCCCTGAATAAAGGGCTGAGGGGGCTCGATAGCGTAGGGGCTGGTTAGAAGTACCGGCCGCGAGACGCCCCGGGGGGACGAACCCATAAGCATGCCCCGTACTGGTGGGTGAGGCACCCGAAGATGTGGCGCCAGCGGGTGTCCGGACGGCCGCTACTCGGGACCCACTCGTCCCGTCCTACAGTATTGGCCGCCCGACCAGACCGGGCCGCGCTGGAAGCCCGGAGTAGAGCGCCGGCCCGCCCAGACTACAAGGGGTCCTTCGGTACCAGCGGTCCGCGCGGTATTCCAGTCCTGAGTCGAGCTGAAGGACCCCCTCGCGAGGCGTTCGGTGTCGCCCGGAATGGGGGCCGGCTCTGACGGCGCCACCGGGGGCGCACGGGTCCGGCCGTTCATACCGGTGCGGGGACCGTAGGGCGTACATCATCTGCCGAAGACCTGGCTCTCGGTCCGCGAACCAGAACTCAGAAGCCGGCGCTGGCTCAAGGCCCCCCGACTGCTCCCGACGGTTTCTGGCCGAGAGCAGCCTTAACTTTGCATAATACGTGCTTGCACCTCCATGGCCCTCGGGACCCGGTGTCCCCCCCTAGGAAGCTCACAGACCCTCGGTGGGGCCTGCCAGACCAACGACAGCCGTCCCCTGCAGCGGGGACGCCCCGGGCTGGGGTGCGTTGTACCTCCTCCGTCCGTTGGAGCGCTCGCATCCTAGAAACCTTACCTCGCTGATCGAAGCTGGGAGTCGGTGCCAGGTGGCGGGACGCGCGGTGCAATCAGCGCACGAGAGAGCCCCCCTGCAGCGCCGGGGCTTCTCAAGGCCGTGAACGTAGGACCGGGTGAGCAACTTGTGCCCGGCCACGGCGAGGTTTCCCGGGCACATGCTCTTG"

dna_seq1 = DnaSequence(dna_sequence=sequence1, sequence_name="Sequence 1")
dna_seq2 = DnaSequence(dna_sequence=sequence2, sequence_name="Sequence 2")

my_sequences = [dna_seq1, dna_seq2]

def compute_cg_proportion(sequences: list[DnaSequence]) -> np.ndarray:
    """
    Compute the CG percentage at each position in sequences of the same length.

    This function computes the percentage of nucleotides that are C or G at
    each position in a set of nucleotide sequences of the same length.

    Parameters:
        - sequences: List containing DNA sequences of the **same length**.

    Returns:
        - NumPy array of shape (sequence_length,) where sequence_length 
          corresponds to the common length of the nucleotide sequences.
    """

    # Get the sequence length and the number of sequences
    sequence_length = len(sequences[0].dna_sequence)
    number_of_sequences = len(sequences)

    # Create an array where we will record whether each nucleotide is C or G
    cg_count_array = np.zeros((number_of_sequences, sequence_length))

    # Let's iterate over the sequences. We need the indices!
    for i in range(number_of_sequences):

        # Get the sequence
        current_sequence = sequences[i].dna_sequence.upper()

        # Find the positions of C or G:
        cg_positions = []

        for j in range(sequence_length):
            nt = current_sequence[j]

            if nt in "CG":
                cg_count_array[i, j] = 1

    # Now, we need to take the mean along each column to get the proportion!
    cg_prob_array = cg_count_array.mean(axis=0)

    return cg_prob_array


Now, let's test it !

In [263]:
my_cg_array = compute_cg_proportion(my_sequences)

print(my_cg_array)

[0.5 1.  0.5 1.  0.5 0.  0.  0.5 1.  0.5 0.5 0.5 1.  1.  1.  0.  0.5 1.
 1.  0.5 0.5 0.5 0.5 0.  0.5 0.5 1.  0.5 0.5 0.5 0.  0.  0.5 0.5 0.5 0.5
 0.5 1.  0.5 0.5 1.  0.5 0.  0.5 0.5 0.5 0.5 1.  1.  1.  0.  0.5 0.5 1.
 1.  1.  0.5 0.5 1.  1.  0.  1.  1.  0.5 1.  0.5 0.  0.5 1.  0.  0.5 1.
 1.  0.  1.  1.  1.  1.  0.5 0.  0.5 0.5 0.5 0.5 1.  0.5 1.  0.5 0.  1.
 0.5 1.  0.5 0.5 0.5 0.5 0.5 0.5 0.  0.  1.  0.5 0.5 1.  1.  0.5 1.  0.5
 0.5 1.  0.5 0.5 0.5 1.  1.  0.5 1.  0.5 1.  1.  0.  1.  0.  0.5 0.5 0.
 0.5 0.  0.5 1.  0.5 1.  0.5 1.  0.  0.5 1.  0.5 1.  1.  0.5 0.5 0.5 1.
 0.  0.5 0.5 0.5 0.5 1.  0.  0.  1.  1.  1.  1.  1.  0.5 0.5 1.  0.5 0.5
 0.  1.  0.5 0.  1.  0.5 0.  1.  0.5 0.5 0.5 1.  0.5 1.  0.5 0.5 0.5 0.5
 0.5 0.5 1.  0.5 0.5 0.5 0.5 1.  1.  1.  1.  0.5 0.5 0.5 0.5 1.  0.  0.5
 1.  0.5 0.5 0.  0.  0.5 0.5 1.  0.5 0.  0.5 0.5 0.5 0.5 1.  1.  0.5 0.5
 0.  0.5 0.  0.5 1.  0.  0.5 0.5 0.5 0.5 1.  0.  1.  1.  1.  0.  1.  1.
 0.5 1.  0.  0.  0.5 0.  0.5 0.5 0.5 1.  1.  1.  0.5 1.  0

In [264]:
# And now for a shorter example:
sequences = ["ATCCATTACAGGCGCA", "CGACAGTGCATGTGAA", "CGACGTAGTCGATGCA", "TCGAGCTAGCATCGAT",
              "CAGTACGTACGATCGA", "CAGTACGTACGATCGA","CGATCGATCGATCGAT",]

# This is a list comprehension! They're really cool and fun!
sequences = [DnaSequence(dna_sequence=seq, sequence_name=f"Sequence{i}") for i, seq in enumerate(sequences)]

gc_probs = compute_cg_proportion(sequences)

print(gc_probs)

[0.71428571 0.57142857 0.57142857 0.42857143 0.42857143 0.71428571
 0.28571429 0.28571429 0.57142857 0.71428571 0.57142857 0.28571429
 0.42857143 1.         0.57142857 0.        ]


For fun, here's an alternative approach that relies on NumPy arrays containing strings! Yes, those are possible!

In [302]:
def compute_cg_proportion(sequences: list[DnaSequence]) -> np.ndarray:
    """
    Compute the CG percentage at each position in sequences of the same length.

    This function computes the percentage of nucleotides that are C or G at
    each position in a set of nucleotide sequences of the same length.

    Parameters:
        - sequences: List containing DNA sequences of the **same length**.

    Returns:
        - NumPy array of shape (sequence_length,) where sequence_length 
          corresponds to the common length of the nucleotide sequences.
    """

    # Split sequences into the nucleotides (see https://www.geeksforgeeks.org/python-split-string-into-list-of-characters/)
    split_sequences = [list(seq.dna_sequence.upper()) for seq in sequences]

    # Create NumPy array based on the sequences
    nucleotide_array = np.array(split_sequences)

    # Create the voting array, with the same shape as the nucleotide array
    cg_count_array = np.zeros(nucleotide_array.shape)

    # Now, for the assignment magic! We're using conditional assignment!
    cg_count_array[nucleotide_array=="C"] = 1
    cg_count_array[nucleotide_array=="G"] = 1

    # Now, for the mean
    cg_probs = cg_count_array.mean(axis=0)

    return cg_probs

In [303]:
# And now for a shorter example:
sequences = ["ATCCATTACAGGCGCA", "CGACAGTGCATGTGAA", "CGACGTAGTCGATGCA", "TCGAGCTAGCATCGAT",
              "CAGTACGTACGATCGA", "CAGTACGTACGATCGA","CGATCGATCGATCGAT",]

# This is a list comprehension! They're really cool and fun!
sequences = [DnaSequence(dna_sequence=seq, sequence_name=f"Sequence{i}") for i, seq in enumerate(sequences)]

gc_probs = compute_cg_proportion(sequences)

print(gc_probs)

[0.71428571 0.57142857 0.57142857 0.42857143 0.42857143 0.71428571
 0.28571429 0.28571429 0.57142857 0.71428571 0.57142857 0.28571429
 0.42857143 1.         0.57142857 0.        ]


## Module Summary

Yay! We've reached the end of the module on packages. In this module, we learned:

* What Python **packages** are, how to find them on **PyPI** and **Anaconda** and how to **install** them using `pip` and `conda`.
* How to **import** modules, classes and functions into Python files.
* How to use **NumPy** to perform mathematical and array operations.
* How to **read documentation** for different NumPy classes and functions.

These skills we've seen in this module will come in very handy in the next module.

# Module 4 - Working with the Operating Systems and the User

It's all good and fun to have scripts that run in a vacuum. But, we often need to bring data in from files to process. We also need to take in arguments from the user, who may or may not know how to write Python code.

In this module, we'll see the basics of working with the operating system, reading and copying files. Then, we'll see how to organise our code into scripts that can be run from the command line.

## Working with the Operating System

Whenever you're using the computer, your **operating system** does most of the management work. It provides the basis for running all other software, managing the memory and the CPU. The operating system also manages the file system, allowing reading and writing files.

### Working with Text Files

Text files are simple files that contain... text! It's **very important** to know the difference between a *text* file and a *word processing* file. A **text file** contains only text. To open it, you can would use a program like **Notepad** or **Notepad++** on Windows, **TextEdit** on macOS, or something like **Gedit** on Linux. You can also edit them on the command line using tools like `vim`, `emacs` or `nano`, or read them using `cat` or `less`. You **do not** open these files using Microsoft Word, LibreOffice or Pages.

Again, these are very simple files that contain **only text**. No images, no tables, no fancy fonts.

We can read and write these files using Python (see [here](https://docs.python.org/3/tutorial/inputoutput.html#tut-files) for complete details). To open a file, we use the built-in `open` function. This function is documented [here](https://docs.python.org/3/library/functions.html#open).

When calling this function, here are the common arguments:

```python

my_file = open("filename.txt", mode="<mode>")

```

* The first argument is the **filename**. This positional argument is *required*.
* The second argument is the **mode**. Commonly-used values for this argument are:
    * `r` - open the file only for *reading* (default).
    * `w` - open the file only for *writing*. **If the file exists, it will be cleared and overwritten**.
    * `a` - open the file only for *appending*. If the file exists, new text is added to the end.

These values for the `mode` are specifically for working with text files. There are *slightly* different values for a different type of file, known as a *binary* file (which we won't see).

When you're done working with a file, you close it using the `close` method.

#### Reading Files

We've seen how to open the files. Now let's look at how to read. There are a few different ways to read a file. The most direct way is to use a `for` loop to read the file line-by-line. We can then process each line.

Here's the syntax:

```python

my_file = open('my_file.txt')

my_lines = []

for line in my_file:
    my_lines.append(line)

my_file.close()

```

Remember to close the file!

Let's see an example with a text file I've included called "my_text_file.txt":

In [266]:
# Your code here

my_file = open("my_text_file.txt")

print(my_file.readlines())

my_file.close()

['Hello, MiCM workshop!\n', 'This is a text file!\n', "It's a nice, simple file.\n", 'The only thing in here is text...']


Alternatively, we can use the file methods `read` and `readlines` to read a single line, or all lines, respectively.

We can also change the syntax a bit so that we don't have to explicitly close the file. This approach involves using a `with` statement:

```python

with open("my_file.txt") as my_file:
    all_lines = my_file.readlines()

    print(f"All lines in file are:\n {all_lines}")

```

Here's an explanation of the `with` statement. Immediately after `with`, we put in code that runs and returns a value. This value is given the name provided after `as`. Once the `with` block finishes, the file is automatically closed.

Let's re-write our example from before:

In [267]:
# Your code here

with open("my_text_file.txt") as my_file:
    print(my_file.read())

Hello, MiCM workshop!
This is a text file!
It's a nice, simple file.
The only thing in here is text...


### Writing Files

In addition to reading files, we can also write files. The key methods are `write` and `writelines`. It is very important to note that the newline character `\n` is **not** automatically added to each line.

Let's do an example, where we write to `my_new_text.txt`:

In [269]:
# Your code here
my_file = open("my_new_text.txt", "w")
my_file.write("Hello, world!\n")
my_file.write("This file was written using Python!\n")
my_file.close()

with open("my_new_text2.txt", "w") as my_file:
    my_file.write("Hello, world!\n")
    my_file.write("This file was written using Python!\n")

# There's no need to close the file when using `with`

**Note:** If you don't use the ``with`` syntax make sure to `.close()` the file to ensure your contents are written!

### Operating System - Hands-On Activity I: Working with FASTA Files

Often, DNA and protein sequences are stored as FASTA files. The file format is described in detail [here](https://blast.ncbi.nlm.nih.gov/doc/blast-topics/). In the file, there are two types of lines:

* Description lines, which begin with the character `>`. These lines **don't** contain the sequences.
* Sequence lines, which contain nucleotides or amino acids.

Let's work on writing code to read and write FASTA files based on our DNA and RNA objects. I've included some sample sequence files in the `sequences` directory.

In [272]:
# Your code here
from typing import List

my_filename = "../code/sequences/non_template_sequences_biased.fasta"

my_sequences: List[DnaSequence] = []

with open(my_filename) as fasta_file:

    current_sequence = ""
    current_name = ""

    for line in fasta_file:

        current_line = line.strip()

        if current_line[0] == ">":

            if current_sequence != "":
                new_sequence_object = DnaSequence(current_sequence, current_name)
                my_sequences.append(new_sequence_object)

                current_sequence = ""

            sequence_name = current_line[1:].strip()
            current_name = sequence_name

        else:
            current_sequence += current_line

print(len(my_sequences))

99


In [304]:
# Let's now rewrite this as a function!

def read_dna_fasta_file(filename: str) -> List[DnaSequence]:
    """
    Read DNA FASTA file.

    This function reads a DNA FASTA file and creates a list of ``DnaSequence`` objects.

    Parameters:
        - filename: path to the FASTA file.

    Returns:
        - List of ``DnaSequence`` objects representing the loaded DNA sequences.
    """

    my_sequences: List[DnaSequence] = []

    with open(filename) as fasta_file:

        current_sequence = ""
        current_name = ""

        for line in fasta_file:

            current_line = line.strip()

            if current_line[0] == ">":

                if current_sequence != "":
                    new_sequence_object = DnaSequence(current_sequence, current_name)
                    my_sequences.append(new_sequence_object)

                    current_sequence = ""

                sequence_name = current_line[1:].strip()
                current_name = sequence_name

            else:
                current_sequence += current_line

    return my_sequences

def write_fasta_file(sequences: List[str], names: List[str], filename: str):
    """
    Write FASTA file containing sequences of any type.

    This function writes a FASTA file, regardless of the type of sequences passed in.
    Lines in the FASTA file have a fixed width of 80 characters.

    Parameters:
        - sequences: sequences to write. May be nucleotide or amino acids.
        - names: sequence names, or other metadata to encode in the FASTA file.
        - filename: path to the FASTA file.

    Returns:
        - None.
    """

    with open(filename, "w") as fasta_file:
        for seq, name in zip(sequences, names):
            fasta_file.write(f"> {name}\n")

            # We want to clip the lines to be a certain length so that the file is easier to read.
            # Let's say 80 characters.
            line_length = 80
            seq_length = len(seq)

            number_of_full_lines = seq_length // line_length
            remaining_chars = seq_length % line_length

            for i in range(number_of_full_lines):
                
                # Get the sub_sequence
                sub_sequence = seq[i * line_length: (i+1) * line_length]
                fasta_file.write(f"{sub_sequence}\n")

            # Check if we have a remainder
            if remaining_chars > 0:
                # Take the last `remaining_chars` characters
                sub_sequence = seq[-remaining_chars:]
                fasta_file.write(f"{sub_sequence}\n")
            

In [275]:
# Let's now try it!

my_sequences = read_dna_fasta_file(my_filename)

print(f"We have {len(my_sequences)} DNA sequences loaded from {my_filename}")

We have 99 DNA sequences loaded from ../code/sequences/non_template_sequences_biased.fasta


### Files and Paths - Using `os` and `shutil`

We've just seen how to read and write files. The most important argument when performing file tasks is the filename. A filename typically contains **two** parts:

* The **file path** - the address where the file is located on your file system. This represents a *folder* or a *directory* on your hard drive. In some cases, this part is optional.
* The **filename** - the actual name of the file. This name *usually* consists of a name and an *extension*, like `.txt` or `.csv`, that indicates what type of file it is.

If the file is located in the same  folder as your Python code, you don't need to provide the *file path*. Otherwise, this element is required. There are two different types of paths:

* **Absolute** paths - where the file is found with respect to the *filesystem root*, which is the hard drive where the file is found. One Windows, this is usually your `C:\` drive. On macOS and Linux, the filesystem root is represented as `/`.
* **Relative** paths - where the file is found with respect to your *current location*. There are a couple of important shortcuts to know. `.` represents the current folder while `..` represents the parent folder (the folder above where you currently are).

The Python Standard Library provides us with some tools for working with files, folders and paths.

For working with paths, we can use the `os` module ([docs](https://docs.python.org/3/library/os.html)). This module provides a number of tools for interacting with the operating system. In the submodule `os.path` ([docs](https://docs.python.org/3/library/os.path.html)), we have a number of useful functions:

* `os.path.join` - join together multiple path elements to produce a full path. This is useful if creating a new path based on variables.
* `os.path.abspath` - convert a relative path to an absolute path.
* `os.path.relpath` - convert an absolute path to a relative path.
* `os.path.isdir` - check to see if a path represents a directory.
* `os.path.isfile` - check to see if a path represents a file.
* `os.path.dirname` - separate out the file path from the complete path.
* `os.path.basename` - separate out the filename from the complete path.
* `os.path.splitext` - split a filename into a base and an extension.

For exploring in a given directory, `os` provides useful functions. We can use `os.listdir` to see all files in a given directory. We can even create new folders using `os.mkdir`.

Let's do a few examples with paths to understand a bit more:

In [277]:
# Your code here
import os

# Let's start with a dummy path name (so that I don't share my entire directory structure 😅 )
my_reference_path = os.path.join("Users", "benjamin", "documents")
print(f"My reference path is: {my_reference_path}")

my_folder_path = os.path.join(my_reference_path, "micm_workshop")
print(f"My folder path is: {my_folder_path}")

my_folder_relative_path = os.path.relpath(my_folder_path, start=my_reference_path)
print(f"My folder path relative to the reference is: {my_folder_relative_path}")

my_filename = os.path.join(my_folder_path, "scripts", "dna_script.py")
print(f"My script file path is: {my_filename}")

my_file_base_name = os.path.basename(my_filename)
my_file_dir_name = os.path.dirname(my_filename)
print(f"My script has a base name of {my_file_base_name} and is in directory {my_file_dir_name}")

file_base, file_ext = os.path.splitext(my_file_base_name)
print(f"My file has a base name of {file_base} and has extension {file_ext}")


My reference path is: Users/benjamin/documents
My folder path is: Users/benjamin/documents/micm_workshop
My folder path relative to the reference is: micm_workshop
My script file path is: Users/benjamin/documents/micm_workshop/scripts/dna_script.py
My script has a base name of dna_script.py and is in directory Users/benjamin/documents/micm_workshop/scripts
My file has a base name of dna_script and has extension .py


These functions are great for working with paths and creating folders. But let's say we want to **move** or **copy** files. What can we do? Well, Python provides the `shutil` module ([docs](https://docs.python.org/3/library/shutil.html)). Here are some of the useful functions in this module:

* `shutil.copy` and `shutil.copy2` - copying files from one location to another.
* `shutil.copytree` - copy an entire folder hierarchy from one location to another.
* `shutil.rmtree` - delete a folder hierarchy and all included files.
* `shutil.move` - move a file from one location to another.

These functions are very useful if you want to perform processing steps and keep all your files in a temporary folder (to avoid having to keep using long paths).

Let's see some examples:

In [286]:
# Your code here

import shutil

shutil.copy("my_text_file.txt", "copied_text_file.txt")

os.mkdir("copied_text_files")
shutil.move("copied_text_file.txt", "copied_text_files")

'copied_text_files/copied_text_file.txt'

In [287]:
# Now we can delete the file and folder!
shutil.rmtree("copied_text_files")

### Operating System - Hands-On Activity II: Processing FASTA Files

Earlier we wrote code to read and write FASTA files. Well, now let's take things a step farther. Let's write code to copy a specific FASTA file to a new folder called `DNA` in a specified parent folder, read the file, perform translation, save the new file in a folder called `Proteins` and then copy the protein file back to the original location. Performing this task will involve using `os` and `shutil`.

In [307]:
# Your code here

def process_fasta_files(fasta_file: str, temp_dir: str = "temp", is_template_strand: bool = True):
    """
    FASTA File Processing

    Translate DNA from a specified FASTA file to a new FASTA file containing the protein sequences.

    Parameters:
        - fasta_file: FASTA file containing the DNA sequences.
        - temp_dir: Name for temporary directory where the processing will occur.
        - is_template_strand: Indicate whether DNA is on template strand.
    """

    # Delete the temp folder if it exists
    if os.path.isdir(temp_dir):
        shutil.rmtree(temp_dir)

    # Create our temporary directory structure.
    os.mkdir(temp_dir)

    dna_path = os.path.join(temp_dir, "DNA")
    protein_path = os.path.join(temp_dir, "Proteins")

    os.mkdir(dna_path)
    os.mkdir(protein_path)

    # Copy the DNA FASTA file to our temporary directory
    temp_fasta_location = os.path.join(dna_path, "dna_file.fasta")

    shutil.copy2(fasta_file, temp_fasta_location)

    temp_protein_file = os.path.join(protein_path, "protein_file.fasta")
    
    # Perform the translation
    dna_sequences = read_dna_fasta_file(temp_fasta_location)
    protein_sequences: List[str] = []
    protein_sequence_names: List[str] = []

    for sequence in dna_sequences:
        m_rna_sequence = sequence.transcribe(is_template_strand=is_template_strand)
        protein_sequence = m_rna_sequence.translate()

        protein_sequences.append(protein_sequence)
        protein_sequence_names.append(sequence.sequence_name)

    # Write the FASTA file
    write_fasta_file(sequences=protein_sequences, names=protein_sequence_names, filename=temp_protein_file)

    # Copy the protein file to the original location... but first, give the file a new name
    original_dna_base_filename = os.path.basename(fasta_file)
    original_dna_folder = os.path.dirname(fasta_file)
    original_dna_base_name_no_ext, original_dna_base_name_ext = os.path.splitext(original_dna_base_filename)

    protein_filename = f"{original_dna_base_name_no_ext}_PROTEIN{original_dna_base_name_ext}"

    final_protein_filename = os.path.join(original_dna_folder, protein_filename)
    
    shutil.copy2(temp_protein_file, final_protein_filename)



Now, let's test it with a file!

In [308]:
my_dna_file = "sequences/non_template_sequences_biased.fasta"
my_temp_dir = "./temp"

process_fasta_files(fasta_file=my_dna_file, temp_dir=my_temp_dir, is_template_strand=False)

Feel free to use any of the other files I've provided in the `code/sequences` folder.

## Basic Scripts

So far, all our Python code has existed in this Jupyter notebook. It's great for me to teach you the basics, but it makes running automated scripts a bit more difficult. In this section, we'll see how to write scripts that you can run easily from the command line. We'll then see how to make them a bit more complicated to allow users to configure options.

To create a Python script, first you need to create a Python file. This is a text file that has the `.py` extension. If you're using Microsoft Visual Studio Code to view this Jupyter Notebook, go to the **Explorer** button on the left side. Then, create a new directory under **Code**, called **MyScripts**. Then, create a new file called `test_script.py`. This file will serve as our script.

Let's write a quick test script that simply prints the string "Hello World!". But first, let's also write some documentation!

To run this script, let's now go to the command line. In vscode, we can easily open a terminal and navigate to our script directory. To run our Python file, we simply type:

```bash

python test_script.py

```

**Note:** If you're not using Anaconda, you may need to type `python3`. If you're on Windows, the terminal in vscode may not work. If this is the case, you will need to open the Anaconda Prompt from the *Anaconda Navigator*, then change to the script's parent directory and run the above line.

What's going on here? We're telling the computer to use `python` to run the file `test_script.py`.

So, this is the bare minimum we need for running a Python script. But, there's more we can do.

We saw how to use packages. We can also create our own packages and modules. To start building our DNA processing script, let's copy our DNA sequence class to a new file called `biological_sequences.py`. We can write a bit of code underneath to test this class.

Now, let's work on writing another file that uses this DNA sequence. Let's create `dna_processing.py`. Let's import `biological_sequences` and create a new DNA sequence and print it to the screen. You may be wondering why we should bother making another file. Well, if your project gets **really big**, you're going to want to have different components in different files.

And this raises an important issue. Let's run our `dna_processing.py` file. Wait!!! What's going on! All the code from `biological_sequences.py` is running!

To prevent this from happening, we can tell Python to only run the script if the file itself is run from the command line, but not to run everything if we're importing the file. In other programming languages, this would involve defining a `main` function or method. In Python, we nest our script functionality in the following `if` statement:

```python

if __name__ == "__main__":
    # Script code goes here...

```

Now, the script only runs if it has been called from the main interpreter. Otherwise, it won't run and we can simply extract classes and functions.

### Basic Scripts - Hands-on Exercise I: DNA Processing Script

We've seen how easy it is to create a simple script. Now, let's take all our processing code so far and put it in a new script that takes all the sequences from our `sequences` folder, loads them, computes and prints the basic information from our NumPy section, performs translation and saves the outputs to a new `sequence_proteins` folder. Write this all in a file called `basic_script_exercise.py`.

In [None]:
# See scripts/basic_script_exercise.py

### Getting User Parameters

This script that we've written works well... but if we want to change where the sequences are coming from, we need to edit the script each time. This is complicated if we're dealing with lots of files and if we're giving the script to people who don't know how to write Python code. Instead of modifying the code, we'd like to pass **command line arguments** to the script.

#### Introduction to Linux Command Line Arguments

On the command line, we can pass arguments to a program in order to modify its behaviour. We'll specifically see the Linux/Unix/macOS command line. Recall that for functions, we had two types of arguments: positional and keyword. We have two similar types of command line arguments: **positional** and **optional**. Here is the syntax:

```bash

python my_script.py arg1 arg2 --pos-arg1 val1 --pos-arg2 val2 -q val3

```

**Positional arguments** are generally required and don't have any flags before them. **Optional arguments** are generally optional, and can appear in one of two ways:

* A *single letter* after a single dash (`-`).
* A *longer name* after two dashes (`--`).

The optional arguments can go in any order, while the positional arguments must follow a specific order.

#### Adding Command Line Arguments to Python - Using `argparse`

To add command line arguments to our script, we use the `argparse` module from the Python Standard Library ([docs](https://docs.python.org/3/library/argparse.html)). Here are the main steps that we use:

* Create an `argparse.ArgumentParser`. This object contains a description of the program and serves as a base for the arguments we will add.
* Add arguments to the parser using `argparse.ArgumentParser.add_argument`. This same function is used for both positional and optional arguments. We'll see it in more detail in a bit.
* Parse the arguments passed in from the command line using `argparse.ArgumentParser.parse_args`. This produces an `argparse.Namespace` object that holds the values of each argument. We can then pass these to functions and use them in our script.

We'll go into each step in a bit more detail. A nice thing about using `argparse` is that the parser can automatically generate a help screen that provides detailed information about all your command line arguments. This help screen is built using description strings that you pass when creating the parser and adding the arguments.

#### Creating the Argument Parser

To create the argument parser, we create an object of type `argparse.ArgumentParser`. There are several parameters that we can pass in, but for many of them the default value is quite good. The ones that I recommend setting are:

* `description` - provide a string explaining what the script does, as well as putting in any relevant citations (if you're implementing a published algorithm). This text is displayed at the *top* of the help screen.
* `epilog` - provide additional summary information about your script. It may be useful to put copyright and contact information here, as this text appears at the bottom of the help screen.

Our code usually looks something like this:

```python

parser = argparse.ArgumentParser(description="description here", epilog="epilog here")

```

#### Adding Arguments

To add arguments, we simply use the `add_argument` method ([docs](https://docs.python.org/3/library/argparse.html#the-add-argument-method)). This method takes in a number of arguments. Here are some of the main ones. See the documentation for a full explanation:

* The first argument is the *name* in the case of a positional argument or the *flags* in the case of an optional argument. For the optional arguments, you can specify **two** values: a short, single letter flag after a single dash, and a longer flag after two dashes. In the second case, if multiple words are combined, they are typically separated by a single dash (**not** an underscore).
* `type` - the type of the value (typically `str`, `int`, `float`).
* `required` - indicates whether the optional argument is required (although typically optional arguments are not required).
* `default` - the default value for the argument.
* `action` - specify an action for a special type of value (we'll see a few examples).
* `choices` - specify specific values that the argument can take.
* `help` - text to display for this argument in the help screen.

##### Actions

The `action` keyword argument in the `add_argument` function can take in a number of different values. We'll only discuss some of them here:

* `store_true` - if the optional argument is present, it has a value of `True` in the parsed namespace. Otherwise, it stores `False`.
* `store_false` - if the optional argument is present, it has a value of `False` in the parsed namespace. Otherwise, it stores `True`.
* `store_const` - if the optional argument is present, it has a value of `const` specified in the `add_argument` function. Otherwise, it has a value of `default`.

For each argument we want to add, we call `add_argument`.

#### Parsing

To parse the arguments, we call the `parse_args` method on the argument parser object. We can pass in a number of different arguments to this function:

* `str` - A string containing the arguments as we would write them on the command line.
* `list` - A list containing each positional argument, followed by each optional argument name, and each optional argument value, all as strings.
* Nothing - if no arguments are provided, then the standard input is provided. Typically in a script, we use this, as the standard input contains the command line arguments.

The result of the function call is a `Namespace` object. The attributes of this object have the same name as the command line arguments from the parser. For positional arguments, the names are exactly as indicated. For optional arguments, the names follow the long-form names, without the preceding dashes and with all internal dashes replaced with underscores. The respective names can be specified when adding the arguments using the `dest` keyword argument.

Let's write a sample parser here:

In [7]:
# Your code here

import argparse

my_parser = argparse.ArgumentParser(description="Test parser for demonstration.")
my_parser.add_argument("positional_argument", choices=["Opt1", "Opt2", "Opt3"], help="Required positional argument.")
my_parser.add_argument("-o", "--option-argument", type=int, default=32, help="Some numeric optional argument")
my_parser.add_argument("-t", "--test", action="store_true", help="Demo of `store_true`.")

parsed_vals = my_parser.parse_args(["Opt2", "-o", "5", "--test"])
print(parsed_vals)

Namespace(positional_argument='Opt2', option_argument=5, test=True)


### Basic Scripts - Hands-on Exercise II: DNA Processing Script with Arguments

To finish up this module, let's modify our script from the previous exercise to take in the following command line arguments:

* The input directory of FASTA DNA sequences (positional).
* The output directory of FASTA protein sequences (optional - default is a directory called "protein_output" in the current directory).
* The maximum number of sequences to process, as an integer (optional - default is 0, meaning to process all sequences).
* A flag indicating whether to compute the statistics and print the results.

Modify the code from a the previous exercise.

In [None]:
# Your code in a separate file

## Module Summary

We've reached the end of our final content-heavy module. In this module, we have seen:

* How to **read and write text files**.
* How to work with **operating system file paths** and **files** using the modules `os` and `shutil`.
* How to **run basic scripts** from the command line.
* How to add **command line arguments** using `argparse` to be able to customise functionality.

# Module 5 - Where to Go From Here
We've just about reached the end of this workshop. Let's take a moment to review what we've seen and where we can go from here.

## Workshop Summary
In this workshop, we've seen a bunch of different topics. Here are the key points of what we've discussed:

* We saw how to wrap small units of behaviour into **functions** so that they can be easily repeated, with the ability to pass specific **arguments** as **parameters** to produce certain **outputs**.
* We saw how to create **classes** as *templates* to define new types of **objects** to group together different **attributes** (variables) and different **methods** (functions). We also saw how to use **data classes** to simplify some of the class creation.
* We saw how to **find and install packages**, how to **import** them into our own code. Specifically, we also saw how to use **NumPy** and how to read **documentation** to learn more about this package.
* We saw how to interact with the **operating system** to **read and write** text files and **move and copy** files.
* We saw how to write **Python scripts** that don't run in a Jupyter Notebook and can be called from the **command line** with different **command line arguments**.

While these skills aren't *everything*, they still provide a solid basis for you to be able to learn **even more** about writing code in Python.

## Next Steps

What is this **even more**, you may be wondering... Well, here are some of the topics that you can explore.

### Class Inheritance

We saw how to define classes from scratch. But, it's also possible to define new classes based on old classes to extend their functionality. The new class is said to *inherit* attributes and methods from the old class, known as the *superclass*. It's also possible to **override** methods to change their functionality. While I didn't have time to cover inheritance today, here are some links that may be able to help:

* From the official [Python docs](https://docs.python.org/3/tutorial/classes.html#inheritance).
* From an [online tutorial](https://realpython.com/inheritance-composition-python/).

### Enumerated Types

We've been working a lot with DNA sequences represented as strings. We've been assuming that users only input valid sequences. But, a character can contain more values than just `ATGC`. Instead of relying on strings, what if we had a type of object that could **only** be `A`, `T`, `G` or `C`? Well, we can define such an object using **enumerated types**. An **enumerated type** (also known as an *enum*) is an object that can only take one of a set of valid values. These values can be represented different underlying Python types, such as integers or strings. Defining an enum is very straightforward and relies on the `Enum` class in the built-in `enum` module. For our DNA example, we could define an enum as:

```python

from enum import Enum

class DeoxyRiboNucleotide(Enum):
    A = 'A'
    T = 'T'
    G = 'G'
    C = 'C'


```

And then, in our DNA sequence object, we can change the type of the `sequence` to be a list of `DeoxyRiboNucleotide`s.

Enums are also useful if you have a function where there are various options available for a parameter, and you want to make sure the user enters a valid option. In my own work, I've used enumerated types to define units.

To learn more about using enumerated types, see the [official documentation](https://docs.python.org/3/library/enum.html).

### Working with Data Frames using Pandas

We briefly touched on data analysis using NumPy. But, what if we want something that can work more like a spreadsheet, and can handle different types of columns? Here, **[Pandas](https://pandas.pydata.org/)** is a great choice! Pandas has tools for inputting data from a number of file formats, including Excel, processing the data and outputting it to various formats. If you have to work with data that is not necessarily all the same type, I definitely recommend checking it out!

### Creating Python Packages to Distribute with PyPI

We've seen how to *install* packages, but did you know it's very easy to also create and publish your own? Python has an official [tutorial](https://packaging.python.org/en/latest/tutorials/packaging-projects/) that explains how to organise your code to create a package that you can then upload to PyPI to distribute. If you plan on distributing your code, it's also a good idea to read up about different **software licenses** (see [here](https://choosealicense.com/)). Different licenses give users different levels of freedom to use and modify your code.

### Writing Good Documentation

If you're going to distribute your package, you may want to write some nice documentation that goes with it. To generate a nice documentation website, check out Sphinx (https://www.sphinx-doc.org/en/master/). I haven't had time to learn it myself, but it comes quite recommended. You can also host Sphinx documentation on a platform like **[Read the docs*](https://about.readthedocs.com/?ref=readthedocs.org)**.

### Developing Python GUIs using PyQt

Working on the command line is great fun... but what if you want to build something visual so that people don't need to open the scary terminal to run your code? The good news is that there are tools for developing graphical user interfaces (GUIs) in Python. A common library used is PyQt. There are tutorials online for learning how to use these tools to develop an intuitive GUI. I recommend [this site](https://www.pythonguis.com/). It explains multiple different libraries in great depth.

### Other Topics of Interest

These other topics may also be interesting to some of you:

* Hosting your code online using GitHub.
* Parallel processing using the `multiprocessing` module.
* Image processing using Scikit Image.
* Generating plots using Matplotlib.
* Generating websites using Flask or Django.

## Important Resources

As you continue your Python journey, there are many resources available to help you. We'll discuss a few of them here.

### Python Documentation

When in doubt, start [here](https://docs.python.org/3/)! Python maintains very clear documentation that includes descriptions of all the built-in modules, classes and functions. They also have tutorials online. If you are learning a new concept that is built into Python, chances are you can find something here to help you. The documentation also explains any quirks associated with functions (like that `math.cos` takes an angle in **radians**), so it's always a good idea to read up about any functions you're using.

### Package Documentation

If you're using a package, chances are they have a similar documentation website. [NumPy](https://numpy.org/doc/stable/) does. So does [SciPy](https://docs.scipy.org/doc/scipy/). And [Matplotlib](https://matplotlib.org/stable/api/index). And [Scikit Learn](https://scikit-learn.org/stable/modules/classes.html). These packages offer both full API documentation as well as a user's guide. Take advantage of both these resources to learn how to use new packages.

### Online Tutorials

There are many websites that offer online tutorials about various aspects of Python. These include W3Schools (https://www.w3schools.com/python/default.asp) and RealPython (https://realpython.com/). There are certainly others online. You can also look on developer forums, like dev.to (https://dev.to/), which aren't specific to Python.

### Your Code Editor

This one may raise a few eyebrows... But, Microsoft, JetBrains and others have invested **tons** of time, effort (and I'm sure, money) into developing tools that help programmers. Please, please, please, please use a good code editor to write your code. Don't just write it in Notepad, or TextEdit, or Gedit. At the very least, make sure you have a code linter. Your code editor can often point out places where you've forgotten to close a bracket or you've misspelled a variable. Don't underestimate the power of your code editor! Also, take advantage of the features it offers! Use **auto-complete**, use **code suggestions**. Hover your mouse over different words to see the documentation. Learn to use the debugger. Take advantage of the resources you have available!

### Stack Overflow

If all else fails, check Stack Overflow. Stack Overflow is great, since if you're having a problem with a very widely-used package, chances are someone else has had it too. So, you'll very easily find an answer. But, make sure you've exhausted all other avenues first. Try to understand what the code is doing instead of just sticking together snippets from Stack Overflow. Read the answer, try to understand what the answer is saying. If you don't understand something, try to read up about it. And always look at suggested code with a critical eye. If the suggested code looks sketchy, it's probably sketchy and there may be another, better way.

### A Word on ChatGPT

Over the course of the past year, generative AI tools, like ChatGPT, have completely changed how we get information. These tools can be very useful for performing some tasks, especially ones that we sometimes find tedious. I hear that ChatGPT is quite good at coding, as are other tools, such as GitHub Copilot. I ***strongly*** recommend that if you plan on using these tools, you develop a solid understanding of programming concepts ***first***. You should be able to understand what the code you are being given does. Similar to Stack Overflow responses, don't just copy the answer. Make sure you read it, understand it, and can explain it to a friend. That way, when (not if) something doesn't work perfectly, you are familiar enough with the code. These tools are very powerful and (I hear) produce code that runs very well, but make sure that you aren't copying any code that you don't understand. Programming is not just about getting tasks done. It's about *thinking* and developing certain logic and intuition. It can also be very fun.

## Conclusion

We've now reached the end of this workshop. You now have the skills necessary to write powerful scripts that look professional. I hope that you continue your programming journey and apply these skills to help advance your research and the field as a whole. Please remember to fill out the feedback survey. If you have any questions about the material, feel free to reach out to me. For more details about MiCM, and to suggest topics for future workshops, please visit https://www.mcgill.ca/micm/.

Until next time!

In [13]:
print("Good luck in the Python-verse!")

Good luck in the Python-verse!
