# Errors, or bugs, in your software

Today we'll cover dealing with errors in your Python code, an important aspect of writing software.

#### What is a software bug?

According to [Wikipedia](https://en.wikipedia.org/wiki/Software_bug) (accessed 16 Oct 2018), a software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or behave in unintended ways.

#### Where did the terminology come from?

Engineers have used the term well before electronic computers and software. Sometimes Thomas Edison is credited with the first recorded use of bug in that fashion. [[Wikipedia](https://en.wikipedia.org/wiki/Software_bug#Etymology)]

#### If incorrect code is never executed, is it a bug?

This is the software equivalent to "If a tree falls and no one hears it, does it make a sound?". 

## Three classes of bugs

Let's discuss three major types of bugs in your code, from easiest to most difficult to diagnose:

1. **Syntax errors**
1. **Runtime errors**
1. **Semantic errors**

In [1]:
import numpy as np

### Syntax errors

Errors where the code is not written in a valid way. (Generally easiest to fix.)

In [2]:
print "This should only work in 2.X NOT used in this class."

SyntaxError: Missing parentheses in call to 'print'. Did you mean print("This should only work in 2.X NOT used in this class.")? (4125897787.py, line 1)

In [3]:
print("This should only work in 2.X NOT used in this class.)

SyntaxError: EOL while scanning string literal (1784136621.py, line 1)

In [4]:
print("This should only work in 2.X NOT used in this class.")

This should only work in 2.X NOT used in this class.


### Runtime errors

Errors where code is syntactically valid, but fails to execute. Often throwing exceptions here. Sometimes these are easy to fix; it's harder when in other people's code.

In [5]:
# invalid operation
a = 0
5/a  # division by 0

ZeroDivisionError: division by zero

In [6]:
# invalid operation
input = "40"
input/11  # Incompatiable types for the operation

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [7]:
while True:
    pass  # Infinite loop

KeyboardInterrupt: 

In [8]:
x = 5
x.isalphanum()  # method doesn't exist on the type

AttributeError: 'int' object has no attribute 'isalphanum'

### Semantic errors

Errors where code is syntactically valid, but contain errors in logic. These can be difficult to find and fix.

Say we're trying to confirm that a trigonometric identity holds. Let's use the basic relationship between sine and cosine, given by the Pythagorean identity

$$
\sin^2 \theta + \cos^2 \theta = 1
$$

We can write a function to check this:

In [9]:
import math

def pythagorean_identity(theta):
    return math.sin(theta)**2 + math.cos(theta)**4

In [10]:
pythagorean_identity(.4)

0.8713500597123389

What is the bug in this code?

* We accidentally cubed the cosine instead of squared it.

## How to find and resolve bugs?

Debugging has the following steps:

1. **Detection** of an exception or invalid results. 
2. **Isolation** of where the program causes the error. This is often the most difficult step.
3. **Resolution** of how to change the code to eliminate the error. Mostly, it's not too bad, but sometimes this can cause major revisions in codes.


### Detection of Bugs

How can we detect that a bug exists?

* Chance. While running your Python code, you encounter unexpected functionality, exceptions, or syntax errors. While we'll focus on this in today's lecture, you should never leave this up to chance in the future.
* Software testing practices allow for thoughtful detection of bugs in software. We'll discuss more in the lecture on testing.

### Isolation of Bugs

How can you figure out where the bug is?

1. The "thought" method. Think about how your code is structured and so what part of your could would most likely lead to the exception or invalid result.
    1. "Rubber-ducking" fits here - explaining your code to a rubber duck in order to figure out the error.
2. Inserting ``print`` statements (or other logging techniques)
    1. Using the ``assert`` statement is particularly helpful
3. Using a line-by-line debugger like ``pdb``.

Typically, all three strategies are used in combination, often repeatedly.

Other testing strategies:
1. Change one thing at a time.
2. Using binary search (look halfway through your program at a time, to figure out which half the problem is in).
3. Simplifying your code (comment-out pieces to figure out where the problem arises).

## Debugging Example: Entropy

Say we're trying to compute the **entropy** of a set of probabilities.  The
form of the equation is

$$
H = -\sum_i p_i \log(p_i)
$$

The choice of base for the logarithm varies for different applications. Base 2 gives the unit of bits, while base `e` gives nats, and base 10 gives units of "dits". 

We can write the function like this:

In [11]:
def entropy(ps):
    items = []
    for p_i in ps:
        interm = p_i * np.log2(p_i)
        items.append(interm)
    return -np.sum(items)

Imagine if we have the nucleotides in DNA: A, T, C & G for adenine, thymine, cytosine, and guanine, respectively.  If we have a given DNA sequence, e.g. `ATCGATCG` we can compute the observed probability for each nucleotide in the sequence as:

| Nucleotide | Occurrences | P(Nucleotide) |
|------------|-------------|---------------|
| A | 2 | .25 |
| T | 2 | .25 |
| C | 2 | .25 |
| G | 2 | .25 |

Thus, we have 4 states that are all equally likely, we can compute the entropy with the following:

In [12]:
entropy([.25, .25, .25, .25])

2.0

To reinforce this concept, let's imagine we want to encode the nucleotides A, T, C, G as bits.  Without any knowledge of nucleotide frequency, i.e. assuming they are all equally likely, we could define a bitwise representation of the nucleotides as:

| Nucleotide | Bit Representation |
|------------|--------------------|
| A | `00` |
| T | `01` |
| C | `10` |
| G | `11` |

We can see that it takes two bits to represent the nucleotides uniquely.  This is exactly what the `entropy` function predicts for the representation given they are all equality, e.g. `entropy([.25, .25, .25, .25])`.

In information theory, this is called **Shannon's source coding theorem**, which states that it is impossible to compress data such that the code rate (average number of bits per symbol) is less than the entropy of the source, without it being virtually certain that information will be lost. So in the DNA case, we must use 2 bits per symbol.

----------------------

Let's try out the function a little bit more:

In [13]:
entropy([0.5, 0.5])

1.0

In [14]:
entropy([0.5, 0.5, 0.5, 0.5])

2.0

In [15]:
entropy([0.5, 1.2])

0.18435871299944745

In [16]:
entropy([-1, 0.5, 0.5])

  interm = p_i * np.log2(p_i)


nan

In [17]:
entropy([0.5, 0.25, 0, 0.25])

  interm = p_i * np.log2(p_i)
  interm = p_i * np.log2(p_i)


nan

In [18]:
entropy([1])

-0.0

What problems did we find?

* We can call the function with probabilities that don't sum to 1 (the inputs are supposed to represent probabilities).
* We can call the function with probabilities greater than 1.
* We get a `nan` result in some cases.
* Asking for the entropy of one probability of 0 gives us a -0 result.

Let's try to figure out what is causing our problems using print statement debugging:

In [19]:
def entropy(p):
    print(p)
    items = []
    for p_i in p:
        print(p_i)
        interm = p_i * np.log2(p_i)
        print(interm)
        items.append(interm)
    print(items)
    return -np.sum(items)

In [20]:
entropy([0.5, 0.5, 0.5])

[0.5, 0.5, 0.5]
0.5
-0.5
0.5
-0.5
0.5
-0.5
[-0.5, -0.5, -0.5]


1.5

More useful: use labels:

In [21]:
def entropy(p):
    print(f"p: {p}")
    items = []
    for p_i in p:
        print(f"p_i: {p_i}")
        interm = p_i * np.log2(p_i)
        print(f"interm: {interm}")
        items.append(interm)
    print(f"items: {items}")
    return -np.sum(items)

In [22]:
entropy([0.5, 0.5, 0.5])

p: [0.5, 0.5, 0.5]
p_i: 0.5
interm: -0.5
p_i: 0.5
interm: -0.5
p_i: 0.5
interm: -0.5
items: [-0.5, -0.5, -0.5]


1.5

Modify our entropy code so that we don't have the problems of values that are not valid probabilities:

In [23]:
def entropy(p):
    items = []
    # begin by checking all inputs are between 0 and 1 inclusive
    check1 = []
    for ele in p:
        check1.append((ele <= 1) and (ele >= 0))
    else:
        pass
    # verify the sum of the probabilities is 1
    # note, the use of np.isclose is correct but the following 
    # may not return the expected result
    # check2 = 1 == np.sum(p)
    check2 = np.isclose(1, np.sum(p), atol=1e-08)
    if all(check1) and check2:
        for p_i in p:
            interm = p_i * np.log2(p_i)
            items.append(interm)
        return -np.sum(items)
    else:
        # return an error value if the preconditions aren't true
        return -1

In [24]:
entropy([0.5, 0.5, 0.5])

-1

In [25]:
entropy([0.5, 1.2])

-1

Modify our entropy code so that we don't have any other problems we have discovered:

In [26]:
def entropy(p):
    items = []
    # begin by checking all inputs are between 0 and 1 inclusive
    check1 = []
    for ele in p:
        check1.append((ele <= 1) and (ele >= 0))
    else:
        pass
    # verify the sum of the probabilities is 1
    # note, the use of np.isclose is correct but the following 
    # may not return the expected result
    # check2 = 1 == np.sum(p)
    check2 = np.isclose(1, np.sum(p), atol=1e-08)
    if all(check1) and check2:
        for p_i in p:
            if p_i > 0:
                interm = p_i * np.log2(p_i)
                items.append(interm)
        return np.abs(-np.sum(items))
    else:
        return -1

In [27]:
entropy([0.5, 0.25, 0, 0.25])

1.5

In [28]:
entropy([1])

0.0

### Using assertions

Python comes with a built-in `assert` statement that basically says "I expect this statement to be true, and if it's not, raise an exception" ([more info](https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement)). Assertions are usually only enabled during development for debugging purposes; when running the same Python code in a production environment, developers typically "optimize their code with the `-O` option which disables assertions (ie they run `python -O my_module.py`).

An assert statement looks like:

```python
assert <expression>, <message>
```

For instance:

```python
assert x > 5, "x is greater than 5"
```

Let's try converting some of our code to assertions:

In [29]:
def entropy(p):
    items = []
    for ele in p:
        assert (ele <= 1) and (ele >= 0), "element is not a valid probability"
    assert np.isclose(1, np.sum(p), atol=1e-08), "probabilities do not sum to 1"
    for p_i in p:
        if p_i > 0:
            interm = p_i * np.log2(p_i)
            items.append(interm)
    return np.abs(-np.sum(items))

In [30]:
entropy([0.5, 0.5, 0.5])

AssertionError: probabilities do not sum to 1

In [31]:
entropy([0.5, 1.2])

AssertionError: element is not a valid probability

### Using Python's debugger, `pdb`

Python comes with a built-in debugger called [pdb](http://docs.python.org/2/library/pdb.html).  It allows you to step line-by-line through a computation and examine what's happening at each step.  Note that this should probably be your last resort in tracing down a bug. It can be hard to use sometimes, because it can be too tedious.

You can use the debugger by inserting the line
``` python
import pdb; pdb.set_trace()
```
within your script. To leave the debugger, type "exit()". To see the commands you can use, type "help".

Let's try this out:

In [32]:
def entropy(p):
    p = np.asarray(p)
    items = p * np.log2(p)
    import pdb; pdb.set_trace()
    return -np.sum(items)

This can be a more convenient way to debug programs and step through the actual execution.

In [33]:
p = [.1, -.2, .3]
entropy(p)

  items = p * np.log2(p)


> [0;32m/var/folders/3g/gsjskh4d3zj8rtt3_vvxbvbc0000gn/T/ipykernel_67921/555854820.py[0m(5)[0;36mentropy[0;34m()[0m
[0;32m      1 [0;31m[0;32mdef[0m [0mentropy[0m[0;34m([0m[0mp[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      2 [0;31m    [0mp[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0masarray[0m[0;34m([0m[0mp[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      3 [0;31m    [0mitems[0m [0;34m=[0m [0mp[0m [0;34m*[0m [0mnp[0m[0;34m.[0m[0mlog2[0m[0;34m([0m[0mp[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      4 [0;31m    [0;32mimport[0m [0mpdb[0m[0;34m;[0m [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 5 [0;31m    [0;32mreturn[0m [0;34m-[0m[0mnp[0m[0;34m.[0m[0msum[0m[0;34m([0m[0mitems[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> print(p)
[ 0.1 -0.2  0.3]
ipdb> s
--Call--
> [0;32m<__array_function__ internals>[0m(2)[0;36msum

--------

## Exercise

*[Drawn from Software Carpentry](https://swcarpentry.github.io/python-novice-inflammation/11-debugging/index.html#not-supposed-to-be-the-same)*

You are assisting a researcher with Python code that computes the Body Mass Index (BMI) of patients. The researcher is concerned because all patients seemingly have unusual and identical BMIs, despite having different physiques. BMI is calculated as weight in kilograms divided by the square of height in metres.

1. Use print statement debugging to debug the following code which computes BMI for a list of people. If you see the bug right away, you should STILL do the print statement debugging to demonstrate the bug.

In [34]:
patients = [[70, 1.8], [80, 1.9], [150, 1.7]]

def calculate_bmi(weight, height):
    print(f"weight={weight}, height={height}")
    return weight / (height ** 2)

def calculate_bmi_for_list(patients):
    print(f"patients: {patients}")
    for patient in patients:
        weight, height = patients[0]
        print(f"weight={weight}, height={height}")
        bmi = calculate_bmi(height, weight)
        print("Patient's BMI is:", bmi)

calculate_bmi_for_list(patients)

patients: [[70, 1.8], [80, 1.9], [150, 1.7]]
weight=70, height=1.8
weight=1.8, height=70
Patient's BMI is: 0.00036734693877551024
weight=70, height=1.8
weight=1.8, height=70
Patient's BMI is: 0.00036734693877551024
weight=70, height=1.8
weight=1.8, height=70
Patient's BMI is: 0.00036734693877551024


What's wrong with this code? Hint: there is more than one problem.

* Instead of using the `patient` variable within the loop, the code is always using `patients[0]`.
* The call to `calculate_bmi` passes the arguments in in the wrong order.

2. Use the Python debugger (`pdb`) to debug instead. Set a breakpoint at the beginning of the `calculate_bmi_for_list` function and step through (and into!) the `calculate_bmi` function.

In [35]:
patients = [[70, 1.8], [80, 1.9], [150, 1.7]]

def calculate_bmi(weight, height):
    return weight / (height ** 2)

def calculate_bmi_for_list(patients):
    import pdb; pdb.set_trace()
    for patient in patients:
        weight, height = patients[0]
        bmi = calculate_bmi(height, weight)
        print("Patient's BMI is:", bmi)

calculate_bmi_for_list(patients)

> [0;32m/var/folders/3g/gsjskh4d3zj8rtt3_vvxbvbc0000gn/T/ipykernel_67921/2027794685.py[0m(8)[0;36mcalculate_bmi_for_list[0;34m()[0m
[0;32m      6 [0;31m[0;32mdef[0m [0mcalculate_bmi_for_list[0m[0;34m([0m[0mpatients[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      7 [0;31m    [0;32mimport[0m [0mpdb[0m[0;34m;[0m [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 8 [0;31m    [0;32mfor[0m [0mpatient[0m [0;32min[0m [0mpatients[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      9 [0;31m        [0mweight[0m[0;34m,[0m [0mheight[0m [0;34m=[0m [0mpatients[0m[0;34m[[0m[0;36m0[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     10 [0;31m        [0mbmi[0m [0;34m=[0m [0mcalculate_bmi[0m[0;34m([0m[0mheight[0m[0;34m,[0m [0mweight[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> n
> [0;32m/var/folders/3g/gsjskh4d3zj8rtt3_vvxbvbc0000gn/T/ipykernel_679