# Type Annotations in Python

## Introduction

Many people find scripting languages attractive because they enable fast prototyping of new code and quick development of extensive programs. It is certainly the case that simply calling the intepreter to execute the code in our source files is much easier and gives better feedback than calling makefiles and compilers and sorting through error messages. 

## *Intermission*: Build systems and compilers

Luckily much progress have been made on this front. In most cases there is no need to manually write makefiles, figure out appropiate compiler flags, and list the source files of your project, since many new build systems and programs have been developed to do just that.

For "older" languages like C, and C++, check out:
- [CMake](https://cmake.org/)
- [build2](https://build2.org/)
- [Conan](https://conan.io/) (this is more of a package manager and build orchestrator)
- [xmake](https://xmake.io/)
- [premake](https://premake.github.io/)

Of course this diversity raises some more problems. How do we integrate packages with different build systems with each other? What happens if a user cannot install the build system we are using to develop our project (e.g. it is not available on Windows)? This problem can be helped with using containers, and images like the ones provided by [Docker](https://docker.com), but that is another topic entirely.

Newer compiled languages like [Go](https://golang.org/), and [Rust](https://www.rust-lang.org/) ship with their own build automation and package management solutions, so there is no need to use some third-party software for that.

## *Intermission* over

Quick development is also helped by the lack of static typing. Most scripting languages, including Python, are dynamically typed. This does not mean that there are absolutely no types inside the langauge, but that type information of a variable is only available at runtime, not at compile time (the Python interpreter generates bytecode files that live in the `__pycache__` directories, this allows to omit the parsing of a source file if its contents are not changes, so we can still associate some kind of compilation step to the language).

It is possible to check for type information in runtime like so:

In [1]:
def append_to_string(text):
    assert(isinstance(text, str))
    return text + "appended_text"

print(append_to_string("123"))
append_to_string(1.0)


123appended_text


AssertionError: 

But this is not considered idiomatic Python and is not feasible in the long term, since the developer has to insert these type assertions everywhere in the code.

One might think that this is not a problem since the language was designed to be dynamically typed and there is no need for type checking since bugs and variables of the wrong types will be discovered while using the program. Before responding to this let me mention one more thing.

## *Intermission*: "Old" statically typed languages

Usually people had a bad experience with statically typed and compiled languages like C, C++, and Fortran. (e.g. strange and long compiler messages, long compile times, frustrating debugging sessions). Because of these experiences they consider a type system a burden and nuisance instead of helping hand. I cannot really blame them. Take a look at the example below written in C++:

```c++
vector<int> v = {1, 2, 3};
for (vector<int>::iterator i = v.begin(); i != v.end(); ++i) {
   // ... use *i ...
}
```

Here we iterate over the elements of variable `v`. Let's look at the same example in Python:

```python
v = [1, 2, 3]

for elem in v:
    # use elem
```

This is much more clear and readable. Two things to mention here:

1. Compiled languages have come a long way, so the same `for` loop written in modern C++ is almost as clear as the one written in Python.
2. The type annotations I will touch upon are much lighter than then type system of "old" C++.

This was just an example of how bad experiences with older compiled languages could leave a sour taste regarding statically enforced types.

## *Intermission* over

Back to the issue at hand. Why wouldy we want to use statically checked types in Python. Many libraries have beent written without them and they seem to work fine. It seems to be te case that adding a some kind of type system to Python would be detrimental, until:

- you have written a Python library with more than 500-1000 SLOC (source lines of code, the number of lines in the source files that are code, i.e. not comment or empty line)
- try to refactor said Python library source code (i.e. rename a function, variable or change the behaviour of a code block)
- try to figure out the arguments to a function, or the order of arguments of a function while trying to use / call said function

There are a quite a few number of cases where declaring the type of your variables and function arguments, can be really useful. Indeed many cases in the documentation Python libraries the authors have refered to the type of function arguments (see the example from numpy below). Before Python 3.5 this was the only way to add type hints.

In [6]:
import numpy as np
help(np.asarray)

Help on function asarray in module numpy:

asarray(a, dtype=None, order=None)
    Convert the input to an array.
    
    Parameters
    ----------
    a : array_like
        Input data, in any form that can be converted to an array.  This
        includes lists, lists of tuples, tuples, tuples of tuples, tuples
        of lists and ndarrays.
    dtype : data-type, optional
        By default, the data-type is inferred from the input data.
    order : {'C', 'F'}, optional
        Whether to use row-major (C-style) or
        column-major (Fortran-style) memory representation.
        Defaults to 'C'.
    
    Returns
    -------
    out : ndarray
        Array interpretation of `a`.  No copy is performed if the input
        is already an ndarray with matching dtype and order.  If `a` is a
        subclass of ndarray, a base class ndarray is returned.
    
    See Also
    --------
    asanyarray : Similar function which passes through subclasses.
    ascontiguousarray : Convert input 

Python 3.5 introduced so-called [type annotations](https://www.python.org/dev/peps/pep-0484/). These annotations can be used to give hints to the user of what kind of arguments are expected to functions. I will demonstrate using the example from the start.

In [7]:
def append_to_string(text: str) -> str:
    return text + "appended_text"

The expected type of the argument `text` is signified with the typename (`str` in our case) written after the colon (`:`).  Since the start of the function block is also signified by the colon character, the return type of the function can be signified by the arrow "symbol" (`->`) which is made up of characters `-`, and `>`.

Let's try it out!

In [8]:
append_to_string("abc")

'abcappended_text'

In [9]:
append_to_string(1.0)

TypeError: unsupported operand type(s) for +: 'float' and 'str'

It looks like we did not get any typechecks or warnings.

It is important to note, that these annotations do not yield in any actual runtime or compile time type checks and also do not decrease runtime performance.

So how can these help?

First of all, types help with documenting our code and functions as they show up in the built-in Python help:

In [10]:
help(append_to_string)

Help on function append_to_string in module __main__:

append_to_string(text: str) -> str



Second, there are a couple of external tools that can use type information to check if we passed the variables with the right types to function arguments:

- [mypy](http://mypy-lang.org/) is probably oldest of Python static type checkers. It works similar to a compiler, you give ot a filename and it checks for the correctness of types
- [pyre](https://pyre-check.org/) is a relatively new tool, that is more performant than mypy and it also includes linting capabalities
- [pyright](https://github.com/microsoft/pyright) is a type checker from Microsoft that supports both a command line and a [language server](https://langserver.org/) mode

## *Intermission*: Code linting

I would really recommend using some kind of linter. Not just when writing Python code, generally when writing code. A linter is a tool that checks for syntax and other kind of errors in the source code file. It can detect bad coding style, missing variables, and other gotchas.

Usually they are command line programs that write their output to the terminal, but nowadays there are plenty of extensions for mainstream text editors that integrate them. This means that you do not have to jump back and forth between your text editor and command line to see the warnings given by the linter. Instead the text editor will display errors and warnings next to the relevant line.

A widely used editor is [Visual Studio Code](https://code.visualstudio.com/) (note: this is different from Visual Studio, the proprietary IDE) that supports pretty much all programming languages and many Python linters.

If you decide to use VS Code, consider checking out the [VSCodium](https://vscodium.com/) as an alternative. This version does not have built-in telemetry that sends your editing data to Microsoft.

## *Intermission* over

## New style classes, `dataclass` and `NamedTuple`

Now that we have looked at type annotations let's see where else would they be useful. One of the most basic building blocks of programs are structures or classes. Note that a structure in C is different from a class in Python, they are completely different concepts, here I am referring to structures and classes as entities that aggregate or group together different types of variables, that usually belong together.

### Creating Python classes the hard way

For many years, defining a Python class usually required a lot of biolerplate code (i.e. code that is repeated often). Let's take a look at a simple example. Below the class `MinMax` is defined. It aggregates two variables `min` and `max`.

In [4]:
class MinMax(object):
    def __init__(self, min, max):
        self.min, self.max = min, max
        # sanity check that they are indeed minimum and
        # maximum values
        assert self.min < self.max
    
    # a simple function to shift both values
    def shift(self, val):
        self.min += val
        self.max += val

Let's create an instance of this class.

In [5]:
m = MinMax(min=1.0, max=2.0)

Check out whether shitfing shifting works.

In [6]:
m.shift(5.0)

Print the value of `m`.

In [7]:
print(m)

<__main__.MinMax object at 0x7fe7002b57c0>


This is not really helpful. Be default Python only prints the name of class and the pointer value that points to the memory block that houses our object.

We can implement custom printing by defining a "special" `__str__` method.

In [16]:
class MinMax(object):
    def __init__(self, min, max):
        self.min, self.max = min, max
    
    def __str__(self):
        return "MinMax(min=%s, max=%s)" % (self.min, self.max)
    
    def shift(self, val):
        self.min += val
        self.max += val

In [9]:
m = MinMax(min=1.0, max=2.0)

Try printing the value of `m`.

In [10]:
print(m)

MinMax(min=1.0, max=2.0)


It works! There is nothing exceptional about the `__str__` method defined for our class. It is the same kind of method as `shift`. The only difference is that its name starts and ends with double underscores (in short they are called "dunder" methods). The print function recognizes that our object has this method and uses it to format the value of it during printing.

In [19]:
m.shift(0.5)

In [20]:
m

MinMax(min=1.5, max=2.5)

Our `shift` method works too.

This has been relatively simple. But what happens if another class has more than two member fields? Than we have to implement the `__str__` method again for that class, paying attention so we do not make a mistake and forget to include one of the fields of the class.

Worse, `__str__` method is not the only special method we can implement for a class. For example there is a `__eq__` special method that lets us compare instances of the class like so: `MinMax(min=1.5, max=2.5) == MinMax(min=1.5, max=2.3)`. This should evaluate to `False`.

Imagine implementing all these methods for every class you define in your program or library. It is quite a boring and cumbersome task, and when a task is boring and cumbersome, not to mention repetitive, people executing these task generally make mistakes. These mistakes can be costly as implementing the method that enables comparing two objects the wrong way can cause serious problems in the program down the line.

Naturally programmers do not like doing repetitive tasks, so there is a solution to this. But before we dwell into that I would like to show another problem with our class and introduce a concept.

### Mutability and immutability

Lets say that we would like to use the `MinMax` class to represent where continents and oceans start and end in the horizontal dimension (i.e. an instance of `MinMax` represents the minimum and maximum `x` coordinate value of each of these components).

For the sake of simplicity we will have two `MinMax` instances. One representing the `x` dimensions of a continent the other representing the ocean's extent in the `x` axis. We assume that we measure distance in kilometres.

In this example we will have a setup where the continent extends from 0 to 100 km and the ocean from 100 to 300 km.

In [23]:
continent = MinMax(min=0.0, max=100.0)
ocean = MinMax(min=100.0, max=300.0)

Say we want to setup another model where the continent goes from 0 to 250 km and ocean from 250 to 450 km.

In [24]:
continent2 = continent
continent2.max = 250.0

ocean2 = ocean
ocean2.min = 250.0
ocean2.max = 450.0

Okay we set up the second case. Let's print the values to check whether we have the right x coordinates.

In [25]:
print(continent2, ocean2)

MinMax(min=0.0, max=250.0) MinMax(min=250.0, max=450.0)


This seems right. Now let's check the first case:

In [26]:
print(continent, ocean)

MinMax(min=0.0, max=250.0) MinMax(min=250.0, max=450.0)


We have overwritten the values defined for our first case. This is because  when we assigned the value of `continent` to `continent2` we did not make a copy of the original value. `continent2` is just a reference to `continent` (to be precise the variable `continent` is also just a reference, you can think of it as a pointer of you come from the C world) so changing member variable values for `continent2` also changes the member variable values of `continent`. This is because `continent` is mutable. This means we can change it's member variable value and since `continent2` is just a reference to it, we do that through `continent2`.

The opposite of mutable is immutable. When a variable is immutable it's value cannot be changed. In Python there are types that are mutable and other types that are mutable. The classic example is lists and tuples.

The contents of a list can change:

In [11]:
a_list = [0, 2, 4]
print("Original: ", a_list)
a_list[2] = 4.5
print("Changed: ", a_list)

Original:  [0, 2, 4]
Changed:  [0, 2, 4.5]


The contents of a tuple cannot change:

In [2]:
a_tuple = (1, 2, 3)
print(a_tuple)
a_tuple[0] = 2.0

(1, 2, 3)


TypeError: 'tuple' object does not support item assignment

Another example is strings:

In [3]:
a_str = "abc"
a_str[0] = "b"

TypeError: 'str' object does not support item assignment

Unfortunately in Python there is no indicator that could tell you whether a variable is immutable or mutable, you have to keep in mind what kind of variables are mutable and what are immutable.

Back to our example. You might say, that the example with the continent and ocean extents does not highlight an issue since we can clearly see that the problem with our code is easily identifiable. This is of course true, but in the real world we have codes that are more than 5 or 10 lines long. In a codebase that is hundreds or perhaps thousands of lines of code, tracing down these side effects can be a real pain.

So generally I would avoid using mutable variables. But how can we do that when we have seen that the class we have defined is clearly mutable. This brings us to next section. You can read more about immutability through this link: https://en.wikipedia.org/wiki/Immutable_object.

### Creating Python classes the new way

Here I would like to show a better way of implementing Python classes. Python 3.5 introduced the `typing` module (https://docs.python.org/3/library/typing.html) to help with type annotations. This module contains a class called `NamedTuple`, which can be used to create new Python classes. Let's use it to implement our `MinMax` class.

#### `NamedTuple`, the immutable solution

In [4]:
from typing import NamedTuple

class MinMax(NamedTuple):
    min: float
    max: float

In [5]:
m = MinMax(min=0.0, max=100.0)

So far this is not too exciting. We did not have to define the `__init__` function manuallay but nothing else. Now let's try to print the of `m`.

In [7]:
print(m)

MinMax(min=0.0, max=100.0)


By inheriting from `NamedTuple`, several special methods were automaticall generated. We can even try the equality operator:

In [8]:
MinMax(min=0.1, max=0.25) == MinMax(min=0.0, max=0.12)

False

In [9]:
MinMax(min=0.1, max=0.25) == MinMax(min=0.1, max=0.25)

True

It works! But how does this helps us regarding the problem of immutability. Let's try to change the `min` value of `m`.

In [10]:
m.min = 5.0

AttributeError: can't set attribute

We get an error. That is because we defined our class using `NamedTuple`. We discussed briefly that the values of tuples cannot be changed. So what does `NamedTuple` has to do with tuple? Well a tuple is just a collection of variables, that can be indexed. If we think about it, a class in it's simplest form is just a collection of variables as well. A named tuple is just a tuple where we use names or labels, `min` and `max` in our case, to access the different variables stored in it.

If we want to, we can transform our `MinMax` object into a plain old Python tuple.

In [12]:
tuple(m)

(0.0, 100.0)

Or, as it is possible with Python tuples, destructure it's member variables into distinct variables:

In [14]:
Min, Max = m
print(Min, Max)

0.0 100.0


Of course in many cases we would like to change the value stored at member variables of a class. `NamedTuple` creates a special method for our class `_replace` that we can use to do that.

In [16]:
m2 = m._replace(min=10.0)
print(m, m2)

MinMax(min=0.0, max=100.0) MinMax(min=10.0, max=100.0)


Notice, how the original variable `m` did not change. Instead `m2` is a new instance of the `MinMax` class, independent of `m`. Well almost independent. In Python every variable has a unique ID associated with it. If two variables have the same ID, then they refer to the same object. Let's check the ID of the `max` member variables.

In [18]:
id(m.max) == id(m2.max)

True

Indeed in the case of `m` and `m2` the `max` variable refers to the same object. This way they share memory with each other. The good thing is, that the value of `max` cannot be changed as we have seen before, so there is no danger of overwriting it's value and changing it for `m` and `m2`.

Let's use our new `MinMax` class to solve our previous problem with the setup of continent and ocean boundaries.

In [20]:
continent = MinMax(min=0.0, max=100.0)
ocean = MinMax(min=100.0, max=300.0)

In [21]:
continent2 = continent._replace(max=250.0)
ocean2 = ocean._replace(min=250.0, max=450.0)

In [22]:
print(continent, ocean)
print(continent2, ocean2)

MinMax(min=0.0, max=100.0) MinMax(min=100.0, max=300.0)
MinMax(min=0.0, max=250.0) MinMax(min=250.0, max=450.0)


Now we have preserved the original values of the first continent and ocean setup.

#### `dataclass`, the mutable solution

There is another way of defining "new style" classes, using the `dataclasses` module.

In [24]:
from dataclasses import dataclass

@dataclass
class MinMax:
    min: float
    max: float

In [25]:
m = MinMax(min=-22.1, max=45.1)
m

MinMax(min=-22.1, max=45.1)

The main difference here if that `m` is mutable.

In [26]:
m.min = 1.0
m

MinMax(min=1.0, max=45.1)

We can make it immutable again by passing an argument to the `@dataclass` decorator call.

In [27]:
@dataclass(frozen=True)
class MinMax:
    min: float
    max: float

m = MinMax(min=1.0, max=2.23)
m.min = -34.0

FrozenInstanceError: cannot assign to field 'min'

Personally I think the `NamedTuple` solution is cleaner, and an adequate solution for most cases. The `dataclass` way of defining new class is more customizable however. For the details check out the Python website here: https://www.python.org/dev/peps/pep-0557/.

There are many things I have not mentioned (e.g. generics, interfaces), but I think this is more than enough for a short introduction to the new type annotation syntax introduced for Python.

The main takeaways are:
- type annotations are a new feature introduced to Python that allows annotationg a type of variables so that they can be checked, usually, by a third-party program
- type annotations by themselves do not ensure type safety, they do not impose a penalty on runtime performance
- types are helpful tools and can help in documenting the code and communicating our intent to the users of our code (which in is most cases is just the person who wrote the code, so essentially it helps us to maintain and understand the code we write)
- immutability and mutability are important concepts and are worth considering when designing and using modules
- I would argue that in most cases immutability is preferred and when we cannot avoid it, we should make it explicit through documentation and code examples
- I would encourage using classes to group together different variables that belong together
- for defining classes `NamedTuple` and `dataclass` are useful since they generate much of the biolerplate code that is generally required for a fully functioning class.