# Best practices

## Typing

By original design, Python is a _dynamically typed_ language, meaning that a
variable's type (e.g., `int` or `str`) can change over time. This also usually
means (and including in Python) that a variable's initial type does not have
to be declared before using the variable. Thus, as you know, you can just do
something like that without problems:

In [None]:
x = 18
print(f"Value of x: {x}")
print(f"Type of x: {type(x)}")
x = "Now I'm a string"
print(f"Value of x: {x}")
print(f"Type of x: {type(x)}")

This is in contrast to _statically typed_ languages, such as Java or C, where
variable types have to be declared first. In C, e.g., this could look like
this:

```c
int x;
x = 18;
int y = 20; /* you can also assign to the variable at the same time */
z = 20; /* raises an error because the type of z hadn't been declared! */
```

In addition to being _dynamically typed_, Python is also a
[_duck-typed_](https://en.wikipedia.org/wiki/Duck_typing) language, meaning
that whether the current value of a given variable matches a required type is
evaluated at run time with the "duck test":

> _"If it walks like a duck and it quacks like a duck, then it must be a duck"_

An example (taken from [Wikipedia](https://en.wikipedia.org/wiki/Duck_typing)):

In [None]:
class Duck:
    def swim(self):
        print("Duck swimming")

    def fly(self):
        print("Duck flying")

class Whale:
    def swim(self):
        print("Whale swimming")

for animal in [Duck(), Whale()]:
    animal.swim()
    animal.fly()

So if we say "everything that can swim is a duck", then a whale is a duck.
Until we get into a situation where we need a duck also needs to fly, hence
the error above.

Both _dyamic_ and _static_ typing systems for programming languages have
advantages and disadvantages. The advantages of _dynamic_ typing, especially
when _duck-typed_ is that they are less tedious and more flexible. The
disadvantages are that they are not tedious enough and too flexible.
Duck-typing takes a load of your mind to such an extent that you are prone not
to think about your code _enough_ and hence are more likely to introduce
errors, especially in edge cases. And once a bug is introduced into your
codebase, _dynamic typing_ will also make it a lot harder to spot it. Or write
unit tests for your code, because it is difficult to foresee all of the
different scenarios in which that code may be used and, e.g., what type of
inputs your function may receive in these scenarios. In other words, _dynamic
typing_ tends to be more dangerous, especially for more complex codebases!

One other major disadvantage of _dynamic typing_ nowadays is that you cannot
use the full power of modern, _smart_ editors, which are able to check your
code for (potential) issues that may arise from _duck typing_. If the types are
not known until the code is run, the editor cannot help you in spotting these!

### Type hints

Up until Python 3.5, it was actually impossible to even declare types for
variables. Due to the disadvantages of _dynamic typing systems_ mentioned in
the previous section, an _optional_ "type hinting" system was introduced since.
In Python 3.9 and above, this is now quite mature, and we strongly recommend
you to make use of it for production code. But you may perhaps skip it for your
unit tests - best of both worlds!

So how does it work? It's quite simple!


> Note that the typing system in Python and the `typing` module have been
> undergoing a lot of changes since they were first introduced in Python 3.5.
> We are referring here to how things are done in Python 3.9, where the system
> is more mature and has stabilized to some degree. Note that if you need to
> support older Python versions, you may need to do things slightly differently
> (generally the older ways are still supported in the newer Python minor
> releases).

To declare the type of a version, you can do:

```python
x: int
x = 18
```

Or you can declare the type and assign at the same type (more common):

```python
x: int = 18
```

However, note that typing _is not enforced_! Unliked in C (see above), your
runtime won't complain at all if you do something like that:

```python
x: int
x = "But I'm a string!"
```

Python remains a _dynamically typed_ language and the type hints are, well,
just _hints_!

So why should I bother with adding them, then?

The answer is that you can use linters like `mypy` to check your code for
typing issues. You will likely be able to configure your editor to use it and
tell you in realtime if you run into potential issues. If you make use of
type hinting, you should also include `mypy` in your CI. Just include a call
`mypy name_of_your_package` and it will report any issues it finds. After
coding for a bit, even after passing all your other linter tests, you might be
amazed what issues `mypy` finds!

Let's look at how to use type hints in functions and methods:

```python
def my_func(
    a: list,
    b: bool = False,
) -> str:
    # my code
```

Here we have defined a function that takes one required parameter `a` that is
supposed to be of type `list`, as well as an optional (default value provided!)
parameter `b` that is of type `bool`. The _return type_ is declared as `str`.
We recommend you to use type hints at the very least for your functions so that
your interfaces are well defined. It also helps you with writing docstrings, as
you don't need to bother with adding variable types in them. Tools that are
able to process properly formatted docstrings (e.g., Google-style docstrings)
will detect the types from the hints in the function/methods signature. This is
of course better, because a docstring is just text, it doesn't enforce
anything, even if you use `mypy`. You can declare a parameter to require a
certain type, but then the actual implementation uses another type. Docstrings
tend to degrade more easily, and the real source of truth is always the code
itself!

Let's look at some more type hints:

```python
a: list[str]  # a list of strings
b: tuple[str, int]  # a tuple with two items, the first a string, the second an integer
c: dict  # a dictionary
d: dict[str, int]  # a dictionary with the keys beings strings and the values being integers
```

If you want your variables to _optionally_ accept `None` (in addition to the
declared type), you can use `Optional` from the `typing` module:

```python
from typing import Optional

a: Optional[list[str]]  # here we accept a list of strings, or `None`
```

Two other useful features of the `typing` module are `Union` and `Any`. They
allow you to specify more than one type:

```python
from typing import (Any, Union)

a: Union[str, int]  # accepts a string OR an integer
b: Any  # accepts any type
```

### Further reading

Of course there's a lot more to the new Python typing system, but you will
probably be able to get quite far with the rather simple examples above. Once
you run into situations where they won't be enough (or if you want to support
Python version below 3.9), you can check the following resources:

* [Official documentation](https://docs.python.org/3/library/typing.html)
* [PEP 484](https://www.python.org/dev/peps/pep-0484/)
* [Cheat sheet](https://mypy.readthedocs.io/en/latest/cheat_sheet_py3.html)

## Docstrings

One particular aspect of coding style covers the feature of Python that allows modules, functions, classes and methods to be described by a simple triple-quoted string, the documentation string, or more commonly referred to as just "docstring". Whenever docstrings are used they have to represent the first statement (i.e., non-comment and non-blank line) of the code unit they describe. For example, for a function a docstring would be placed right after the function definition:

```python
def my_function():
    """I am a docstring"""
    pass
```

Docstrings are important because they can be used to automatically generate extensive documentation for a Python package/module/class/function, and it ties the documentation directly to the code, which helps tremendously in ensuring that code and documentation do not diverge over time. In fact, docstrings are deemed important enough that they received their own PEP, [PEP 257](https://www.python.org/dev/peps/pep-0257/), which goes way beyond what [PEP 8](https://www.python.org/dev/peps/pep-0008/) has to say about them.

We _strongly_ recommend you to write extensive docstrings at least for those of your modules, classes and functions that might ever be useful on their own to other people, including your future self! It may be tedious to write them, but in the end it will increase the uptake of your code, trust in your code and a better reputation as a developer. Writing the documentation also makes you think again about your code and possible side effects or edge cases of your functions, methods and classes.

There are some popular styles in which docstrings are written, all of which follow the conventions and have tooling available to auto-generate beautiful documentation from them:

* [Sphinx style](https://pythonhosted.org/an_example_pypi_project/sphinx.html#function-definitions)
  ```python
  def func(arg1, arg2):
      """Summary line.

      Extended description of function.

      :param arg1: Description of arg1
      :type arg1: int
      :param arg2: Description of arg2
      :type arg2: str
      ...
      :returns: Description of return value
      :rtype: bool
      """
  ```


* [NumPy style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html)
  ```python
  def func(arg1, arg2):
      """Summary line.

      Extended description of function.

      Parameters
      ----------
      arg1 : int
          Description of arg1
      arg2 : str
          Description of arg2

      Returns
      -------
      bool
          Description of return value
      """
      return True
  ```


* [Google style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
  ```python
  def func(arg1, arg2):
      """Summary line.

      Extended description of function.

      Args:
          arg1 (int): Description of arg1
          arg2 (str): Description of arg2

      Returns:
          bool: Description of return value
      """
      return True
  ```

We will be using **Google-style docstrings** for our collaborative coding projects, as we feel they are easiest to read and write in most situations, and with tooling available to render them for auto-documentation, they are not really less functional than the reStructuredText-based Sphinx-based style. We recommend you to use the same style for small to medium-sized projects.

> Note that if you are using _type hinting_ (see above), you can _and should_
> omit the type information for the arguments and the return value from the
> docstring definition. Otherwise you unnecessarily increase the risk of your
> code and your documentation diverging over time.

## Linting

Linting refers to the automated checking of your codebase for coding (i.e., programmatic) and style issues. _**Linters**_ are tools that can be run to analyze your codebase, and they are available for most programming languages, domain-specific languages and markdown languages. In fact, there are often different _linters_ available for different _aspects_ of different languages, e.g., those that focus on enforcing particular coding style guides, documentation or on functionally assessing your code based on type information (very cool!).

Two of the most popular linters for Python are:
* [Pylint](https://pylint.org/), perhaps the oldest and most widely used one, but also (by default) the strictest
* [Flake8](https://flake8.pycqa.org/), somewhat less strict than Pylint with default settings, but still good enough to achieve decent results

> Note that there are also _code formatters_ that not only analyze your code, but actually modify it to automatically adhere to a configurable style (or use the default one). The most popular one for Python is [Black](https://github.com/psf/black). While code formatters can be very helpful, we do not recommend them for beginners, as it is better to put work in developing good habits rather than relying on software. Also, they come with the downside that it is difficult to automate the inclusion of a code formatter, because they cannot be easily integrated into a CI pipeline, given that they actually _change_ the code (something that shouldn't normally happen inside during CI).  
>  
> Also note that it is well possible (and frequently done) to use _more than one_ linter at a time, even if these linters focus on the same aspects.  
>  
> Finally, note if you are using a commonly used code editor, such as [Visual Studio Code](https://code.visualstudio.com/) or [Atom](https://atom.io/), you can easily configure linters to run automatically every time you save your code, or even continuously. We strongly recommend you to set this up for yourself, as it is a huge time saver and, given that you get immediate feedback if you go wrong, you will learn sticking to a clean coding style much better.

### `flake8`

We will be using the [`flake8`](https://flake8.pycqa.org/) linter for our collaborative coding project. It is very easy to install:

```bash
pip install flake8
```

...and use:

```bash
flake8 code_directory/ tests/
```

There we go!

Like all other linters, and indeed all tools presented during this session, `flake8` can be configured to your heart's contents. However, we will be using the defaults for now, with one exception: We will also install the [`flake8-docstring`](https://pypi.org/project/flake8-docstrings/) extension:

```bash
pip install flake8-docstrings
```

Now we can run

```bash
flake8 --docstring-convention google code_directory/ tests/
```

and `flake8` will also warn us about issues with our docstrings! :)

## Writing tests

_**"Untested code is broken code."**_  
Martin Aspeli, Philipp von Weitershausen

To keep your sanity while trying to maintain and extend an evergrowing codebase, it is crucially important to not only test your code while you are writing it, but to keep your test cases in your code repository, keep them up-to-date with your code base and run them whenever you add new code. You should strive to cover _every line of code_ in your codebase.

Testing is a complex subject and it takes a lot of practice. Also, there are various types of tests, primarily:

* _**unit tests**_ test a block of code (typically a function or method) in isolation, thus focusing only on the behavior of that specific code block
* _**integration tests**_ test the behavior or two or more code blocks together
* _**end-to-end tests**_ are a special type of _integration test_ that test the behavior of the entire program

To get you started with testing, here we will **focus on unit tests**, as they are the easiest to write. Having your entire codebase covered by unit tests already goes a very long way in preventing bugs and easing maintenance, especially if you encapsulate and isolate your _code units_ well and minimize _side effects_ (i.e., when a code block is relying on or modifiying code outside of its scope).

### `pytest`

[`pytest`](https://docs.pytest.org/) is a widely used package for code testing in Python. Like other packages, it can be installed with 

```
pip install pytest
```

and once installed it can be called with 
```
pytest
```

What it does is to run all the code in files called `test_*.py` or `*_test.py` that are located in the current directory or its subdirectories. Which directories and file patterns to search can be changed by commandline parameters of `pytest`. More on customizing `pytest` can be found at https://docs.pytest.org/en/6.2.x/reference.html. For now, we will use the default parameters.

It is a good idea to separate the actual code to be tested form the testing code. Typically, we put the latter in a directory `tests/` in the repository root directory. However, the testing code will need to access the code of the module being tested. The way to do this is to _import_ the module to be tested into the testing code, in a way that does not make assumptions about the directories in which the modules to be tested reside. Given that we have know how to package our code, this is simple though, as we can simply install our package by executing the following in the repository root (compare the `EncapsulationPackaging.ipynb` notebook):

```
pip install -e .
```

#### A simple example

Let's now look at a simple example (adapted from https://gist.github.com/bobhsr/4635489). Assume that we have the following code in module `arithmetic/arithmetic.py`:

```python
"""Classes for arithmetics operations."""


class Arithmetic:
    """A python class for basic arithmetic operations for two rational numbers.

    Non-number inputs are attempted to be cast to floats. The behavior for
    passing values that are not (rational) numbers and cannot be easily cast
    to numbers is not well defined.
    """
    def add(self, x, y):
        """Calculate the sum of inputs.

        Args:
            x: Number to be added to `y`.
            y: Number to be added to `x`.

        Returns:
            Sum of inputs.
        """
        return float(x) + float(y)

    def subtract(self, x, y):
        """Calculate the difference between inputs.

        Args:
            x: Number from which `y` is to be subtracted.
            y: Number to be subtracted from `x`.

        Returns:
            Difference between inputs.
        """
        return float(x) - float(y)
```

Let's further assume that we have created a package out of directory `arithmetic/` (by adding `__init__.py`) and installed it by creating a corresponding `setup.py` and installing with `pip install -e` from the repository root directory.

Now, we want to write code that tests all of the methods in the class `Arithmetic` defined above. As mentioned previously, we will save this code in a `tests/` directory. One way of organizing the tests is to create one module of testing code for each module of code to be tested. Considering the naming conventions that ensure that `pytest` finds the tests, (part of) our project's directory structure will look something like this:

```
├── arithmetic
│   ├── __init__.py
│   └── arithmetic.py
├── setup.py
└── tests
    └── test_arithmetic.py
```

The `test_arithmetic.py` file could look like this:

```python
# imports
import pytest

from arithmetic.arithmetic import Arithmetic  # we import class `Arithmetic` from module `arithmetic` in package `arithmetic`

# create an instance of the Arithmetic class
ar = Arithmetic()


# tests for the `.add()` method
def test_add():
    assert ar.add(1, 2) == 3.0  # we ensure that the addition works as expected for a few cases
    assert ar.add(3, -7) == -4.0
    assert ar.add(-10, -10) == -20.0
    assert ar.add(1, "2") == 3.0  # we also ensure that strings that can be cast to numbers are handled as expected
    assert ar.add("1", 2) == 3.0
    assert ar.add("1", "2") == 3.0
    with pytest.raises(ValueError):  # and we make sure that Python complains if we try to convert letters to numbers
        ar.add(1, "a")
    with pytest.raises(ValueError):
        ar.add("a", 1)
    with pytest.raises(ValueError):
        ar.add("a", "b")


# tests for the `.subtract()` method
# here we are making use of pytest's functionalities to run parametrized tests by creating from a nested list of tuples
# (1) a list `test_input` of input tuples and (2) a list `expected` of expected values
# now let's use those lists to write the tests
@pytest.mark.parametrize(
    "test_input,expected",
    [((1, 2), -1.0), ((3, -7), 10.0), ((-10, -10), 0.0), ((1, "2"), -1.0), (("1", 2), -1.0), (("1", "2"), -1.0)]
)
def test_subtract_param(test_input, expected):  # we need to pass the lists to the test function...
    assert ar.subtract(test_input[0], test_input[1]) == expected


# we can do the same for the tests that raise an error:
@pytest.mark.parametrize(
    "test_input,expected",
    [((1, "a"), ValueError), (("a", 1), ValueError), (("a", "b"), ValueError)]
)
def test_subtract_param_failing(test_input, expected):
    with pytest.raises(expected):
        ar.add(test_input[0], test_input[1])
```

The first two lines of code import `pytest` and the `Arithmetic` class, respectively. We then create an instance of the class to be used in our tests. What follows is a block of functions, each designed to test the functionality of one method of the `Arithmetic` class. Specifically, we are using two basic test cases:

1. Asserting a specific result when calling the tested code, with the general syntax:
   ```python
   assert func(x, y) == result
   ```
2. Ensuring that a specific error is raised when calling the tested code, with the general syntax:
   ```python
   with pytest.raise(Error):
       func(x, y)
   ```

When running `pytest` from the repository root directory, all tests will be executed and if we did everything correctly, we will learn that all tests we have set up have passed - yay! :)

> Note that running `pytest` will create a directory `.pytest_cache/` in the repository root directory. Make sure you do _not_ version control it by including it in your `.gitignore` file. If you have automatically created your `.gitignore` file via http://gitignore.io/ and you have selected Python, you will not need to add it manually.

Let's be brave and tweak one of the tests to find out how `pytest` reacts if a test fails. For example, if you do

```python
def test_add():
    with pytest.raise(ValueError):
        ar.add(1, True)
```

`pytest` will fail because no `ValueError` is raised. Can you imagine why not?

Apart from test parametrization, `pytest` offers many more features and "fixtures" (functions that help you set up a test case), such as creating temporary files, checking the screen output etc. Have a look at `pytest`'s [documentation](https://docs.pytest.org/en/6.2.x/contents.html) for more info. But don't be put off by the complexity, you will learn more of it incrementally, when you need it.

There are, however, two important aspects of testing that we would like to highlight.

#### Monkeypatching

As mentioned before, in _unit tests_ we are testing blocks of code in isolation. But what if our code depends on the functionality of third-party code? Should our tests extend to documented (and hopefully tested) behavior of other people's code? For unit tests, the answer is "It depends!". While we should never go as far as writing entire unit tests for other people's code, if our code depends on it, we should usually try to cover with our tests all responses (return values or errors) we may reasonably expect from it. However, we should also make sure that our tests run quickly and ideally do not depend on too many dependencies, e.g., the availability of an external service or database. So what can we do to avoid it?

The answer is _**monkeypatching**_ or _**mocking**_, the practice of overriding code to return a specifc, well-defined response.

There are several uses cases for monkeypatching (see [here](https://docs.pytest.org/en/6.2.x/monkeypatch.html) for some more), but let's focus on the example mentioned earlier, the dependency of an external service that we are trying to call via HTTP. Consider this example code (adapted from [`pytest`'s documentation](https://docs.pytest.org/en/6.2.x/monkeypatch.html#monkeypatching-returned-objects-building-mock-classes)):

```python
# contents of app.py, a simple example where a JSON response is retrieved from
# a web service available at a specified URL and is serialized into a Python dictionary
import requests


def get_json(url):
    """Takes a URL, and returns the JSON."""
    r = requests.get(url)
    return r.json()
```

So what if the URL is done when we are running our tests? The tests would likely fail and we might scratch our heads, thinking that something is wrong with our own code, when in fact it was just a server outage. Of course, you should _also_ include in your code the possibility that such outages exist and test for _that_ behavior, but you will also need to test the expected _normal_ behavior, so in comes _monkeypatching_:

```python
# contents of test_app.py, a simple test for our API retrieval

# import requests for the purposes of monkeypatching
import requests

import pytest

# our app.py that includes the `get_json()` function
# see the previous code block example
import app

# custom class to be the mock return value
# will override the `requests.Response` object returned from `requests.get()`
class MockResponse:

    # `requests.Response` has a `.json()` method that we are relying on
    # our mock `.json()` method will always return a specific, well-defined testing dictionary
    @staticmethod
    def json():
        return {"mock_key": "mock_response"}


# now let's write our test case
def test_get_json(monkeypatch):  # we need to pass monkeypatch here; it's available as soon as you import pytest

    # here we are defining a mock method: whatever arguments are passed to it,
    # mock_get() will always return our mocked object `MockResponse`, which only has the .json() method.
    def mock_get(*args, **kwargs):
        return MockResponse()

    # now let's monkeypatch requests.get with mock_get
    # we do this by telling `monkeypatch` to set the `.get` method (or more generally, attribute)
    # of the `requests` class to the `mock_get` function so that whenver `requests.get()` is called
    # our `mock_get()` function is called, which returns our `MockResponse` object
    monkeypatch.setattr(requests, "get", mock_get)

    # now let's call `app.get_json()`, which contains `requests.get()` (see previous code block)
    # it will use our monkeypatch...
    result = app.get_json("https://fakeurl")
    # ...and so our JSON response returns our pre-defined dictionary!
    assert result["mock_key"] == "mock_response"
```

As we have seen, we set our test up in a way that it isn't actually making that call to the `url` anymore, but rather we simply get the response that we put in. For the simple code block we have tested, this may seem somewhat pointless, because there's basically _nothing else_ but that call to the URL and so why bother testing it if we basically then don't actually make that call and just tell it what to return us instead. We're just getting out what we put in - way to test! But now imagine that there's more in `get_json()`, some processing that actually needs proper testing? In that case, we would be able to test that code independently of the availability of a server at URL `url` or in the absence of an internet connection.

Apart from setting/overriding and deleting class attributes, the `monkeytest` fixture also has methods to set and delete dictionary items and environment variables. And while it is difficult for beginners to estimate when to use monkeypatching in practice and we will probably not need to make use of it for our coding project, we feel it is important for you to know that overriding external code, data and variables is _possible_ should the need ever arise (and it soon will if you start doing something more complex!).

### Code coverage

The last aspect on testing we want to touch upon is the concept of _**code coverage**_, which is defined as the percentage of all code statements that are covered by the entirety of available test cases. Say, your code consists of 100 statements, but your test cases never run 30 of these statements, then your _code coverage_ is 70%. As mentioned before, we should be striving to have _all_ statements covered by tests, so in terms of _code coverage_, we are striving for 100% - a very high bar! But again, use/publish untested code at your own peril - sooner or later it's gonna come back at you hard!

There is a nice Python package [`coverage`](https://coverage.readthedocs.io/en/6.0.2/) that conveniently allows you to calculate your code coverage. It is easy to install:

```bash
pip install coverage
```

...and use (call from repository root directory):
```bash
coverage run --source=code_directory/ -m pytest
# where code directory is the _top-level_ directory containing the code to be tested
```

This will calculate the _code coverage_ across all modules inside the `code_directory/` directory and all its subdirectories and write output to a file `.coverage` that is created in the repository root directory.

> Similar to the `.pytest_cache/` directory mentioned above, make sure not to version control the `.coverage` file. And again, if you have automatically created your `.gitignore` file via http://gitignore.io/ and you have selected "Python", you don't need to worry about adding it manually.

You can also specify more than one directory at a time, like so:

```bash
coverage run --source=code_directory_1/,code_directory_2/ -m pytest
```

In order to see the calculated _code coverage_, execute the following:

```bash
coverage report -m
```

The `-m` flag ensures that the lines of code that are not covered by tests are explicitly mentioned in the output, which is very useful to tell you what tests you still need to implement. Luckily for our `arithmetic` example, there are no statements missing (not surprising, because our methods were just single lines of code with no conditionals/branching), so we are at 100% coverage:

```console
Name                       Stmts   Miss  Cover   Missing
--------------------------------------------------------
arithmetic/__init__.py         0      0   100%
arithmetic/arithmetic.py       5      0   100%
--------------------------------------------------------
TOTAL                          5      0   100%
```

## Git

[Git](https://git-scm.com/) is a distributed VCS originally launched in 2005 by
Linus Torvalds, the father and namesake of the Linux kernel. Over the years,
Git outcompeted other VCS solutions to the extent that it is now considered the
**_de facto_ standard VCS for open source software development** (check [this
report from
RhodeCode](https://rhodecode.com/insights/version-control-systems-2016) for
some actual, though dated, numbers; the actual dominance of Git has likely
further increased substantially since the report was published in 2016; you may
also want to read [this Hackernoon blog
post](https://hackernoon.com/how-git-changed-the-history-of-software-version-control-5f2c0a0850df)
for some additional context on the impact of Git). 

Given the popularity of Git, especially in the scientific software community
(we are not aware of a single piece of serious, widely used open source
software in the field that does offer its code in a Git repository), **we will
be using Git as our VCS of choice throughout this course**.

Git is **fast, free, open source** and comes preinstalled on most Mac and Linux
machines. It stores a project's history as a directed graph, with a root, edges
("branches"), nodes and leaves ("commits"). Commits represent snapshots of a
project's state, and given Git's distributed nature (i.e., users work on a
clone of the entire code repository, not just the current state) it is easy to
traverse the tree to go back in time, roll back changes or to compare one state
of the project with another (which makes code review a breeze). Branches
represent different lines of work and they can be used for various purposes,
e.g., maintaining multiple versions of your software or keeping stable,
fully-tested code separately from features that are currently being
implemented. Generally, there is one default branch (typically the `main` or,
formerly, the `master` branch - depending on your version of Git). This is also
often referred to as a _stable branch_, _release branch_ or _production branch_.

> Note that while Git is designed as a distributed nature, it doesn't mean that
> there cannot be a central, authoritative repository in Git workflows. It just
> allows you _not_ to have one, if it happens to suit your project's needs. In
> reality, though, most open source software projects _will_ make use of such a
> _blessed_ repository, and so will we.

One other important property of Git is that it distinguishes between three
different environments:

* The **working directory**  
  This is your current directory structure and corresponds to the state of the
  project on your file system.
* The **staging area**  
  This includes all changes _staged_ to be included in the next commit.
* The actual **local repository**  
  This represents the current state of the commit you are currently viewing
  (also referred to as the HEAD).

Now, when you create a fresh Git repository from an empty directory, clone a
repository from a Git server or pull the latest changes from a remote
repository to your local copy (more on the latter two later), there will be no
difference between the working directory and the HEAD, and the staging area
will be empty. But once you start adding or editing files to the directory
containing the Git repository, the status of your working directory will start
to differ from that of the HEAD. You can then _stage_ one, multiple or all of
the created or modified files to be included in the next commit, filling up the
staging area. Unstaged changes will never be committed. Once you are happy with
your staged changes you can go ahead and commit them - at which point the
working directory and the HEAD will be in sync again.

### Common Git workflows for collaborative coding

When working on a project collaboratively, it is critically important to keep
the corresponding code repository in a clean state so as to avoid conflicts
introduced through modifying the same portions of the same files by different
as much as possible. Conflicts cannot always be avoided and that's totally
fine and noone's mistake, but it helps to adopt a common Git workflow that
everyone can follow.

Here are some of the most commonly used Git workflows (or branching models),
in increasing order of complexity. Which one to pick will depend on the
requirements of your project, but generally the bigger the code base and the
more people are contributing, the more complex the workflow should be. All of
these branching models have in common that they do not allow anyone to push
code directly to the main/default branch.

* [**"GitHub flow"**](https://guides.github.com/introduction/flow/) (low
  complexity)  
  Feature branches are created off the main branch and are merged back into it after a feature has been implemented, tested and reviewed; multiple feature branches can be worked on at the same time, one for each feature and ideally by a single person; the main branch is always assumed to be stable and in a state to be deployed
* [**"GitLab flow"**](https://docs.gitlab.com/ee/topics/gitlab_flow.html)
  (medium complexity)  
  This branching model builds on the GitHub flow and includes guidelines for setting up optional production (to deploy code from development whenever the time is right), environment and release branches on top of the default/main branch and feature branches; this is useful for situations where intensive testing is required prior to deployment (e.g., staging, pre-production, production deployments and corresponding environment branches), when deployments are scheduled or where multiple explicitly versioned releases of a software are to be released 
* [**"GitFlow"**](https://nvie.com/posts/a-successful-git-branching-model/)
  (high complexity)  
  The GitHub and GitLab flows represent recent simplications of this workflow, one of the oldest and most widely known Git branching models; while still a good fit for some release/versioning schemes, its popularity is decreasing with the increasing adoption of continuous integration and delivery solutions, which render some of its features unnecessary; GitFlow prescribes the use of several branch types (main/production, hotfix, release, development and feature branches), with the development branch being the default branch

**For this course, we will be making use of the simples branching model, the
GitHub flow**, with the additional constraint that every feature branch will be
the sole responsibility of a single person. In this way, it is not so important
that feature branches are kept overly tidy and you can commit to it as much as
you like (it takes experience to keep commits tidy). "Features" to be
implemented during the collaborative coding project will also be kept small,
thus minimizing the possibility of merge conflicts.

### Interactive session: Basic Git commands

In this session, we will be creating a local Git repository, learn about
staging and committing files, practice the GitHub flow and learn about some
useful Git utilities.

As a first step, please make sure that Git is installed on your machine by
executing `git --version` in your shell. If the command is not found, you will
first need to [install
Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) before you
can continue. Please let us know if you run into any problems during
installation.

Please refer to the following documentation for further details or in case you
missed the interactive session:

* Git commands
  * [`git config`](https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-config):
    set your user name and email address for attribution and feedback
  * [`git init`](https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-init):
    create a Git repository out of the current working directory
  * [`git status`](https://www.atlassian.com/git/tutorials/inspecting-a-repository):
    display current branch and the state of the working directory and staging area 
  * [`git diff`](https://www.atlassian.com/git/tutorials/saving-changes/git-diff):
    compare changes between current HEAD and working directory
  * [`git add`](https://www.atlassian.com/git/tutorials/saving-changes):
    add files to the staging area
  * [`git commit`](https://www.atlassian.com/git/tutorials/saving-changes/git-commit):
    commit staged files to local repository
  * [`git log`](https://www.atlassian.com/git/tutorials/inspecting-a-repository):
    display commit history
  * [`git branch`](https://www.atlassian.com/git/tutorials/using-branches):
    create or delete a branch
  * [`git checkout`](https://www.atlassian.com/git/tutorials/using-branches/git-checkout):
    switch to a different branch or commit
  * [`git merge`](https://www.atlassian.com/git/tutorials/using-branches/git-merge):
    merge one branch into another
 
 > Note that all of these commands have multiple options and most have
 > additional functionalities than the ones mentioned. For brevity, we are
 > focusing only on those functionalities that we will likely be using during
 > the course.

Execute the following commands in Bash:

```bash
# 1. SET UP GIT

# check if Git is installed
git --version

# configure user details
# skip argument to get current user name and email
git config --global user.name "MY NAME" 
git config --global user.email "my@email.com"

# 2. CREATE REPOSITORY

# create new directory and move into it
mkdir -p my_repository
cd my_repository

# create Git repository from current working directory (and all sudirectories)
# directory can be empty or contain preexisting files and directories
# generates a repository root, an (empty) default branch ("main" or "master",
# depending on Git version) and a ".git" directory that stores all changesets,
# metadata etc 
git init

# 3. ADD COMMIT STRAIGHT TO DEFAULT BRANCH

# check status of working directory and staging area relative to state of
# repository
git status

# now it's time to add or modify some files...
touch my_file_1 my_file_2

# confirm that repository status changed: now there are untracked files and/or
# unstaged changes; also check exactly which lines differ between the working
# directory and the (in this case still "empty") repository
git status
git diff

# stage files for inclusion in next commit, then check and commit
git add my_file_1 my_file_2
# alternatively, do: "git add -A" to add _all_ new/modified files to staging
# area
git status
git commit -m "initial commit"
# alternatively just do: "git commit" to open an editor where you can enter a
# more detailed commit message

# confirm that commit history now contains a new entry and that the staging
# area is clean again
git log
# alternatives:
# short representation with one line per commit: "git log --oneline"
# include visualization of branch tree:
# "git log --graph --decorate --oneline --all"
git status

# make sure your default branch is "main", not "master"
git branch -m main

# 4. MERGE IN CHANGES FROM FEATURE BRANCH

# create feature branch and switch to feature branch
git branch my_feature
# you can use "git branch", without argument, to list available branches and
# verify that a branch was indeed created
git checkout my_feature
# alternatively, do "git checkout -b my_feature" to create and switch to
# feature branch with a single command
# alternatively, you can use "git checkout" also to switch to a specific commit
# by passing a commit identifier/hash (7-digit or long form); you can get these
# from "git log"

# add and/or modify some files...
touch my_file_3
echo "some_content" >> my_file_1

# add to staging area and commit
git status
git diff
git add my_file_3 my_file_1
git status
git commit -m "feat: add my feature"
git log --oneline

# now merge new commit(s) from feature branch into default branch
git checkout main
git merge my_feature

# delete feature branch
git branch -D my_feature
# you could use "git branch" to verify that the branch was indeed deleted
# was indeed deleted
```

### Remote repositories & GitLab

So far we have only been working with Git locally. But to successfully use Git
for collaborative coding, we need a common remote repository to push our code
changes to, pull the work of other from etc.

**For hosting our remote repository we will be making use of the popular
Git server and social coding platform [GitLab](https://gitlab.com/)**. While
GitLab is not quite as widely used as the Microsoft's
[GitHub](https://github.com/) platform, we chose to use it for this course as
the University of Basel's scientific compute center
[sciCORE](https://scicore.unibas.ch/) offers a local deployment of GitLab at
http://git.scicore.unibas.ch/ and so it will be easy to apply what you have
learned here while working at the Biozentrum. Besides, GitLab and GitHub are
actually quite similar, so it will not be very difficult to transition in case
the need arises (we are using both in our lab).

Next to simple hosting of Git repositories (there are various services that do
just that), GitLab and other social coding platforms offers various project
management functionalities that are incredibly useful for running a software
development project, including merge requests (called pull requests on GitHub
and other platforms), code review tools, an issue tracker and automated kanban
boards for managing issues and merge requests. Social coding platforms (and
Git servers in general) also allow users to _fork_ any public repositories,
i.e., create a private, remote copy of it. In open source software development,
creating merge/pull requests from a fork to the corresponding original
repository is the typical way for people to contribute code to projects that
they are not directly affiliated to and thus do not have the permissions to
push code directly to the original repository. During the collaborative coding
project in the second half of this course, we will also make use of forking, as
we will keep the remote repository of the project on our sciCORE GitLab
instance, to which not all of you may have access to (and thus we are not able
to grant you permissions to add you as a collaborator with the necessary write
permissions to the project.

### Best practices

1. **Mind what you commit**  
   Only commit manually generated files. Auto-generated files can be recreated
   later and only clutter the repository. For the same reason (and also because
   Git servers impose restrictions on total file and/or repository sizes), do
   not commit big files, such as data files, so keep your test files small and
   relevant. Do not commit any artefacts that are specific to your environment,
   e.g., absolute file paths (this is also a potential security issue!) and,
   most importantly, **do not commit secrets or any other sensitive
   information!** There are bots out there that are constantly scanning public
   Git repositories for such information, and it is cumbersome to rewrite
   the Git history to completely remove such information. String patterns
   matching file and directory names to be included can be indicated in various
   places, most often in a version-controlled `.gitignore` file that is placed
   in the repository root directory. However, patterns for files that you want
   to exclude for all repositories and that are specific to your work
   environment (e.g., editor-specific artefacts such as lock and backup files),
   you should rather indicate globally. Have a look at the ["gitignore"
   documentation](https://git-scm.com/docs/gitignore) to find out how to use
   it. We also recommend that you make use of the
   [gitignore.io](http://gitignore.io/) service that auto-generates "gitignore"
   patterns for you based from a wide list of keywords (e.g., `Python`,
   `VisualStudioCode`, `Linux`). Given the importance of configuring Git to
   ignore certain files, we have included the generation of a `.gitignore` file
   in the homework below.
2. **Create single-purpose commits with semantic commit messages**  
   Analogous to the [single-responsibility
   principle](https://en.wikipedia.org/wiki/Single-responsibility_principle)
   of software development, a commit should wrap only functionally and/or
   semantically related changes. For example, if you find a typo in your
   project's documentation while you are working on something else, do not
   just fix it along with your other changes. It's easier to review code that
   is consisting of the least possible number of lines. It's also easier to
   roll back buggy code without any additional side effects. Finally, it will
   make for a clean commit history and changelog, so that users of your
   software can track its development and the release of new features and bug
   fixes. In this regard, we strongly recommended that you describe your
   commits using concise (up to 50 characters in the title line) semantic
   commit messages following the
   [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/)
   specification. It may be painful at first, but learning it early should
   prevent you from adopting bad practices, and it comes with multiple
   benefits, including a clean and consistent commit history, the ability to
   create changelogs, bump versions according to the widely used [Semantic
   Versioning (SemVer)](https://semver.org/) specification and even publish
   entire releases automatically (in Python, e.g., with the
   [`python-semantic-release`](https://python-semantic-release.readthedocs.io/en/latest/)
   package).
3. **Do not rewrite history**  
   Once you push code to a public repository, others are able to pull the code
   and work on it (which you may not necessarily be aware of). If you change
   the commit graph between the time point that someone has checked out the old
   history tree and the time point they are trying to merge their new code, Git
   will not be able to resolve the resulting conflicts, causing much justified
   frustration. Apart from creating irresolvable conflicts, rewriting history
   can also lead to contributions being misattributed to different people
   (imagine someone rewriting your paper, stripping your name and replacing the
   old version with the new one on the publisher's website) and difficulties in
   tracking the progress of your project. Of course, as with (almost) every
   guidelines, there are valid exceptions: For example, in some Git workflows,
   including the one we are using, the history of a _feature branch_ (but not
   a release/stable branch) can be rewritten in order to clean up or squash
   commits before merging. Here, by convention, the assumption generally holds
   that only a single person is working on a iven feature branch agt a time.
   Another exception is when you inadvertently commit sensitive information to
   your repository (see above). Clearly, in such a situation even rewriting a
   stable branch may become necessary.


> More thoughts and best practices can be found in [this extensive
> resource](https://sethrobertson.github.io/GitBestPractices/), courtesy of
> Seth Robertson.



### Interactive session: Git & GitLab

In this session, we will create a _remote_ on GitLab. It will serve as our central, authoritative repository to/from which we push/pull the latest code changes. Next to the relevant Git commands to interact with a remote, we will explore GitLab's basic project management, code review and merging features.

As a first step, please [register with GitLab](https://gitlab.com/users/sign_up) if you haven't already done so.

Please refer to the following documentation for further details or in case you missed the interactive session:

* GitLab functionalities
  * [Creating a blank project](https://docs.gitlab.com/ee/user/project/working_with_projects.html#blank-projects)
  * [Managing project permissions](https://docs.gitlab.com/ee/user/project/working_with_projects.html#blank-projects)
  * [Setting branch protection rules to enforce GitHub flow](https://docs.gitlab.com/ee/user/project/protected_branches.html#require-everyone-to-submit-merge-requests-for-a-protected-branch)
  * [Creating issues](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#create-a-new-issue)
  * [Creating merge requests](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
  * [Automatically closing issues](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
  * [Reviewing merge requests](https://docs.gitlab.com/ee/user/project/merge_requests/reviews/)
  * [Squashing commits before merging](https://docs.gitlab.com/ee/user/project/merge_requests/squash_and_merge.html)


* Git commands
  * [`git remote add`](https://docs.github.com/en/get-started/getting-started-with-git/managing-remote-repositories#adding-a-remote-repository):
    connect a local with a remote repository
  * [`git push`](https://docs.github.com/en/get-started/using-git/pushing-commits-to-a-remote-repository#about-git-push):
    push changes from local to remote repository
  * [`git fetch`](https://docs.github.com/en/get-started/using-git/getting-changes-from-a-remote-repository#fetching-changes-from-a-remote-repository):
    fetch latest metadata from remote repository but do _not_ merge any changes into local repository
  * [`git pull`](https://docs.github.com/en/get-started/using-git/getting-changes-from-a-remote-repository#pulling-changes-from-a-remote-repository):
    fetch latest metadata and merge changes into the local repository (shortcut for executing `git fetch` and `git merge` one after another)
  * [`git clone`](https://docs.github.com/en/get-started/using-git/getting-changes-from-a-remote-repository#cloning-a-repository):
    create a local copy of a remote repository

Execute the following code in Bash:

```bash
REPO_ADDRESS=git@gitlab.com:username/repo_name.git  # replace this with your real remote repository URL

# 1. CONNECT LOCAL WITH REMOTE REPOSITORY

# connect the local to the remote repository; call the remote repository
# "origin"
git remote add origin $REPO_ADDRESS
# push everything in local repository (all branches and tags) to the remote
# called "origin"
# tags are just optional labels for specific commits, e.g., if you decide that
# this commit right here is going to represent "v1.2.3" of your software
git push -u origin --all
git push -u origin --tags

# 2. ADD CODE CHANGES TO FEATURE BRANCH AND PUSH TO REMOTE

# make sure you are on the default branch and that it is in sync with all the
# latest changes on the remote
git checkout main
git fetch
git merge
# alternatively: "git pull" does "git fetch" (to fetch changes) and "git merge"
# (merge in changes) all in one

# create new branch and switch to it
git branch my_new_feature
git checkout my_new_feature


# add and/or modify some files...
touch my_file_4
echo "some_other_content" >> my_file_2

# add to staging area and commit
git status
git diff
git add my_file_4 my_file_2
git status
git commit -m "feat: add my new feature"
git log --oneline

# push your feature branch to the remote repo
# setting the "-u" flag sets the default remote branch for the current local
# branch, so that for future "git push" operations on your feature branch,
# you only need to execute "git push" (and similarly, git pull)
git push -u origin my_new_feature

# you can now go ahead and create a merge request on your Git server to have
# your code reviewed and your feature merged into the main/default/production
# branch
```

### Further reading

We focused here on functionalities that you are likely going to use during the
the course, but there are plenty of other Git commands, as well as nuances to
the introduced commands that have not been addressed. The [official Git
website](https://git-scm.com/) offers extensive documentation, including the
[reference documentation](https://git-scm.com/docs), the book "[Git
Pro](https://git-scm.com/book/en/v2)" (free), some
[videos](https://git-scm.com/videos) and [links](https://git-scm.com/doc/ext)
to externally hosted tutorials, books, videos and courses. Apart from that,
[Stack Overflow](https://stackoverflow.com/) has [more than 100'000 questions
tagged with "Git"](https://stackoverflow.com/questions/tagged/git), so you will
likely find answers to pretty much any question you may have.

We encourage you to (re)visit these resources whenever you need them so that
you can, with time, add cool new Git skills to your toolbox.