# Lab 1: Environments and Reproducibility

The software that you write for all of your research, whether it be data analysis or complex software should be FAIR: Findable, Accessible, Interoperable and Reusable. In short, the work you present in a paper should be reproducible with minimal effort by an external party. In order to facilitate this process, we will go through some techniques to reproduce your working environment. It begins with us actually creating an environment to begin with!

## Environments
It is important for us to separate out projects managing various version of python and package dependencies, etc. There are a couple of popular package and environment managers: `conda`, `poetry`, `venv`, to name a few.

We're going to keep things basic and use `venv`. Creating virtual environments is really easy. You would usually first create a directory and navigate to that directory. In macos or linux:

```bash
$ mkdir ProjectName
$ cd ProjectName
```

In Windows it's the same. But we already have a folder to work in. So navigate to the AutumnSchool folder using the command line. To make the virtual environment:

```bash
$ python -m venv venv # the second venv is the name of the environment. You can call this anything.
```

and again it should be the same in Windows.

To activate the `venv`

```bash
$ source venv/bin/activate
```

and in Windows command line:

```bash
C:\> venv\Scripts\activate.bat
```

or in Powershell:

```bash
PS C:\> venv\Scripts\Activate.ps1
```

Now we are ready to open VSCode. From where you are in the terminal, run

```bash
$ code . # there is a fullstop here...
```

And this should open VSCode in the current directory. If you want to keep using the existing terminal, that's fine. But it's convenient to use the terminal at the bottom of the screen. Hit `cmd` + `~` to open the terminal if it isn't already open. You can now run the following:

```bash
$ python -m pip install -r requirements.txt
```

This should install all of the stuff we need for the course. It is generally a good idea to add packages to your `requirements.txt` file as you go. You can run

```bash
$ pip freeze
```

to see a list of python packages installed, and you can also dump the installed packages using

```bash
$ pip freeze > requirements.txt
```

Package managers like `conda` and Poetry can automatically generate files that recreate your environment for you, typically in the form of a `.toml` file. We'll talk about this later...

### A note on Anaconda, `conda` and IDEs

Anaconda has dual functionality: it will help manage your packages and will also manage your virtual environments. It is a very powerful tool, especially if collaborating with others. It also comes with `conda`, a command line tool, which is also available independent of anaconda.

Why are we using VSCode? Because it's great. It's lightweight, supports every language under the sun, has a plugin for practically everything, and you can make choose from loads of fancy themes!

Spyder and PyCharm are also very powerful IDEs. One feature of PyCharm is that it will create separate anaconda environments for your projects, which can sometimes take up space.

For this course, we will be using VSCode. You're welcome to use any other IDE or editor (like VIM or Nano, if you're insane), but you're on your own in terms of support.

## Reproducibility

There might be occasions where you need random number generation. For example:
- Drawing from probability distributions for simulations
- Initializing the weights in a neural network
- Shuffling training data

Although we call this "random", it isn't of course. True randomness is very difficult to generate at the scale of computer hardware. Instead we use pseudo-random number generation.

In your programs, there might be a few sources of randomness that you want to control in order to ensure reproducibility. The first of these is the python random number generator.

In [4]:
import random
random.seed(1337)
print(random.random())
print(random.random())
print("Reinitializing the random number generator...")
random.seed(1337)
print(random.random())
print(random.random())

0.6177528569514706
0.5332655736050008
Reinitializing the random number generator
0.6177528569514706
0.5332655736050008


The next one is the numpy random number generator. It is common to set the seed using

```python
np.random.seed(1337)
```

but the best practice is to use a `Generator` instance instead:

In [7]:
from numpy.random import default_rng

rng = default_rng(1337)
print(rng.random())
print(rng.random())
print("Reinitializing the random number generator...")
rng = default_rng(1337)
print(rng.random())
print(rng.random())

0.8781019003471183
0.18552796163759344
Reinitializing the random number generator...
0.8781019003471183
0.18552796163759344


The advantages of this is that you can use the random number generator for specific purposes while avoiding any other imported packaged from resetting your global random seed. For most uses, however, using the global method will be OK.

Next up, we have sources of randomness from Scikit-Learn. Let's have a look at this...

In [59]:
from sklearn.utils import check_random_state
type(check_random_state(None))

numpy.random.mtrand.RandomState

So now we have seen that sklearn just uses the global numpy seed. But let's quickly verify this

In [17]:
from sklearn.neural_network import MLPClassifier
import numpy as np

X = np.random.rand(5,5)
y = np.random.randint(0,5, (5,))

np.random.seed(1337)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 5), random_state=None)
clf.fit(X, y)
print(clf.coefs_[0])

np.random.seed(1337)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 5), random_state=None)
clf.fit(X, y)
print(clf.coefs_[0])

[[-0.36765581 -0.52731021 -0.34277954 -0.06888807  3.23657406]
 [ 0.02841567 -0.3677821   0.73551962  0.75131746 -4.07905129]
 [-0.17569734  0.19852565 -0.57926018  1.6052553   1.35179923]
 [ 0.44734809  0.45439334 -0.2143415  -1.1645211   2.64666125]
 [ 0.40194782 -0.48231493 -0.32726745  0.4198575   3.73596932]]
[[-0.36765581 -0.52731021 -0.34277954 -0.06888807  3.23657406]
 [ 0.02841567 -0.3677821   0.73551962  0.75131746 -4.07905129]
 [-0.17569734  0.19852565 -0.57926018  1.6052553   1.35179923]
 [ 0.44734809  0.45439334 -0.2143415  -1.1645211   2.64666125]
 [ 0.40194782 -0.48231493 -0.32726745  0.4198575   3.73596932]]


It's super annyoying having to reset the random seed if you want to reproduce results, so we'll have a look at this later. But first, let's tackle PyTorch. There is a lot of nondeterministic features of PyTorch. You can use the torch `manual_seed()` method to fix the RNG:

In [60]:
import torch
torch.manual_seed(1337)

<torch._C.Generator at 0x135438ad0>

Annoyingly, PyTorch operations can sometimes use internal random number generators, so if the operation is called over and over, you'll get different results unless you set the manual seed between calls, like we did with sklearn.

In addition, the cuDNN library also can be a source of nondeterminism. This is to do with how cuDNN finds optimal convolution algorithms. You can disable this by using

```python
torch.backends.cudnn.benchmarks = False
```

You can also avoid nondeterministic algorithms, by using

```python
torch.use_deterministic_algorithms()
```

This will mess with a lot of potential neural network layers like LSTM and max pooling layers, and probably should be avoided.

The short version of the story is it is almost impossible to guarantee absolute reproducibility across all PyTorch versions, on multiple platforms. **In general you should not assume perfect reproducibility between CPU and GPU executions**.

### Reproducibility from MATLAB to NumPy
This is something that might be important to you when verifying algorthims that you are translating from MATLAB to NumPy, and is something that has personally caused me quite a bit of grief.

It is possible to exactly reproduce a significant amount of randomness between MATLAB and NumPy. NumPy uses the Mersenne Twister 19937 algorthim by default, and you can force MATLAB to use the same algorithm. This means that both languages will produce the same string of random numbers.

Since MATLAB and NumPy also both use the same underlying linear algebra subroutines (BLAS and LAPACK, both written in FORTRAN), you can also reproduce the results of many common linear solvers.

**WARNING**: You should not use these random number generators for security or cryptographic purposes. There are other libraries that exist for this.

Now that we've discussed randomness and reproducibility a little, let's see how we can set these things globally.

In [70]:
import os
import random
import numpy as np
import torch

DEFAULT_SEED = 1337

def set_python(seed=DEFAULT_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)

def set_numpy(seed=DEFAULT_SEED):
    np.random.seed(seed)

def set_torch(seed=DEFAULT_SEED, deterministic=False):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    if deterministic:
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.deterministic = True

def set_all_seeds(seed=DEFAULT_SEED, deterministic=False):
    set_python(seed)
    set_numpy(seed)
    set_torch(seed, deterministic)

set_all_seeds(1337)

Do not optimize for random seed! Do not base any decisions you make on your random seed!