# Lab 1: Environments and Reproducibility

The software that you write for all of your research, whether it be data analysis or complex software should be FAIR: Findable, Accessible, Interoperable and Reusable.



## Environments
It is important for us to separate out projects managing various version of python and package dependencies, etc. There are a couple of popular package and environment managers: conda, poetry, pyenv, to name a few.

We're going to keep things basic and use pyenv.

## Reproducibility

There might be occasions where you need random number generation. For example:
- Drawing from probability distributions for simulations
- Initializing the weights in a neural network
- Shuffling training data

Although we call this "random", it isn't of course. True randomness is very difficult to generate at the scale of computer hardware. Instead we use pseudo-random number generation.

In your programs, there might be a few sources of randomness that you want to control in order to ensure reproducibility. The first of these is the python seed.

In [4]:
import random
random.seed(1337)
print(random.random())
print(random.random())
print("Reinitializing the random number generator...")
random.seed(1337)
print(random.random())
print(random.random())

0.6177528569514706
0.5332655736050008
Reinitializing the random number generator
0.6177528569514706
0.5332655736050008


The next one is the numpy random number generator. It is common to set the seed using

```python
np.random.seed(1337)
```

but the best practice is to use a `Generator` instance instead:

In [7]:
from numpy.random import default_rng

rng = default_rng(1337)
print(rng.random())
print(rng.random())
print("Reinitializing the random number generator...")
rng = default_rng(1337)
print(rng.random())
print(rng.random())

0.8781019003471183
0.18552796163759344
Reinitializing the random number generator...
0.8781019003471183
0.18552796163759344


The advantages of this is that you can use the random number generator for specific purposes while avoiding any other imported packaged from resetting your global random seed. For most uses, however, using the global method will be OK.

Next up, we have sources of randomness from Scikit-Learn. Let's have a look at this...

So now we have seen that sklearn just uses the global numpy seed. But let's quickly verify this

In [59]:
from sklearn.utils import check_random_state
type(check_random_state(None))

numpy.random.mtrand.RandomState

In [56]:
from sklearn.neural_network import MLPClassifier

np.random.seed(1337)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 3), random_state=None)
clf.fit(X, y)
print(clf.coefs_[0])

np.random.seed(1337)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 3), random_state=None)
clf.fit(X, y)
print(clf.coefs_[0])

[[-0.3884982  -0.5572034   1.2191349  -0.68651964 -0.29221923]
 [ 0.03002655 -0.38863165  2.03523039  0.28036444 -0.62807045]
 [-0.18565761  0.20978005 -2.28069864  0.56606969 -0.09268623]
 [ 0.47270823  0.48015287 -1.54028493 -0.11151901  0.13755263]]
[[-0.3884982  -0.5572034   1.2191349  -0.68651964 -0.29221923]
 [ 0.03002655 -0.38863165  2.03523039  0.28036444 -0.62807045]
 [-0.18565761  0.20978005 -2.28069864  0.56606969 -0.09268623]
 [ 0.47270823  0.48015287 -1.54028493 -0.11151901  0.13755263]]


It's super annyoying having to reset the random seed if you want to reproduce results, so we'll have a look at this later. But first, let's tackle PyTorch. There is a lot of nondeterministic features of PyTorch. You can use the torch `manual_seed()` method to fix the RNG:

In [60]:
import torch
torch.manual_seed(1337)

<torch._C.Generator at 0x135438ad0>

Annoyingly, PyTorch operations can sometimes use internal random number generators, so if the operation is called over and over, you'll get different results unless you set the manual seed between calls, like we did with sklearn.

In addition, the cuDNN library also can be a source of nondeterminism. This is to do with how cuDNN finds optimal convolution algorithms. You can disable this by using

```python
torch.backends.cudnn.benchmarks = False
```

You can also avoid nondeterministic algorithms, by using

```python
torch.use_deterministic_algorithms()
```

This will mess with a lot of potential neural network layers like LSTM and max pooling layers, and probably should be avoided.

The short version of the story is it is almost impossible to guarantee absolute reproducibility across all PyTorch versions, on multiple platforms. **In general you should not assume reproducibility between CPU and GPU executions**.

### Reproducibility from MATLAB to NumPy
This is something that might be important to you when verifying algorthims that you are translating from MATLAB to NumPy, and is something that has personally caused me quite a bit of grief.

It is possible to exactly reproduce a significant amount of randomness between MATLAB and NumPy. NumPy uses the Mersenne Twister 19937 algorthim by default, and you can force MATLAB to use the same algorithm. This means that both languages will produce the same string of random numbers.

Since MATLAB and NumPy also both use the same underlying linear algebra subroutines (BLAS and LAPACK, both written in FORTRAN), you can also reproduce the results of many common linear solvers.

**WARNING**: You should not use these random number generators for security or cryptographic purposes. There are other libraries that exist for this.

Now that we've discussed randomness and reproducibility a little, let's see how we can set these things globally.

In [70]:
import os
import random
import numpy as np
import torch

DEFAULT_SEED = 1337

def set_python(seed=DEFAULT_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)

def set_numpy(seed=DEFAULT_SEED):
    np.random.seed(seed)

def set_torch(seed=DEFAULT_SEED, deterministic=False):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    if deterministic:
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.deterministic = True

def set_all_seeds(seed=DEFAULT_SEED, deterministic=False):
    set_python(seed)
    set_numpy(seed)
    set_torch(seed, deterministic)

set_all_seeds(1337)

Do not optimize for random seed! Do not base any decisions you make on your random seed!