In [1]:
import json
import numpy as np
import os
import pandas as pd
import sklearn

print(np.__version__)
print(sklearn.__version__)

1.18.5
0.24.1


## Multivariate Reproducibility

This notebook showcases the reproducibility issues with sampling from a multivariate normal distribution in NumPy. It uses the following [GitHub issue](https://github.com/numpy/numpy/issues/2435) as an example.

In [2]:
results = {}
for file in sorted(os.listdir('output')):
    df = pd.read_csv(os.path.join('output', file))
    env = file.rsplit('_', 1)[0]
    results[env] = df

In [3]:
randomstate_matches = []
svd_matches = []
cholesky_matches = []
for i, (env1, df1) in enumerate(results.items()):
    for j, (env2, df2) in enumerate(results.items()):
        if i < j:
            randomstate_match = np.allclose(df1['randomstate'], df2['randomstate'])
            svd_match = np.allclose(df1['svd'], df2['svd'])
            cholesky_match = np.allclose(df1['cholesky'], df2['cholesky'])
            print(f'Looking to see if {env1} and {env2} samples match..')
            print(f"RandomState matches: {randomstate_match}")
            print(f"SVD matches: {svd_match}")
            print(f"Cholesky matches: {cholesky_match}")
            randomstate_matches.append(randomstate_match)
            svd_matches.append(svd_match)
            cholesky_matches.append(cholesky_match)

Looking to see if linuxubuntu_mkl and linuxubuntu_openblas samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if linuxubuntu_mkl and macosbigsur_mkl samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if linuxubuntu_mkl and macosbigsur_openblas samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if linuxubuntu_openblas and macosbigsur_mkl samples match..
RandomState matches: True
SVD matches: True
Cholesky matches: True
Looking to see if linuxubuntu_openblas and macosbigsur_openblas samples match..
RandomState matches: True
SVD matches: True
Cholesky matches: True
Looking to see if macosbigsur_mkl and macosbigsur_openblas samples match..
RandomState matches: True
SVD matches: True
Cholesky matches: True


### Scenario 1

Using the old `RandomState` class, we sample from the multivariate normal:

In [4]:
print(f"Ratio of environment pairs with matching samples using RandomState: {np.mean(randomstate_matches)}")

Ratio of environment pairs with matching samples using RandomState: 0.5


Only 50% of the environments actually match.

### Scenario 2
Using the new `Generator` class, we sample from the multivariate normal with default arguments:

In [5]:
print(f"Ratio of environment pairs with matching samples using Generator with SVD: {np.mean(svd_matches)}")

Ratio of environment pairs with matching samples using Generator with SVD: 0.5


Only 50% of the environments actually match.

### Scenario 3
Using the new `Generator` class with the Cholesky decomposition method, we sample from the multivariate normal:

In [6]:
print(f"Ratio of environment pairs with matching samples using Generator with Cholesky: {np.mean(cholesky_matches)}")

Ratio of environment pairs with matching samples using Generator with Cholesky: 1.0


Cholesky decomposition gives us deterministic samples across all environments.