In [1]:
import json
import numpy as np
import os
import pandas as pd
import sklearn
import sys

print(sys.version)
print(np.__version__)
print(sklearn.__version__)

3.7.10 (default, Feb 26 2021, 10:16:00) 
[Clang 10.0.0 ]
1.18.5
0.24.1


## LinTS GoodReads Recommendations

This notebook explores the differences in GoodReads recommendations across multiple NumPy environments. In particular, this notebook uses the preprocessed data generated in [Goodreads Preprocessing](Goodreads%20Preprocessing.ipynb) with some sampling for time constraints, as in [Goodreads Samples](Goodreads%20Samples.ipynb). All scenarios use LinTS for generating recommendations.

In [2]:
results = {}
for file in sorted(os.listdir('output')):
    if file.endswith('recs.csv'):
        df = pd.read_csv(os.path.join('output', file))
        env = file.rsplit('_', 1)[0]
        results[env] = df

In [3]:
randomstate_matches = []
svd_matches = []
cholesky_matches = []
for i, (env1, df1) in enumerate(results.items()):
    for j, (env2, df2) in enumerate(results.items()):
        if i < j:
            randomstate_match = np.allclose(df1['randomstate'], df2['randomstate'])
            svd_match = np.allclose(df1['svd'], df2['svd'])
            cholesky_match = np.allclose(df1['cholesky'], df2['cholesky'])
            print(f'Looking to see if {env1} and {env2} samples match..')
            print(f"RandomState matches: {randomstate_match}")
            print(f"SVD matches: {svd_match}")
            print(f"Cholesky matches: {cholesky_match}")
            randomstate_matches.append(randomstate_match)
            svd_matches.append(svd_match)
            cholesky_matches.append(cholesky_match)

Looking to see if linuxubuntu_mkl and linuxubuntu_openblas samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if linuxubuntu_mkl and macosbigsur_mkl samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if linuxubuntu_mkl and macosbigsur_openblas samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if linuxubuntu_openblas and macosbigsur_mkl samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if linuxubuntu_openblas and macosbigsur_openblas samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True
Looking to see if macosbigsur_mkl and macosbigsur_openblas samples match..
RandomState matches: False
SVD matches: False
Cholesky matches: True


### Option 1
We use the default implementation, which uses `np.random.multivariate_random`, and set the global seed to ensure reproducibility in a single environment. Note that this is the same as using `np.random.RandomState`, as the global seed sets the random state.

In [4]:
print(f"Ratio of environment pairs with matching samples using RandomState: {np.mean(randomstate_matches)}")

Ratio of environment pairs with matching samples using RandomState: 0.0


None of the environments actually match.


### Option 2
We use the new `Generator` class with default parameters, which internally uses SVD for decomposition:

In [5]:
print(f"Ratio of environment pairs with matching samples using Generator with SVD: {np.mean(svd_matches)}")

Ratio of environment pairs with matching samples using Generator with SVD: 0.0


None of the environments actually match.

### Option 3
We use Cholesky decomposition with the new Generator class. Our hypothesis is that this will produce reproducible results across different environments.

In [6]:
print(f"Ratio of environment pairs with matching samples using Generator with Cholesky: {np.mean(cholesky_matches)}")

Ratio of environment pairs with matching samples using Generator with Cholesky: 1.0


Cholesky decomposition gives us deterministic samples across all environments.