# Hashing can be weird for some objects
Below we illustrate several potentially confusing behaviors that are hard to
eradicate in general:
- even if we set all random seeds properly, certain computations (e.g., training
a `scikit-learn` model) result in objects with non-deterministic content IDs
- certain objects can change their content ID after making a roundtrip through
the serialization-deserialization pipeline

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
import random
import numpy as np

from mandala.utils import get_content_hash, serialize, deserialize

X, y = load_digits(n_class=10, return_X_y=True)

def train_model():
    ### set both the numpy and python random seed
    np.random.seed(42)
    random.seed(42)
    ### train a model, passing the random_state explicitly
    model = RandomForestClassifier(max_depth=2, 
                                n_estimators=100, random_state=42).fit(X, y)
    return model

### training in the exact same way will produce different content hashes
model_1 = train_model()
model_2 = train_model()
print(f'Content IDs of the two models: {get_content_hash(model_1)} and {get_content_hash(model_2)}')

### a roundtrip serialization will produce a different content hash
roundtrip_model_1 = deserialize(serialize(model_1))
print(f'Content IDs of the original and restored model: {get_content_hash(model_1)} and {get_content_hash(roundtrip_model_1)}')

Content IDs of the two models: e50ecc81eb3892c1e40a41539d8cf0e1 and 01899c01a78746fd1c554171b1e944fc
Content IDs of the original and restored model: e50ecc81eb3892c1e40a41539d8cf0e1 and 549fedf90f84de7c8c77ac26940c7ed6


**Why is this hard to get rid of in general?** One pervasive issue is that some
custom Python objects, e.g. many kinds of ML models and even `pytorch` tensors,
create internal state related to system resources, such as memory layout. These 
can be different between objects that otherwise have semantically equivalent
state, leading to different content hashes. It is impossible to write down a
hash function that always ignores these aspects for arbitrary classes, because 
we don't know how to interpret which attributes of the object are semantically
meaningful and which are contingent.

**What should you do about it?** This issue does come up that often in practice.
Note that this is not an issue for many kinds of objects, such as primitive
Python types and nested python collections thereof, as well as some other types
like numpy arrays. If you always pass as inputs to `@op`s objects like this, or
`Ref`s obtained from other `@op`s, this issue will not come up. Indeed, if
"unwieldy" objects are always results of `@op`s, a single copy of each such
object will be saved and deserialized every time.

This problem does, however, make it very difficult to detect when your `@op`s
have side effects.