In [None]:
import numpy as np
import pandas as pd
import blindat as bd

Create a `pandas.DataFrame()` with four columns of random data:


In [None]:
# data params
COLUMNS = ["A", "B", "C", "D"]
NUM_ROWS = int(1e7)
DATA_SEED = 19421127

# generate data
np.random.seed(DATA_SEED)
data = np.random.rand(NUM_ROWS, len(COLUMNS))
df = pd.DataFrame(data, columns=COLUMNS)

df.head()

### `generate_rules()`

Rules are defined by a specification that describes which columns to blind using a linear transform: `blinded_data = scale * data + offset`.  

The `offset` and `scale` parameters can be fixed or randomly sampled from a given range.  Randomness ensures the transform is not known to the user.

The simplest way to specify rules is with a column name (or a list of names) and global ranges for `offset` and/or `scale`.

In [None]:
# list of columns with global offset and scale ranges
rules = bd.generate_rules("A", offset=(10.0, 20.0), random_seed=42)

### `inspect()`

You shouldn't be looking at the rule parameters.  But maybe you have a legit reason, in which case, use `inspect()`.

In [None]:
bd.inspect(rules)

It's not necessary to save the rules because they can be recreated by fixing the `random_seed`.  But if you really want to store them, consider using `dill` (regular pickling doesn't work with lambda functions).  Or you could save the output of `inspect()`.


###  `blind()`

In [None]:
# blind data
df1 = bd.blind(df, rules)
df1.head()

### `@obfuscate`

If an experiment generates many different data files, it might be convenient to develop a custom class with methods for accessing each component. The decorator `@obfuscate` adds blinding to functions or methods that return a pandas DataFrame as the first or only result.  The method must accept the keyword argument `transform` (or `**kwargs`).

In [None]:
from blindat import obfuscate


class MeasurementData:
    def __init__(self, path=None):
        self.path = path  # path to data directory
        self._sim()

    def _sim(self):
        np.random.seed(DATA_SEED)
        self._columns = COLUMNS
        self._data = np.random.rand(NUM_ROWS, len(self._columns))

    @obfuscate
    def load_dataframe(self, transform=None):
        df = pd.DataFrame(self._data, columns=self._columns)
        return df


# initialize
measurement = MeasurementData()

In [None]:
# load dataframe
measurement.load_dataframe(transform=rules).head()

In [None]:
# original data
measurement.load_dataframe().head()

This example requires the user to explicitly opt-in to blinding their data (zen of python #2).  

For consistency and to save the user a little effort you could include a `default_rules()` function in your data-access module.  This might be appropriate if columns with certain names always have similar values and should always be blinded.

In [None]:
# in your data access module
DEFAULT_SPECIFICATION = {
    "A": {"offset": (10.0, 20.0), "scale": (0.9, 1.1)},
}


def default_rules(random_seed=None):
    return bd.generate_rules(DEFAULT_SPECIFICATION, random_seed=random_seed)


# in your analysis notebook
measurement.load_dataframe(transform=default_rules(42)).head()

Alternatively, one could hard-code rules into a data-access class.  However, forgetting about this could be disastrous! Consider using an unambiguously named subclass and/or warnings.

In [None]:
import warnings
from blindat import blind


class BlindData(MeasurementData):
    def __init__(self, *args, random_seed=None, **kwargs):
        super().__init__(*args, **kwargs)
        self._rules = default_rules(random_seed)

    def _secret_data(self):
        return super().load_dataframe()

    def load_dataframe(self):
        warnings.warn("data may be altered to mitigate experimenter bias.")
        return blind(self._secret_data(), rules=self._rules)


blind_data = BlindData(random_seed=42)

# blind by default
blind_data.load_dataframe().head()