In [None]:
!pip install antimatter==0.1.12 pandas pyarrow

In [None]:
import antimatter
import pandas 

First thing we do is create an Antimatter domain. This can be done with the CLI, or with the python library.

In [None]:
# Create a new domain
amr = antimatter.new_domain("my@email.com")

This will have sent a confirmation email to your email address (it can take a few minutes). Click the button in the email to activate your domain before proceeding.

We can print the details of the domain we just created. Save these, you can use them to log in to the domain with the CLI or to use the python library with an existing domain:

In [None]:
# Print domain details
amr.config()
# To interact with an existing domain:
# amr = antimatter.Session(domain="dm-xxxxxxxxxxx", api_key="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

Now that we have an Antimatter domain, let's see how we can use Antimatter to classify some sensitive data and redact it. Let's load a parquet file that contains a mix of structured data and embedded unstructured data (a comments column)

In [None]:
df = pandas.read_parquet("https://get.antimatter.io/data/example_data.parquet")
df

You can see that we have a bunch of data here that would probably be considered sensitive in some contexts. Some of it is in clearly labelled columns, which might be easy to deal with manually, but some of it is embedded within free-form text in the comments column. That's pretty common whenever you are storing data coming from users: they can (and often do) enter sensitive info that needs special handling.

Let's *encapsulate* this data. A capsule is Antimatter's object format for tagged data. It stores the full set of data, as well as all the classification tags. It's encrypted, so can be stored anywhere without worrying that sensitive data might be accessible to those who can access the object. Having an intermediate file format lets you do the classification once, and then re-use multiple times. This is convenient, because classification is often fairly heavyweight.

When we encapsulate, we have to specify a *write_context* that contains the configuration for which classifiers to run on the data. New domains come with a `default` write context that does no classification, and a `sensitive` write context that uses the `fast-pii` classifier to tag common PII. You can change the configuration of these write contexts or add new ones as needed, but we're going to just use `sensitive` for now.

In [None]:
capsule = amr.encapsulate(df, write_context="sensitive")

Once the classification is done, we can write the capsule to a file, or just use it directly to read the data. When reading data, you need to specify a `read_context` which contains the configuration for what redaction and transformation should occur on the data. New domains come with a `default` read context which does no redaction, but we will add some rules to it in a bit

In [None]:
capsule.save("mycapsule.ca")
data = amr.load_capsule(data=capsule, read_context="default").data()
data

You will notice that what we got back was a Pandas dataframe. Antimatter stores some information about the shape of the data during encapsulation, and automatically presents the data in the same form when reading (in this case, a Pandas dataframe). This lets you insert Antimatter into your data pipeline without impacting any of the operations that happen after the read. You can use `.data_as()` instead of `.data()` if you'd like to read the data in a different format.

So now, let's do some redaction. This is achieved by adding a rule to the read context. Rules can be fairly complex, to deal with advanced cases like reproducing permissions that existed in the original source of data, but we're going to make a simple one that just references a Tag and redacts if it exists:

In [None]:
amr.add_read_context_rules('default', antimatter.ReadContextRuleBuilder()
                    .add_match_expression(antimatter.Source.Tags, 
                                          key="tag.antimatter.io/pii/name",
                                          operator=antimatter.Operator.Exists)
                    .set_action(antimatter.Action.Redact))

Now, if we read the data again, we'll see that names have been redacted in both the name columns, but also in the comments:

In [None]:
amr.load_capsule(data=capsule, read_context="default").data()

We added the rule to the `default` read context, but the purpose of read contexts is to be able to capture policy about what data can be used in which conditions. So you might have different read contexts for use cases (e.g. model training) or for different teams (e.g. fraud). Let's make a new read context for `analytics` and configure some rules to redact more of the PII in this dataset:

In [None]:
# add a new read context
amr.add_read_context('analytics', antimatter.ReadContextBuilder()
                     .set_summary("redacts data for use in analytics"))

# set some rules
amr.add_read_context_rules('analytics',antimatter.ReadContextRuleBuilder()
                    .add_match_expression(antimatter.Source.Tags, 
                                          key="tag.antimatter.io/pii/name",
                                          operator=antimatter.Operator.Exists)
                    .set_action(antimatter.Action.Redact))
amr.add_read_context_rules('analytics',antimatter.ReadContextRuleBuilder()
                    .add_match_expression(antimatter.Source.Tags, 
                                          key="tag.antimatter.io/pii/email_address",
                                          operator=antimatter.Operator.Exists)
                    .set_action(antimatter.Action.Redact))
amr.add_read_context_rules('analytics',antimatter.ReadContextRuleBuilder()
                    .add_match_expression(antimatter.Source.Tags, 
                                          key="tag.antimatter.io/pii/ssn",
                                          operator=antimatter.Operator.Exists)
                    .set_action(antimatter.Action.Redact))

Now, if we load the capsule again, we'll see that more of the PII has been redacted

In [None]:
amr.load_capsule(data=capsule, read_context="analytics").data()

You can see which tags are available to reference in your rules by calling `list_hooks`. The `sensitive` write context uses `fast-pii` by default:

In [None]:
amr.list_hooks()

One of the advantage of using Antimatter is that the policy captured in the read context is separate from your data pipeline. Often, the rules of what data is allowed to be used by whom are actually decided by different stakeholders (e.g. the security or legal teams) than the folks who are doing data cleaning and augmentation for the purposes of analytics. 

The read context rules can be configured by anyone who is invited to the domain, using the python libraries, the command line tool, or the web app. Let's create an API key for a colleague on the security team to configure the read contexts. For simplicity, we'll make them an `admin`:

In [None]:
apik = amr.insert_identity_provider_principal('apikey',
    capabilities={'admin':None}, 
    principal_type=antimatter.PrincipalType.ApiKey)
print (f"Login with --domain-id={amr.config()['domain']} --api-key={apik['api_key']}")

They can use the CLI (for example) to interact with the domain like this:

```bash
# get the latest Antimatter CLI:
$ sudo curl https://get.antimatter.io/cli/latest-macos-arm64 -o /usr/local/bin/am
$ sudo chmod a+x /usr/local/bin/am

$ am login --domain-id dm-xxxxxxxxxxx --api-key xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ am read-context list
readContexts:
- name: analytics
  summary: redacts data for use in analytics
  description: ''
  disableReadLogging: false
  keyCacheTTL: 0
  readParameters: []
  imported: false
- name: default
  summary: Default read context
  description: The default read context
  disableReadLogging: false
  keyCacheTTL: 0
  readParameters: []
  imported: false
```

They can add a read context rule, e.g. to redact physical addresses like this:

```bash
$ am read-context rule create \
--name analytics \
--action redact \
--match 'exists(tag("tag.antimatter.io/pii/location"))' \
--priority 0
```


Now, if we read the same capsule as before, we will see that addresses have been redacted. We don't need to re-encapsulate or re-materialize any datasets. For example purposes, lets read it from the file instead of using the `capsule` variable:

In [None]:
# we saved "mycapsule.ca" earlier
amr.load_capsule(path="mycapsule.ca",read_context="analytics").data()

You can see that the addresses are now redacted too.

We used a Pandas dataframe above, but you can encapsulate data of multiple different shapes. For example, even a plain string can be encapsulated by itself:

In [None]:
string_cap = amr.encapsulate(
    """
    This works with arbitrary data, e.g. 'contact Alan McKinsey at some@email.com'",
    We support many shapes of data, like strings, dicts, lists of dicts, pandas dataframes, pytorch data loaders etc
    """, write_context="sensitive")
print(amr.load_capsule(data=string_cap, read_context="analytics").data())

For more information, please see [the docs](https://docs.antimatter.io)