In [None]:
!pip install antimatter==2.0.1 pandas pyarrow

In [None]:
import antimatter as am
import pandas
import json

pandas.set_option('display.max_colwidth', None)

In [None]:
# IMPORTANT: Change to your business email address to create a free account.
# Follow the link in your inbox to verify email address before proceeding.
my_email_address = "email@sample.com"

First thing we do is create an Antimatter domain. This can be done with the CLI, or with the python library. 

To make it easier to run the notebook, we'll save our credentials for the created domain and re-use them if they exist

In [None]:
# Log into (or create) an Antimatter domain, re-using the saved domain if it's found
try:
    amr = am.load_domain(config_path=my_email_address+".cfg")
    # double check domain is verified
    _ = amr.list_read_context()
except am.errors.errors.SessionLoadError as e:
    # Failed to load, create a new domain
    amr = am.new_domain(email=my_email_address, make_active=True, config_path=my_email_address+".cfg")
    raise Exception(f"New domain created. Check your email address ({my_email_address}) and click the link to verify, then re-Run All to proceed.")
except am.errors.errors.SessionError as e:
    if "domain not authenticated" in str(e):
        # amr.resend_verification_email(my_email_address)
        raise Exception (f"Domain not verified. Check your email address ({my_email_address}) and click the link to verify, then re-Run All to proceed.")

If this is the first time you have run the notebook, this will have sent a confirmation email to your email address (it can take a few minutes). 

**Click the button in the email to activate your domain before proceeding.**


Now that we have an Antimatter domain, let's see how we can use Antimatter to classify some sensitive data and redact it. Let's load a parquet file that contains a mix of structured data and embedded unstructured data (a comments column)

In [None]:
df = pandas.read_parquet("https://get.antimatter.io/data/example_data.parquet")
df

You can see that we have a bunch of data here that would probably be considered sensitive in some contexts. Some of it is in clearly labelled columns, which might be easy to deal with manually, but some of it is embedded within free-form text in the comments column. That's pretty common whenever you are storing data coming from users: they can (and often do) enter sensitive info that needs special handling.

Let's *encapsulate* this data. A capsule is Antimatter's object format for tagged data. It stores the full set of data, as well as all the classification tags. It's encrypted, so can be stored anywhere without worrying that sensitive data might be accessible to those who can access the object. Having an intermediate file format lets you do the classification once, and then re-use multiple times. This is convenient, because classification is often fairly heavyweight.

When we encapsulate, we have to specify a *write_context* that contains the configuration for which classifiers to run on the data. New domains come with a `default` write context that does no classification, and a `sensitive` write context that uses the `fast-pii` classifier to tag common PII. You can change the configuration of these write contexts or add new ones as needed, but we're going to just use `sensitive` for now.

In [None]:
capsule = amr.encapsulate(df, write_context="sensitive")

Once the classification is done, we can write the capsule to a file, or just use it directly to read the data. When reading data, you need to specify a `read_context` which contains the configuration for what redaction and transformation should occur on the data. New domains come with a `default` read context which does no redaction, but we will add some rules to it in a bit

In [None]:
capsule.save("mycapsule.ca")
data = amr.load_capsule(data=capsule, read_context="default").data()
data

You will notice that what we got back was a Pandas dataframe. Antimatter stores some information about the shape of the data during encapsulation, and automatically presents the data in the same form when reading (in this case, a Pandas dataframe). This lets you insert Antimatter into your data pipeline without impacting any of the operations that happen after the read. You can use `.data_as()` instead of `.data()` if you'd like to read the data in a different format.

So now, let's do some redaction. This is achieved by bind a data policy rule to the read context. Data policy rules can be fairly complex, to deal with advanced cases like reproducing permissions that existed in the original source of data, but we're going to make a simple one that just references a Tag and redacts if it exists. 

First we make a data policy and the save the ID. Next, we can update the policy rules and add our new policy to redact email addresses. Once we have our data policy configured to our liking, we can bind it to read context to ensure it gets used:

In [None]:
# create a new data policy and save its ID
res = amr.create_data_policy("my_data_policy", "sample description")
policy_id = res.policy_id

# update the data policy with rules to redact email addresses
email_rule = amr.update_data_policy_rules(
    policy_id=policy_id,
    rules=am.DataPolicyRuleChangesBuilder().add_rule(
        am.NewDataPolicyRuleBuilder(comment="Deny", effect=am.RuleEffect.REDACT, priority=10).add_clause(
            clause=am.DataPolicyClauseBuilder(am.ClauseOperator.AnyOf).add_tag(
                am.TagExpressionBuilder()
                .set_name("tag.antimatter.io/pii/email_address")
                .set_operator(am.Operator.Exists)
            )
        )
    ),
)

# bind the data policy to the default read context
amr.set_data_policy_binding(
    policy_id=policy_id,
    default_attachment=am.Attachment.NotAttached,
    read_contexts=[("default", am.Attachment.Attached)],
)

Now, if we read the data again, we'll see that names have been redacted in both the name columns, but also in the comments:

In [None]:
amr.load_capsule(data=capsule, read_context="default").data()

We added the rule to the `default` read context, but the purpose of read contexts is to be able to capture policy about what data can be used in which conditions. So you might have different read contexts for use cases (e.g. model training) or for different teams (e.g. fraud). Let's make a new read context for `analytics` and configure some data policy rules to redact more of the PII in this dataset:

In [None]:
# create the "analytics" read context
amr.add_read_context("analytics",
    am.ReadContextBuilder().
        set_summary("Analytics read context").
        set_description("Read context for data analysis").
        add_required_hook(am.Hook.Fast)
)

# create a new data policy and save its ID
res = amr.create_data_policy("analytics_data_policy", "policies related to the analysis of data")
policy_id = res.policy_id

# update the data policy with rules to redact PII
amr.update_data_policy_rules(
    policy_id=policy_id,
    rules=am.DataPolicyRuleChangesBuilder().add_rule(
        am.NewDataPolicyRuleBuilder(comment="Deny", effect=am.RuleEffect.REDACT, priority=10)
        .add_clause(
            clause=am.DataPolicyClauseBuilder(am.ClauseOperator.AnyOf)
            .add_tag(
                am.TagExpressionBuilder()
                .set_name("tag.antimatter.io/pii/name")
                .set_operator(am.Operator.Exists)
            )
            .add_tag(
                am.TagExpressionBuilder()
                .set_name("tag.antimatter.io/pii/email_address")
                .set_operator(am.Operator.Exists)
            )
            .add_tag(
                am.TagExpressionBuilder()
                .set_name("tag.antimatter.io/pii/ssn")
                .set_operator(am.Operator.Exists)
            )
        )
    ),
)

# bind the data policy to the analytics read context
amr.set_data_policy_binding(
    policy_id=policy_id,
    default_attachment=am.Attachment.NotAttached,
    read_contexts=[("analytics", am.Attachment.Attached)],
)

Now, if we load the capsule again, we'll see that more of the PII has been redacted

In [None]:
amr.load_capsule(data=capsule, read_context="analytics").data()

You can see which tags are available to reference in your rules by calling `list_hooks`. The `sensitive` write context uses `fast-pii` by default:

In [None]:
amr.list_hooks()

One of the advantages of using Antimatter is that the policy captured in the read context is separate from your data pipeline. Often, the rules of what data is allowed to be used by whom are actually decided by different stakeholders (e.g. the security or legal teams) rather than by the folks who are doing data cleaning and augmentation for the purposes of analytics. 

The data policy rules can be configured by anyone who is invited to the domain, using the python libraries, the command line tool, or the web app. Let's create an API key for a colleague on the security team to configure the read contexts. For simplicity, we'll make them an `admin`:

In [None]:
apik = amr.insert_identity_provider_principal('apikey',
    capabilities={'admin':None}, 
    principal_type=am.PrincipalType.ApiKey)
print (f"Login with --domain-id={amr.config()['domain_id']} --api-key={apik.api_key}")

They can use the CLI (for example) to interact with the domain like this:

```bash
# get the latest Antimatter CLI:
$ sudo curl https://get.antimatter.io/cli/darwin/arm64/am -o /usr/local/bin/am
$ sudo chmod a+x /usr/local/bin/am

$ am config domain login --domain-id dm-xxxxxxxxxxx --api-key xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ am read-context list
readContexts:
- name: analytics
  summary: Analytics read context
  description: Read context for data analysis
  disableReadLogging: false
  keyCacheTTL: 0
  readParameters: []
  imported: false
- name: default
  summary: Default read context
  description: The default read context
  disableReadLogging: false
  keyCacheTTL: 0
  readParameters: []
  imported: false
```

In order to create a new data policy rule, they will need the existing policy's ID. Below we are using the [yq](https://mikefarah.gitbook.io/yq) command line tool but you can also just use `am data-policy list` without yq and copy the ID for the data policy named `analytics_data_policy`.

```bash
$ POLICY_ID=$(am data-policy list | yq '.policies[] | select(.name == "analytics_data_policy") | .id')
```

Our colleague can then add a new data policy rule, e.g. to redact physical addresses like this:

```bash
$ am data-policy rule create \
  --policy-id $POLICY_ID \
  --effect Redact \
  --priority 20 \
  --clause '{"operator": "AnyOf", "tags": [{"name": "tag.antimatter.io/pii/location", "values": [], "operator": "Exists"}]}'
```


Now, if we read the same capsule as before, we will see that addresses have been redacted. We don't need to re-encapsulate or re-materialize any datasets. For example purposes, lets read it from the file instead of using the `capsule` variable:

In [None]:
# we saved "mycapsule.ca" earlier
amr.load_capsule(path="mycapsule.ca",read_context="analytics").data()

You can see that the addresses are now redacted too.

We used a Pandas dataframe above, but you can encapsulate data of multiple different shapes. For example, even a plain string can be encapsulated by itself:

In [None]:
string_cap = amr.encapsulate(
    """
    This works with arbitrary data, e.g. 'contact Alan McKinsey at some@email.com'",
    We support many shapes of data, like strings, dicts, lists of dicts, pandas dataframes, pytorch data loaders etc
    """, write_context="sensitive")
print(amr.load_capsule(data=string_cap, read_context="analytics").data())

In [None]:
# You can also try it with your own data. NOTE: This will redact name, emails and SSNs
#  Try filling in some text here:
sentence = "my social is 555 55 5555"
amr.classify_and_redact(sentence, write_context="sensitive", read_context="analytics").data()

For more information, please see [the docs](https://docs.antimatter.io)