# Tools to understand the behavior of ML algorithms - Hands-on tutorials

In this notebook, we provide step-by-step guides for geting started using 2 Microsoft Responsible AI techniques for protecting AI systems data:

- **Data Anonymization** with [Presidio](https://github.com/Microsoft/presidio)
- **Differencial Privacy** with the [SmartNoise system](https://github.com/opendp/smartnoise-core)

These are inspired by the example notebooks provided in each repo, and are meant to get aggregate all these tools into a single document to get you started as quickly as possible.

## Presidio

Presidio is a data protection and anonymization SDK for text and images providing fast identification and anonymization of private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.

In this tutorial, we introduce two of Presidio's modules, namely 
1. **Presidio analyzer** for custom or predefined PII detection in text, leveraging Named Entity Recognition, regular expressions, rule-based logic, and checksum with relevant context in multiple languages, and 
2. **Presidio anonymizer** which is the module allowing anonymization of the detected PII entities using different operators. 

To get there, we follow 4 simple steps to show how Presidio works.

**Step 1:** Installing the presidio_analyzer and presidio_anonymizer libraries using pip along with the spaCy English language model needed by the analyzer.

In [1]:
# Installing packages if not already done
# !pip install presidio_analyzer 
# !pip install presidio_anonymizer

# Presidio analyzer requires a spaCy language model. 
# !python -m spacy download en_core_web_lg

# Importing the required modules
from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities.engine import OperatorConfig

**Step 2:** Once the presidio-analyzer package is installed, run this simple analysis script. This will print the result of the PII analysis, in this case the detected phone numbers in the provided text.

In [2]:
text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER"], language='en')

print(analyzer_results)



[type: PHONE_NUMBER, start: 46, end: 58, score: 1.0]


**Step 3:**  Creating Custom PII Entity Recognizers.

In [8]:
from presidio_analyzer import PatternRecognizer

text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555" 
titles_recognizer = PatternRecognizer(supported_entity="TITLE",
                                      deny_list=["Mr.","Mrs.","Miss"])
pronoun_recognizer = PatternRecognizer(supported_entity="PRONOUN",
                                      deny_list=["he", "his", "she", "hers"])
analyzer.registry.add_recognizer(titles_recognizer)
analyzer.registry.add_recognizer(pronoun_recognizer)

analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')

analyzer_results



[type: TITLE, start: 12, end: 15, score: 1.0,
 type: PRONOUN, start: 26, end: 29, score: 1.0,
 type: PHONE_NUMBER, start: 46, end: 58, score: 1.0,
 type: PERSON, start: 16, end: 21, score: 0.85]

The previous code sample:
1. Creates custom titles and pronouns recognizers.
2. Adds the new custom recognizers to the analyzer.
3. Calls analyzer to get results from the old and new recognizers.

It prints all the PII detected including titles and pronouns we just defined.

**Step 4:**  Anonymizing the identified PII entities.

In [9]:

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities.engine import OperatorConfig


anonymizer = AnonymizerEngine()


anonymized_results = anonymizer.anonymize(
    text=text_to_anonymize,
    analyzer_results=analyzer_results,    
    operators={"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}), 
               "PHONE_NUMBER": OperatorConfig("mask", {"type": "mask", "masking_char" : "*", "chars_to_mask" : 12, "from_end" : True}),
                "TITLE": OperatorConfig("redact", {})}
)

anonymized_results.to_json()


'{"text": "His name is  <ANONYMIZED> and <ANONYMIZED> phone number is ************", "items": [{"start": 59, "end": 71, "entity_type": "PHONE_NUMBER", "text": "************", "operator": "mask"}, {"start": 30, "end": 42, "entity_type": "PRONOUN", "text": "<ANONYMIZED>", "operator": "replace"}, {"start": 13, "end": 25, "entity_type": "PERSON", "text": "<ANONYMIZED>", "operator": "replace"}, {"start": 12, "end": 12, "entity_type": "TITLE", "text": "", "operator": "redact"}]}'

The previous code sample:
1. Sets up the anonymizer engine.
2. Creates an anonymizer request - text to anonymize, list of anonymizers to apply and the results from the analyzer request.
3. Anonymizes the text.

It prints the anonymized text along with a list of the detected PII entities.

## SmartNoise

**SmartNoise** is a joint project by Microsoft and Harvard's Institute for Quantitative Social Science (IQSS) and the School of Engineering and Applied Sciences (SEAS) as part of the OpenDP initiative. It aims to make Differential Privacy broadly accessible.

The SmartNoise tools primarily focus on the "global model" of **Differential Privacy** where a trusted data collector is presumed to have access to unprotected data and wishes to protect public releases of aggregate information. For example, a hospital having access to patients’ information and wishing to release aggregated statistics about these patients without affecting their privacy.

SmartNoise is an open-source project that contains different components for building global differentially private systems. SmartNoise is made up of a core library and an SDK, only the [SmartNoise core library](https://github.com/opendp/smartnoise-core) is explored here.

In this tutorial, we will explore how data can be protected against reidentification using Differential Privacy and the SmartNoise system.

The goal is to show how an attacker can leverage basic demographic information like age and zip codes to reidentify individuals even when the sensitive data is published in an anonymized format. Then we show how Differential Privacy can help prevent such an attack. 


In [10]:
# Install required libraries, uncomment if needed
# %pip install git+https://github.com/opendifferentialprivacy/smartnoise-sdk#subdirectory=sdk
# %pip install faker zipcodes tqdm opendp-smartnoise
# !pip install z3-solver==4.8.9.0

In [None]:
import warnings
warnings.filterwarnings("ignore")

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
import pandas as pd
import numpy as np
import random
import string
import uuid
import time
import logging
from datetime import datetime
import matplotlib.pyplot as plt
from tqdm import tqdm
import reident_tools as reident
from opendp.smartnoise.synthesizers.mwem import MWEMSynthesizer

%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2