# PII Data Detection Demo

In this demo I use Presidio (through it's Python interface) to recognize PII information from given text and then replace the detected entities with masks.

The github repo for Presidio is: https://github.com/microsoft/presidio

The youtube link to introduce Presidio PII dection is: https://www.youtube.com/watch?v=1pUEG0MZxvM

Here I assume the computer environment to run this notebook should be compatible to __[Anaconda Indivisual Edition](https://www.anaconda.com/open-source)__ setup in a __Spark cluster__.

# Install Main Open Source Resources

In [1]:
# update pip with latest version
!pip install pip --upgrade

Collecting pip
  Using cached pip-20.3.3-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.0.2
    Uninstalling pip-20.0.2:
      Successfully uninstalled pip-20.0.2
Successfully installed pip-20.3.3


## Step 1: Install spaCy python library
```shell
pip install spacy
```
or
```shell
conda install -c conda-forge spacy
```

In [2]:
!pip install spacy

Collecting spacy
  Downloading spacy-2.3.5-cp37-cp37m-manylinux2014_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 12.3 MB/s eta 0:00:01
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.4-cp37-cp37m-manylinux2014_x86_64.whl (9.8 MB)
[K     |████████████████████████████████| 9.8 MB 60.5 MB/s eta 0:00:01
[?25hCollecting catalogue<1.1.0,>=0.0.7
  Using cached catalogue-1.0.0-py2.py3-none-any.whl (7.7 kB)
Collecting cymem<2.1.0,>=2.0.2
  Using cached cymem-2.0.5-cp37-cp37m-manylinux2014_x86_64.whl (35 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Using cached murmurhash-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl (20 kB)
Collecting plac<1.2.0,>=0.9.6
  Using cached plac-1.1.3-py2.py3-none-any.whl (20 kB)
Collecting preshed<3.1.0,>=3.0.2
  Using cached preshed-3.0.5-cp37-cp37m-manylinux2014_x86_64.whl (126 kB)
Collecting srsly<1.1.0,>=1.0.2
  Using cached srsly-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl (184 kB)
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4

## Step 2: Install spaCy pre-trained (english and Spanish NLP) models
```shell
python3 -m spacy download en_core_web_lg
python3 -m spacy download es_core_news_md
```

In [14]:
!python3 -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 95.0 MB/s eta 0:00:01     |███████▊                        | 199.5 MB 842 kB/s eta 0:12:26     |███████████▌                    | 297.5 MB 65.6 MB/s eta 0:00:09     |███████████████████████▏        | 598.4 MB 48.1 MB/s eta 0:00:05�████████▋    | 713.0 MB 567 kB/s eta 0:03:23     |████████████████████████████▍   | 735.3 MB 39.5 MB/s eta 0:00:03��████████  | 776.2 MB 39.5 MB/s eta 0:00:02
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180943 sha256=4d5f41f892e2aec4bf405f6bb9776ed34d36d885a2456ce525196a95578a4071
  Stored in directory: /tmp/pip-ephem-wheel-cache-7pn3m87z/wheels/11/95/ba/2c36cc368c0bd339b44a791

In [15]:
!python3 -m spacy download es_core_news_md

Collecting es_core_news_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-2.2.5/es_core_news_md-2.2.5.tar.gz (78.4 MB)
[K     |████████████████████████████████| 78.4 MB 70.4 MB/s eta 0:00:01
Building wheels for collected packages: es-core-news-md
  Building wheel for es-core-news-md (setup.py) ... [?25ldone
[?25h  Created wheel for es-core-news-md: filename=es_core_news_md-2.2.5-py3-none-any.whl size=79649480 sha256=151af54cfb47fc151e7e2592bd8dcf82a745f90b5a2c17ef79286cedbe62e5d2
  Stored in directory: /tmp/pip-ephem-wheel-cache-20mzj118/wheels/d8/f5/92/ee8a4f74fac67775fbc0314b1c9ae4694f4180437f6fc3dd1c
Successfully built es-core-news-md
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_md')


## Step 3: Install Presidio from github source code (via pip install from wheel file)

The instruction of install presidio_analyzer from github source code can be found at https://microsoft.github.io/presidio/deploy.html.

The presidio analyzer requires gcc to compile and install the final Python package. So, the following shell commands is to install gcc linux tool if it doesn't exist.

In [17]:
!apt-get update

Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://deb.debian.org/debian buster InRelease [121 kB]
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [260 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7907 kB]
Get:6 http://deb.debian.org/debian buster-updates/main amd64 Packages [7860 B]
Fetched 8414 kB in 2s (5024 kB/s)
Reading package lists... Done


In [19]:
!apt-get install gcc -y

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-8 gcc-8 libasan5
  libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libgcc-8-dev libisl19
  libitm1 liblsan0 libmpc3 libmpfr6 libmpx2 libquadmath0 libtsan0 libubsan1
  linux-libc-dev manpages manpages-dev
Suggested packages:
  binutils-doc cpp-doc gcc-8-locales gcc-multilib make autoconf automake
  libtool flex bison gdb gcc-doc gcc-8-multilib gcc-8-doc libgcc1-dbg
  libgomp1-dbg libitm1-dbg libatomic1-dbg libasan5-dbg liblsan0-dbg
  libtsan0-dbg libubsan1-dbg libmpx2-dbg libquadmath0-dbg glibc-doc
  man-browser
The following NEW packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-8 gcc gcc-8
  libasan5 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libgcc-8-dev
  libisl19 libitm1 liblsan0 libmpc3 libmpfr6 libmpx2 libquadmath0 li

Now, let's install presidio analyzer Python package, which will be the main PII detection engine.

In [20]:
!pip install presidio_analyzer-0.3.dev0-py2.py3-none-any.whl

Processing ./presidio_analyzer-0.3.dev0-py2.py3-none-any.whl
Collecting cython==0.29.10
  Using cached Cython-0.29.10-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Collecting grpcio==1.21.1
  Using cached grpcio-1.21.1-cp37-cp37m-manylinux1_x86_64.whl (2.2 MB)
Collecting knack==0.6.2
  Using cached knack-0.6.2-py2.py3-none-any.whl (54 kB)
Collecting protobuf==3.8.0
  Using cached protobuf-3.8.0-cp37-cp37m-manylinux1_x86_64.whl (1.2 MB)
Collecting regex==2019.6.8
  Using cached regex-2019.06.08.tar.gz (651 kB)
Building wheels for collected packages: regex
  Building wheel for regex (setup.py) ... [?25ldone
[?25h  Created wheel for regex: filename=regex-2019.6.8-cp37-cp37m-linux_x86_64.whl size=679492 sha256=92e6b65dc87b777db29b2a34e6f1c58a71c35231f05ce17197b23ec0a65ef7de
  Stored in directory: /root/.cache/pip/wheels/fe/d3/3d/9d4cc9eb91c616089c9851063afef76b3b71052d1f2f17f8ad
Successfully built regex
Installing collected packages: regex, protobuf, knack, grpcio, cython, presidio-analyzer


# PII Data Detection

## Import PySpark Libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
# Set up a spark session with leveraging all available CPUs
spark = SparkSession \
        .builder \
        .master('local[*]')\
        .appName("NLP") \
        .getOrCreate()

In [2]:
# Load pyspark SQL library
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType, FloatType, DoubleType
from pyspark.sql import Row
from pyspark.sql.functions import *

## Import Presidio Libraries

In [3]:
# import presidio python libraries
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistry

## Presidio Analyzer Engine

### Initialize an Engine with Two Language Pre-trained Models

In [4]:
# initialize presidio analyzer engine
registry = RecognizerRegistry()
nlp = SpacyNlpEngine({"en": "en_core_web_lg", "es": "es_core_news_md"})
registry.load_predefined_recognizers(["en", "es"], "spacy")
analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp, default_language="es")

[1m

lang             en                            
name             core_web_lg                   
license          MIT                           
author           Explosion                     
url              https://explosion.ai          
email            contact@explosion.ai          
description      English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.
sources          [{'name': 'OntoNotes 5', 'url': 'https://catalog.ldc.upenn.edu/LDC2013T19', 'license': 'commercial (licensed by Explosion)'}, {'name': 'Common Crawl'}]
pipeline         ['tagger', 'parser', 'ner']   
version          2.2.5                         
spacy_version    >=2.2.2                       
parent_package   spacy                         
labels           {'tagger': ['$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'HYPH', 'IN', 'JJ', '

[2021-01-07 11:03:48,105][presidio][INFO]Loaded recognizer: UsLicenseRecognizer
[2021-01-07 11:03:48,105][presidio][INFO]Loaded recognizer: UsItinRecognizer
[2021-01-07 11:03:48,106][presidio][INFO]Loaded recognizer: UsPassportRecognizer
[2021-01-07 11:03:48,106][presidio][INFO]Loaded recognizer: UsPhoneRecognizer
[2021-01-07 11:03:48,107][presidio][INFO]Loaded recognizer: UsSsnRecognizer
[2021-01-07 11:03:48,107][presidio][INFO]Loaded recognizer: NhsRecognizer
[2021-01-07 11:03:48,108][presidio][INFO]Loaded recognizer: SgFinRecognizer
[2021-01-07 11:03:48,108][presidio][INFO]Loaded recognizer: CreditCardRecognizer
[2021-01-07 11:03:48,108][presidio][INFO]Loaded recognizer: CryptoRecognizer
[2021-01-07 11:03:48,109][presidio][INFO]Loaded recognizer: DomainRecognizer
[2021-01-07 11:03:48,109][presidio][INFO]Loaded recognizer: EmailRecognizer
[2021-01-07 11:03:48,110][presidio][INFO]Loaded recognizer: IbanRecognizer
[2021-01-07 11:03:48,110][presidio][INFO]Loaded recognizer: IpRecognizer

### Test the Engine

In [37]:
# define masking function for identified PII information
def mask_pii(text, response):
    mask_text = text
    for res in response:
        if res.score >= 0.01:
            mask_string = "*".ljust(res.end-res.start, '*')
            mask_text = mask_text[:res.start] + mask_string + mask_text[res.end:]
    return mask_text

In [38]:
# detecting PII information via presidio anlayzer on a given English text block
original_text = "Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231. My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check. This is my website: https://www.dougf.io/"

presidio_response = analyzer.analyze(original_text,language='en',all_fields=True)
print(presidio_response)

[2021-01-07 11:30:43,657][presidio][ERROR]Failed to get recognizers hash


ERROR:presidio:Failed to get recognizers hash


[type: EMAIL_ADDRESS, start: 167, end: 195, score: 1.0, type: DOMAIN_NAME, start: 185, end: 195, score: 1.0, type: DOMAIN_NAME, start: 746, end: 758, score: 1.0, type: DATE_TIME, start: 5, end: 12, score: 0.85, type: PERSON, start: 36, end: 55, score: 0.85, type: PERSON, start: 216, end: 227, score: 0.85, type: LOCATION, start: 229, end: 240, score: 0.85, type: PHONE_NUMBER, start: 265, end: 277, score: 0.85, type: US_SSN, start: 308, end: 319, score: 0.85, type: DATE_TIME, start: 347, end: 359, score: 0.85, type: DATE_TIME, start: 415, end: 431, score: 0.85, type: US_BANK_NUMBER, start: 347, end: 359, score: 0.4, type: US_BANK_NUMBER, start: 379, end: 388, score: 0.4, type: US_SSN, start: 379, end: 388, score: 0.3, type: US_PASSPORT, start: 379, end: 388, score: 0.05, type: US_BANK_NUMBER, start: 415, end: 431, score: 0.05, type: US_DRIVER_LICENSE, start: 347, end: 359, score: 0.01, type: US_DRIVER_LICENSE, start: 379, end: 388, score: 0.01, type: US_DRIVER_LICENSE, start: 415, end: 4

In [39]:
print(mask_pii(original_text, presidio_response))

Good *******, everybody. My name is *******************, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address ****************************. My address is 2657 ***********, ***********, CA. My phone number is ************. My Social security number is ***********. My Bank account number is ************ and routing number *********. My credit card number is ****************, Expiration Date 6/1/2022, my C V V code is 121, and my pin ******. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check. This is my website: https://************/


In [40]:
# detecting PII information via presidio anlayzer on a given Spanish text block
original_text = "Mi nombre es Francisco Pérez con DNI 55555555-K, vivo en Madrid y trabajo para la ONU."

presidio_response = analyzer.analyze(original_text,language='es',all_fields=True)
presidio_response

[2021-01-07 11:30:57,714][presidio][ERROR]Failed to get recognizers hash


ERROR:presidio:Failed to get recognizers hash


[type: ES_NIF, start: 37, end: 47, score: 1.0,
 type: PERSON, start: 13, end: 28, score: 0.85,
 type: LOCATION, start: 57, end: 63, score: 0.85]

In [41]:
print(mask_pii(original_text, presidio_response))

Mi nombre es *************** con DNI **********, vivo en ****** y trabajo para la ONU.


## Load the Data into Spark DataFrame

In [52]:
# read sample data into dataframe
df = spark.read.csv('./pii-data-samples.txt', header='true', inferSchema='true')
df.show()

+--------------------+
|                text|
+--------------------+
|Good morning, eve...|
|Hello Zhang Wei. ...|
+--------------------+



## Define UDF function for PII Detection

In [53]:
def pii_detect(text, languageCode = 'en'):
    # initialize presidio analyzer engine
    registry = RecognizerRegistry()
    if languageCode == 'en':
        engine = {"en": "en_core_web_lg"}
        # load pre-trained model as engine
        registry.load_predefined_recognizers(["en"], "spacy")
    else:
        engine = {"es": "es_core_news_md"}
        # load pre-trained model as engine
        registry.load_predefined_recognizers(["es"], "spacy")
    nlp = SpacyNlpEngine(engine)
    analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp, default_language=languageCode)
    # apply presidio analyer on the input text to detect PII
    response = analyzer.analyze(str(text), language=languageCode, all_fields=True)
    # return the masked text
    return mask_pii(text, response)
        
pii_detect_udf = udf(lambda x: pii_detect(x) if x is not None else None, StringType())

## Apply UDF to the Spark DataFrame

In [54]:
# Apply udf function to the dataframe to generate a new column
df = df.withColumn("text_NoPII", pii_detect_udf(col('text')))

In [56]:
df.show()

+--------------------+--------------------+
|                text|          text_NoPII|
+--------------------+--------------------+
|Good morning, eve...|Good *******, eve...|
|Hello Zhang Wei. ...|Hello *********. ...|
+--------------------+--------------------+



# Appendix - spaCy NER usage

In [1]:
# Import the spacy library
import spacy

# Load the pre-trained statistical model
nlp = spacy.load("en_core_web_lg")

# Join the email lines into single text.
doc = nlp("Hi my name is Doug Funny and this is my website: https://www.dougf.io/, and I live in Miami. My drivers license is AC432223 and phone number is 212-555-5555")

# List to store all the proper-nouns
pii = list()

# Loop through each word to determine it's POS and filter the
# ones which are Proper-Nouns
for token in doc:
#     if token.pos_ == 'PROPN':
#         pii.append(token.text)
    print("token: ", token, "  type: ", token.pos_)

# Print all the unique proper-nouns
# print(set(pii))

token:  Hi   type:  INTJ
token:  my   type:  DET
token:  name   type:  NOUN
token:  is   type:  AUX
token:  Doug   type:  PROPN
token:  Funny   type:  PROPN
token:  and   type:  CCONJ
token:  this   type:  DET
token:  is   type:  AUX
token:  my   type:  DET
token:  website   type:  NOUN
token:  :   type:  PUNCT
token:  https://www.dougf.io/   type:  X
token:  ,   type:  PUNCT
token:  and   type:  CCONJ
token:  I   type:  PRON
token:  live   type:  VERB
token:  in   type:  ADP
token:  Miami   type:  PROPN
token:  .   type:  PUNCT
token:  My   type:  DET
token:  drivers   type:  NOUN
token:  license   type:  NOUN
token:  is   type:  AUX
token:  AC432223   type:  NOUN
token:  and   type:  CCONJ
token:  phone   type:  NOUN
token:  number   type:  NOUN
token:  is   type:  AUX
token:  212   type:  NUM
token:  -   type:  SYM
token:  555   type:  NUM
token:  -   type:  PUNCT
token:  5555   type:  NUM
