<a href="https://colab.research.google.com/github/UrielBaldesco/Machine-Learning-Custom-Models/blob/main/Presidio_Practicing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://microsoft.github.io/presidio/samples/python/presidio_notebook/
https://github.com/microsoft/presidio/blob/main/docs/samples/python/presidio_notebook.ipynb

Goal:
* Be able to use Presidio to Recognize, Redact, and Anonymize text

Tasks:
* Get familiar with Presidio's funcitonality

To customize:
* the alphabetical spelling of a numbre vs. the numerical representation of a number
* Eastern names vs. Western names

In [None]:
#download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg #This command downloads the en_core_web_lg language model for use with the spaCy library, a popular NLP library.

Collecting presidio_analyzer
  Downloading presidio_analyzer-2.2.355-py3-none-any.whl.metadata (2.9 kB)
Collecting presidio_anonymizer
  Downloading presidio_anonymizer-2.2.355-py3-none-any.whl.metadata (8.2 kB)
Collecting phonenumbers<9.0.0,>=8.12 (from presidio_analyzer)
  Downloading phonenumbers-8.13.45-py2.py3-none-any.whl.metadata (10 kB)
Collecting tldextract (from presidio_analyzer)
  Downloading tldextract-5.1.2-py3-none-any.whl.metadata (11 kB)
Collecting azure-core (from presidio_anonymizer)
  Downloading azure_core-1.31.0-py3-none-any.whl.metadata (39 kB)
Collecting pycryptodome>=3.10.1 (from presidio_anonymizer)
  Downloading pycryptodome-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting requests-file>=1.4 (from tldextract->presidio_analyzer)
  Downloading requests_file-2.1.0-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading presidio_analyzer-2.2.355-py3-none-any.whl (109 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [27]:
from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
import json
from pprint import pprint

from typing import List

from presidio_analyzer import (
    AnalyzerEngine,
    PatternRecognizer,
    EntityRecognizer,
    Pattern,
    RecognizerResult,
)
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

Below is a simple PII Recognizer:

In [11]:
#analyze text for personally identifiable entities
text_to_anonymize = "My name is Zhang Liu. His name is Aung Khant Than. And his phone number is 212-555-5555"

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
analyzer_results = analyzer.analyze(text=text_to_anonymize,
                                    entities=["PHONE_NUMBER"],
                                    language='en')

print(analyzer_results)



[type: PHONE_NUMBER, start: 75, end: 87, score: 0.75]


In [34]:
#creating simple customized recognizers

#https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/

titles_recognizer = PatternRecognizer(supported_entity="TITLE",
                                      deny_list=["Mr.","Mrs.","Miss"])

pronoun_recognizer = PatternRecognizer(supported_entity="PRONOUN",
                                       deny_list=["he", "He", "his", "His", "she", "She", "hers", "Hers"])


#add the recognizer directly to the existing registry
analyzer.registry.add_recognizer(titles_recognizer)
analyzer.registry.add_recognizer(pronoun_recognizer)

analyzer_results = analyzer.analyze(text=text_to_anonymize,
                            entities=["TITLE", "PRONOUN"],
                            language="en")
analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')
print(f"entity recognition:\n{analyzer_results}")
print("\n")
print("Identified these PII entities:")
for each in analyzer_results:
    print(f"- {text_to_anonymize[each.start:each.end]} as {each.entity_type}")

entity recognition:
[type: PRONOUN, start: 22, end: 25, score: 1.0, type: PRONOUN, start: 55, end: 58, score: 1.0, type: PERSON, start: 11, end: 20, score: 0.85, type: PERSON, start: 34, end: 44, score: 0.85, type: PHONE_NUMBER, start: 75, end: 87, score: 0.75]


Identified these PII entities:
- His as PRONOUN
- his as PRONOUN
- Zhang Liu as PERSON
- Aung Khant as PERSON
- 212-555-5555 as PHONE_NUMBER


**practice making customized reocognizers**:


---



BUILT IN SUPPORTED ENTITIES:


CREDIT_CARD: Recognizes credit card numbers based on known patterns for major card providers (Visa, MasterCard, etc.).

EMAIL_ADDRESS: Identifies email addresses using standard email format patterns.

PHONE_NUMBER: Detects phone numbers, including different formats across countries.

DATE_TIME: Recognizes dates and times in various formats (e.g., MM/DD/YYYY, DD-MM-YYYY).

IBAN_CODE: Recognizes International Bank Account Numbers (IBAN) based on structure and length.

IP_ADDRESS: Detects IP addresses, both IPv4 and IPv6.

NRP (National Personal Identifier): Recognizes national ID numbers from various countries.

US_SSN: Recognizes US Social Security Numbers (SSNs) based on structure.

LOCATION: Detects geographical locations (city names, street addresses, etc.).

PERSON: Identifies proper names that refer to individuals.

BANK_ACCOUNT: Recognizes bank account numbers.

CRYPTO: Detects cryptocurrency wallet addresses, including Bitcoin and Ethereum.

<!-- #practice making customized reocognizers

# BUILT IN SUPPORTED ENTITIES:
# CREDIT_CARD: Recognizes credit card numbers based on known patterns for major card providers (Visa, MasterCard, etc.).
# EMAIL_ADDRESS: Identifies email addresses using standard email format patterns.
# PHONE_NUMBER: Detects phone numbers, including different formats across countries.
# DATE_TIME: Recognizes dates and times in various formats (e.g., MM/DD/YYYY, DD-MM-YYYY).
# IBAN_CODE: Recognizes International Bank Account Numbers (IBAN) based on structure and length.
# IP_ADDRESS: Detects IP addresses, both IPv4 and IPv6.
# NRP (National Personal Identifier): Recognizes national ID numbers from various countries.
# US_SSN: Recognizes US Social Security Numbers (SSNs) based on structure.
# LOCATION: Detects geographical locations (city names, street addresses, etc.).
# PERSON: Identifies proper names that refer to individuals.
# BANK_ACCOUNT: Recognizes bank account numbers.
# CRYPTO: Detects cryptocurrency wallet addresses, including Bitcoin and Ethereum. -->

In [35]:
#RESOURCE USED: https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/

#making a list of tokens which should be marked as PII if detected
countries_list = [
    "America",
    "United States",
    "Canada",
    "Mexico",
    "China",
    "Philippines",
    "Japan",
    "Myanmar",
    "Spain",
    "India",
]

countries_recognizer = PatternRecognizer(supported_entity="LOCATION", deny_list=countries_list)
text_example = "My name is Zhang Wei, I reside in China. His name is Aung Khant Than from Myanmar. Her name is Himari Sato from Japan."
result = countries_recognizer.analyze(text_example, entities=["LOCATION"])

#now that you made a custom recognizer, add it to the list of Presidio Analyzers:
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(countries_recognizer)

print(f"Detecting countries:\n{result}\n")
#now check
print("Identified these PII entities:")
for each in result:
    print(f"- {text_example[each.start:each.end]} as {each.entity_type}")



Detecting countries:
[type: LOCATION, start: 34, end: 39, score: 1.0, type: LOCATION, start: 74, end: 81, score: 1.0, type: LOCATION, start: 112, end: 117, score: 1.0]

Identified these PII entities:
- China as LOCATION
- Myanmar as LOCATION
- Japan as LOCATION




---

Regex Based PII Recognition for Recognizing Numbers within words

In [48]:
# Define the regex pattern in a Presidio `Pattern` object:
numbers_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)

# Define the recognizer with one or more patterns
number_recognizer = PatternRecognizer(
    supported_entity="NUMBER", patterns=[numbers_pattern]
)

text2 = "I live in 510 Broad st. where I have 10000 dollars in cash"

numbers_result = number_recognizer.analyze(text=text2, entities=["NUMBER"])
print("Result:")
print(numbers_result)

print("\n")
print("Identified these PII entities:")
for each in numbers_result:
    print(f"- {text2[each.start:each.end]} as {each.entity_type}")

Result:
[type: NUMBER, start: 10, end: 13, score: 0.5, type: NUMBER, start: 37, end: 42, score: 0.5]


Identified these PII entities:
- 510 as NUMBER
- 10000 as NUMBER




---

Rule based logic recognizer

* to detect numbers within words, e.g. "Number One".
* create a new class, which implements EntityRecognizer, the basic recognizer in Presidio. This abstract class requires us to implement the load method and analyze method.

In [49]:
# Detecting numbers in either numerical or alphabetic (e.g. Forty five) form
class NumbersRecognizer(EntityRecognizer):

    expected_confidence_level = 0.7  # expected confidence level for this recognizer

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Analyzes test to find tokens which represent numbers (either 123 or One Two Three).
        """
        results = []

        # iterate over the spaCy tokens, and call `token.like_num`
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                    entity_type="NUMBER",
                    start=token.idx,
                    end=token.idx + len(token),
                    score=self.expected_confidence_level,
                )
                results.append(result)
        return results

In [53]:
new_numbers_recognizer = NumbersRecognizer(supported_entities=["NUMBER"])
text3 = "Roberto lives in Five 10 Broad st. and he keeps Ten Thousand in Cash"
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(new_numbers_recognizer)

numbers_results2 = analyzer.analyze(text=text3, language="en")
print("\n")
print("Identified these PII entities:")
for each in numbers_results2:
    print(f"- {text3[each.start:each.end]} as {each.entity_type}")
# print_analyzer_results(numbers_results2, text=text3)





Identified these PII entities:
- Roberto as PERSON
- Broad st. as LOCATION
- Five as NUMBER
- 10 as NUMBER
- Ten as NUMBER
- Thousand as NUMBER




---
Supporting new languages:
* Attempt: Try to incorporate this method to detect names that are not written in engish alphabet

Use: https://microsoft.github.io/presidio/analyzer/languages/


In [55]:
# to download the Spanish medium spaCy model:
!python -m spacy download es_core_news_md

#onfigure Presidio to use spaCy as its underlying NLP framework, with NLP models in English and Spanish:
from presidio_analyzer.nlp_engine import NlpEngineProvider

# import spacy
# spacy.cli.download("es_core_news_md")

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "es", "model_name": "es_core_news_md"},
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, supported_languages=["en", "es"]
)

# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es Morris", language="es")
print("Results from Spanish request:")
print(results_spanish)

results_english = analyzer.analyze(text="My name is Morris", language="en")
print("Results from English request:")
print(results_english)

Collecting es-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.7.0/es_core_news_md-3.7.0-py3-none-any.whl (42.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.




Results from Spanish request:
[type: PERSON, start: 13, end: 19, score: 0.85]
Results from English request:
[type: PERSON, start: 11, end: 17, score: 0.85]
