# Overview

This notebook provides an overview of the package and its functionality.

In [1]:
# used to hide warnings
import warnings

warnings.filterwarnings("ignore")

Let us first define the text, from which we will showcase the package's functionality.

In [2]:
original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""

## Extract personal information from text

The `anonipy` has implemented entity extraction components, that can be used to extract personal information from text.

More can be found in the chapter [Extractors](/documentation/notebooks/01-extractors).

### Language detector

In [3]:
from anonipy.utils.language_detector import LanguageDetector

lang_detector = LanguageDetector()

In [4]:
# identify the language of the original text
language = lang_detector(original_text)
language

('en', 'English')

### Extract personal information

In [5]:
from anonipy.anonymize.extractors import NERExtractor

In [6]:
# define the labels to be extracted and anonymized
labels = [
    {"label": "name", "type": "string"},
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

In [7]:
# language taken from the language detector
entity_extractor = NERExtractor(labels, lang=language, score_th=0.5)

In [8]:
# extract the entities from the original text
doc, entities = entity_extractor(original_text)

In [9]:
# display the entities in the original text
entity_extractor.display(doc)

The extracted entities metadata is available in the `entities` variable, which are:

In [10]:
entities

[Entity(text='John Doe', label='name', start_index=30, end_index=38, score=0.9961156845092773, type='string', regex='.*'),
 Entity(text='15-01-1985', label='date of birth', start_index=54, end_index=64, score=0.9937193393707275, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=0.9867385625839233, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='123-45-6789', label='social security number', start_index=121, end_index=132, score=0.9993416666984558, type='custom', regex='[0-9]{3}-[0-9]{2}-[0-9]{4}'),
 Entity(text='John Doe', label='name', start_index=157, end_index=165, score=0.994924783706665, type='string', regex='.*'),
 Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=0.8285622596740723, type='date', regex='(\\d{1,2}[\\/\\-\\.

## Anonymize the original text

The `anonipy` has implemented generators for different types of information, that can be used 
to generate replacements for the original text.

More on generators can be found in the chapter [Generators](/documentation/notebooks/02-generators),
while chapter [Strategies](/documentation/notebooks/03-strategies) provides strategies for anonymizing
the original text.

### Prepare generators for generating replacements

In [11]:
from anonipy.anonymize.generators import (
    LLMLabelGenerator,
    DateGenerator,
    NumberGenerator,
)

In [12]:
# initialize the generators
llm_generator = LLMLabelGenerator()
date_generator = DateGenerator()
number_generator = NumberGenerator()

Loading checkpoint shards: 100%|██████████| 4/4 [00:21<00:00,  5.44s/it]


In [13]:
# prepare the anonymization mapping
def anonymization_mapping(text, entity):
    if entity.type == "string":
        return llm_generator.generate(entity, temperature=0.7)
    if entity.label == "date":
        return date_generator.generate(entity, output_gen="middle_of_the_month")
    if entity.label == "date of birth":
        return date_generator.generate(entity, output_gen="middle_of_the_year")
    if entity.label == "social security number":
        return number_generator.generate(entity)
    return "[REDACTED]"

### Anonymize the original text

In [14]:
from anonipy.anonymize.strategies import PseudonymizationStrategy

In [15]:
# initialize the pseudonymization strategy
pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)

In [16]:
# anonymize the original text
anonymized_text, replacements = pseudo_strategy.anonymize(original_text, entities)

The anonymized text is:

In [17]:
print(anonymized_text)

Medical Record

Patient Name: Ethan Lane
Date of Birth: 01-07-1985
Date of Examination: 15-05-2024
Social Security Number: 588-85-9388

Examination Procedure:
Ethan Lane underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024



And the associated replacements are:

In [18]:
replacements

[{'original_text': '15-11-2024',
  'label': 'date',
  'start_index': 717,
  'end_index': 727,
  'anonymized_text': '15-11-2024'},
 {'original_text': 'John Doe',
  'label': 'name',
  'start_index': 157,
  'end_index': 165,
  'anonymized_text': 'Ethan Lane'},
 {'original_text': '123-45-6789',
  'label': 'social security number',
  'start_index': 121,
  'end_index': 132,
  'anonymized_text': '588-85-9388'},
 {'original_text': '20-05-2024',
  'label': 'date',
  'start_index': 86,
  'end_index': 96,
  'anonymized_text': '15-05-2024'},
 {'original_text': '15-01-1985',
  'label': 'date of birth',
  'start_index': 54,
  'end_index': 64,
  'anonymized_text': '01-07-1985'},
 {'original_text': 'John Doe',
  'label': 'name',
  'start_index': 30,
  'end_index': 38,
  'anonymized_text': 'Ethan Lane'}]

### Fixing the anonymized text

In case the anonymized text is not suitable, we can fix it by using the `anonymize` function found in the `anonipy.anonymize` module.
To do this, let us define a new set of replacements. 

We can edit existing replacements by changing the `anonymized_text` value, remove the ones that are not suitable,
and add missing ones.

Note that the new set does not require the `original_text` and `label` values.

In [19]:
new_replacements = [
    {
        "start_index": 30,
        "end_index": 38,
        "anonymized_text": "Mark Strong",
    },
    {
        "original_text": "20-05-2024",
        "label": "date",
        "start_index": 86,
        "end_index": 96,
        "anonymized_text": "18-05-2024",
    },
    {
        "original_text": "123-45-6789",
        "label": "social security number",
        "start_index": 121,
        "end_index": 132,
        "anonymized_text": "119-88-7014",
    },
    {
        "original_text": "John Doe",
        "label": "name",
        "start_index": 157,
        "end_index": 165,
        "anonymized_text": "Mark Strong",
    },
]

Now, let us anonymize the original text using the new replacements.

In [20]:
from anonipy.anonymize import anonymize

In [21]:
# anonymize the original text using the new replacements
anonymized_text, replacements = anonymize(original_text, new_replacements)

In [22]:
print(anonymized_text)

Medical Record

Patient Name: Mark Strong
Date of Birth: 15-01-1985
Date of Examination: 18-05-2024
Social Security Number: 119-88-7014

Examination Procedure:
Mark Strong underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024

