# NLP for Document De-Identification Using Python

## Introduction
This notebook demonstrates how to use **Natural Language Processing (NLP)** to detect and anonymize Personally Identifiable Information (PII) in text documents. PII includes sensitive information such as names, addresses, phone numbers, and emails that need to be protected for privacy and compliance with regulations like GDPR and HIPAA.

### Objectives
- Create a synthetic feedback document containing PII.
- Use `spaCy` and `Presidio` to identify and anonymize PII.
- Understand how to visualize and evaluate PII detection using NLP tools.

By the end of this exercise, you will have learned practical methods for automating privacy compliance in datasets.

## Sections Overview
1. Install Required Libraries
2. Create a Synthetic Feedback Document
3. Detect PII Entities Using spaCy
4. Mask Detected PII
5. Enhanced Detection and Anonymization Using Presidio
6. Visualize Detected Entities
7. Quiz Questions

In [1]:
# Install necessary libraries
!pip install presidio-analyzer presidio-anonymizer -q

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Section 1: Create a Synthetic Feedback Document
### Code Explanation
This section creates a synthetic text document containing PII such as names, addresses, and phone numbers. The document is saved locally as `feedback_data.txt`.

In [2]:
# Import required libraries
import spacy

# Step 1: Generate a synthetic document
fake_document = """
Dear John Doe,

We appreciate your feedback on our services. Your experience at 123 Elm Street, Springfield, was invaluable.
If you have further concerns, please contact us at 555-1234 or email john.doe@example.com.
Your ticket number is #12345.

Best regards,
Customer Service Team
"""

# Save the fake document to a text file
with open("feedback_data.txt", "w") as file:
    file.write(fake_document)

print("Synthetic document created and saved as feedback_data.txt.")

Synthetic document created and saved as feedback_data.txt.


## Section 2: Detect PII Entities Using spaCy
### Code Explanation
This section uses the `spaCy` NLP library to analyze the text and identify entities such as names, locations, and phone numbers.

In [3]:
# Load spaCy NLP model
nlp = spacy.load("en_core_web_sm")

# Read the synthetic document
with open("feedback_data.txt", "r") as file:
    feedback = file.read()

# Process the document using spaCy
doc = nlp(feedback)

# Display the detected entities
print("Detected Entities:")
for ent in doc.ents:
    print(f"Entity: {ent.text} | Label: {ent.label_}")

Detected Entities:
Entity: John Doe | Label: PERSON
Entity: Elm Street | Label: LOC
Entity: Springfield | Label: GPE
Entity: 555-1234 | Label: DATE
Entity: 12345 | Label: MONEY
Entity: Customer Service Team
 | Label: ORG


## Section 3: Mask Detected PII
### Code Explanation
This section replaces detected PII with placeholders (e.g., `[PERSON]` or `[LOC]`) to anonymize the document.

In [4]:
# Mask the detected PII
masked_feedback = feedback
for ent in doc.ents:
    masked_feedback = masked_feedback.replace(ent.text, f"[{ent.label_}]")

print("\nMasked Document:")
print(masked_feedback)

# Save the masked document to a file
with open("masked_feedback_data.txt", "w") as file:
    file.write(masked_feedback)

print("Masked document saved as masked_feedback_data.txt.")


Masked Document:

Dear [PERSON],

We appreciate your feedback on our services. Your experience at 123 [LOC], [GPE], was invaluable.
If you have further concerns, please contact us at [DATE] or email john.doe@example.com.
Your ticket number is #[MONEY].

Best regards,
[ORG]
Masked document saved as masked_feedback_data.txt.


## Section 4: Enhanced Detection and Anonymization Using Presidio
### Code Explanation
This section utilizes Microsoft's Presidio library to enhance PII detection and anonymize text.

In [5]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Initialize Presidio components
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Analyze and anonymize the text
results = analyzer.analyze(text=feedback, language="en")
anonymized_feedback = anonymizer.anonymize(text=feedback, analyzer_results=results)

print("\nAnonymized Document using Presidio:")
print(anonymized_feedback)



print("Anonymized document saved as anonymized_feedback_data.txt.")

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m



Anonymized Document using Presidio:
text: 
Dear <PERSON>,

We <IN_PAN> your feedback on our services. Your <IN_PAN> at 123 <LOCATION>, <LOCATION>, was <IN_PAN>.
If you have further concerns, please contact us at 555-1234 or email <EMAIL_ADDRESS>.
Your ticket number is #12345.

Best regards,
Customer Service Team

items:
[
    {'start': 188, 'end': 203, 'entity_type': 'EMAIL_ADDRESS', 'text': '<EMAIL_ADDRESS>', 'operator': 'replace'},
    {'start': 109, 'end': 117, 'entity_type': 'IN_PAN', 'text': '<IN_PAN>', 'operator': 'replace'},
    {'start': 93, 'end': 103, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
    {'start': 81, 'end': 91, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
    {'start': 65, 'end': 73, 'entity_type': 'IN_PAN', 'text': '<IN_PAN>', 'operator': 'replace'},
    {'start': 20, 'end': 28, 'entity_type': 'IN_PAN', 'text': '<IN_PAN>', 'operator': 'replace'},
    {'start': 6, 'end': 14, 'entity_type': 'PERSON', 'text':

## Section 5: Visualize Detected Entities
### Code Explanation
This section uses `spaCy`'s `displacy` module to visualize the detected entities in the text.

In [6]:
from spacy import displacy

# Visualize detected entities
displacy.render(doc, style="ent", jupyter=True)

## Quiz Questions
1. **What are the benefits of automating PII detection using NLP?**
   - A. Saves time compared to manual review.
   - B. Ensures compliance with data protection laws.
   - C. Detects non-sensitive data.
   - D. Both A and B.

2. **Which library provides customizable PII detection and anonymization capabilities?**
   - A. spaCy
   - B. Presidio
   - C. TensorFlow
   - D. OpenCV

3. **Why is it important to mask PII in datasets?**
   - A. To enhance readability.
   - B. To ensure privacy and avoid misuse of sensitive information.
   - C. To improve model training performance.
   - D. To comply with ethical hacking standards.

## Completion Message
Congratulations! You have successfully anonymized a document using NLP techniques. 🎉

**Next Steps:** Explore other datasets and customize your PII detection pipeline!