# **Named Entity Recognition and Sensitive Information Masking**

## Introduction

In this notebook, we will explore how to use Natural Language Processing (NLP) techniques to handle sensitive information in text data. Our focus will be on two main tasks:


1. Named Entity Recognition (NER) for Person Names using SpaCy:
We will use SpaCy, a powerful NLP library, to identify and mask names of people mentioned in the text. NER helps us automatically detect specific entities like person names, organizations, locations, and more for text data analysis and privacy protection.

2. Hiding Emails, Phone Numbers, and URLs using the clean-text Library:
Apart from identifying person names, we often need to anonymize other sensitive information, such as email addresses and URLs. For this task, we will use the clean-text library, which provides simple and effective functions to clean and standardize text data by removing or masking unwanted patterns like emails, URLs, and phone numbers. It also allows us to perform further data cleaning, which is why I will include those lines in the code block.



## Why is this important?

With the growing concerns around data privacy and security, it is essential to handle sensitive information carefully. Whether working with customer data, social media content, or any other text data, it's crucial to anonymize identifiable information to comply with privacy regulations like GDPR and to maintain trust with users.

## What will you learn?
How to perform Named Entity Recognition (NER) using SpaCy to identify and mask person names in text data.
How to use the clean-text library to effectively hide emails, URLs, and other sensitive patterns in the text.
How to combine these techniques to create a robust pipeline for anonymizing sensitive information in text datasets.


## Steps to Follow

1.   Install the required libraries: We will start by installing SpaCy and clean-text to our Python environment.
2.   Load SpaCy’s pre-trained model: We will use SpaCy's English model to perform NER on our text data. Here, it is crucial to translate non-English dataset before using it.
3. Write functions for masking names and other sensitive information:
4. We'll create custom functions that use SpaCy for person names and clean-text for emails and URLs.
5. Apply these functions to sample text data: We'll test our functions on sample text to see how they perform in masking sensitive information.

Let's get started!



## 1.   Install the required libraries



In [None]:
!pip install clean-text
!python -m spacy download en_core_web_sm

## 2.   Import Libraries and Load SpaCy Model

In [45]:
import pandas as pd
import numpy as np
import spacy
import re
from cleantext import clean

In [46]:
nlp = spacy.load('en_core_web_sm')

## 3. Define Functions for Masking Names, Emails, and URLs

In [62]:
def mask_person_names(text):
    """
    Function to mask person names using SpaCy NER.
    Replaces recognized person names with [PERSON].
    """
    doc = nlp(text)
    masked_text = text
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            masked_text = masked_text.replace(ent.text, "[PERSON]")
    return masked_text

def mask_sensitive_entity(text):
    """
    Function to mask emails, URLs, and phone numbers using clean-text library.
    Replaces these patterns with placeholders like [EMAIL], [URL], and [PHONE].
    """
    cleaned_text = clean(text,
    no_urls=True,                  # replace all URLs with a special token
    no_emails=True,                # replace all email addresses with a special token
    no_phone_numbers=True,         # replace all phone numbers with a special token
    fix_unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    lower=False,                    # lowercase text
    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
    no_numbers=False,               # replace all numbers with a special token
    no_digits=False,                # replace all digits with a special token
    no_currency_symbols=False,      # replace all currency symbols with a special token
    no_punct=False,                 # remove punctuations
    replace_with_punct="",          # instead of removing punctuations you may replace them
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en"                       # set to 'de' for German special handling
)

    return cleaned_text

## 4. Test the Functions on Sample Text

In [64]:
# Sample text
sample_text = """
Hello, my name is Lara Smith. You can reach me at lara.smith@example.com or visit my website at https://www.larasmith.com. Feel free to call me at +1-123-456-7890.
My colleague, Jane Smith, also shares similar contact details.
"""

# Apply the person name masking function
masked_names_text = mask_person_names(sample_text)
print("Text after masking person names:\n", masked_names_text)

# Apply the sensitive info masking function
masked_info_text = mask_sensitive_entity(masked_names_text)
print("\nText after masking sensitive information:\n", masked_info_text)


Text after masking person names:
 
Hello, my name is [PERSON]. You can reach me at lara.smith@example.com or visit my website at https://www.larasmith.com. Feel free to call me at +1-123-456-7890. 
My colleague, [PERSON], also shares similar contact details.


Text after masking sensitive information:
 Hello, my name is [PERSON]. You can reach me at <EMAIL> or visit my website at <URL>. Feel free to call me at <PHONE>.
My colleague, [PERSON], also shares similar contact details.


# 5. Combine Both Functions for Complete Anonymization


In [65]:
def anonymize_text(text):
    """
    Function to fully anonymize text by first masking person names,
    then masking emails, URLs, and phone numbers.
    """
    text_with_masked_names = mask_person_names(text)
    fully_anonymized_text = mask_sensitive_entity(text_with_masked_names)
    return fully_anonymized_text

# Test the combined function
fully_anonymized_text = anonymize_text(sample_text)
print("\nFully anonymized text:\n", fully_anonymized_text)


Fully anonymized text:
 Hello, my name is [PERSON]. You can reach me at <EMAIL> or visit my website at <URL>. Feel free to call me at <PHONE>.
My colleague, [PERSON], also shares similar contact details.
