# Sensitive Data Detection with the Data Labeler

In this example, we utilize the Data Labeler component of the Data Profiler to detect the sensitive information for both structured and unstructured data. In addition, we show how to train the Data Labeler on some specific dataset with different list of entities.

First, let's dive into what the Data Labeler is.

## What is the Data Labeler

The Data Labeler is a pipeline designed to make building, training, and predictions with ML models quick and easy. There are 3 major components to the Data Labeler: the preprocessor, the model, and the postprocessor.

![alt text](DL-Flowchart.png "Title")

Each component can be switched out individually to suit your needs. As you might expect, the preprocessor takes in raw data and prepares it for the model, the model performs the prediction or training, and the postprocessor takes prediction results and turns them into human-readable results. 

Now let's run some examples. Start by importing all the requirements.

In [1]:
import os
import sys
import json
import pandas as pd
sys.path.insert(0, '..')
import dataprofiler as dp

## Structured Data Prediction

We'll use the aws honeypot dataset in the test folder for this example. First, look at the data using the Data Reader class of the Data Profiler. 

In [2]:
data = dp.Data("../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv")
df_data = data.data
df_data.head()

Unnamed: 0,datetime,host,src,proto,type,srcport,destport,srcip,locale,localeabbr,postalcode,latitude,longitude,owner,comment,int_col
0,3/3/13 21:53,groucho-oregon,1032051418.0,TCP,,6000,1433,61.131.218.218,Jiangxi Sheng,36,,28.55,115.9333,,He my polite be object oh change. Consider no ...,9464.0
1,3/3/13 21:57,groucho-oregon,1347834426.0,UDP,,5270,5060,80.86.82.58,,,,51.0,9.0,,,3731.0
2,3/3/13 21:58,groucho-oregon,2947856490.0,TCP,,2489,1080,175.180.184.106,Taipei,,,25.0392,121.525,,Of on affixed civilly moments promise explain ...,3963.0
3,3/3/13 21:58,,,UDP,,43235,1900,,Oregon,OR,97124.0,45.5848,-122.9117,,,1422.0
4,3/3/13 21:58,groucho-singapore,3587648279.0,TCP,,56577,80,213.215.43.23,,,,48.86,2.35,,Affronting everything discretion men now own d...,9271.0


We can directly predict the labels of a structured dataset on the cell level.

In [6]:
data_labeler = dp.DataLabeler(labeler_type='structured')

# print out the labels and label mapping
print(data_labeler.labels) 
print("\n")
print(data_labeler.label_mapping)
print("\n")

# make predictions and get labels for each cell going row by row
# predict options are model dependent and the default model can show prediction confidences
predictions = data_labeler.predict(data, predict_options={"show_confidences": True})

# display prediction results
print(predictions['pred'])
print("\n")

# display confidence results
print(predictions['conf'])

['PAD', 'UNKNOWN', 'ADDRESS', 'BAN', 'CREDIT_CARD', 'DATE', 'TIME', 'DATETIME', 'DRIVERS_LICENSE', 'EMAIL_ADDRESS', 'UUID', 'HASH_OR_KEY', 'IPV4', 'IPV6', 'MAC_ADDRESS', 'PERSON', 'PHONE_NUMBER', 'SSN', 'URL', 'US_STATE', 'INTEGER', 'FLOAT', 'QUANTITY', 'ORDINAL']


{'PAD': 0, 'UNKNOWN': 1, 'ADDRESS': 2, 'BAN': 3, 'CREDIT_CARD': 4, 'DATE': 5, 'TIME': 6, 'DATETIME': 7, 'DRIVERS_LICENSE': 8, 'EMAIL_ADDRESS': 9, 'UUID': 10, 'HASH_OR_KEY': 11, 'IPV4': 12, 'IPV6': 13, 'MAC_ADDRESS': 14, 'PERSON': 15, 'PHONE_NUMBER': 16, 'SSN': 17, 'URL': 18, 'US_STATE': 19, 'INTEGER': 20, 'FLOAT': 21, 'QUANTITY': 22, 'ORDINAL': 23}


['DATETIME' 'UNKNOWN' 'PHONE_NUMBER' ... 'None' 'None' 'FLOAT']


[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]]


The profiler uses the data labeler to perform column by column predictions. The data contains 16 columns, each of which has different data label. Next, we will use the Data Labeler of the Data Profiler to predict the label for each column in this tabular dataset. To use only the labeling functionality, other options of the Data Profiler are disabled.

In [7]:
# set options to only run the data labeler
profile_options = dp.ProfilerOptions()
profile_options.set({"text.is_enabled": False, 
                     "int.is_enabled": False, 
                     "float.is_enabled": False, 
                     "order.is_enabled": False, 
                     "category.is_enabled": False, 
                     "datetime.is_enabled": False,})

profile = dp.Profiler(data, profiler_options=profile_options)

# get the prediction from the data profiler
def get_structured_results(results):
    columns = []
    predictions = []
    for col in results['data_stats']:
        columns.append(col)
        predictions.append(results['data_stats'][col]['data_label'])

    df_results = pd.DataFrame({'Column': columns, 'Prediction': predictions})
    return df_results

results = profile.report()    
print(get_structured_results(results))

  0%|          | 0/16 [00:00<?, ?it/s]

Finding the Null values in the columns... (with 15 processes)


100%|██████████| 16/16 [00:01<00:00, 10.01it/s]
  0%|          | 0/16 [00:00<?, ?it/s]

Calculating the statistics...  (with 4 processes)


  6%|▋         | 1/16 [00:00<00:02,  6.21it/s]



 12%|█▎        | 2/16 [00:00<00:02,  6.66it/s]



 25%|██▌       | 4/16 [00:00<00:01,  8.64it/s]



 62%|██████▎   | 10/16 [00:00<00:00, 12.77it/s]



 75%|███████▌  | 12/16 [00:01<00:00, 12.07it/s]



100%|██████████| 16/16 [00:01<00:00, 11.73it/s]

        Column        Prediction
0     datetime    DATETIME|FLOAT
1         host           UNKNOWN
2          src  BAN|PHONE_NUMBER
3        proto           UNKNOWN
4         type           INTEGER
5      srcport           INTEGER
6     destport           INTEGER
7        srcip              IPV4
8       locale           UNKNOWN
9   localeabbr   INTEGER|UNKNOWN
10  postalcode           INTEGER
11    latitude             FLOAT
12   longitude             FLOAT
13       owner              None
14     comment           UNKNOWN
15     int_col             FLOAT





The results show that the Data Profiler is able to detect sensitive information such as datetime, ipv4, or phone number.

## Unstructured Data Prediction

Besides structured data, the Data Profiler detects the sensitive information on the unstructured text. We use a sample of spam email in Enron email dataset for this demo. As above, we start investigating the content of the given email sample.

In [8]:
# load data
data = "Message-ID: <11111111.1111111111111.JavaMail.evans@thyme>\n" + \
        "Date: Fri, 10 Aug 2005 11:31:37 -0700 (PDT)\n" + \
        "From: w..smith@company.com\n" + \
        "To: john.smith@company.com\n" + \
        "Subject: RE: ABC\n" + \
        "Mime-Version: 1.0\n" + \
        "Content-Type: text/plain; charset=us-ascii\n" + \
        "Content-Transfer-Encoding: 7bit\n" + \
        "X-From: Smith, Mary W. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>\n" + \
        "X-To: Smith, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>\n" + \
        "X-cc: \n" + \
        "X-bcc: \n" + \
        "X-Folder: \SSMITH (Non-Privileged)\Sent Items\n" + \
        "X-Origin: Smith-S\n" + \
        "X-FileName: SSMITH (Non-Privileged).pst\n\n" + \
        "All I ever saw was the e-mail from the office.\n\n" + \
        "Mary\n\n" + \
        "-----Original Message-----\n" + \
        "From:   Smith, John  \n" + \
        "Sent:   Friday, August 10, 2005 13:07 PM\n" + \
        "To:     Smith, Mary W.\n" + \
        "Subject:        ABC\n\n" + \
        "Have you heard any more regarding the ABC sale? I guess that means that " + \
        "it's no big deal here, but you think they would have send something.\n\n\n" + \
        "John Smith\n" + \
        "123-456-7890\n"

# convert string data to list to feed into the data labeler
data = [data]

By default, the Data Profiler predicts the results at the character level for unstructured text.

In [10]:
data_labeler = dp.DataLabeler(labeler_type='unstructured')

# make predictions and get labels per character
predictions = data_labeler.predict(data)

# display results
print(predictions['pred'])

[array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
       21., 21., 21., 21., 21., 21., 21., 21., 21., 21., 21., 21., 21.,
       21., 22., 18., 18.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,
        9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,
        9.,  9.,  9.,  9.,  9.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        5.,  5.,  5.,  5., 20., 20.,  1.,  7.,  7.,  7.,  7.,  7.,  7.,
        7.,  7.,  7.,  7.,  7.,  7.,  7.,  7.,  7.,  7.,  7.,  7.,  7.,
        7.,  7.,  7.,  6.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,
        9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  1.,  1.,
        1.,  1.,  1.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,
        9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  9.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  

In addition to the character-level result, the Data Profiler provides the results at the word level following the standard NER (Named Entity Recognition), e.g., utilized by spaCy. 

In [11]:
# convert prediction to word format and ner format
# Set the output to the NER format (start position, end position, label)
data_labeler.set_params(
    { 'postprocessor': { 'output_format':'ner', 'use_word_level_argmax':True } } 
)

# make predictions and get labels per character
predictions = data_labeler.predict(data)

# display results
print('\n')
print('=======================Prediction======================\n')
for pred in predictions['pred'][0]:
    print('{}: {}'.format(data[0][pred[0]: pred[1]], pred[2]))
    print('--------------------------------------------------------')




<11111111: FLOAT
--------------------------------------------------------
JavaMail.evans@thyme>: EMAIL_ADDRESS
--------------------------------------------------------
Aug 2005 11:31:37 -0700: DATETIME
--------------------------------------------------------
smith@company.com: EMAIL_ADDRESS
--------------------------------------------------------
john.smith@company.com: EMAIL_ADDRESS
--------------------------------------------------------
text/plain: ORDINAL
--------------------------------------------------------
7bit: ORDINAL
--------------------------------------------------------
Smith, Mary W: PERSON
--------------------------------------------------------
</O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>: HASH_OR_KEY
--------------------------------------------------------
Smith, John: PERSON
--------------------------------------------------------
</O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>: HASH_OR_KEY
--------------------------------------------------------
Smith, John: PERSON
-------

Here, the Data Profiler is able to identify sensitive information such as datetime, email address, person names, and phone number in an email sample. 

## Train the Data Labeler from Scratch

The Data Labeler can be trained from scratch with a new list of labels. Below, we show an example of training the Data Labeler on a dataset with labels given as the columns of that dataset.

In [17]:
data = dp.Data("../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv")

# the column 'comment' is changed to UNKNOWN, as the data labeler requires at least one column with label UNKNOWN
df = data.data.rename({'comment': 'UNKNOWN'}, axis=1)

# split data to training and test set
split_ratio = 0.2
df = df.sample(frac=1).reset_index(drop=True)
data_train = df[:int((1 - split_ratio) * len(df))]
data_test = df[int((1 - split_ratio) * len(df)):]

# train a new data labeler with column names as labels
if not os.path.exists('data_labeler_saved'):
    os.makedirs('data_labeler_saved')

data_labeler = dp.train_structured_labeler(
    data=data_train,
    save_dirpath="data_labeler_saved",
    epochs=4
)

EPOCH 0, validation_batch_id 1(After removing non-entity tokens)
               precision    recall  f1-score   support

    datetime       0.16      0.21      0.19      5331
        host       0.00      0.00      0.00      6422
         src       0.00      0.00      0.00      4773
       proto       0.00      0.00      0.00      1451
        type       0.00      0.00      0.00        29
     srcport       0.00      0.00      0.00      1807
    destport       0.00      0.00      0.00      1641
       srcip       0.00      0.00      0.00      6449
      locale       0.00      0.00      0.00      3974
  localeabbr       0.00      0.00      0.00       695
  postalcode       0.00      0.00      0.00       436
    latitude       0.00      0.00      0.00      2803
   longitude       0.08      0.27      0.13      3246
       owner       0.00      0.00      0.00         0
     int_col       0.00      0.00      0.00      3599

   micro avg       0.03      0.05      0.04     42656
   macro avg  

The trained Data Labeler is then used by the Data Profiler to provide the prediction on the new dataset.

In [18]:
# predict with the data labeler object
profile_options.set({'data_labeler.data_labeler_object': data_labeler})
profile = dp.Profiler(data_test, profiler_options=profile_options)

# get the prediction from the data profiler
results = profile.report()
print(get_structured_results(results))

  0%|          | 0/16 [00:00<?, ?it/s]

Finding the Null values in the columns... (with 15 processes)


100%|██████████| 16/16 [00:01<00:00,  8.26it/s]
 12%|█▎        | 2/16 [00:00<00:01, 11.49it/s]

Calculating the statistics...  (with 4 processes)


 25%|██▌       | 4/16 [00:00<00:00, 14.54it/s]



100%|██████████| 16/16 [00:00<00:00, 21.09it/s]

        Column           Prediction
0     datetime       host|longitude
1         host                 host
2          src                 host
3        proto                 host
4         type                  src
5      srcport               locale
6     destport               locale
7        srcip                srcip
8       locale                 host
9   localeabbr         UNKNOWN|host
10  postalcode               locale
11    latitude     locale|longitude
12   longitude  could not determine
13       owner                 None
14     UNKNOWN                 host
15     int_col             destport





Another way to use the trained Data Labeler is through the directory path of the saved labeler.

In [19]:
# predict with the data labeler loaded from path
profile_options.set({'data_labeler.data_labeler_dirpath': 'data_labeler_saved'})
profile = dp.Profiler(data_test, profiler_options=profile_options)

# get the prediction from the data profiler
results = profile.report()
print(get_structured_results(results))

  0%|          | 0/16 [00:00<?, ?it/s]

Finding the Null values in the columns... (with 15 processes)


100%|██████████| 16/16 [00:01<00:00, 11.15it/s]
 12%|█▎        | 2/16 [00:00<00:00, 15.49it/s]

Calculating the statistics...  (with 4 processes)


100%|██████████| 16/16 [00:00<00:00, 24.51it/s]

        Column           Prediction
0     datetime       host|longitude
1         host                 host
2          src                 host
3        proto                 host
4         type                  src
5      srcport               locale
6     destport               locale
7        srcip                srcip
8       locale                 host
9   localeabbr         UNKNOWN|host
10  postalcode               locale
11    latitude     locale|longitude
12   longitude  could not determine
13       owner                 None
14     UNKNOWN                 host
15     int_col             destport





## Transfer Learning a Labeler

Instead of training a model from scratch, we can also transfer learn to improve the model and/or extend the labels.

In [21]:
data = dp.Data("../dataprofiler/tests/data/csv/diamonds.csv")
df_data = data.data[:1000]
df_data.head()

# prep data
df_data = df_data.reset_index(drop=True).melt()
df_data.columns = [1, 0]  # labels=1, values=0 in that order
df_data = df_data.astype(str)
new_labels = df_data[1].unique().tolist()

# load structured Data Labeler w/ trainable set to True
data_labeler = dp.DataLabeler(labeler_type='structured', trainable=True, dirpath="data_labeler_saved")

# Reconstruct the model to add each new label
for label in new_labels:
    data_labeler.add_label(label)

# this will use transfer learning to retrain the data labeler on your new
# dataset and labels.
# Setting labels with a list of labels or label mapping will overwrite the existing labels with new ones
# Setting the reset_weights parameter to false allows transfer learning to occur
model_results = data_labeler.fit(x=df_data[0], y=df_data[1], validation_split=0.2, 
                                 epochs=2, labels=None, reset_weights=False)

EPOCH 0, validation_batch_id 1(After removing non-entity tokens)
               precision    recall  f1-score   support

    datetime       0.00      0.00      0.00         0
        host       0.00      0.00      0.00         0
         src       0.00      0.00      0.00         0
       proto       0.00      0.00      0.00         0
        type       0.00      0.00      0.00         0
     srcport       0.00      0.00      0.00         0
    destport       0.00      0.00      0.00         0
       srcip       0.00      0.00      0.00         0
      locale       0.00      0.00      0.00         0
  localeabbr       0.00      0.00      0.00         0
  postalcode       0.00      0.00      0.00         0
    latitude       0.00      0.00      0.00         0
   longitude       0.00      0.00      0.00         0
       owner       0.00      0.00      0.00         0
     int_col       0.00      0.00      0.00         0
       carat       0.00      0.00      0.00       753
         cut   

Let's display the training results of the last epoch:

In [22]:
print("{:14s}  Precision  Recall  F1-score  Support".format(""))
for item in model_results[-1][2]:
    print("{:14s}  {:4.3f}      {:4.3f}   {:4.3f}     {:7.0f}".format(item,
                                                                      model_results[-1][2][item]["precision"],
                                                                      model_results[-1][2][item]["recall"],
                                                                      model_results[-1][2][item]["f1-score"],
                                                                      model_results[-1][2][item]["support"]))

                Precision  Recall  F1-score  Support
datetime        0.000      0.000   0.000           0
host            0.000      0.000   0.000           0
src             0.000      0.000   0.000           0
proto           0.000      0.000   0.000           0
type            0.000      0.000   0.000           0
srcport         0.000      0.000   0.000           0
destport        0.000      0.000   0.000           0
srcip           0.000      0.000   0.000           0
locale          0.000      0.000   0.000           0
localeabbr      0.000      0.000   0.000           0
postalcode      0.000      0.000   0.000           0
latitude        0.000      0.000   0.000           0
longitude       0.000      0.000   0.000           0
owner           0.000      0.000   0.000           0
int_col         0.000      0.000   0.000           0
carat           0.000      0.000   0.000         753
cut             0.000      0.000   0.000        1246
color           0.000      0.000   0.000      

The model is now trained to detect additional labels!

## Saving and Loading a Data Labeler

The data labeler can easily be saved or loaded with one simple line.

In [23]:
# Ensure save directory exists
if not os.path.exists('my_data_labeler'):
    os.makedirs('my_data_labeler')

# Saving the data labeler
data_labeler.save_to_disk("my_data_labeler")

# Loading the data labeler
data_labeler = dp.DataLabeler(labeler_type='structured', dirpath="data_labeler_saved_again")

INFO:tensorflow:Assets written to: data_labeler_saved_again/assets


## Building a Data Labeler from the Ground Up

As mentioned earlier, the data labeler is comprised of three components, and each of the compenents can be created and interchanged in the the data labeler pipeline.

In [None]:
import dataprofiler as dp
from dataprofiler.labelers.character_level_cnn_model import \
    CharacterLevelCnnModel
from dataprofiler.labelers.data_processing import \
    StructCharPreprocessor, StructCharPostprocessor

model = CharacterLevelCnnModel({"PAD":0, "UNKNOWN":1, "Test_Label":2})
preprocessor = StructCharPreprocessor()
postprocessor = StructCharPostprocessor()

data_labeler = dp.DataLabeler(labeler_type='structured')
data_labeler.set_preprocessor(preprocessor)
data_labeler.set_model(model)
data_labeler.set_postprocessor(postprocessor)

# check for basic compatibility between the processors and the model
data_labeler.check_pipeline()

data_labeler.help()

The components can each be created if you inherit the BaseModel and BaseProcessor for the model and processors, respectively. More info can be found about coding your own components in the Labeler section of the [documentation]( https://capitalone.github.io/dataprofiler).

### Setting Parameters

When it comes to setting parameters of each component, it can easily be done by passing in a nested dictionary to the data labeler. Calling help() can reveal parameters to set.

In [None]:
# First call help() to learn what parameters are available
data_labeler.help()

These parameters can then be set by sending in a nested dictionary like this:

In [None]:
import random
parameters={
    'preprocessor':{
        'max_length': 100,
    },
    'model':{
        'max_length': 100,
    },
    'postprocessor':{
        'random_state': random.Random(1)
    }
} 
data_labeler.set_params(parameters)

In summary, the Data Profiler open source library can be used to scan sensitive information in both structured and unstructured data with different file types. It support multiple input formats and output formats at word and character levels. Users can also train the data labeler on their own datasets.