# Data Profiler - What's in your data?

The library is designed to easily detect sensitive data and gather statistics on your datasets with just a few lines of code.

This demo covers the followings:

    - Basic usage of the Data Profiler
    - The data reader class
    - Updating and merging profiles
    - Profile differences
    - Graphing a profile
    - Saving profiles
    - Data labeling

First, let's import the libraries needed for this example.

In [None]:
import os
import sys
import json

import pandas as pd
import tensorflow as tf

try:
    sys.path.insert(0, '..')
    import dataprofiler as dp
except ImportError:
    import dataprofiler as dp
    
data_folder = "../dataprofiler/tests/data"

# remove extra tf loggin
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

## Basic Usage of the Data Profiler

This section shows the basic example of the Data Profiler. A CSV dataset is read using the data reader, then the Data object is given to the Data Profiler to detect sensitive data and obtain the statistics.

In [None]:
# read, profile, and get the report in 3 lines

# get the data
data = dp.Data(os.path.join(data_folder, "csv/diamonds.csv"))

# profile the data
profile = dp.Profiler(data)

# generate the report
report  = profile.report(report_options={"output_format": "compact"})

In [None]:
data.head() # data.data provides access to a pandas.DataFrame

In [None]:
# print the report
print('\nREPORT:\n' + '='*80)
print(json.dumps(report, indent=4))

## Data reader class -- Automatic Detection

Within the Data Profiler, there are 5 data reader classes:

  * CSVData (delimited data: CSV, TSV, etc.)
  * JSONData
  * ParquetData
  * AVROData
  * TextData

In [None]:
# use data reader to read input data with different file types
data_folder = "../dataprofiler/tests/data"
csv_files = [
    "csv/aws_honeypot_marx_geo.csv",
    "csv/all-strings-skip-header-author.csv", # csv files with the author/description on the first line
    "csv/sparse-first-and-last-column-empty-first-row.txt", # csv file with the .txt extension
]
json_files = [
    "json/complex_nested.json",
    "json/honeypot_intentially_mislabeled_file.csv", # json file with the .csv extension
]
parquet_files = [
    "parquet/nation.dict.parquet",
    "parquet/nation.plain.intentionally_mislabled_file.csv", # parquet file with the .csv extension
]
avro_files = [
    "avro/userdata1.avro",
    "avro/userdata1_intentionally_mislabled_file.json", # avro file with the .json extension
]
text_files = [
    "txt/discussion_reddit.txt",
]

all_files = csv_files + json_files + parquet_files + avro_files + text_files

print('filepath' + ' ' * 58 + 'data type')
print('='*80)
for file in all_files:
    filepath = os.path.join(data_folder, file)
    ############################
    ##### READING THE DATA #####
    data = dp.Data(filepath)
    ############################
    print("{:<65} {:<15}".format(file, data.data_type))

In [None]:
# importing from a url
data = dp.Data('https://raw.githubusercontent.com/capitalone/DataProfiler/main/dataprofiler/tests/data/csv/diamonds.csv')
data.head()

## Data Profiling
As we saw above, profiling is as simple as:

```python
import dataprofiler as dp

data = dp.Data('my_data.csv')
profiler = dp.Profiler(data)
report = profiler.report(report_options={"output_format": "compact"})
```

### Update profiles - the case for batching / streaming data¶

The profiler allows users to send the data to the profile in batches.

In [None]:
# divide dataset in half
data = dp.Data(os.path.join(data_folder, "csv/diamonds.csv"))
df = data.data
df1 = df.iloc[:int(len(df)/2)]
df2 = df.iloc[int(len(df)/2):]

In [None]:
# Update the profile with the first half
profile = dp.Profiler(df1)

############################
####### BATCH UPDATE #######
profile.update_profile(df2)
############################
report_batch  = profile.report(report_options={"output_format": "compact"})

# print('\nREPORT:\n' + '='*80)
print(json.dumps(report_batch, indent=4))

### Merge profiles -- the case for parallelization

Two profiles can be added together to create a combined profile.

In [None]:
# create two profiles and merge
profile1 = dp.Profiler(df1)
profile2 = dp.Profiler(df2)
profile_merge = profile1 + profile2

# check results of the merged profile
report_merge  = profile.report(report_options={"output_format": "compact"})

# # print the report
# print('\nREPORT:\n' + '='*80)
# print(json.dumps(report_merge, indent=4))

# Differences in Data
Can be appliied to both structured and unstructured datasets. 

Such reports can provide details on the differences between training and validation data like in this pseudo example:
```python
profiler_training = dp.Profiler(training_data)
profiler_testing = dp.Profiler(testing_data)

validation_report = profiler_training.diff(profiler_testing)
```

In [None]:
from pprint import pprint

# structured differences example
data_split_differences = profile1.diff(profile2)
pprint(data_split_differences)

## Graphing a Profile

We've also added the ability to generating visual reports from a profile.

The following plots are currently available to work directly with your profilers:

  * missing values matrix
  * histogram (numeric columns only)

In [None]:
import matplotlib.pyplot as plt


# get the data
data = dp.Data(os.path.join(data_folder, "csv/aws_honeypot_marx_geo.csv"))

# profile the data
profile = dp.Profiler(data)

In [None]:
# generate a missing values matrix
fig = plt.figure(figsize=(8, 6), dpi=100)
fig = dp.graphs.plot_missing_values_matrix(profile, ax=fig.gca(), title="Missing Values Matrix")

In [None]:
# generate histogram of all int/float columns
fig = dp.graphs.plot_histograms(profile)
fig.set_size_inches(8, 6)
fig.set_dpi(100)

## Saving and Loading a Profile

Not only can the Profiler create and update profiles, it's also possible to save, load then manipulate profiles.

In [None]:
# Load data
data = dp.Data(os.path.join(data_folder, "csv/diamonds.csv"))

# Generate a profile
profile = dp.Profiler(data)

# Save a profile to disk for later (saves as pickle file)
profile.save(filepath="my_profile.pkl")

# Load a profile from disk
loaded_profile = dp.Profiler.load("my_profile.pkl")

# Report the compact version of the profile
# report = profile.report(report_options={"output_format":"compact"})
# print(json.dumps(report, indent=4))

# Unstructured Profiling

Similar to structured datasets, text data can also be profiled with the unstructured profiler. 
It currently provides an easy overview of information in the text such as:
  * memory size
  * char stats
  * word stats
  * data labeling entity stats

In [None]:
profiler_string = dp.Profiler("This is my random text: 332-23-2123")
print(json.dumps(profiler_string.report(), indent=4))

In [None]:
email_data = ["Message-ID: <11111111.1111111111111.JavaMail.evans@thyme>\n" + \
              "Date: Fri, 10 Aug 2005 11:31:37 -0700 (PDT)\n" + \
              "From: w..smith@company.com\n" + \
              "To: john.smith@company.com\n" + \
              "Subject: RE: ABC\n" + \
              "Mime-Version: 1.0\n" + \
              "Content-Type: text/plain; charset=us-ascii\n" + \
              "Content-Transfer-Encoding: 7bit\n" + \
              "X-From: Smith, Mary W. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>\n" + \
              "X-To: Smith, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>\n" + \
              "X-cc: \n" + \
              "X-bcc: \n" + \
              "X-Folder: \SSMITH (Non-Privileged)\Sent Items\n" + \
              "X-Origin: Smith-S\n" + \
              "X-FileName: SSMITH (Non-Privileged).pst\n\n" + \
              "All I ever saw was the e-mail from the office.\n\n" + \
              "Mary\n\n" + \
              "-----Original Message-----\n" + \
              "From:   Smith, John  \n" + \
              "Sent:   Friday, August 10, 2005 13:07 PM\n" + \
              "To:     Smith, Mary W.\n" + \
              "Subject:        ABC\n\n" + \
              "Have you heard any more regarding the ABC sale? I guess that means that " + \
              "it's no big deal here, but you think they would have send something.\n\n\n" + \
              "John Smith\n" + \
              "123-456-7890\n"]

profiler_email = dp.Profiler(email_data, profiler_type='unstructured')
print(json.dumps(profiler_email.report(), indent=4))

## Merging Unstructured Data

In [None]:
merged_profile = profiler_string + profiler_email
print(json.dumps(merged_profile.report(), indent=4))

## Differences in Unstructured Data

In [None]:
# unstructured differences example
validation_report = profiler_email.diff(profiler_string)
print(json.dumps(validation_report, indent=4))

## Data Labeling

The Labeler is a pipeline designed to make building, training, and predictions with ML models quick and easy. There are 3 major components to the Labeler: the preprocessor, the model, and the postprocessor.

![alt text](DL-Flowchart.png "Title")

Default labels:
* UNKNOWN
* ADDRESS
* BAN (bank account number, 10-18 digits)
* CREDIT_CARD
* EMAIL_ADDRESS
* UUID
* HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
* IPV4
* IPV6
* MAC_ADDRESS
* PERSON
* PHONE_NUMBER
* SSN
* URL
* US_STATE
* DRIVERS_LICENSE
* DATE
* TIME
* DATETIME
* INTEGER
* FLOAT
* QUANTITY
* ORDINAL

In [None]:
# helper functions for printing results

def get_structured_results(results):
    """Helper function to get data labels for each column."""
    columns = []
    predictions = []
    samples = []
    for col in results['data_stats']:
        columns.append(col['column_name'])
        predictions.append(col['data_label'])
        samples.append(col['samples'])

    df_results = pd.DataFrame({'Column': columns, 'Prediction': predictions, 'Sample': samples})
    return df_results

def get_unstructured_results(data, results):
    """Helper function to get data labels for each labeled piece of text."""
    labeled_data = []
    for pred in results['pred'][0]:
        labeled_data.append([data[0][pred[0]:pred[1]], pred[2]])
    label_df = pd.DataFrame(labeled_data, columns=['Text', 'Labels'])
    return label_df
    

pd.set_option('display.width', 100)

### Structured Labeling

Each column within your profile is given a suggested data label.

In [None]:
# profile data and get labels for each column
data = dp.Data(os.path.join(data_folder, "csv/SchoolDataSmall.csv"))
profiler = dp.Profiler(data)
report = profiler.report()


print('\Label Predictions:\n' + '=' * 85)
print(get_structured_results(report))

### Unstructured Labeling

In [None]:
# load data
email_data = ["Message-ID: <11111111.1111111111111.JavaMail.evans@thyme>\n" + \
              "Date: Fri, 10 Aug 2005 11:31:37 -0700 (PDT)\n" + \
              "From: w..smith@company.com\n" + \
              "To: john.smith@company.com\n" + \
              "Subject: RE: ABC\n" + \
              "Mime-Version: 1.0\n" + \
              "Content-Type: text/plain; charset=us-ascii\n" + \
              "Content-Transfer-Encoding: 7bit\n" + \
              "X-From: Smith, Mary W. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>\n" + \
              "X-To: Smith, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>\n" + \
              "X-cc: \n" + \
              "X-bcc: \n" + \
              "X-Folder: \SSMITH (Non-Privileged)\Sent Items\n" + \
              "X-Origin: Smith-S\n" + \
              "X-FileName: SSMITH (Non-Privileged).pst\n\n" + \
              "All I ever saw was the e-mail from the office.\n\n" + \
              "Mary\n\n" + \
              "-----Original Message-----\n" + \
              "From:   Smith, John  \n" + \
              "Sent:   Friday, August 10, 2005 13:07 PM\n" + \
              "To:     Smith, Mary W.\n" + \
              "Subject:        ABC\n\n" + \
              "Have you heard any more regarding the ABC sale? I guess that means that " + \
              "it's no big deal here, but you think they would have send something.\n\n\n" + \
              "John Smith\n" + \
              "123-456-7890\n"]

In [None]:
labeler = dp.DataLabeler(labeler_type='unstructured')

# convert prediction to word format and ner format
# Set the output to the NER format (start position, end position, label)
labeler.set_params(
    { 'postprocessor': { 'output_format': 'ner', 'use_word_level_argmax': True } } 
)

# make predictions and get labels per character
predictions = labeler.predict(email_data)

# display results
print('=========================Prediction========================')
print(get_unstructured_results(email_data, predictions))