# DataProfiler - Profilers

Data profiling is the process of examining a dataset and collecting statistical or informational summaries about said dataset.

The Profilers inside the DataProfiler is designed to calculate multiple statistics, make predictions on the entities inside a given column / key-value store (via the Labeler) and generally determine informational summaries. 

In [None]:
import os
import sys
import json
import pandas as pd
import matplotlib.pyplot as plt
sys.path.insert(0, '..')
import dataprofiler as dp

data_path = "../dataprofiler/tests/data"

## Reporting

One of the primary purposes of the Profiler are to quickly identify what's in the dataset. This can be useful for analyzing a dataset prior to use or determining which columns could be useful for a given purpose.

In terms of reporting, there are four reporting options:

* **Pretty**: floats are rounded to four decimal places, and lists are shortened.
* **Compact**: Similar to pretty, but removes detailed statistics such as runtimes, label probabilities, index locations of null types, etc.
* **Serializable**: Output is json serializable and not prettified
* **Flat**: Nested output is returned as a flattened dictionary

In [None]:
data = dp.Data(os.path.join(data_path, "csv/aws_honeypot_marx_geo.csv"))
profile = dp.Profiler(data)

# Compact - A high level view, good for quick reviews
report  = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))

The report includes `global_stats` and `data_stats` for the given dataset. The former contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio, while the later contains specific properties and statistics for each column such as min, max, mean, variance, etc. In this example, the `compact` format of the report is used to shorten the full list of the results. To get more results related to detailed predictions at the entity level from the DataLabeler component or histogram results, the format `pretty` should be used.

In addition to reading the input data from multiple file types, DataProfiler allows the input data as a dataframe.

In [None]:
# run data profiler and get the report
my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]], columns=["col_int", "col_float"])
profile = dp.Profiler(my_dataframe)

report  = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))

## Profiler options

DataProfiler can run several selected components if needed. For example, if the users only want the statistics information, they may turn off the DataLabeler functionality. Below, let's remove the histogram and data labeler component while running DataProfiler. 

Full list of options: https://capitalone.github.io/DataProfiler

In [None]:
profile_options = dp.ProfilerOptions()
profile_options.set({
    "histogram.is_enabled": False,
    "data_labeler.is_enabled": False
})

profile = dp.Profiler(data, profiler_options=profile_options)
report  = profile.report(report_options={"output_format":"compact"})

# Print the report
print(json.dumps(report, indent=4))

### Updating Profiles

Beyond just profiling, one of the unique aspects of the DataProfiler (as compared to pandas-profiling) is the ability to update the profiles. To update appropriately, the schema (columns / keys) must match appropriately.

In [None]:
# Load and profile a CSV file
data = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-header-and-author.txt"))
profile = dp.Profiler(data)

# Update the profile with new data:
new_data = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-skip-header.txt"))
profile.update_profile(new_data)

# Report the compact version of the profile
report  = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))

### Merging Profiles

Merging profiles are an alternative method for updating profiles. Particularly, multiple profiles can be generated seperately, then added together with a simple `+` command: `profile3 = profile1 + profile2`

In [None]:
# Load a CSV file with a schema
data1 = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-header-and-author.txt"))
profile1 = dp.Profiler(data1)

# Load another CSV file with the same schema
data2 = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-skip-header.txt"))
profile2 = dp.Profiler(data2)

# Merge the profiles
profile3 = profile1 + profile2

# Report the compact version of the profile
report  = profile3.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))

As you can see, the `update_profile` function and the `+` operator function similarly. The reason the `+` operator is important is that it's possible to *save and load profiles*, which we cover next.

### Saving and Loading a Profile

Not only can the Profiler create and update profiles, it's also possible to save profiles. This is critical as it is possible to generate multiple profiles and save them, then later load them.

In [None]:
# Load a CSV file, with "," as the delimiter
filenames = [
    "csv/sparse-first-and-last-column-header-and-author.txt",
    "csv/sparse-first-and-last-column-skip-header.txt"
]

data_objects = []
for filename in filenames:
    data_objects.append(dp.Data(os.path.join(data_path, filename)))


# Generate and save profiles
for i in range(len(data_objects)):
    profile = dp.Profiler(data_objects[i])
    profile.save(filepath="data-"+str(i)+".pkl")


# Load profiles and add them together
profile = None
for i in range(len(data_objects)):
    saved_profile_name = "data-"+str(i)+".pkl"
    profile_tmp = dp.Profiler.load(saved_profile_name)
    if profile is None:
        profile = profile_tmp
    else:
        profile += profile_tmp


# Report the compact version of the profile
report = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))