# DataProfiler - Profilers

**Data profiling** - *is the process of examining a dataset and collecting statistical or informational summaries about said dataset.*

The Profiler class inside the DataProfiler is designed to generate *data profiles* via the Data class or Pandas DataFrame. 

Currently, the Data class supports loading the following file formats:

* Any delimited  (CSV, TSV, etc.)
* JSON object
* Avro
* Parquet
* Text files

Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc.

In [10]:
import os
import sys
import json
sys.path.insert(0, '..')
import dataprofiler as dp

data_path = "../dataprofiler/tests/data"

## Reporting

One of the primary purposes of the Profiler are to quickly identify what is in the dataset. This can be useful for analyzing a dataset prior to use or determining which columns could be useful for a given purpose.

In terms of reporting, there are multiple reporting options:

* **Pretty**: Floats are rounded to four decimal places, and lists are shortened.
* **Compact**: Similar to pretty, but removes detailed statistics such as runtimes, label probabilities, index locations of null types, etc.
* **Serializable**: Output is json serializable and not prettified
* **Flat**: Nested Output is returned as a flattened dictionary

The **Pretty** and **Compact** reports are the two most commonly used reports and includes `global_stats` and `data_stats` for the given dataset. 

`global_stats` contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio:

```
"global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,    
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
}
```

`data_stats` contains specific properties and statistics for each column such as min, max, mean, variance, etc.


```
"data_stats": {
    <column name>: {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
        "samples": list(str),
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list(string),
            "null_types_index": {
                string: list(int)
            },
            "data_type_representation": [string, list(string)],
            "min": [null, float],
            "max": [null, float],
            "mean": float,
            "variance": float,
            "stddev": float,
            "histogram": { 
                "bin_counts": list(int),
                "bin_edges": list(float),
            },
            "quantiles": {
                int: float
            }
            "vocab": list(char),
            "avg_predictions": dict(float), 
            "data_label_representation": dict(float),
            "categories": list(str),
            "unique_count": int,
            "unique_ratio": float,
            "precision": {
                'min': int,
                'max': int,
                'mean': float,
                'var': float,
                'std': float,
                'sample_size': int,
                'margin_of_error': float,
                'confidence_level': float		
            },
            "times": dict(float),
            "format": string
        }
    }
}
```

In the example, the `compact` format of the report is used to shorten the full list of the results. To get more results related to detailed predictions at the entity level from the DataLabeler component or histogram results, the format `pretty` should be used.

In [2]:
data = dp.Data(os.path.join(data_path, "csv/aws_honeypot_marx_geo.csv"))
profile = dp.Profiler(data)

# Compact - A high level view, good for quick reviews
report  = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))





BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.








Utilizing 3 processes for profiling


  return func(self, *args, **kwargs)
 62%|██████▎   | 10/16 [00:02<00:00,  6.51it/s]



 69%|██████▉   | 11/16 [00:02<00:00,  6.25it/s]



 81%|████████▏ | 13/16 [00:03<00:00,  6.11it/s]



100%|██████████| 16/16 [00:03<00:00,  4.71it/s]


{
    "global_stats": {
        "samples_used": 2999,
        "column_count": 16,
        "row_count": 2999,
        "row_has_null_ratio": 1.0,
        "row_is_null_ratio": 0.0,
        "unique_row_ratio": 1.0,
        "duplicate_row_count": 0,
        "file_type": "csv",
        "encoding": "utf-8"
    },
    "data_stats": {
        "datetime": {
            "column_name": "datetime",
            "data_type": "datetime",
            "data_label": "INTEGER|TIME|DATETIME",
            "categorical": false,
            "order": "random",
            "samples": "['3/5/13 7:42', '3/5/13 0:36', '3/4/13 7:08', '3/5/13 2:55', '3/4/13 6:46']",
            "statistics": {
                "min": "3/3/13 21:53",
                "max": "3/25/13 16:34",
                "format": "['%m/%d/%y %H:%M']",
                "unique_count": 1400,
                "unique_ratio": 0.4671,
                "sample_size": 2999,
                "null_count": 2,
                "null_types": "['']",
               

It should be noted, in addition to reading the input data from multiple file types, DataProfiler allows the input data as a dataframe.

In [3]:
# run data profiler and get the report
import pandas as pd
my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]], columns=["col_int", "col_float"])
profile = dp.Profiler(my_dataframe)

report  = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(report, indent=4))





BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.






  0%|          | 0/2 [00:00<?, ?it/s]

Utilizing 3 processes for profiling


100%|██████████| 2/2 [00:00<00:00,  7.87it/s]

{
    "global_stats": {
        "samples_used": 3,
        "column_count": 2,
        "row_count": 3,
        "row_has_null_ratio": 0.0,
        "row_is_null_ratio": 0.0,
        "unique_row_ratio": 1.0,
        "duplicate_row_count": 0,
        "file_type": "<class 'pandas.core.frame.DataFrame'>",
        "encoding": null
    },
    "data_stats": {
        "col_int": {
            "column_name": "col_int",
            "data_type": "int",
            "data_label": "INTEGER",
            "categorical": true,
            "order": "descending",
            "samples": "['1', '1', '-1']",
            "statistics": {
                "min": -1.0,
                "max": 1.0,
                "mean": 0.3333,
                "variance": 1.3333,
                "stddev": 1.1547,
                "histogram": {
                    "bin_counts": "[1, 0, 2]",
                    "bin_edges": "[-1.        , -0.33333333,  0.33333333,  1.        ]"
                },
                "quantiles": {
      




## Profiler options

The DataProfiler has the ability to turn on and off components as needed. This is accomplished via the `ProfilerOptions` class.

For example, a user doesn't require histogram information they may desire to turn off the histogram functionality. Simialrly, if a user is looking for a more accurate labeling, they can increase the samples used to label.

Below, let's remove the histogram and increase the number of samples to the labeler component (1,000 samples). 

Full list of options in the Profiler section of the [DataProfiler documentation](https://capitalone.github.io/DataProfiler).

In [4]:
data = dp.Data(os.path.join(data_path, "csv/diamonds.csv"))

profile_options = dp.ProfilerOptions()
profile_options.set({ 
    "histogram.is_enabled": False 
})
profile_options.structured_options.data_labeler.max_sample_size = 1000

profile = dp.Profiler(data, profiler_options=profile_options)
report  = profile.report(report_options={"output_format":"compact"})

# Print the report
print(json.dumps(report, indent=4))





BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.






  0%|          | 0/10 [00:00<?, ?it/s]

Utilizing 3 processes for profiling


  "not the whole dataset.".format(sample_size))




 10%|█         | 1/10 [00:00<00:02,  3.92it/s]



 20%|██        | 2/10 [00:00<00:01,  4.69it/s]



100%|██████████| 10/10 [00:01<00:00,  5.19it/s]


{
    "global_stats": {
        "samples_used": 10788,
        "column_count": 10,
        "row_count": 53940,
        "row_has_null_ratio": 0.0,
        "row_is_null_ratio": 0.0,
        "unique_row_ratio": 0.9973,
        "duplicate_row_count": 146,
        "file_type": "csv",
        "encoding": "utf-8"
    },
    "data_stats": {
        "carat": {
            "column_name": "carat",
            "data_type": "float",
            "data_label": "FLOAT",
            "categorical": true,
            "order": "random",
            "samples": "['0.75', '1.12', '0.88', '0.32', '0.72']",
            "statistics": {
                "min": 0.2,
                "max": 4.01,
                "mean": 0.7976,
                "variance": 0.2293,
                "stddev": 0.4789,
                "quantiles": {
                    "0": 0.3465,
                    "1": 0.6396,
                    "2": 0.9913
                },
                "precision": 1.0,
                "unique_count": 236,
    

## Updating Profiles

Beyond just profiling, one of the unique aspects of the DataProfiler is the ability to update the profiles. To update appropriately, the schema (columns / keys) must match appropriately.

In [5]:
# Load and profile a CSV file
data = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-header-and-author.txt"))
profile = dp.Profiler(data)

# Update the profile with new data:
new_data = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-skip-header.txt"))
# new_data = dp.Data(os.path.join(data_path, "iris-utf-16.csv")) # will error due to schema mismatch
profile.update_profile(new_data)

# Take a peak at the data
print(data.data)
print(new_data.data)

# Report the compact version of the profile
report  = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))





BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.






 33%|███▎      | 1/3 [00:00<00:00,  7.20it/s]

Utilizing 3 processes for profiling


100%|██████████| 3/3 [00:00<00:00,  9.38it/s]
 33%|███▎      | 1/3 [00:00<00:00,  6.27it/s]

Utilizing 3 processes for profiling


100%|██████████| 3/3 [00:00<00:00,  9.13it/s]


  CONTAR                EL NOMBRE  EL NUMERO
0      1        George Washington          1
1      2               John Adams         12
2                Thomas Jefferson           
3                   James Madison         13
4                    James Monroe           
5                      John Adams           
6      7           Andrew Jackson           
7                Martin Van Buren         83
8          William Henry Harrison           
  CONTAR                EL NOMBRE  EL NUMERO
0      1        George Washington          1
1      2               John Adams         12
2                Thomas Jefferson           
3                   James Madison         13
4                    James Monroe           
5                      John Adams           
6      7           Andrew Jackson           
7                Martin Van Buren         83
8          William Henry Harrison           
{
    "global_stats": {
        "samples_used": 18,
        "column_count": 3,
        "row_count": 

## Merging Profiles

Merging profiles are an alternative method for updating profiles. Particularly, multiple profiles can be generated seperately, then added together with a simple `+` command: `profile3 = profile1 + profile2`

In [6]:
# Load a CSV file with a schema
data1 = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-header-and-author.txt"))
profile1 = dp.Profiler(data1)

# Load another CSV file with the same schema
data2 = dp.Data(os.path.join(data_path, "csv/sparse-first-and-last-column-skip-header.txt"))
profile2 = dp.Profiler(data2)

# Merge the profiles
profile3 = profile1 + profile2

# Report the compact version of the profile
report  = profile3.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))





BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.






  return func(self, *args, **kwargs)
 33%|███▎      | 1/3 [00:00<00:00,  6.96it/s]

Utilizing 3 processes for profiling


100%|██████████| 3/3 [00:00<00:00,  9.11it/s]






BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.






 33%|███▎      | 1/3 [00:00<00:00,  7.15it/s]

Utilizing 3 processes for profiling


100%|██████████| 3/3 [00:00<00:00,  9.13it/s]






BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.






0it [00:00, ?it/s]

Utilizing 3 processes for profiling





{
    "global_stats": {
        "samples_used": 18,
        "column_count": 3,
        "row_count": 18,
        "row_has_null_ratio": 0.7778,
        "row_is_null_ratio": 0.0,
        "unique_row_ratio": 0.5,
        "duplicate_row_count": 9,
        "file_type": "csv",
        "encoding": "utf-8"
    },
    "data_stats": {
        "CONTAR": {
            "column_name": "CONTAR",
            "data_type": "int",
            "data_label": "INTEGER",
            "categorical": true,
            "order": "random",
            "samples": "['7', '1', '2']",
            "statistics": {
                "min": 1.0,
                "max": 7.0,
                "mean": 3.3333,
                "variance": 8.2667,
                "stddev": 2.8752,
                "quantiles": {
                    "0": 1.75,
                    "1": 1.75,
                    "2": 5.5
                },
                "unique_count": 3,
                "unique_ratio": 0.5,
                "categories": "['2', '7', '

As you can see, the `update_profile` function and the `+` operator function similarly. The reason the `+` operator is important is that it's possible to *save and load profiles*, which we cover next.

## Saving and Loading a Profile

Not only can the Profiler create and update profiles, it's also possible to save, load then manipulate profiles.

In [9]:
# Load data
data = dp.Data(os.path.join(data_path, "csv/names-col.txt"))

# Generate a profile
profile = dp.Profiler(data)

# Save a profile to disk for later use
profile.save(filepath="my_profile.pkl")

# Load a profile from disk
loaded_profile = dp.Profiler.load("my_profile.pkl")

# Report the compact version of the profile
report = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))





BEFORE USE IN PRODUCTION CONTACT THE DEVELOPMENT TEAM

EMAIL: #Data-Innovation@capitalone.com
SLACK: #data-profiler-support

Controls must be put into place to ensure data security and the 
Data Innovation Team @ UIUC must notify the MRO of new usecases to 
ensure compliance with the Capital One Model Risk Office.

The instantiation of this class will also be recorded and sent to 
the Data Innovation Team.






  0%|          | 0/1 [00:00<?, ?it/s]

Utilizing 3 processes for profiling


100%|██████████| 1/1 [00:00<00:00,  7.57it/s]


AttributeError: 'Profiler' object has no attribute 'save'

With the ability to save and load profiles, profiles can be generated via multiple machines then merged. Further, profiles can be stored and later used in applications such as change point detction, synthetic data generation, and more. 

In [None]:
# Load a multiple files via the Data class
filenames = ["csv/sparse-first-and-last-column-header-and-author.txt",
             "csv/sparse-first-and-last-column-skip-header.txt"]
data_objects = []
for filename in filenames:
    data_objects.append(dp.Data(os.path.join(data_path, filename)))


# Generate and save profiles
for i in range(len(data_objects)):
    profile = dp.Profiler(data_objects[i])
    profile.save(filepath="data-"+str(i)+".pkl")


# Load profiles and add them together
profile = None
for i in range(len(data_objects)):
    if profile is None:
        profile = dp.Profiler.load("data-"+str(i)+".pkl")
    else:
        profile += dp.Profiler.load("data-"+str(i)+".pkl")


# Report the compact version of the profile
report = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))