# Overview

This is an example of a utils in the dataprofiler for distributed merging of profile objects. This assumes the user is providing a list of profile objects to the function for merging all the lists together. 

# Imports

Let's start by importing the necessary packages...

In [1]:
import os
import sys
import json

import pandas as pd
import tensorflow as tf

try:
    sys.path.insert(0, '..')
    import dataprofiler as dp
    from dataprofiler.profilers.utils import merge_profile_list
except ImportError:
    import dataprofiler as dp
    from dataprofiler.profilers.utils import merge_profile_list

# remove extra tf loggin
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

## Setup the Data and Profiler

This section shows the basic example of the Data Profiler. 

1. Instantiate a Pandas dataframe.
2. Pass the dataframe to the `Profiler` and instantiate two separate profilers in a list.

In [2]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

list_of_profiles = [dp.Profiler(df), dp.Profiler(df)]

2022-08-08 12:50:43.340749: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 499.35it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.13it/s]


INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 767.13it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  7.68it/s]


Take a look at the list of profiles... 

In [3]:
list_of_profiles

[<dataprofiler.profilers.profile_builder.StructuredProfiler at 0x7ff2af027b10>,
 <dataprofiler.profilers.profile_builder.StructuredProfiler at 0x7ff2af1c9650>]

## Run Merge on List of Profiles

Now let's merge the list of profiles into a `single_profile`

In [4]:
single_profile = merge_profile_list(list_of_profiles=list_of_profiles)

  f"Overlapping indices detected. To resolve, indices "
  f"Overlapping indices detected. To resolve, indices "


And check out the `.report` on the single profile:

In [5]:
single_profile.report()

{'global_stats': {'samples_used': 4,
  'column_count': 2,
  'row_count': 4,
  'row_has_null_ratio': 0.0,
  'row_is_null_ratio': 0.0,
  'unique_row_ratio': 0.5,
  'duplicate_row_count': 2,
  'file_type': "<class 'pandas.core.frame.DataFrame'>",
  'encoding': None,
  'correlation_matrix': None,
  'chi2_matrix': array([[1.        , 0.04601171],
         [0.04601171, 1.        ]]),
  'profile_schema': defaultdict(list, {'col1': [0], 'col2': [1]}),
  'times': {'row_stats': 0.004318714141845703}},
 'data_stats': [{'column_name': 'col1',
   'data_type': 'int',
   'data_label': 'INTEGER',
   'categorical': True,
   'order': 'random',
   'samples': ['2', '1'],
   'statistics': {'min': 1.0,
    'max': 2.0,
    'mode': [1.0005, 1.9995],
    'median': 1.5,
    'sum': 6.0,
    'mean': 1.5,
    'variance': 0.3333333333333333,
    'stddev': 0.5773502691896257,
    'skewness': 0.0,
    'kurtosis': -6.0,
    'histogram': {'bin_edges': array([1.        , 1.33333333, 1.66666667, 2.        ]),
     'bin_c