# DataProfiler - What's in your data?

This introductory jupyter notebook demonstrates the basic usages of the DataProfiler. The library is designed in the way that users can easily get the statistics and other detailed information about the input datasets with just several lines of code. DataProfiler provides multiple data classes that can handle different types of data. In addition, users are given various options to skip some properties if not needed while profiling their datasets. The last key feature covered in this example is the ability to allow users to update their profilers from multiple batches of the large datasets, or merge multiple profilers which is suitable for the distributed computing environment. In particular, this example covers the followings:

    - Basic usage of DataProfiler
    - Profiler options
    - Data reader class
    - Update profiles and merge profiles

First, let's import the libraries needed for this example.

In [None]:
import os
import sys
import json
import pandas as pd
import matplotlib.pyplot as plt
sys.path.insert(0, '..')
import dataprofiler as dp

## Basic examples

This section shows the basic example of DataProfiler. A CSV dataset is read using the data reader, then the returned data is given to the DataProfiler to obtain the properties and stastistics.

In [None]:
# use data reader to read input data
data = dp.Data("../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv")
print(data.data.head())

# run data profiler and get the report
profile = dp.Profiler(data)
report  = profile.report(report_options={"output_format":"compact"})

# print the report
print(json.dumps(report, indent=4))

The report includes `global_stats` and `data_stats` for the given dataset. The former contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio, while the later contains specific properties and statistics for each column such as min, max, mean, variance, etc. In this example, the `compact` format of the report is used to shorten the full list of the results. To get more results related to detailed predictions at the entity level from the DataLabeler component or histogram results, the format `pretty` should be used.

In addition to reading the input data from multiple file types, DataProfiler allows the input data as a dataframe.

In [None]:
# run data profiler and get the report
my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]], columns=['col_int', 'col_float'])
profile = dp.Profiler(my_dataframe)
report  = profile.report(report_options={"output_format":"compact"})

# Print the report
print(json.dumps(report, indent=4))

## Profiler options

DataProfiler can run several selected components if needed. For example, if the users only want the statistics information, they may turn off the DataLabeler functionality. Below, let's remove the histogram and data labeler component while running DataProfiler.

In [None]:
profile_options = dp.ProfilerOptions()
profile_options.set({"histogram.is_enabled": False,
                     "data_labeler.is_enabled": False,})

profile = dp.Profiler(data, profiler_options=profile_options)
report  = profile.report(report_options={"output_format":"compact"})

# Print the report
print(json.dumps(report, indent=4))

## Data reader class

DataProfiler can detect multiple file types including CSV, JSON, Parquet, AVRO, and text. The example below shows that it successfully detects data types from multiple categories regardless of the file extensions.

In [None]:
# use data reader to read input data with different file types
csv_files = [
    "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv",
    "../dataprofiler/tests/data/csv/all-strings-skip-header-author.csv", # csv files with the author/description on the first line
    "../dataprofiler/tests/data/csv/sparse-first-and-last-column-empty-first-row.txt", # csv file with the .txt extension
]
json_files = [
    "../dataprofiler/tests/data/json/complex_nested.json",
    "../dataprofiler/tests/data/json/honeypot_intentially_mislabeled_file.csv", # json file with the .csv extension
]
parquet_files = [
    "../dataprofiler/tests/data/parquet/nation.dict.parquet",
    "../dataprofiler/tests/data/parquet/nation.plain.intentionally_mislabled_file.csv", # parquet file with the .csv extension
]
avro_files = [
    "../dataprofiler/tests/data/avro/userdata1.avro",
    "../dataprofiler/tests/data/avro/userdata1_intentionally_mislabled_file.json", # avro file with the .json extension
]
text_files = [
    "../dataprofiler/tests/data/txt/discussion_reddit.txt",
]

all_files = {
    'csv': csv_files,
    'json': json_files,
    'parquet': parquet_files,
    'avro': avro_files,
    'text': text_files
}

for file_type in all_files:
    print(file_type)
    for file in all_files[file_type]:
        data = dp.Data(file)
        print("{:<85} {:<15}".format(file, data.data_type))
    print('\n')

The `Data` class detects the file type and uses one of the following classes: `CSVData`, `JSONData`, `ParquetData`, `AVROData`, `TextData`. Users can call these specific classes directly if desired. For example, below we provide a collection of data with different types, each of them is processed by the corresponding data class.

In [None]:
# use individual data reader classes
from dataprofiler.data_readers.csv_data import CSVData
from dataprofiler.data_readers.json_data import JSONData
from dataprofiler.data_readers.parquet_data import ParquetData
from dataprofiler.data_readers.avro_data import AVROData
from dataprofiler.data_readers.text_data import TextData

csv_files = "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
json_files = "../dataprofiler/tests/data/json/complex_nested.json"
parquet_files = "../dataprofiler/tests/data/parquet/nation.dict.parquet"
avro_files = "../dataprofiler/tests/data/avro/userdata1.avro"
text_files = "../dataprofiler/tests/data/txt/discussion_reddit.txt"

all_files = {
    'csv': [csv_files, CSVData],
    'json': [json_files, JSONData],
    'parquet': [parquet_files, ParquetData],
    'avro': [avro_files, AVROData],
    'text': [text_files, TextData],
}

for file_type in all_files:
    file, data_reader = all_files[file_type]
    data = data_reader(file)
    print('File name {}\n'.format(file))
    if file_type == 'text':
        print(data.data[0][:1000]) # print the first 1000 characters
    else:
        print(data.data)
    print('===============================================================================')

## Update profiles

One of the interesting features of the DataProfiler is the ability to update profiles from batches of data, which is then makes DataProfiler applicable for data streaming usage. In this section, the original dataset is separated into two batches with equal size. Each batch is then updated with DataProfiler sequentially.  

In [None]:
# read the input data and devide it into two equal halves
data = dp.Data("../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv")
df = data.data
df1 = df.iloc[:int(len(df)/2)]
df2 = df.iloc[int(len(df)/2):]

# get profile for the first half
profile = dp.Profiler(df1)

# Update the profile with the second half
profile.update_profile(df2)

## Merge profiles

In addition to the profile update, DataProfiler provides the merging functionality which allows users to combine the profiles updated from multiple locations. This enables DataProfiler to be used in a distributed computing environment. Below, we assume that the two aforementioned halves of the original dataset come from two different machines. Each of them is then updated with the DataProfiler on the same machine, then the resulted profiles are merged.

In [None]:
# get profile for the first half
profile1 = dp.Profiler(df1)

# Update the profile with the second half
profile2 = dp.Profiler(df2)

# merge profiles
profile = profile1 + profile2

## More on the profile merge for the statistics update

After merging, we expect the resulted profiles give the same statistics as the profiles updated from the full dataset. The following example demonstrates that through checking several statistics of the given dataset. First, let's see the total list of statistics for a collected column.

In [None]:
# run data profiler with labeler disabled (just to reduce running time)
profile_options = dp.ProfilerOptions()
profile_options.set({"data_labeler.is_enabled": False,})
profile_full = dp.Profiler(df, profiler_options=profile_options)

# select `int_col` column from the data to show the statistics update
report = profile_full.report()
selected_col = 'int_col'
stats = report['data_stats'][selected_col]['statistics']
print(stats)

Now, let's choose several statistics, `min`, `max`, `mean`, `variance`, `stddev`, to check the profile merging. We'll see that the statistics update from the merging profile is the same as from the profile updated from the full dataset.

In [None]:
import pandas as pd

profile1 = dp.Profiler(df1, profiler_options=profile_options)
profile2 = dp.Profiler(df2, profiler_options=profile_options)
profile_merge = profile1 + profile2

list_stats = ['min', 'max', 'mean', 'variance', 'stddev']
result_stats = {stat:[] for stat in list_stats}
list_profiles = [profile1, profile2, profile_merge, profile_full]
selected_col = 'int_col'

for profile in list_profiles:
    report = profile.report()
    stats = report['data_stats'][selected_col]['statistics']
    for stat in result_stats:
        result_stats[stat].append(stats[stat])
result_stats = pd.concat([pd.DataFrame({'Profile': ['profile1', 'profile2', 'profile_merge', 'profile_full']}), 
                          pd.DataFrame(result_stats)], axis=1)
result_stats

## Conclusion

We have walked through some basic examples of DataProfiler usage, with different input data types and profiling options. We also work with update and merging functionality of the DataProfiler, which make it applicable for data streaming and distributed environment. Interested users can try with different datasets and functionalities as desired.