# Dataloader with Popmon Reports

This demo is to cover the usage of popmon with the dataloader from the dataprofiler

This demo covers the followings:

    - How to install popmon
    - Comparison of the dynamic dataloader from dataprofiler to the 
        standard dataloader used in pandas
    - Popmon's usage example using both dataloaders
    - Dataprofiler's examples using both dataloaders
    - Usage of the pm_stability_report function (popmon reports)


## How to Install Popmon
To install popmon you can use the command below:

`pip3 install popmon`


From here, we can import the libararies needed for this demo.

In [None]:
import os
import sys
try:
    sys.path.insert(0, '..')
    import dataprofiler as dp
except ImportError:
    import dataprofiler as dp
import pandas as pd
import popmon  # noqa

## Comparison of Dataloaders

First, we have the original pandas dataloading which works for specific file types. 
This is good for if the data format is known ahead of time but is less useful for more dynamic cases.

In [None]:
def pm_dataloader(path, time_index):
    # Load pm dataframe (Can only read csvs unless reader option is changed)
    if not time_index is None:
        pm_data = pd.read_csv(path, parse_dates=[time_index])
    else:
        time_index = True
        pm_data = pd.read_csv(path)
    return pm_data

Next, we have the dataprofiler's dataloader. This allows for the dynamic loading of different data formats which is super useful when the data format is not know ahead of time.
This is intended to be an improvement on the dataloader standardly used in pandas.

In [None]:
def dp_dataloader(path, time_index):
    # Datalaoder from dataprofiler used
    dp_data = dp.Data(path) 

    # Make dataframe from dataprofiler popmon compliant
    if not time_index is None:
        # Mimics parse_dates=[time_index] optional param
        dp_data.data[time_index] = pd.to_datetime(dp_data.data[time_index])

    return dp_data.data, time_index

## Popmon's usage example using both dataloaders

To execute this example we are going to need to download data from popmons "resources" component so we must import that into our project as well. 

In [None]:
from popmon import resources

Next, we'll download a dataset from the component

In [None]:
popmon_tutorial_data = resources.data("flight_delays.csv.gz")
import gzip
import shutil
with gzip.open(popmon_tutorial_data, 'rb') as f_in:
    with open('./flight_delays.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Finally we read in the data with popmon and print the report to a file

In [None]:
# Default csv from popmon example
path = "./flight_delays.csv"
time_index = "DATE"
report_output_dir = "./popmon_output/flight_delays_full"
if not os.path.exists(report_output_dir):
    os.makedirs(report_output_dir)


In [None]:
pm_data = pm_dataloader(path, time_index)

report_pm_loader = pm_data.pm_stability_report(
    time_axis=time_index,
    time_width="1w",
    time_offset="2015-07-02",
    extended_report=False,
    pull_rules={"*_pull": [10, 7, -7, -10]},
)
# Save pm reports
report_pm_loader.to_file(os.path.join(report_output_dir, "popmon_loader_report.html"))
print("Report printed at:", os.path.join(report_output_dir, "popmon_loader_report.html"))

We then do the same for the dataprofiler loader

In [None]:
dp_dataframe, dp_time_index = dp_dataloader(path, time_index)
# Generate pm report using dp dataloader
report_dp_loader = dp_dataframe.pm_stability_report(
    time_axis=dp_time_index,
    time_width="1w",
    time_offset="2015-07-02",
    extended_report=False,
    pull_rules={"*_pull": [10, 7, -7, -10]},
)

report_dp_loader.to_file(os.path.join(report_output_dir, "dataprofiler_loader_report.html"))

print("Report printed at:", os.path.join(report_output_dir, "dataprofiler_loader_report.html"))

## Examples of data
Next, We'll use some data from the test files of the data profiler to compare the dynamic loading of the dataprofiler's data loader to that of the standard pandas approach. 


## Dataprofiler's examples using both dataloaders

To execute this properly, simply choose one of the 3 examples below and then run the report generation below.

In [None]:
# Default csv from popmon example (mini version)
path = "../dataprofiler/tests/data/csv/flight_delays.csv"
time_index = "DATE"
report_output_dir = "./popmon_output/flight_delays_mini"


In [None]:
# Random csv from dataprofiler tests
path = "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
time_index = "datetime"
report_output_dir = "./popmon_output/aws_honeypot_marx_geo"

In [None]:
# Random json file from dataprofiler tests
path = "../dataprofiler/tests/data/json/math.json"

time_index = "data.9"
report_output_dir = "./popmon_output/math"

Run the block below to create an output directory for your popmon reports.

In [None]:
if not os.path.exists(report_output_dir):
    os.makedirs(report_output_dir)
dp_dataframe, dp_time_index = dp_dataloader(path, time_index)

## Report comparison

We generate reports using the two different sets of data from the dataprofiler and pandas below.


The dataprofiler's dataloader can seemlessly switch between data formats and generate reports with the exact same code in place.

In [None]:
# Generate pm report using dp dataloader
report_dp_loader = dp_dataframe.pm_stability_report(
    time_axis=dp_time_index,
    time_width="1w",
    time_offset="2015-07-02",
    extended_report=False,
    pull_rules={"*_pull": [10, 7, -7, -10]},
)

Unlike above, there are some problems with the pandas loader (if using the json data example above). It can't handle different data formats without changing the code above to notify the loader about a change in data format.

In [None]:
# Generate pm report using pm dataloader
loader_fail = False
try:
    pm_data = pm_dataloader(path, time_index)
except Exception as e:
    loader_fail = True
    print(e)

## Report generation 
(using popmon's pm_stability_report)

If the dataloaders are valid, you can see the reports and compare them at the output directory specified in the printout below each report generation block (the two code blocks below).

In [None]:
if not loader_fail:
    report_pm_loader = pm_data.pm_stability_report(
        time_axis=time_index,
        time_width="1w",
        time_offset="2015-07-02",
        extended_report=False,
        pull_rules={"*_pull": [10, 7, -7, -10]},
    )
    # Save pm reports
    report_pm_loader.to_file(os.path.join(report_output_dir, "popmon_loader_report.html"))
    print("Report printed at:", os.path.join(report_output_dir, "popmon_loader_report.html"))
    

In [None]:
# Save dp reports
report_dp_loader.to_file(os.path.join(report_output_dir, "dataprofiler_loader_report.html"))
print("Report printed at:", os.path.join(report_output_dir, "dataprofiler_loader_report.html"))

