Due the the long runtime of this feature and issues with displaying widgets on GitHub, we offer a HTML version of this notebook which was already ran.\
Please see the HTML file with the same name as this notebook if you would like to see those results.

# Automated Data Exploration Using LLMs Demo

Pd-Explain offers the option to use a LLM for conducting automated data exploration.\
\
By providing a query of what the user wishes to explore, we utilize a LLM to generate queries with the goal of discovering information relevant to the user's request.\
Each query is analyzed using our explainers (specifically, the FEDEx and MetaInsight explainers), and fed back to the LLM for an iterative process, where in each iteration the LLM creates more queries to gleam more information relevant to the user's query, constructing a query tree.\
At the end of the process, the user will be provided with a summary of findings, visualizations of the queries referenced when producing the conclusions, and a log of the process itself, in a tab widget.

## Import pd-explain and load data

In [None]:
import pandas as pd
import pd_explain

In [None]:
adults = pd.read_csv("../Datasets/adult.csv")
adults

#### Note about LLM performance for automated data exploration

From our experience, deepseek-ai/DeepSeek-R1-Distill-Llama-70B-free works best for this task, giving the best performance in terms of producing valid and relevant queries (out of available free models).

# Using automated data exploration

## Automated data exploration with a user defined query

The usage of the automatic data exploration is simple.\
However, note that it may take a while to run (several minutes).\
Once done, you will see a widget with several tabs open.

In [None]:
adults.automated_data_exploration(user_query="Explore the effect that education and occupation have on one's capital-loss and capital-gain")

#### Other parameters

There are other parameters that can be controlled:

- `num_iterations`: Number of iterations to run the deep dive analysis. Default is 10. Note that each iteration will call the LLM once.
- `fedex_top_k`: Number of top findings to return from the FEDEx explainer. Default is 3.
- `metainsight_top_k`: Number of top findings to return from the MetaInsight explainer. Default is 2.
- `metainsight_max_filter_cols`: Maximum number of columns to analyze distribution of in the MetaInsight explainer. Default is 3.
- `metainsight_max_agg_cols`: Maximum number of columns to aggregate by in the MetaInsight explainer. Default is 3.
- `max_iterations_to_add`: The maximum number of iterations to add in case the LLM fails during some iterations. Default is 3. This can help mitigate cases where the LLM fails too many iterations and thus does not get enough information. Failures include failure to generate queries as well as cases where all queries generated produced zero findings.

#### Saving and re-loading results

It is possible to both save and re-load results after the automated exploration is finished, to avoid re-running the process, taking up time, tokens, and likely giving a not identical result (though typically somewhat similar).

To save the results, use:

In [None]:
adults.save_data_exploration(file_path="data_exploration_example.dill")

To load and visualize the results again, use:

In [None]:
adults.visualize_from_saved_data_exploration(file_path="data_exploration_example.dill")

## Automated follow up on explainer results using automated data exploration

You can use the automated data exploration feature to follow up on the results you got from the explainer, particularly so if you added LLM reasoning to it.\
The follow up feature will automatically pass your selected explanations and format a query for you, instructing the LLM to attempt to draw more information regarding the explanation(s), add context to them, and potentially corroborate any added reasoning by a LLM.

In [None]:
low_income = adults[adults['label'] == '<=50K']
low_income.explain(top_k=4)

The follow up feature requires the index of the explanation to follow up on. This is always 0-indexed.\
For fedex, top left is 0, top right is 1, and so on. Same for MetaInsight.\
For the many to one explainer, the index is the row number of the explanation in the table.\
For the outlier explainer, this parameter is not required, as it only has one explanation.

In [None]:
# You need to use the same DF you called 'explain' on
low_income.follow_up_with_automated_data_exploration(0)