# LLM Features Demo

PD-Explain supports integration with LLMs (Large Language Models) to provide additional features and capabilities.\
In this demo, we will explore the LLM features of PD-Explain.

## Import pd-explain and load data

In [1]:
import pandas as pd
import pd_explain

usetex-False


In [2]:
adults = pd.read_csv("../Datasets/adult.csv")
adults

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


## Automated Data Exploration

Pd-Explain offers the option to use a LLM for conducting automated data exploration.\
\
By providing a query of what the user wishes to explore, we utilize a LLM to generate queries with the goal of discovering information relevant to the user's request.\
Each query is analyzed using our explainers (specifically, the FEDEx and MetaInsight explainers), and fed back to the LLM for an iterative process, where in each iteration the LLM creates more queries to gleam more information relevant to the user's query, constructing a query tree.\
At the end of the process, the user will be provided with a summary of findings, visualizations of the queries deemed most important to producing the conclusions, and the query tree drawn as a graph where the user can view all of the queries as well as visualize the findings of each query (where possible) by clicking on their node.

### Note about LLM performance for automated data exploration

From our experience, deepseek-ai/DeepSeek-R1-Distill-Llama-70B-free works best for this task, giving the best performance in terms of producing valid and relevant queries.\
On the other hand, gemini-2.5-flash often performs poorly, often ignoring instructions and producing queries that are not syntactically correct.

### Using automated data exploration

The usage of the automatic data exploration is simple.\
However, note that it may take a while to run (possibly several minutes).\
Once done, you will see a widget with several tabs open.

In [3]:
adults.automated_data_exploration(user_query="Explore the effect that education and occupation have on one's capital-loss and capital-gain", verbose=True)

Initial plan generated by the LLM:

1. **Summarize Key Variables**: Generate summary statistics for education, occupation, capital-gain, and capital-loss to understand their distributions and central tendencies.

2. **Explore Capital-Gain and Capital-Loss Distributions**: Run queries to examine the distribution of capital-gain and capital-loss across the dataset, including mean, median, min, max, and standard deviation.

3. **Examine Education Distribution**: Generate a frequency distribution of the education variable to understand the prevalence of different education levels.

4. **Examine Occupation Distribution**: Generate a frequency distribution of the occupation variable to understand the prevalence of different occupations.

5. **Calculate Average Capital-Gain by Education**: Group the data by education and calculate the average capital-gain for each education level.

6. **Calculate Average Capital-Loss by Education**: Group the data by education and calculate the average capita

Tab(children=(HTML(value='\n            <div style=\'padding:20px; max-width:800px; line-height:1.5; font-fami…

There are other parameters that can be controlled:

- `num_iterations`: Number of iterations to run the deep dive analysis. Default is 10. Note that each iteration will call the LLM once.
- `queries_per_iteration`: Number of queries to generate per iteration. Default is 5. This number is not set in stone, and may go up during the process if the LLM's queries fail too often.
- `fedex_top_k`: Number of top findings to return from the FEDEx explainer. Default is 3.
- `metainsight_top_k`: Number of top findings to return from the MetaInsight explainer. Default is 2.
- `metainsight_max_filter_cols`: Maximum number of columns to analyze distribution of in the MetaInsight explainer. Default is 3.
- `metainsight_max_agg_cols`: Maximum number of columns to aggregate by in the MetaInsight explainer. Default is 3.
- `visualization_type`: The type of visualization for the query tree. Can be 'graph' for an interactive graph visualization, or 'simple' for a simpler, static HTML visualization. Default is 'graph'.

#### Saving and re-loading results

It is possible to both save and re-load results after the automated exploration is finished, to avoid re-running the process, taking up time, tokens, and likely giving a not identical result (though typically somewhat similar).

To save the results, use:

In [4]:
adults.save_data_exploration(file_path="data_exploration_example_3.dill")

To load and visualize the results again, use (`graph` is the default value of the second argument)

In [3]:
adults.visualize_from_saved_data_exploration(file_path="data_exploration_example_3.dill", visualization_type='graph')

Tab(children=(HTML(value='\n            <div style=\'padding:20px; max-width:800px; line-height:1.5; font-fami…

### Loading pre-prepared results for this demo

Since the automated data exploration can take a while to run, we have prepared a file with the results of the automated data exploration for this demo.

In [None]:
# Uncomment the following line if you don't have gdown installed
#!pip install gdown

#### Example 1

In [12]:
import gdown
url= "https://drive.google.com/uc?export=download&id=1AlXSWy-P_Uif6BZ2pBcIZDyyvShFF5LV"
output = "data_exploration_example.dill"
gdown.download(url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?export=download&id=1AlXSWy-P_Uif6BZ2pBcIZDyyvShFF5LV
From (redirected): https://drive.google.com/uc?export=download&id=1AlXSWy-P_Uif6BZ2pBcIZDyyvShFF5LV&confirm=t&uuid=d3bacf88-4a62-4c28-8406-6c1cead829b1
To: C:\Users\Yuval\PycharmProjects\pd-explain\Examples\Notebooks\data_exploration_example.dill
100%|██████████| 157M/157M [00:04<00:00, 36.0MB/s] 


'data_exploration_example.dill'

In [17]:
adults.visualize_from_saved_data_exploration(file_path="data_exploration_example.dill")

Tab(children=(HTML(value='\n            <div style=\'padding:20px; max-width:800px; line-height:1.5; font-fami…

#### Example 2

In [14]:
import gdown
url= "https://drive.google.com/uc?export=download&id=19LC-vv7rHnIbqzzsn5RAoFm12erZAL0N"
output = "data_exploration_example2.dill"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?export=download&id=19LC-vv7rHnIbqzzsn5RAoFm12erZAL0N
To: C:\Users\Yuval\PycharmProjects\pd-explain\Examples\Notebooks\data_exploration_example2.dill
100%|██████████| 29.8M/29.8M [00:01<00:00, 25.9MB/s]


'data_exploration_example2.dill'

In [18]:
adults.visualize_from_saved_data_exploration(file_path="data_exploration_example2.dill")

Tab(children=(HTML(value='\n            <div style=\'padding:20px; max-width:800px; line-height:1.5; font-fami…

### Automated follow up on explainer results using automated data exploration

You can use the automated data exploration feature to follow up on the results you got from the explainer, particularly so if you added LLM reasoning to it.\
The follow up feature will automatically pass your selected explanations and format a query for you, instructing the LLM to attempt to draw more information regarding the explanation(s), add context to them, and potentially corroborate any added reasoning by a LLM.

In [None]:
low_income = adults[adults['label'] == '<=50K']
low_income.explain(top_k=4)

In [None]:
# For this example, we are using the explanations for low_income from the first section.
# The index passed is the index of the explanation, in this case, the top left plot.
low_income.follow_up_with_automated_data_exploration(explanation_index=0)

#### Pre-prepared results for this demo

There are two pre-prepared results for this demo - once without LLM reasoning, and once with LLM reasoning.\
First, we will load the results without LLM reasoning:

In [None]:
# Uncomment the following line if you don't have gdown installed
#!pip install gdown

In [19]:
import gdown
url= "https://drive.google.com/uc?export=download&id=1zpkg2U0nFR0-k8wXpTFe01fOjWLfT7hr"
output = "follow_up_example.dill"
gdown.download(url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?export=download&id=1zpkg2U0nFR0-k8wXpTFe01fOjWLfT7hr
From (redirected): https://drive.google.com/uc?export=download&id=1zpkg2U0nFR0-k8wXpTFe01fOjWLfT7hr&confirm=t&uuid=72a9457c-7478-44ce-af80-e607f32b6294
To: C:\Users\Yuval\PycharmProjects\pd-explain\Examples\Notebooks\follow_up_example.dill
100%|██████████| 234M/234M [00:05<00:00, 41.2MB/s] 


'follow_up_example.dill'

In [20]:
adults.visualize_from_saved_data_exploration(file_path="follow_up_example.dill")

Tab(children=(HTML(value='\n            <div style=\'padding:20px; max-width:800px; line-height:1.5; font-fami…

And next, the results with LLM reasoning:

In [21]:
import gdown
url= "https://drive.google.com/uc?export=download&id=1iaxDdzAEjKtytYAItip23MnVNBQ3ZK_l"
output = "follow_up_example_with_llm.dill"
gdown.download(url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?export=download&id=1iaxDdzAEjKtytYAItip23MnVNBQ3ZK_l
From (redirected): https://drive.google.com/uc?export=download&id=1iaxDdzAEjKtytYAItip23MnVNBQ3ZK_l&confirm=t&uuid=1a498f49-8f2f-435c-9647-691e7112e578
To: C:\Users\Yuval\PycharmProjects\pd-explain\Examples\Notebooks\follow_up_example_with_llm.dill
100%|██████████| 265M/265M [00:06<00:00, 44.0MB/s] 


'follow_up_example_with_llm.dill'

In [22]:
adults.visualize_from_saved_data_exploration(file_path="follow_up_example_with_llm.dill")

Tab(children=(HTML(value='\n            <div style=\'padding:20px; max-width:800px; line-height:1.5; font-fami…

## Beautifying Plots with LLMs

The beautification feature allows you to use a LLM to generate a (hopefully) more aesthetically pleasing version of your plots.\
Note that this feature requires a multi-modal LLM that can handle images, such as Google's Gemini.\
\
The beautification process is done by:
1. Showing the LLM the plot + the code and data used to generate it, and asking it to generate new code that will produce a more aesthetically pleasing version of the plot.
2. Iteratively running the generated code until the LLM either produces a plot that is satisfactory, or it runs out of attempts to fix the plot. The LLM will score each attempt where a plot was generated, and will stop when it reaches a satisfactory score or when it runs out of attempts.
3. Returning the plot with the highest score, alongside the original plot.

Currently, only the `fedex` and `MetaInsight` explainers support beautification.\
\
Beautification can also be used with the automated data exploration feature, to beautify each explainer's plots as well as the query tree visualization.\
\
Parameters:
- `beautify`: Whether to beautify the plot or not. Default is False.
- `beautify_max_fix_attempts`: Maximum number of attempts to fix the plot. Default is 10.
- `silent_beautify`: Whether to print progress messages during the beautification process. Default is False.

### Example 1: Beautifying a FEDEx Explanation Plot

In [3]:
low_income = adults[adults['label'] == '<=50K']
low_income

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48836,33,Private,245211,Bachelors,13,Never-married,Prof-specialty,Own-child,White,Male,0,0,40,United-States,<=50K
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K


In [4]:
low_income.explain(top_k=4, beautify=True, add_llm_explanation_reasoning=True, silent_beautify=False)

Error encountered in LLM generated code - Execution error - cannot access local variable 'get_max_k' where it is not associated with a value
Attempting to fix the code... (1/10)
The generated code executed successfully.
Approving or improving the generated visualization... 2/10
The LLM disapproved the generated visualization and scored it 6.0 / 10. It will attempt to improve it.
The generated code executed successfully.
Approving or improving the generated visualization... 3/10
The LLM disapproved the generated visualization and scored it 9.0 / 10. It will attempt to improve it.
The generated code executed successfully.
Approving or improving the generated visualization... 4/10
The LLM disapproved the generated visualization and scored it 8.0 / 10. It will attempt to improve it.
The generated code executed successfully.
Approving or improving the generated visualization... 5/10
The LLM disapproved the generated visualization and scored it 9.0 / 10. It will attempt to improve it.
The ge

Tab(children=(Output(), Output()), selected_index=0, titles=('Original Visualization', 'Beautified Visualizati…

### Example 2: Beautifying a MetaInsight Explanation Plot

In [10]:
by_marital_status = adults.groupby("marital-status").mean()
by_marital_status

Unnamed: 0_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
marital-status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Divorced,43.159204,184749.693954,10.052917,793.675562,67.654304,41.115483
Married-AF-spouse,31.945946,184132.675676,10.432432,2971.621622,84.756757,39.810811
Married-civ-spouse,43.353724,186790.58251,10.303275,1739.700612,120.619509,43.306984
Married-spouse-absent,40.613057,197523.157643,9.377389,629.004777,63.184713,39.684713
Never-married,28.128064,195450.902836,9.972141,384.382639,54.126078,36.891357
Separated,39.72549,202974.111111,9.270588,581.842484,56.618954,39.667974
Widowed,59.37747,175529.942688,9.088274,603.644269,81.620553,33.438076


In [11]:
by_marital_status.explain(top_k=4,
                           explainer='metainsight',
                           beautify=True,
                          add_llm_explanation_reasoning=True,
                          silent_beautify=False)

The generated code executed successfully.
Approving or improving the generated visualization... 1/10
The LLM disapproved the generated visualization and scored it 6.0 / 10. It will attempt to improve it.
Error encountered in LLM generated code - Execution error - GridSpecFromSubplotSpec.__init__() missing 1 required positional argument: 'subplot_spec'
Attempting to fix the code... (2/10)
Error encountered in LLM generated code - Execution error - subplot_spec must be type SubplotSpec, usually from GridSpec, or axes.get_subplotspec.
Attempting to fix the code... (3/10)
The generated code executed successfully.
Approving or improving the generated visualization... 4/10
The LLM disapproved the generated visualization and scored it 0.0 / 10. It will attempt to improve it.
The generated code executed successfully.
Approving or improving the generated visualization... 5/10
The LLM disapproved the generated visualization and scored it 0.0 / 10. It will attempt to improve it.
The generated cod

Tab(children=(Output(), Output()), selected_index=0, titles=('Original Visualization', 'Beautified Visualizati…