# Automating Insight Extraction from ML Visual Data (with Large Language Models)

## Introduction

+ visualisations as a tool to "test" visual properties of ML pipelines (data, model)

![](../catalogue/select-04-linear.png)
![](../catalogue/select-04-rbf.png)
+ ...but data changes as a reflection of real-world
+ visual properties observed become **implicit expectations**
+ `assert` statements to define **explicit expectations**

![](../catalogue/select-12.png)

```python
assert np.mean(history[:10]) >np.mean(history[-10:]), "RNN didn't converge"
```

## The Data

We mined **Jupyter Notebooks** from:
+ Github: 52K (35GB)
+ Kaggle: 250K (252GB)



In [1]:
%%bash
(
    cd ~/phd/shome2023notebook
    find data/shome2023notebook -type d -depth 1 -not -path "*mondal2023cell2doc*" -exec du -h -d 0 {} \;
)

3.8G	data/shome2023notebook/quaranta2021kgtorrent
4.2G	data/shome2023notebook/assert_notebooks


+ extract **contents** of code cells with `assert` keyword
+ extract **context** of code cells with `assert` keyword
+ compute high-level descriptive statistics of notebooks
+ extract **content** and **output** of all cells that produce a visualisation

## Enough talk, show me the data!

In [9]:
import pandas as pd

github_stats = pd.read_csv(
    "shome2023notebook/github-stats.csv",
    header=None,
    names=["notebook", "num_code_cells", "num_md_cells", "num_assert_cells"],
)
github_visualisations = pd.read_csv(
    "shome2023notebook/github-visualisations.csv",
    header=None,
    names=["notebook", "image/png"]
)

github_assert_content = pd.read_csv(
    "shome2023notebook/github-assert-content.csv",
    header=None,
    names=["cell_type", "source", "notebook"],
)
github_assert_context = pd.read_csv(
    "shome2023notebook/github-assert-context.csv",
    header=None,
    names=["cell_type", "source", "notebook", "location", "assert_cell_index"]
)

kaggle_stats = pd.read_csv(
    "shome2023notebook/quaranta2021kgtorrent-stats.csv",
    header=None,
    names=["notebook", "num_code_cells", "num_md_cells", "num_assert_cells"],
)
kaggle_visualisations = pd.read_csv(
    "shome2023notebook/quaranta2021kgtorrent-visualisations.csv",
    header=None,
    names=["notebook", "image/png"]
)
kaggle_assert_content = pd.read_csv(
    "shome2023notebook/quaranta2021kgtorrent-assert-content.csv",
    header=None,
    names=["cell_type", "source", "notebook"],
)
kaggle_assert_context = pd.read_csv(
    "shome2023notebook/quaranta2021kgtorrent-assert-context.csv",
    header=None,
    names=["cell_type", "source", "notebook", "location", "assert_cell_index"]
)

stats = pd.concat([github_stats, kaggle_stats])
visualisations = pd.concat([github_visualisations, kaggle_visualisations])
assert_content = pd.concat([github_assert_content, kaggle_assert_content])
assert_context = pd.concat([github_assert_context, kaggle_assert_context])

In [10]:
stats.head()

Unnamed: 0,notebook,num_code_cells,num_md_cells,num_assert_cells
0,data/assert_notebooks/tanmay2298/Advanced-Mach...,33,12,1
1,data/assert_notebooks/tanmay2298/Advanced-Mach...,20,13,8
2,data/assert_notebooks/tanmay2298/Advanced-Mach...,24,15,3
3,data/assert_notebooks/mykolesiko/advanced_RL/t...,30,10,2
4,data/assert_notebooks/raotnameh/NLP_LECTURE/As...,32,24,2


In [11]:
stats.shape

(44655, 4)

In [12]:
assert_content.head()

Unnamed: 0,cell_type,source,notebook
10,code,# simple test on random numbers\n\ndummy_X = n...,data/assert_notebooks/tanmay2298/Advanced-Mach...
9,code,# some tests\nfrom util import eval_numerical_...,data/assert_notebooks/tanmay2298/Advanced-Mach...
12,code,"class Dense(Layer):\n def __init__(self, in...",data/assert_notebooks/tanmay2298/Advanced-Mach...
14,code,"l = Dense(128, 150)\n\nassert -0.05 < l.weight...",data/assert_notebooks/tanmay2298/Advanced-Mach...
15,code,"# To test the grads, we use gradients obtained...",data/assert_notebooks/tanmay2298/Advanced-Mach...


In [13]:
assert_content.shape

(95906, 3)

In [23]:
print(assert_content.iloc[0].source)

# simple test on random numbers

dummy_X = np.array([
        [0,0],
        [1,0],
        [2.61,-1.28],
        [-0.59,2.1]
    ])

# call your expand function
dummy_expanded = expand(dummy_X)

# what it should have returned:   x0       x1       x0^2     x1^2     x0*x1    1
dummy_expanded_ans = np.array([[ 0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ],
                               [ 1.    ,  0.    ,  1.    ,  0.    ,  0.    ,  1.    ],
                               [ 2.61  , -1.28  ,  6.8121,  1.6384, -3.3408,  1.    ],
                               [-0.59  ,  2.1   ,  0.3481,  4.41  , -1.239 ,  1.    ]])

#tests
assert isinstance(dummy_expanded,np.ndarray), "please make sure you return numpy array"
assert dummy_expanded.shape == dummy_expanded_ans.shape, "please make sure your shape is correct"
assert np.allclose(dummy_expanded,dummy_expanded_ans,1e-3), "Something's out of order with features"

print("Seems legit!")



In [15]:
assert_context.head()

Unnamed: 0,cell_type,source,notebook,location,assert_cell_index
9,markdown,Here are some tests for your implementation of...,data/assert_notebooks/tanmay2298/Advanced-Mach...,above,10
11,markdown,## Logistic regression\n\nTo classify objects ...,data/assert_notebooks/tanmay2298/Advanced-Mach...,below,10
8,code,class ReLU(Layer):\n def __init__(self):\n ...,data/assert_notebooks/tanmay2298/Advanced-Mach...,above,9
10,markdown,#### Instant primer: lambda functions\n\nIn py...,data/assert_notebooks/tanmay2298/Advanced-Mach...,below,9
11,markdown,### Dense layer\n\nNow let's build something m...,data/assert_notebooks/tanmay2298/Advanced-Mach...,above,12


In [16]:
assert_context.shape

(186223, 5)

In [17]:
visualisations.head()

Unnamed: 0,notebook,image/png
5,data/assert_notebooks/tanmay2298/Advanced-Mach...,iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAA...
28,data/assert_notebooks/tanmay2298/Advanced-Mach...,iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAA...
31,data/assert_notebooks/tanmay2298/Advanced-Mach...,iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAA...
36,data/assert_notebooks/tanmay2298/Advanced-Mach...,iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAA...
41,data/assert_notebooks/tanmay2298/Advanced-Mach...,iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAA...


In [25]:
visualisations.shape

(165694, 2)

## Research Questions

### RQ1: seq-to-seq translation of python visualisation code to python assertion
**WARNING**: this is bleeding edge stuff with high-risk or a negative outcome! Choose this if you are really interested and passionate about NLP!

+ curate a dataset of ML visualisation code and related python assertion code
+ train seq-2-seq models to automatically *translate* given visualisation code to a python assertion

### RQ1 (backup): automatically detect related visualisation-assertion code pairs (VA pairs)

+ we manually created a dataset of 256 VA pairs (ground truth)
+ explore code similarity metrics for doing this

### RQ2: taxonomy of visualisations in ML

+ start with high level data exploration of visualisations in ML (eg. most frequent type of plot?)
+ create a taxonomy of how visualisations are used in ML to perform specific verification/validation tasks

### RQ3: taxonomy of assertions in ML

+ start with a high level data exploration of assert statemetns in ML
+ create a taxonomy of how assertions are used in ML notebooks for specific V/V tasks

### RQ4: explore the role of unit testing in mature ML projects

```bibtex
@InProceedings{   widyasari2023niche,
  title         = {NICHE: A Curated Dataset of Engineered Machine Learning
                  Projects in Python},
  url           = {http://dx.doi.org/10.1109/MSR59073.2023.00022},
  doi           = {10.1109/msr59073.2023.00022},
  booktitle     = {2023 IEEE/ACM 20th International Conference on Mining
                  Software Repositories (MSR)},
  publisher     = {IEEE},
  author        = {Widyasari, Ratnadira and Yang, Zhou and Thung, Ferdian and
                  Qin Sim, Sheng and Wee, Fiona and Lok, Camellia and Phan,
                  Jack and Qi, Haodi and Tan, Constance and Tay, Qijin and
                  Lo, David},
  year          = {2023},
  month         = may
}
```
+ dataset of 470 mature ML projects
+ how does testing in mature projects differ from notebooks?

### RQ5: smells in ML visualisations
**NOTE**: more open-ended RQ with multiple directions to look into

+ what are some frequent anti-patterns in ML visualisation code? (eg. lack of labels)
+ can find smells from design aspects to performance of visualisation code itself