# Symphony: CIFAR-10 Example

**Visualizing a [DNIKit](https://betterwithdata.github.io/dnikit/) [Dataset Report](https://betterwithdata.github.io/dnikit/introspectors/data_introspection/dataset_report.html)**

This is an example of visualizing the CIFAR-10 dataset with Symphony. Beyond the image samples themselves, we've used [DNIKit](https://betterwithdata.github.io/dnikit/) to compute some other statistics about the data. Symphony uses this data in the Familiarity and Duplicates widgets.

In DNIKit, you can create a `DatasetReport` object, that has a `data` field, which is a pandas DataFrame table with metadata about each data sample like its familiarity, duplicates, overall summary, and dimensionality projection coordinates. Symphony can directly visualize this DataFrame.

For this example, we'll load a precomputed analysis for the CIFAR-10 dataset that has been saved to disk as a pandas DataFrame. If you are interested in generating this DataFrame yourself (or for a different dataset or model), see [this DNIKit example](https://betterwithdata.github.io/dnikit/notebooks/data_introspection/dataset_report.ipynb). This Symphony example picks up at the end of it.

## Symphony in Jupyter Notebooks

Let's use Symphony to explore this dataset in a Jupyter notebook.

In [1]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
import cv2
from keras.datasets import cifar10

2024-11-29 01:39:10.112196: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-29 01:39:10.131866: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-29 01:39:10.131884: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-29 01:39:10.132546: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-29 01:39:10.136474: I tensorflow/core/platform/cpu_feature_guar

In [2]:
from watermark import watermark
print(watermark(packages="numpy,scipy,traitlets,tqdm,easyimages,tensorflow,keras,dnikit,cffi"))

numpy     : 1.26.4
scipy     : 1.14.1
traitlets : 5.14.3
tqdm      : 4.67.1
easyimages: not installed
tensorflow: 2.15.0
keras     : 2.15.0
dnikit    : not installed
cffi      : 1.17.1



Let's first load and download the CIFAR-10 dataset. We'll save it to a folder named `cifar`. 

In [3]:
data_path = "./cifar/"

In [4]:
# Load data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
class_to_name = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

# Concatenate the train and test into one array, as well as the train/test labels, and the class labels
full_dataset = np.concatenate((x_train, x_test))
dataset_labels = ['train']*len(x_train) + ['test']*len(x_test)
class_labels = np.squeeze(np.concatenate((y_train, y_test)))

# Helper function for file pathing
def class_path(index, dataset_labels, class_labels):
    return f"{dataset_labels[index]}/{class_to_name[int(class_labels[index])]}"

In [5]:
# Loop through data and save images to `cifar` folder
for idx in range(full_dataset.shape[0]):
    base_path = os.path.join(data_path, class_path(idx, dataset_labels, class_labels))
    Path(base_path).mkdir(exist_ok=True, parents=True)
    filename = os.path.join(base_path, f"image{idx}.png")
    # Write to disk after converting to BGR format, used by opencv
    cv2.imwrite(filename, cv2.cvtColor(full_dataset[idx, ...], cv2.COLOR_RGB2BGR))

Now that we have the images saved, we can load our precomputed analysis from DNIKit to visualize in Symphony. You can use Symphony to visualize CIFAR-10, and other datsets, directly. But some components require special metadata that we can use DNIKit's Dataset Report to generate automatically for us.

We can also print out the DataFrame to see the types of metadata columns that are included.

In [6]:
df = pd.read_pickle('canvas_cifar_example.pkl')



In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 9 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   id                                          60000 non-null  object 
 1   class                                       60000 non-null  object 
 2   dataset                                     60000 non-null  object 
 3   duplicates_conv_pw_13                       60000 non-null  int32  
 4   projection_conv_pw_13_x                     60000 non-null  float32
 5   projection_conv_pw_13_y                     60000 non-null  float32
 6   familiarity_conv_pw_13                      60000 non-null  float64
 7   splitFamiliarity_conv_pw_13_byAttr_class    60000 non-null  object 
 8   splitFamiliarity_conv_pw_13_byAttr_dataset  60000 non-null  object 
dtypes: float32(2), float64(1), int32(1), object(5)
memory usage: 3.4+ MB


To use Symphony, we'll import the main library and instantiate a Symphony object, passing the pandas DataFrame analysis and a file path to the dataset we downloaded.

In [8]:
import pyarrow as pa

table = pa.Table.from_pandas(df)

In [9]:
table.slice(0, 1)['id'].to_numpy()[0]

'train/frog/image0.png'

In [10]:
import canvas_ux

symph = canvas_ux.Canvas(df, files_path=str(data_path), notebook = True)

Canvas spect dict value is {'filesPath': '/files/./cifar/', 'dataType': 2, 'instancesPerPage': 40, 'showUnfilteredData': True, 'idColumn': 'id'}


To use the different Symphony widgets, you can import them indepdently. Let's first look at the Summary widget to see the overall distributions of our datset.

In [11]:
from canvas_summary import CanvasSummary

symph.widget(CanvasSummary)

Canvas spect dict value is {'width': 'XXL', 'height': 'M', 'page': 'Summary', 'name': 'CanvasSummary', 'description': 'A Canvas component that visualizes an overview of a dataset', 'summaryElements': []}


HBox(children=(CanvasSummary(layout=Layout(overflow='unset', width='100%'), widget_spec={'width': 'XXL', 'heig…

Instead of a summary, if we want to browse through the data we can use the List widget.

In [12]:
from canvas_list import CanvasList

symph.widget(CanvasList)

/media/satish/Development/workspace/projects/deepview_dev/notebooks/data_introspection
Canvas spect dict value is {'width': 'XXL', 'height': 'M', 'page': 'List', 'name': 'CanvasList', 'description': 'A Canvas component that displays a view of data instances'}


HBox(children=(CanvasList(layout=Layout(overflow='unset', width='100%'), widget_spec={'width': 'XXL', 'height'…

It's common to use dimensionality reduction techniques to summarize and find patterns in ML dataset. DNIKit already ran a reduction, and saves it when running a DataSet Report. We can use the Scatterplot widget to visualize this embedding.

In [18]:
from canvas_scatterplot import CanvasScatterplot

symph.widget(CanvasScatterplot)

HBox(children=(CanvasScatterplot(canvas_spec={'filesPath': '/files/./cifar/', 'dataType': 2, 'instancesPerPage…

Some datasets can contain duplicates: data instances that are the same or very similar to others. These can be hard to find, and become espeically problematic if the same data instance is in the training and testing splits. We can answer these questions using the Duplicates widget.

Hint: Take a look at the `automobile` class, where there are duplicates across train and test data!

In [17]:
from canvas_duplicates import CanvasDuplicates

symph.widget(CanvasDuplicates)

Canvas spect dict value is {'width': 'XXL', 'height': 'M', 'page': 'Duplicates', 'name': 'CanvasDuplicates', 'description': 'A Canvas component for inspecting potential duplicates in a dataset'}


HBox(children=(CanvasDuplicates(layout=Layout(overflow='unset', width='100%'), widget_spec={'width': 'XXL', 'h…

Lastly, we can use advanced ML metrics and the Familiarity widget to find the most and least representative data instances from a given datset, which can help identify model biases and annotation errors.

In [18]:
from canvas_familiarity import CanvasFamiliarity

symph.widget(CanvasFamiliarity)

Canvas spect dict value is {'width': 'XXL', 'height': 'M', 'page': 'Familiarity', 'name': 'CanvasFamiliarity', 'description': 'A Canvas component to find outliers and common instances in a dataset'}


HBox(children=(CanvasFamiliarity(layout=Layout(overflow='unset', width='100%'), widget_spec={'width': 'XXL', '…

## Symphony as a Standalone Export

Symphony can also be exported as a standalone static export to be shared with others or hosted. To explore this example in a web browser, you can export the report to local folder.

If you only want to visualize locally without sharing the data, you can specify Symphony to handle the paths for a local standlone visualization by setting ``symlink_files`` to True:

In [None]:
symph.export('./symphony_report', name="Symphony CIFAR-10 Example", symlink_files=True)

You can now serve the dataset report. For example, from the `symphony_export` folder, run a simple server from the command line:

```bash
python -m http.server
```

And navigate to http://localhost:8000/.