
# FiftyOne Workshop: Loading and Exploring Datasets (March 12th 2025)

Welcome to this hands-on workshop where we will learn how to load and explore datasets using FiftyOne.
This notebook will guide you through programmatic interaction via the **FiftyOne SDK** and visualization using the **FiftyOne App**.

![Image](https://github.com/user-attachments/assets/d2830448-530e-4336-b3a4-f8f3838f5c73)

## 🏆 Learning Objectives:
- Load datasets into FiftyOne from different sources.
- Understand the structure and metadata of datasets.
- Use FiftyOne’s querying and filtering capabilities.
- Interactively explore datasets in the FiftyOne App.

In this example, we use HuggingFace Hub for dataset loading, but you are encouraged to explore other sources like local files, cloud storage, or custom dataset loaders.

---


## Requirements and FiftyOne Installation

First thing you need to do is create a Python environment in your system, if you are not familiar with that please take a look of this [ReadmeFile](https://github.com/voxel51/fiftyone-examples?tab=readme-ov-file#-prerequisites-for-beginners-), where we will explain how to create the environment. After that be sure you activate the created environment and install FiftyOne there.

## Install FiftyOne

Run the line below to install FiftyOne on your machine or Google Colab instance.

In [1]:
!pip install fiftyone==1.3.1 -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.7/99.7 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.7/61.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.4/112.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.8/74.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m



## 📥 Loading a Dataset into FiftyOne

FiftyOne provides multiple ways to import datasets, including:
- **Hugging Face Hub** (as demonstrated here)
- **Local files** (images, videos, or annotations in JSON, COCO, PASCAL VOC, etc.)
- **Cloud storage** (AWS S3, Google Drive, etc.) - Just for FiftyOne Enterprise

To load a dataset, we specify the source and format, ensuring FiftyOne properly indexes the data.

🔗 **Relevant Documentation:** [Dataset Importing in FiftyOne](https://voxel51.com/docs/fiftyone/user_guide/dataset_creation/index.html)

We are using [MVTec AD Dataset](https://www.mvtec.com/company/research/datasets/mvtec-ad) from [Voxel51's HuggingFace Hub](https://huggingface.co/datasets/Voxel51/mvtec-ad). The difference between the original resource and the Voxel51's one is the data structure, while in the first one we have a tree directory with category, in the second one we have an unstructured dataset with metadata such as ```categories.label```, and ```defect.label```.


In [2]:
import fiftyone as fo # base library and app
import fiftyone.utils.huggingface as fouh # Hugging Face integration
dataset_ = fouh.load_from_hub("Voxel51/mvtec-ad", persistent=True, overwrite=True)

# Use this CLI if you already have the dataset
# in your disk or if this is not the first time you run this notebook
#dataset = fo.load_dataset("Voxel51/mvtec-ad")

# Define the new dataset name
dataset_name = "mvtec-ad_1"

# Check if the dataset exists
if dataset_name in fo.list_datasets():
    print(f"Dataset '{dataset_name}' exists. Loading...")
    dataset = fo.load_dataset(dataset_name)
else:
    print(f"Dataset '{dataset_name}' does not exist. Creating a new one...")
    # Clone the dataset with a new name and make it persistent
    dataset = dataset_.clone(dataset_name, persistent=True)

Downloading config file fiftyone.yml from Voxel51/mvtec-ad


INFO:fiftyone.utils.huggingface:Downloading config file fiftyone.yml from Voxel51/mvtec-ad


fiftyone.yml:   0%|          | 0.00/127 [00:00<?, ?B/s]

Loading dataset


INFO:fiftyone.utils.huggingface:Loading dataset


Importing samples...


INFO:fiftyone.utils.data.importers:Importing samples...


 100% |███████████████| 5354/5354 [142.1ms elapsed, 0s remaining, 37.7K samples/s]     


INFO:eta.core.utils: 100% |███████████████| 5354/5354 [142.1ms elapsed, 0s remaining, 37.7K samples/s]     


Migrating dataset 'Voxel51/mvtec-ad' to v1.3.1


INFO:fiftyone.migrations.runner:Migrating dataset 'Voxel51/mvtec-ad' to v1.3.1


Downloading 5354 media files...


INFO:fiftyone.utils.huggingface:Downloading 5354 media files...
100%|██████████| 54/54 [05:12<00:00,  5.79s/it]


Dataset 'mvtec-ad_1' does not exist. Creating a new one...



## 🧐 Exploring the Dataset

Once the dataset is loaded, we can inspect its structure using FiftyOne’s SDK.
We will explore:
- The number of samples in the dataset.
- Available metadata and labels.
- How images/videos are structured.

🔗 **Relevant Documentation:** [Inspecting Datasets in FiftyOne](https://docs.voxel51.com/user_guide/using_datasets.html#using-fiftyone-datasets)


In [3]:
print(dataset)
print(dataset.last())  # Inspect the first or last sample

Name:        mvtec-ad_1
Media type:  image
Num samples: 5354
Persistent:  True
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    category:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    defect:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    split:            fiftyone.core.fields.StringField
    defect_mask:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Segmentation)
<Sample: {
    'id': '6621d76a324f6e05d5838ef7',
    'media_type': 'image',
    'filepath': '/root/fiftyone/hugging

In [4]:
session = fo.launch_app(dataset, port=5151, auto=False)


Could not connect session, trying again in 10 seconds

Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.



Welcome to

███████╗██╗███████╗████████╗██╗   ██╗ ██████╗ ███╗   ██╗███████╗
██╔════╝██║██╔════╝╚══██╔══╝╚██╗ ██╔╝██╔═══██╗████╗  ██║██╔════╝
█████╗  ██║█████╗     ██║    ╚████╔╝ ██║   ██║██╔██╗ ██║█████╗
██╔══╝  ██║██╔══╝     ██║     ╚██╔╝  ██║   ██║██║╚██╗██║██╔══╝
██║     ██║██║        ██║      ██║   ╚██████╔╝██║ ╚████║███████╗
╚═╝     ╚═╝╚═╝        ╚═╝      ╚═╝    ╚═════╝ ╚═╝  ╚═══╝╚══════╝ v1.3.1

If you're finding FiftyOne helpful, here's how you can get involved:

|
|  ⭐⭐⭐ Give the project a star on GitHub ⭐⭐⭐
|  https://github.com/voxel51/fiftyone
|
|  🚀🚀🚀 Join the FiftyOne Discord community 🚀🚀🚀
|  https://community.voxel51.com/
|



INFO:fiftyone.core.session.session:
Welcome to

███████╗██╗███████╗████████╗██╗   ██╗ ██████╗ ███╗   ██╗███████╗
██╔════╝██║██╔════╝╚══██╔══╝╚██╗ ██╔╝██╔═══██╗████╗  ██║██╔════╝
█████╗  ██║█████╗     ██║    ╚████╔╝ ██║   ██║██╔██╗ ██║█████╗
██╔══╝  ██║██╔══╝     ██║     ╚██╔╝  ██║   ██║██║╚██╗██║██╔══╝
██║     ██║██║        ██║      ██║   ╚██████╔╝██║ ╚████║███████╗
╚═╝     ╚═╝╚═╝        ╚═╝      ╚═╝    ╚═════╝ ╚═╝  ╚═══╝╚══════╝ v1.3.1

If you're finding FiftyOne helpful, here's how you can get involved:

|
|  ⭐⭐⭐ Give the project a star on GitHub ⭐⭐⭐
|  https://github.com/voxel51/fiftyone
|
|  🚀🚀🚀 Join the FiftyOne Discord community 🚀🚀🚀
|  https://community.voxel51.com/
|



![Image](https://github.com/user-attachments/assets/82e253a8-d17d-4d39-a957-a406c23d70b6)


## 🔍 Querying and Filtering

FiftyOne provides a powerful querying engine to filter and analyze datasets.
We can apply filters to:
- Retrieve specific labels (e.g., all images with "cat" labels).
- Apply confidence thresholds to object detections.
- Filter data based on metadata (e.g., image size, timestamp).

🔗 **Relevant Documentation:** [Dataset views](https://docs.voxel51.com/user_guide/using_views.html#dataset-views), [Querying Samples](https://docs.voxel51.com/user_guide/using_views.html#querying-samples), [Common filters](https://docs.voxel51.com/user_guide/using_views.html#querying-samples)

### Examples:
- Show all images containing a particular class.
- Retrieve samples with object detection confidence above a threshold.
- Filter out low-quality images based on metadata.


In [5]:
import fiftyone.core.expressions as foe
# Query images where the defect is labeled as "scratch"
view = dataset.match(foe.ViewField("defect.label") == "scratch")
print(view)

# Launch FiftyOne App with the filtered dataset
session = fo.launch_app(view, port=5151, auto=False)


Dataset:     mvtec-ad_1
Media type:  image
Num samples: 91
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    category:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    defect:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    split:            fiftyone.core.fields.StringField
    defect_mask:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Segmentation)
View stages:
    1. Match(filter={'$expr': {'$eq': [...]}})
Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.


![Image](https://github.com/user-attachments/assets/a8143d7a-40cb-4f4c-8a04-0daf6f29abe7)

In [6]:
filter = view.match(foe.ViewField("category.label") == "wood")
session.view = filter
print(filter)

Dataset:     mvtec-ad_1
Media type:  image
Num samples: 21
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    category:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    defect:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    split:            fiftyone.core.fields.StringField
    defect_mask:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Segmentation)
View stages:
    1. Match(filter={'$expr': {'$eq': [...]}})
    2. Match(filter={'$expr': {'$eq': [...]}})


In [7]:
# Launch FiftyOne App with the filtered dataset
session = fo.launch_app(filter, port=5151, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.


![Image](https://github.com/user-attachments/assets/961e07d1-f1fe-4a69-93bc-92be0d1a700b)

In [None]:
# Display the FiftyOne app within the notebook
session.show()



## 🖥️ Interactive Exploration with the FiftyOne App

The **FiftyOne App** allows users to interactively browse, filter, and analyze datasets.
This visual interface is an essential tool for understanding dataset composition and refining data exploration workflows.

Key features of the FiftyOne App:
- Interactive filtering of images/videos.
- Object detection visualization.
- Dataset statistics and metadata overview.

🔗 **Relevant Documentation:** [Using the FiftyOne App](https://voxel51.com/docs/fiftyone/user_guide/app.html)


### Interacting with Plugins to Understand the Dataset

FiftyOne provides a powerful [plugin framework](https://docs.voxel51.com/plugins/index.html) that allows for extending and customizing the functionality of the tool to suit your specific needs. In this case we will use the [@voxel51/dashboard](https://github.com/voxel51/fiftyone-plugins/blob/main/plugins/dashboard/README.md) plugin, a plugin that enables users to construct custom dashboards that display statistics of interest about the current dataset (and beyond)

In [8]:
#!fiftyone plugins download https://github.com/voxel51/fiftyone-plugins --plugin-names @voxel51/dashboard

![Image](https://github.com/user-attachments/assets/107f1873-8e19-4c37-abe1-46bb39cb993c)

## New Dataset

Here we create a new FiftyOne dataset containing a copy of the contents of the view.

In [9]:
new_dataset = view.clone()
print(new_dataset)

Name:        2025.03.12.10.16.10
Media type:  image
Num samples: 91
Persistent:  False
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    category:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    defect:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    split:            fiftyone.core.fields.StringField
    defect_mask:      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Segmentation)


## Exporting `Dataset` to `FiftyOneDataset`

FiftyOne supports various dataset formats. In this notebook, we’ve worked with a custom dataset from HuggingFace Hub. Now, we export it into a FiftyOne-compatible dataset to leverage additional capabilities.

For more details on the dataset types supported by FiftyOne, refer to this [documentation page.](https://docs.voxel51.com/api/fiftyone.types.dataset_types.html?highlight=dataset%20type#module-fiftyone.types.dataset_types)

In [10]:
export_dir = "MVTec_scratch"
new_dataset.export(
    export_dir=export_dir,
    dataset_type=fo.types.FiftyOneDataset,
)

Exporting samples...


INFO:fiftyone.utils.data.exporters:Exporting samples...


 100% |██████████████████████| 91/91 [808.9ms elapsed, 0s remaining, 112.5 docs/s]      


INFO:eta.core.utils: 100% |██████████████████████| 91/91 [808.9ms elapsed, 0s remaining, 112.5 docs/s]      


## In this notebook, we covered:
1. Loading datasets from Hugging Face Hub (extendable to other sources).
2. Exploring dataset structure and metadata.
3. Applying filtering and querying techniques to analyze data.
4. Utilizing the FiftyOne App for interactive visualization.
5. Clone dataset views and export your Data in FiftyOne Format




### Next Steps:
Try modifying the dataset loading parameters, apply different filters, and explore the FiftyOne App’s visualization features! 🚀

🔗 **More Resources:**  
- [FiftyOne Docs](https://voxel51.com/docs/fiftyone/)  
- [FiftyOne Tutorials](https://voxel51.com/docs/fiftyone/tutorials/index.html)