<div align="center" dir="auto">
<p dir="auto"><a href="https://colab.research.google.com/github/encord-team/encord-notebooks/blob/main/colab-notebooks/Encord_Active_HuggingFace_Dataset_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<div align="center" dir="auto">
  <div style="flex: 1; padding: 10px;">
    <a href="https://join.slack.com/t/encordactive/shared_invite/zt-1hc2vqur9-Fzj1EEAHoqu91sZ0CX0A7Q" target="_blank" style="text-decoration:none">
      <img alt="Join us on Slack" src="https://img.shields.io/badge/Join_Our_Community-4A154B?label=&logo=slack&logoColor=white">
    </a>
    <a href="https://docs.encord.com/docs/active-overview" target="_blank" style="text-decoration:none">
      <img alt="Documentation" src="https://img.shields.io/badge/docs-Online-blue">
    </a>
    <a href="https://twitter.com/encord_team" target="_blank" style="text-decoration:none">
      <img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&amp;style=social">
    </a>
    <img alt="Python versions" src="https://img.shields.io/pypi/pyversions/encord-active">
    <a href="https://pypi.org/project/encord-active/" target="_blank" style="text-decoration:none">
      <img alt="PyPi project" src="https://img.shields.io/pypi/v/encord-active">
    </a>
    <a href="https://docs.encord.com/docs/active-contributing" target="_blank" style="text-decoration:none">
      <img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue">
    </a>
    <img alt="License" src="https://img.shields.io/github/license/encord-team/encord-active">
  </div>
</div>

<div align="center">
  <p>
    <a align="center" href="" target="_blank">
      <img
        width="7232"
        src="https://storage.googleapis.com/encord-notebooks/encord_active_notebook_banner.png">
    </a>
  </p>
</div>

# üü£ Encord Active | ü§ó HuggingFace Dataset Exploration

## üèÅ Overview

üëã Hi there! In this notebook, you will use Encord Active (EA) to explore the quality of a dataset from the [Hugging Face Datasets](https://huggingface.co/datasets) library.


> ‚ö†Ô∏è **Prerequisites:** you should have `encord-active` [installed](https://docs.encord.com/docs/active-overview) in your environment.

This üìí notebook will cover:
* Using ü§ó Datasets to download and generate the dataset.
* Creating an Encord Active project.
* Inspecting problematic images in the dataset.
* Exploring more features with the EA UI.

<br>

> üí° Learn more about üü£ Encord Active:
* [GitHub](https://github.com/encord-team/encord-active)
* [Docs](https://docs.encord.com/docs/active-overview)

## üõ†Ô∏è Install Encord Active

üìå  `python3.9`, `python3.10`, and `python3.11` are the version requirements to run Encord Active.

In [None]:
# Assert that python is 3.9 or 3.10 instead
import sys
assert sys.version_info.minor in [9, 10, 11], "Encord Active only supported for python 3.9, 3.10, and 3.11."

!pip install encord-active &> /dev/null
!encord-active --version

## üì• Install the ü§ó Hugging Face Datasets package

üëü Run the following installation script for [ü§ó Datasets](https://huggingface.co/datasets).



In [None]:
# Install the Hugging Face Datasets library
%pip install datasets &> /dev/null

# üì® Download a Hugging Face Dataset

You can explore the [Hugging Face dataset](https://huggingface.co/datasets) directory and loady any dataset prefer to explore.


Here, install [`sashs/dog-food`](https://huggingface.co/datasets/sasha/dog-food) dataset where there are 3000 images consists of dogs and foods.

In [None]:
from datasets import load_dataset, concatenate_datasets
from pathlib import Path
import shutil
from tqdm import tqdm

# Use load_dataset function to download any dataset on the Hugging Face
# You can browse through datasets here: https://huggingface.co/datasets
dataset_dict = load_dataset('sasha/dog-food')
dataset = concatenate_datasets([d for d in dataset_dict.values()])

huggingface_dataset_path =  Path.cwd() / "huggingface_dataset"

if huggingface_dataset_path.exists():
  shutil.rmtree(huggingface_dataset_path)
huggingface_dataset_path.mkdir()

for counter, item in tqdm(enumerate(dataset)):
  image = item['image']
  image.save(f'./huggingface_dataset/{counter}.{image.format}')

# üîß Create an Encord Active project

## üëâ Add the Dataset to an Encord Active Project

The code below sets up a project using Encord Active, initializes it with image files, and runs metrics on the project's data.

* It obtains a list of all the image files from `huggingface_dataset` directory with the `collect_all_images` that takes a root folder path as input and returns a list of Path objects representing image files within the root folder

* Initializes a local project using Encord Active's `init_local_project` function

* Creates a project in the specified `projects_dir` directory with the image files and project name

* Calls  the [`run_metrics_by_embedding_type`](https://docs.encord.com/docs/active-sdk-quality-metric-execution#compute-only-data-or-label-metrics) function to run metrics for the image embeddings (`EmbeddingType.IMAGE`). The metrics will be executed on the data in `project_path`

In [None]:
from pathlib import Path

from encord_active.lib.metrics.execute import run_metrics, run_metrics_by_embedding_type
from encord_active.lib.metrics.metric import EmbeddingType
from encord_active.lib.project.local import ProjectExistsError, init_local_project
from encord_active.lib.project.project import Project

def collect_all_images(root_folder: Path) ->  list[Path]:
    image_extensions = {".jpg", ".jpeg", ".png", ".bmp"}
    image_paths = []

    for file_path in root_folder.glob("**/*"):
        if file_path.suffix.lower() in image_extensions:
            image_paths.append(file_path)

    return image_paths

# Enter path to the downloaded hugging face dataset
root_folder = Path("./huggingface_dataset")
projects_dir = Path.cwd()

if not projects_dir.exists():
  projects_dir.mkdir()

image_files = collect_all_images(root_folder)

try:
    project_path: Path = init_local_project(
        files = image_files,
        target = projects_dir,
        project_name = "sample_ea_project",
        symlinks = False,
    )
except ProjectExistsError as e:
    project_path = Path("./sample_ea_project")
    print(e)  # A project already exist with that name at the given path.

run_metrics_by_embedding_type(
    EmbeddingType.IMAGE,
    data_dir=project_path,
    use_cache_only=True
)

ea_project = Project(project_path)

# üì• Import helper functions



Now import some helper functions from Encord Active and with visualization libraries to visualize the images.

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px

from encord_active.lib.charts.data_quality_summary import create_image_size_distribution_chart, create_outlier_distribution_chart
from encord_active.lib.dataset.summary_utils import get_all_image_sizes, get_metric_summary, get_median_value_of_2d_array
from encord_active.lib.metrics.utils import load_available_metrics
from encord_active.lib.dataset.outliers import MetricsSeverity, get_all_metrics_outliers
from encord_active.lib.common.image_utils import load_or_fill_image
from encord_active.lib.charts.histogram import get_histogram

def plot_top_k_images(metric_name: str, metrics_data_summary: MetricsSeverity, project: Project, k: int, show_description: bool = False, ascending: bool = True):
    metric_df = metrics_data_summary.metrics[metric_name].df
    metric_df.sort_values(by='score', ascending=ascending, inplace=True)

    for _, row in metric_df.head(k).iterrows():
        image = load_or_fill_image(row, project.file_structure)
        plt.imshow(image)
        plt.show()
        print(f"{metric_name} score: {row['score']}")
        if show_description:
          print(f"{row['description']}")

def plot_metric_distribution(metric_name: str, metric_data_summary: MetricsSeverity):
    fig = px.histogram(metrics_data_summary.metrics[metric_name].df, x="score", nbins=50)

    fig.update_layout(title=f"{metric_name} score distribution", bargap=0.2)
    fig.show()


# üîî Plot image size distributions

In [None]:
image_sizes = get_all_image_sizes(ea_project.file_structure)
median_image_dimension = get_median_value_of_2d_array(image_sizes)

fig = create_image_size_distribution_chart(image_sizes)

print(f"Total images in the dataset: {len(image_sizes)}")
print(f"Median image sizes: {median_image_dimension[0]}x{median_image_dimension[1]}")
fig.show()

# üìà Show total outliers

In [None]:
available_metrics = load_available_metrics(ea_project.file_structure.metrics)
metrics_data_summary = get_metric_summary(available_metrics)
all_metrics_outliers = get_all_metrics_outliers(metrics_data_summary)
fig = create_outlier_distribution_chart(all_metrics_outliers, "tomato", 'orange')

print(f'Total severe outliers: {metrics_data_summary.total_unique_severe_outliers} \n'
      f'Total moderate outliers: {metrics_data_summary.total_unique_moderate_outliers}')

fig.show()

# üßê Inspect problematic images

Now you will have to inspect the dataset for problematic images.

In [None]:
# First, get the list of available metrics
[metric.name for metric in available_metrics]

# üëÅÔ∏è Visualize score distributions based on metric

In [None]:
for metric in available_metrics:
  plot_metric_distribution(metric.name, metrics_data_summary)

# Get the smallest images

In [None]:
plot_top_k_images('Area', metrics_data_summary, ea_project, k=5, ascending=True)

# Get the biggest images

In [None]:
plot_top_k_images('Area', metrics_data_summary, ea_project, k=5, ascending=False)

# Get the blurriest images

In [None]:
plot_top_k_images('Blur', metrics_data_summary, ea_project, k=5, ascending=False)

# Get the brightest images

In [None]:
plot_top_k_images('Brightness', metrics_data_summary, ea_project, k=5, ascending=False)

# Get the darkest images

In [None]:
plot_top_k_images('Brightness', metrics_data_summary, ea_project, k=5, ascending=True)

# Get the least unique images

In [None]:
plot_top_k_images('Image Singularity', metrics_data_summary, ea_project, k=15, show_description=True)

# Get the images that have the smallest aspect ratio

In [None]:
plot_top_k_images('Aspect Ratio', metrics_data_summary, ea_project, k=10)

# Get the images that have the biggest aspect ratio

In [None]:
plot_top_k_images('Aspect Ratio', metrics_data_summary, ea_project, k=10, ascending=False)

# ‚úÖ Wrap Up: Explore more features with Encord Active UI



This was just a small part of Encord Active's capabilities. Use Encord Active app to explore more of your dataset, labels, and model performance via easy to use user interface. With Encord Active UI, you can:

* Understand the data and label distribution
* Search through data in natural language
* Detect exact and near duplicate images
* Detect label errors and biases
* Gain insights into your model‚Äôs weak areas
* Generate model explainability reports
* Test, validate, and evaluate your models with advanced error analysis


<br>

![Encord Active UI](https://images.prismic.io/encord/73635182-4f04-4299-a992-a4d383e19765_image2.gif?auto=compress,format)




üü£ Encord Active is an open source toolkit to prioritize the most valuable image data for labeling to supercharge model performance! **Check out the project on [GitHub](https://github.com/encord-team/encord-active), leave a star üåü** if you like it. We welcome you to [contribute](https://docs.encord.com/docs/active-contributing) if you find something is missing.

---

üëâ Check out the üìñ [Encord Blog](https://encord.com/blog/) and üì∫ [YouTube](https://www.youtube.com/@encord) channel to stay up-to-date with the latest in computer vision, foundation models, active learning, and data-centric AI.

---

Thanks for now!

# ‚è≠Ô∏è Next: Learn how to build custom metrics functions in Encord Active

What should you check out next? üëÄ Learn how to build custom metrics functions in Encord Active. The Colab notebook will cover code samples and example walkthroughs for:
* Defining metric sub-classes.
* Executing metric functions.
* Investigating custom metrics in the Encord Active UI.

### $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ *üëá*

### ‚¨ÖÔ∏è [*Previous Notebook*](./Encord_Active_Torchvision_Dataset_Exploration.ipynb) $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ [*Next Notebook*](./Encord_Active_Building_a_Custom_Metric_Function.ipynb) *‚û°Ô∏è*
