<a href="https://colab.research.google.com/github/wandb/edu/blob/main/mlops-001/lesson1/01_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

# EDA 
<!--- @wandbcode{course-lesson1} -->

In this notebook, we will download a sample of the [BDD100K](https://www.bdd100k.com/) semantic segmentation dataset and use W&B Artifacts and Tables to version and analyze our data. 

In [1]:
DEBUG = False # set this flag to True to use a small subset of data for testing

In [2]:
from fastai.vision.all import *
import params

import wandb

## Download and look at data

We have defined some global configuration parameters in the `params.py` file. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually. 

In the section below, we will use `untar_data` function from `fastai` to download and unzip our datasets. 

In [3]:
URL = 'https://storage.googleapis.com/wandb_course/bdd_simple_1k.zip'

In [4]:
path = Path(untar_data(URL, force_download=True))

In [5]:
path.ls()

(#3) [Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/LICENSE.txt')]

In [6]:
(path / "images").ls()


(#1000) [Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/a59131a5-00000000.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/6886b3d9-6ab2b28d.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/115e4aff-00000000.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/b803d91d-671b8cff.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/c665137e-6fffaf45.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/6b293d3e-59d5f868.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/898ac5b9-00000000.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/a91b7555-00001125.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/16e186ec-00000000.jpg'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/images/b18cb922-e3af77af.jpg')...]

In [7]:
(path / "labels").ls()


(#1001) [Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/2ad035f4-cd94d608_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/7f92cc43-edf59deb_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/3d0d454e-f0132c99_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/a1acfdb8-68db64af_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/1af21ce1-e157b867_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/167a78e3-f955e644_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/652f7310-5dbedb62_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/6b73ccdd-00000000_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/203fb7bc-cbb4be86_mask.png'),Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/0027eed2-09c90000_mask.png')...]

## Functions to handle the dataset

Here we define several functions to help us process the data and upload it as a `Table` to W&B. 

In [8]:
def label_func(fname: Path) -> Path:
    """This function takes a file name for an image
    and it returns the full path to the corresponding mask/label
    file.

    Args:
        fname (Path): The filename of the image we want the label of. 
                     (it should be in the 'images' folder)

    Returns:
        Path: The full path to the mask file in the 'labels' folder
    """
    return (fname.parent.parent/"labels")/f"{fname.stem}_mask.png"


In [9]:
image_path = (path / "images").ls()[0]

In [10]:
print(image_path)

/Users/enrythebest/.fastai/data/bdd_simple_1k/images/a59131a5-00000000.jpg


In [11]:
label_func(image_path)

Path('/Users/enrythebest/.fastai/data/bdd_simple_1k/labels/a59131a5-00000000_mask.png')

In [12]:
from typing import Dict


def get_classes_per_image(mask_data: np.ndarray, class_labels: Dict) -> Dict:
    """This function takes the data of a label file (mask)
    and a dictionary mapping numbers to different object classes.
    Then it returns a new dictionary with all the classes in the
    given mask.

    Args:
        mask_data (np.ndarray): The data from the mask/label file
        class_labels (Dict): A dictionary of all classes and their names

    Returns:
        Dict: A dictionary of the classes in the input mask file
    """
    unique = list(np.unique(mask_data))
    result_dict = {}
    for _class in class_labels.keys():
        result_dict[class_labels[_class]] = int(_class in unique)
    return result_dict


In [13]:
params.BDD_CLASSES  # all class labels

{0: 'background',
 1: 'road',
 2: 'traffic light',
 3: 'traffic sign',
 4: 'person',
 5: 'vehicle',
 6: 'bicycle'}

In [14]:
mask_data = np.array(Image.open(label_func(image_path)))

In [15]:
mask_data.shape

(720, 1280)

In [16]:
get_classes_per_image(mask_data=mask_data, class_labels=params.BDD_CLASSES)

{'background': 1,
 'road': 1,
 'traffic light': 0,
 'traffic sign': 1,
 'person': 0,
 'vehicle': 1,
 'bicycle': 0}

In [17]:
from typing import List


def _create_table(image_files: List, class_labels: Dict) -> wandb.Table:
    """This function creates a wandb Table to be stored on wandb.
    It includes columns with images and numbers. The images come from the
    image files and label files.

    Args:
        image_files (List): A list of Paths to files with images.
        class_labels (Dict): The dictionary defining the labels in the dataset

    Returns:
        wandb.Table: A wandb table including images and numbers
    """
    labels = [
        str(class_labels[_lab]) for _lab in list(class_labels)
    ]  # strings for each label
    table = wandb.Table(
        columns=["File_Name", "Images", "Split"] + labels
    )  # init table with these columns (split is for train/valid/test indicator)

    for _, image_file in progress_bar(enumerate(image_files), total=len(image_files)):
        # loop over all images in the list
        image = Image.open(image_file)
        mask_data = np.array(Image.open(label_func(image_file)))
        class_in_image = get_classes_per_image(mask_data, class_labels)
        # add data to table: one row
        table.add_data(
            str(image_file.name),
            # use the Image type: it can handle masks!
            # https://docs.wandb.ai/guides/track/log/media#image-overlays-in-tables
            wandb.Image(
                image,
                masks={
                    "predictions": {
                        "mask_data": mask_data,
                        "class_labels": class_labels,
                    }
                },
            ),
            "None",  # we don't have a dataset split yet
            *[class_in_image[_lab] for _lab in labels]
        )

    return table


## Start a W&B run

We will start a new W&B `run` and put everything into a raw Artifact.

In [18]:
# job_type can be used to identify runs by name. Here we just use the run to upload artifacts.
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="upload")

[34m[1mwandb[0m: Currently logged in as: [33merinaldi[0m ([33merinaldi-team[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [19]:
# init Artifact giving it a name and type. Here we want to update the raw data
raw_data_at = wandb.Artifact(params.RAW_DATA_AT, type="raw_data")

Add the first file

In [20]:
raw_data_at.add_file(path/'LICENSE.txt', name='LICENSE.txt')

ArtifactManifestEntry(path='LICENSE.txt', digest='X+6ZFkDOlnKesJCNt20yRg==', ref=None, birth_artifact_id=None, size=1594, extra={}, local_path='/Users/enrythebest/Library/Application Support/wandb/artifacts/staging/tmp3ym4gd0o')

Let's add the images and label masks.

In [21]:
raw_data_at.add_dir(path/'images', name='images')
raw_data_at.add_dir(path/'labels', name='labels')

[34m[1mwandb[0m: Adding directory to artifact (/Users/enrythebest/.fastai/data/bdd_simple_1k/images)... Done. 0.3s
[34m[1mwandb[0m: Adding directory to artifact (/Users/enrythebest/.fastai/data/bdd_simple_1k/labels)... Done. 0.3s


Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table. 

In [22]:
image_files = get_image_files(path/"images", recurse=False)

# sample a subset if DEBUG
if DEBUG: image_files = image_files[:10]

In [23]:
table = _create_table(image_files, params.BDD_CLASSES)

In [24]:
table

<wandb.data_types.Table at 0x2a16d1c60>

Finally, we will add the Table to our Artifact, log it to W&B and finish our `run`. 

In [25]:
raw_data_at.add(table, "eda_table")

ArtifactManifestEntry(path='eda_table.table.json', digest='Kb95NefVOOU0ulL2SYXFtw==', ref=None, birth_artifact_id=None, size=588824, extra={}, local_path='/Users/enrythebest/Library/Application Support/wandb/artifacts/staging/tmpeis8mvcv')

This is the command that does the actual push to the W&B server using `log_artifact`.

In [26]:
run.log_artifact(raw_data_at)


<wandb.sdk.wandb_artifacts.Artifact at 0x280301690>

For now we are done, so we can close the W&B run we started with `init`. This can take some time because the upload happens here.

In [27]:
run.finish()