<a href="https://colab.research.google.com/github/wandb/edu/blob/main/mlops-001/lesson1/01_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

In [1]:
!pip install wandb -qq

In [2]:
# Install dependencies (run once)
!wget https://raw.githubusercontent.com/wandb/edu/main/mlops-001/lesson1/requirements.txt
!wget https://raw.githubusercontent.com/wandb/edu/main/mlops-001/lesson1/params.py
!wget https://raw.githubusercontent.com/wandb/edu/main/mlops-001/lesson1/utils.py
!pip install -r requirements.txt

--2024-12-17 20:08:07--  https://raw.githubusercontent.com/wandb/edu/main/mlops-001/lesson1/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 82 [text/plain]
Saving to: ‘requirements.txt’


2024-12-17 20:08:08 (1.04 MB/s) - ‘requirements.txt’ saved [82/82]

--2024-12-17 20:08:08--  https://raw.githubusercontent.com/wandb/edu/main/mlops-001/lesson1/params.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 295 [text/plain]
Saving to: ‘params.py’


2024-12-17 20:08:08 (22.5 MB/s) - ‘params.py’ saved [295/295]

--2024-1

# EDA
<!--- @wandbcode{course-lesson1} -->

In this notebook, we will download a sample of the [BDD100K](https://www.bdd100k.com/) semantic segmentation dataset and use W&B Artifacts and Tables to version and analyze our data.

In [3]:
DEBUG = False # set this flag to True to use a small subset of data for testing

In [4]:
from fastai.vision.all import *
import params

import wandb

We have defined some global configuration parameters in the `params.py` file. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually.

In the section below, we will use `untar_data` function from `fastai` to download and unzip our datasets.

In [5]:
URL = 'https://storage.googleapis.com/wandb_course/bdd_simple_1k.zip'

In [6]:
path = Path(untar_data(URL, force_download=True))

In [8]:
(path/'images').ls()

(#1000) [Path('/root/.fastai/data/bdd_simple_1k/images/25531db8-076c0000.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/069837be-00000000.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/2d783b3a-a6304dbd.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/6b9077b7-e6a07c88.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/3f7e121a-53b6beb4.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/2d3734da-e7e6f31c.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/68e7781a-cffc2268.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/8e74dd69-c75b794b.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/6a06ce90-c2a41753.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/6ec802d7-6f673b8a.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/3a875150-0882e9d2.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/38e561dd-8ea8ae2d.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/c2c5fee5-a8360924.jpg'),Path('/root/.fastai/data/bdd_simple_1k/images/982f86ee-c452edf3.jpg'),Path('/root

Here we define several functions to help us process the data and upload it as a `Table` to W&B.

In [9]:
def label_func(fname):
    return (fname.parent.parent/"labels")/f"{fname.stem}_mask.png"

def get_classes_per_image(mask_data, class_labels):
    unique = list(np.unique(mask_data))
    result_dict = {}
    for _class in class_labels.keys():
        result_dict[class_labels[_class]] = int(_class in unique)
    return result_dict

def _create_table(image_files, class_labels):
    "Create a table with the dataset"
    labels = [str(class_labels[_lab]) for _lab in list(class_labels)]
    table = wandb.Table(columns=["File_Name", "Images", "Split"] + labels)

    for i, image_file in progress_bar(enumerate(image_files), total=len(image_files)):
        image = Image.open(image_file)
        mask_data = np.array(Image.open(label_func(image_file)))
        class_in_image = get_classes_per_image(mask_data, class_labels)
        table.add_data(
            str(image_file.name),
            wandb.Image(
                    image,
                    masks={
                        "predictions": {
                            "mask_data": mask_data,
                            "class_labels": class_labels,
                        }
                    }
            ),
            "None", # we don't have a dataset split yet
            *[class_in_image[_lab] for _lab in labels]
        )

    return table

We will start a new W&B `run` and put everything into a raw Artifact.

In [10]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="upload")
raw_data_at = wandb.Artifact(params.RAW_DATA_AT, type="raw_data")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [11]:
raw_data_at.add_file(path/'LICENSE.txt', name='LICENSE.txt')

ArtifactManifestEntry(path='LICENSE.txt', digest='X+6ZFkDOlnKesJCNt20yRg==', size=1594, local_path='/root/.local/share/wandb/artifacts/staging/tmpbfnih2nt', skip_cache=False)

Let's add the images and label masks.

In [12]:
raw_data_at.add_dir(path/'images', name='images')
raw_data_at.add_dir(path/'labels', name='labels')

[34m[1mwandb[0m: Adding directory to artifact (/root/.fastai/data/bdd_simple_1k/images)... Done. 1.2s
[34m[1mwandb[0m: Adding directory to artifact (/root/.fastai/data/bdd_simple_1k/labels)... Done. 0.6s


Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table.

In [13]:
image_files = get_image_files(path/"images", recurse=False)

# sample a subset if DEBUG
if DEBUG: image_files = image_files[:10]

In [14]:
table = _create_table(image_files, params.BDD_CLASSES)

Finally, we will add the Table to our Artifact, log it to W&B and finish our `run`.

In [15]:
raw_data_at.add(table, "eda_table")

ArtifactManifestEntry(path='eda_table.table.json', digest='1QwTQVOGS15Fy6Q8TGSegA==', size=588824, local_path='/root/.local/share/wandb/artifacts/staging/tmp_r3a4n7e', skip_cache=False)

In [16]:
run.log_artifact(raw_data_at)
run.finish()

VBox(children=(Label(value='21.825 MB of 35.862 MB uploaded\r'), FloatProgress(value=0.6085747757105108, max=1…