<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD;
    color: black;"> <img style="float: left; padding-right: 10px;" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png" height="50px"> <a href='https://harvard-iacs.github.io/2023-AC215/' target='_blank'><strong><font color="#A41034">AC215: Productionizing AI (MLOps)</font></strong></a></h1>

# **<font color="#A41034">Tutorial - Data Pipelines, Labeling, Versioning, Cloud Storage</font>**

**Harvard University**<br/>
**Fall 2023**<br/>
**Instructor:**<br/>
Pavlos Protopapas

<hr style="height:2pt">

## **<font color="#A41034">Tutorial Outline</font>**

## **Learning Objectives**

By the end of this notebook, you will be:
* Familiar with data version from a DVC repo


## **Tutorial Content**

In order to run this tutorial you should have completed [Data Labeling & Versioning](https://github.com/dlops-io/data-labeling) steps.

- View data from a remote DVC repo
- Retrieve data from a remote DVC repo
- Retrieve a specifi data version from DVC


## **<font color="#A41034">Setup Notebook</font>**

**Copy & setup Colab**

1) Select "File" menu and pick "Save a copy in Drive"   

**Installs**

In [None]:
!pip install dvc dvc-gs

Collecting dvc
  Downloading dvc-3.22.0-py3-none-any.whl (425 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/425.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m419.8/425.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m425.8/425.8 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dvc-gs
  Downloading dvc_gs-2.22.1-py3-none-any.whl (10 kB)
Collecting colorama>=0.3.9 (from dvc)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting configobj>=5.0.6 (from dvc)
  Downloading configobj-5.0.8-py2.py3-none-any.whl (36 kB)
Collecting dpath<3,>=2.1.0 (from dvc)
  Downloading dpath-2.1.6-py3-none-any.whl (17 kB)
Collecting dvc-data<2.17.0,>=2.16.0 (from dvc)
  Downloading dvc_data-2.16.1-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.6/67.6 kB[0m [31m8.8 MB/s[0m eta [3

**Imports**

In [None]:
import os
import cv2
import numpy as np
import pandas as pd

# Colab auth
from google.colab import auth
from google.cloud import storage

**Utils**

Here are some util functions that we will be using in this notebook

In [None]:
def compute_dataset_metrics(dataset_path):

  label_names = os.listdir(dataset_path)
  print("Labels:", label_names)

  # Generate a list of labels and path to images
  data_list = []
  for label in label_names:
    # Images
    image_files = os.listdir(os.path.join(dataset_path,label))
    data_list.extend([(label,os.path.join(dataset_path,label,f)) for f in image_files])

  data_list_with_metrics = []
  for item in data_list:
    # Read image
    image = cv2.imread(item[1])
    data_list_with_metrics.append((item[0],item[1],image.nbytes / (1024 * 1024.0)))

  # Build a dataframe
  data_list_with_metrics = np.asarray(data_list_with_metrics)
  dataset_df = pd.DataFrame({
    'label': data_list_with_metrics[:, 0],
    'path': data_list_with_metrics[:, 1],
    'size': data_list_with_metrics[:, 2],
    })

  dataset_df["size"] = dataset_df["size"].astype(float)

  dataset_mem_size = dataset_df["size"].sum()
  value_counts = dataset_df["label"].value_counts()

  print("Dataset Metrics:")
  print("----------------")
  print("Label Counts:")
  print(value_counts)
  print("Size in memory:",round(dataset_df["size"].sum(),2),"MB")

## **<font color="#A41034">Authenticate</font>**

Here we authenticate your Google account that has access to your GCS bucket

In [None]:
# This step is required for DVC in colab to access your Bucket
auth.authenticate_user()

## **View Remote Data**

To view a DVC repository, listing data files and directories tracked by DVC alongside the remaining Git repo contents.

[dvc list reference](https://dvc.org/doc/command-reference/list)

In [None]:
# Replace github url with your url
!dvc list -R https://github.com/dlops-io/data-versioning

.dvcignore
.gitignore
[01;32mDockerfile[0m
[01;32mPipfile[0m
Pipfile.lock
README.md
cli.py
docker-shell.bat
[01;32mdocker-shell.sh[0m
mushroom_dataset.dvc
mushroom_dataset/amanita/1.jpg
mushroom_dataset/crimini/10.jpg
mushroom_dataset/crimini/11.jpg
[0m

## **Retrieving Data**

Once DVC-tracked data are stored remotely, they can be downloaded with dvc get when needed.

[dvc get reference](https://dvc.org/doc/command-reference/get)

In [None]:
!rm -rf mushroom_dataset
!dvc get https://github.com/dlops-io/data-versioning mushroom_dataset --force --rev dataset_v1

Downloading mushroom_dataset:   0% 0/3 [00:00<?, ?files/s{'info': ''}] 
![A
  0%|          |mushroom_dataset/amanita/1.jpg     0.00/? [00:00<?,        ?B/s][A
mushroom_dataset/amanita/1.jpg:   0% 0.00/441k [00:00<?, ?B/s{'info': ''}]      [A
                                                                          [A
![A
  0%|          |mushroom_dataset/crimini/10.jpg    0.00/? [00:00<?,        ?B/s][A
mushroom_dataset/crimini/10.jpg:   0% 0.00/26.7k [00:00<?, ?B/s{'info': ''}]    [A
                                                                            [A
![A
  0%|          |mushroom_dataset/crimini/11.jpg    0.00/? [00:00<?,        ?B/s][A
Downloading mushroom_dataset: 100% 3/3 [00:00<00:00, 25.74files/s{'info': ''}]
[0m

In [None]:
# Check the dataset
compute_dataset_metrics("mushroom_dataset")

Labels: ['amanita', 'crimini']
Dataset Metrics:
----------------
Label Counts:
crimini    2
amanita    1
Name: label, dtype: int64
Size in memory: 9.56 MB


## **Retrieve a different data version**

In [None]:
!rm -rf mushroom_dataset
!dvc get https://github.com/dlops-io/data-versioning mushroom_dataset --force --quiet --rev dataset_v2

[0m

In [None]:
# Check the dataset
compute_dataset_metrics("mushroom_dataset")

Labels: ['amanita', 'crimini', 'oyster']
Dataset Metrics:
----------------
Label Counts:
crimini    2
amanita    1
oyster     1
Name: label, dtype: int64
Size in memory: 11.7 MB
