# DVC Demo of Common Use Case

Nutrition5k has dish images and corresponding dish macronutrient labels (csv format) from different Google cafeterias. This tutorial walks through pulling different versions out of the dish macro label csv's.  In the future, this will be useful when we start adding new batches of images and label csv's (or make changes).

Note: This notebook is intended to run in Google Colab and assumes previous versions of the desired dataset were already tagged with DVC

## Install DVC

In [1]:
#dvc installation uncomment below to run in this cell

#!pip install dvc dvc-gs

In [2]:
import os
import cv2
import numpy as np
import pandas as pd

# Colab auth
from google.colab import auth
from google.cloud import storage

import dvc.api

In [3]:
# This step is required for DVC in colab to access our Google Bucket
auth.authenticate_user()

### Utils


In [7]:
# util to show metrics
def print_directory_metrics(dataset_path):

  file_names = os.listdir(dataset_path)
  print("Files:", file_names)

## View Remote Data

In [8]:
# Replace github url with your url
# add git personal access token in front of @github if repo is private

!dvc list -R https://{INSERT_GIT_CLASSIC_PERSONAL_TOKEN}@github.com/dyeramosu/AC215_snapnutrition/

.gitignore
README.md
app/.flaskenv
app/.gitignore
app/Dockerfile
app/Pipfile
app/Pipfile.lock
app/app.py
app/static/css/custom_styles.css
app/static/css/styles.css
app/static/fonts/glyphicons-halflings-regular.eot
app/static/fonts/glyphicons-halflings-regular.svg
app/static/fonts/glyphicons-halflings-regular.ttf
app/static/fonts/glyphicons-halflings-regular.woff
app/static/fonts/glyphicons-halflings-regular.woff2
app/static/img/1837-diabetic-pecan-crusted-chicken-breast_JulAug20DF_clean-simple_061720 Background Removed.png
app/static/img/1837-diabetic-pecan-crusted-chicken-breast_JulAug20DF_clean-simple_061720.jpg
app/static/img/construction_img.jpeg
app/static/img/sample_upload_file_ui.png
app/static/js/main.js
app/static/js/scripts.js
app/templates/layouts/main.html
app/templates/pages/home.html
app/templates/pages/results.html
app/templates/pages/under_construction.html
app/templates/pages/upload_photo.html


Previously, we tagged version 1 of our dishlabels under the git tag "dishlabels_raw_v1"

In [9]:
#delete existing dish_labels folder if exists, and download v1
!rm -rf dish_labels
!dvc get https://{INSERT_GIT_CLASSIC_PERSONAL_TOKEN}@github.com/dyeramosu/AC215_snapnutrition/ snapnutrition_data_bucket/data/raw_data/Nutrition5k_Other/dish_labels --force --rev dishlabels_raw_v1

Downloading dish_labels:   0% 0/2 [00:00<?, ?files/s{'info': ''}] 
!
  0%|          |snapnutrition_data_bucket/data/raw_0.00/? [00:00<?,        ?B/s]
  0% 0.00/99.4k [00:00<?, ?B/s{'info': ''}]                                     

!

  0%|          |snapnutrition_data_bucket/data/raw_0.00/? [00:00<?,        ?B/s]

  0% 0.00/2.10M [00:00<?, ?B/s{'info': ''}]                                     
 39% 38.9k/99.4k [00:00<00:00, 339kB/s{'info': ''}]

  2% 46.9k/2.10M [00:00<00:05, 407kB/s{'info': ''}]
Downloading dish_labels:  50% 1/2 [00:00<00:00,  1.49files/s{'info': ''}]
                                                   

  4% 78.9k/2.10M [00:00<00:06, 329kB/s{'info': ''}]

 12% 255k/2.10M [00:00<00:02, 872kB/s{'info': ''}] 

 26% 559k/2.10M [00:00<00:00, 1.67MB/s{'info': ''}]

 48% 1.01M/2.10M [00:00<00:00, 2.73MB/s{'info': ''}]

Downloading dish_labels: 100% 2/2 [00:01<00:00,  1.80files/s{'info': ''}]


In [10]:
# Check the dataset
print_directory_metrics("dish_labels")

Files: ['dish_metadata_cafe2.csv', 'dish_metadata_cafe1.csv']


Hooray, we have the original 2 csv's from 2 Google cafeterias

Next, let's show that we can pull 3 csv's when a 3rd csv is added for the git tag dishlabels_raw_v2

In [11]:
#delete existing dish_labels folder if exists, and download v2

!rm -rf dish_labels
!dvc get https://{INSERT_GIT_CLASSIC_PERSONAL_TOKEN}@github.com/dyeramosu/AC215_snapnutrition/ snapnutrition_data_bucket/data/raw_data/Nutrition5k_Other/dish_labels --force --rev dishlabels_raw_v2

Downloading dish_labels:   0% 0/3 [00:00<?, ?files/s{'info': ''}] 
!
  0%|          |snapnutrition_data_bucket/data/raw_0.00/? [00:00<?,        ?B/s]
  0% 0.00/99.4k [00:00<?, ?B/s{'info': ''}]                                     

!

  0%|          |snapnutrition_data_bucket/data/raw_0.00/? [00:00<?,        ?B/s]

  0% 0.00/2.10M [00:00<?, ?B/s{'info': ''}]                                     


!


  0%|          |snapnutrition_data_bucket/data/raw_0.00/? [00:00<?,        ?B/s]


  0% 0.00/99.4k [00:00<?, ?B/s{'info': ''}]                                     
 63% 62.9k/99.4k [00:00<00:00, 550kB/s{'info': ''}]

  3% 62.9k/2.10M [00:00<00:03, 553kB/s{'info': ''}]


 55% 54.9k/99.4k [00:00<00:00, 452kB/s{'info': ''}]
Downloading dish_labels:  33% 1/3 [00:00<00:01,  1.03files/s{'info': ''}]
                                                   

  4% 94.9k/2.10M [00:00<00:05, 393kB/s{'info': ''}]


 95% 94.9k/99.4k [00:00<00:00, 380kB/s{'info': ''}]


                                      

In [12]:
# Check the dataset
print_directory_metrics("dish_labels")

Files: ['dish_metadata_cafe2.csv', 'dish_metadata_TESTcafe3.csv', 'dish_metadata_cafe1.csv']


Hooray! We have the 3rd 'dish_metadata_TESTcafe3.csv' from a theoretical 3rd cafeteria or general new data label batch.