# Using Verta Manged Dataset versioning on S3

Verta experiment management provides functionality to fully manage dataset versions with Verta (i.e., store data in Verta artifact store and provide full data reproducibility)

This notebook shows how to use Verta's experiment management system to use Verta managed dataset versions. See Verta [documentation](https://verta.readthedocs.io/en/master/_autogen/verta.dataset.html) for full details on Verta's dataset versioning capabilities.

Updated for Verta version: 0.18.2

<a href="https://colab.research.google.com/github/VertaAI/examples/blob/main/experiment_management/dataset-versioning/verta-managed-dataset-versioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Imports

In [1]:
from __future__ import print_function

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
import os

import numpy as np
import pandas as pd

import sklearn
from sklearn import model_selection
from sklearn import linear_model
import wget

### 0.1 Verta import and setup

In [2]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !pip install verta

In [3]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] = 
# os.environ['VERTA_HOST']

In [4]:
from verta import Client
from verta.utils import ModelAPI
import os

client = Client(os.environ['VERTA_HOST'])

---

## 1. Model Training

### 1.1 Prepare data

In [5]:
bucket = "verta-starter"
key = "census-train.csv"

First we download our data from S3 for use in this notebook.

In [6]:
data_dir = os.curdir
train_data_filename = os.path.join(data_dir, key)

if not os.path.exists(train_data_filename):
    train_data_url = "http://s3.amazonaws.com/{}/{}".format(bucket, key)
    wget.download(train_data_url, train_data_filename)

Then we version our dataset; with `enable_mdb_versioning=True`, the client will obtain the data file(s) from S3 and store them in ModelDB.

In [7]:
from verta.dataset import S3

dataset = client.set_dataset(name="Census Income S3")
content = S3("s3://{}/{}".format(bucket, key), enable_mdb_versioning=True)
dataset_version = dataset.create_version(content)

In [8]:
df_train = pd.read_csv(train_data_filename)
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:, -1]

df_train.head()

In [9]:
# create object to track experiment run
proj = client.set_project("Verta Managed Dataset Versioning")
expt = client.set_experiment("My Experiment")
run = client.set_experiment_run()

# log training data
run.log_dataset_version("train", dataset_version)

# ---------------------- other tracking below ------------------------

# create validation split
(X_val_train, X_val_test,
 y_val_train, y_val_test) = model_selection.train_test_split(X_train, y_train,
                                                             test_size=0.2,
                                                             shuffle=True)
# log hyperparameters
hyperparams = {"C" : 10}
run.log_hyperparameters(hyperparams)
print(hyperparams, end=' ')

# create and train model
model = linear_model.LogisticRegression(**hyperparams)
model.fit(X_train, y_train)

# calculate and log validation accuracy
val_acc = model.score(X_val_test, y_val_test)
run.log_metric("val_acc", val_acc)
print("Validation accuracy: {:.4f}".format(val_acc))

In [10]:
# fetch the dataset version info
run.get_dataset_version("train")

---

# Retrieve data

Let's say our data file becomes lost in the future.

In [11]:
os.remove(train_data_filename)

Because we've used Verta to manage our data, we can obtain the dataset version from our experiment run, and use it to recover our original dataset.

In [12]:
run = proj.expt_runs[0]
version = run.get_dataset_version("train")

version.get_content().download(
    "s3://{}/{}".format(bucket, key),
    download_to_path=train_data_filename,
)

In [13]:
pd.read_csv(train_data_filename).head()

---