# Tracking S3 dataset versions using Verta

Verta's experiment management system enables data scientists to track dataset versions stored on S3.

This notebook shows how to use Verta's experiment management system to track dataset versions. See Verta [documentation](https://verta.readthedocs.io/en/master/_autogen/verta.dataset.html) for full details on Verta's dataset versioning capabilities.

Updated for Verta version: 0.18.2

<a href="https://colab.research.google.com/github/VertaAI/examples/blob/main/experiment_management/dataset-versioning/s3-dataset-versioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Imports

In [1]:
# NOTE: you may need pip3 instead of pip
!pip install boto3
!pip install wget

# from __future__ import print_function

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from datetime import datetime
import itertools
import os
import time

import numpy as np
import pandas as pd

import sklearn
from sklearn import model_selection
from sklearn import linear_model
from sklearn import metrics

import boto3
import wget

### 0.1 Verta import and setup

In [2]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !pip install verta

In [3]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] = 
# os.environ['VERTA_HOST']

In [4]:
from verta import Client
from verta.utils import ModelAPI
import os

client = Client(os.environ['VERTA_HOST'])

---

## 1. Model Training

### 1.1 Prepare Data

In [5]:
from verta.dataset import S3

# create a dataset
dataset = client.set_dataset(name="Census Income S3")

# create a version using the S3 versioning convenience function
# https://verta.readthedocs.io/en/master/_autogen/verta.dataset.S3.html#verta.dataset.S3
dataset_version = dataset.create_version(S3("s3://verta-starter"))

In [6]:
print(dataset, dataset_version)
# Notice that the S3 convenience function has collected metadata about 
# every blob in the S3 bucket including its checksum, and last modified 
# dates. Verta will use this info to determine if a new version muts be 
# created

In [7]:
DATASET_PATH = "./"
train_data_filename = DATASET_PATH + "census-train.csv"
test_data_filename = DATASET_PATH + "census-test.csv"

def download_starter_dataset(bucket_name):
    if not os.path.exists(DATASET_PATH + "census-train.csv"):
        train_data_url = "http://s3.amazonaws.com/" + bucket_name + "/census-train.csv"
        if not os.path.isfile(train_data_filename):
            wget.download(train_data_url)

    if not os.path.exists(DATASET_PATH + "census-test.csv"):
        test_data_url = "http://s3.amazonaws.com/" + bucket_name + "/census-test.csv"
        if not os.path.isfile(test_data_filename):
            wget.download(test_data_url)

download_starter_dataset("verta-starter")

In [8]:
df_train = pd.read_csv("./census-train.csv")
X_train = df_train.iloc[:,:-1]
Y_train = df_train.iloc[:, -1]

df_train.head()

### 1.2 Train a Model and associate the dataset version with it

In [9]:
# create object to track experiment run
run = client.set_experiment_run()

# log training data
run.log_dataset_version("train", dataset_version)

# ---------------------- other tracking below ------------------------

# create validation split
(X_val_train, X_val_test,
 Y_val_train, Y_val_test) = model_selection.train_test_split(X_train, Y_train,
                                                             test_size=0.2,
                                                             shuffle=True)
# log hyperparameters
hyperparams = {"C" : 10}
run.log_hyperparameters(hyperparams)
print(hyperparams, end=' ')

# create and train model
model = linear_model.LogisticRegression(**hyperparams)
model.fit(X_train, Y_train)

# calculate and log validation accuracy
val_acc = model.score(X_val_test, Y_val_test)
run.log_metric("val_acc", val_acc)
print("Validation accuracy: {:.4f}".format(val_acc))

In [10]:
# fetch the dataset version info
run.get_dataset_version("train")

---