# Tracking HDFS dataset versions using Verta

Verta's experiment management system enables data scientists to track dataset versions stored on HDFS.

This notebook shows how to use Verta's experiment management system to track dataset versions. See Verta [documentation](https://verta.readthedocs.io/en/master/_autogen/verta.dataset.HDFSPath.html) for full details on Verta's dataset versioning capabilities.

Updated for Verta version: 0.18.2

## 0. Imports

In [1]:
from __future__ import print_function

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
import os

import numpy as np
import pandas as pd

import sklearn
from sklearn import model_selection
from sklearn import linear_model

# !pip install pyspark
import pyspark

### 0.1 Verta import and setup

In [2]:
# restart your notebook if prompted on Colab
try:
    import verta
except ImportError:
    !pip install verta

In [3]:
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] = 
# os.environ['VERTA_HOST']

In [4]:
from verta import Client
from verta.utils import ModelAPI
import os

client = Client(os.environ['VERTA_HOST'])

---

## 1. Model Training

### 1.1 Prepare Data

In [5]:
from pyspark import SparkContext

sc = SparkContext("local")

In [6]:
from verta.dataset import HDFSPath

# NOTE: we need HDFS running on a host and port
hdfs = "hdfs://HOST:PORT"

dataset = client.set_dataset(name="Census Income HDFS")
blob = HDFSPath.with_spark(sc, "{}/data/census/*".format(hdfs))
dataset_version = dataset.create_version(blob)

In [7]:
print(dataset, dataset_version)

In [8]:
csv = sc.textFile("{}/data/census/census-train.csv".format(hdfs)).collect()

In [9]:
from verta.external.six import StringIO

df_train = pd.read_csv(StringIO('\n'.join(csv)))
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:, -1]

df_train.head()

In [10]:
# create object to track experiment run
run = client.set_experiment_run()

# log training data
run.log_dataset_version("train", dataset_version)

# ---------------------- other tracking below ------------------------

# create validation split
(X_val_train, X_val_test,
 Y_val_train, Y_val_test) = model_selection.train_test_split(X_train, Y_train,
                                                             test_size=0.2,
                                                             shuffle=True)
# log hyperparameters
hyperparams = {"C" : 10}
run.log_hyperparameters(hyperparams)
print(hyperparams, end=' ')

# create and train model
model = linear_model.LogisticRegression(**hyperparams)
model.fit(X_train, Y_train)

# calculate and log validation accuracy
val_acc = model.score(X_val_test, Y_val_test)
run.log_metric("val_acc", val_acc)
print("Validation accuracy: {:.4f}".format(val_acc))

In [11]:
# fetch the dataset version info
run.get_dataset_version("train")

---