# Data preparation using Verta for data versioning
This notebook shows an example of using Verta to manage input and output datasets for a data preparation step.

## Initialize Verta client

In [2]:
from verta import Client
from verta.utils import ModelAPI

VERTA_HOST = "https://cm.dev.verta.ai"

client = Client(VERTA_HOST)

set email from environment
set developer key from environment
connection successfully established


## Load latest version of the dataset

In [3]:
dataset = client.get_dataset(name="Census Income S3")
dataset_version = dataset.get_latest_version()
# dataset_version.download()

set existing Dataset: Census Income S3 from personal workspace
got existing dataset version: d5a01a87188b0a2884466a51aa2e721a4a13d7f3629a4a8e76f92f6ebc82d8ee


## Perform some feature engineering
Create a new feature based on the number of hours per week that the person works

In [14]:
import pandas as pd

output_files = []
for type_ in ["train", "test"]:
    df = pd.read_csv("census-{}.csv".format(type_))
    df['part-time'] = df_train['hours-per-week'] < 30
    df['over-time'] = df_train['hours-per-week'] > 40
    output_file = "census-{}-processed.csv".format(type_)
    df.to_csv(output_file)
    output_files.append(output_file)

## Save the output dataset
The dataset is saved with versioning enabled, which saves the artifacts in Verta directly for download and consumption later.

In [15]:
from verta.dataset import Path
output_dataset = client._set_dataset2(name="Census Income S3 - Processed") ##
# output_dataset = client.get_or_create_dataset(name="Census Income S3 - Processed")
output_dataset.create_version(
    Path(
        output_files,
        enable_mdb_versioning=True,
    ),
)

got existing Dataset: Census Income S3 - Processed
created new Dataset Version: 2 for Census Income S3 - Processed
