<a href="https://colab.research.google.com/github/wandb/edu/blob/main/mlops-001/lesson1/02_Split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

# Data preparation

<!--- @wandbcode{course-lesson1} -->

In this notebook we will prepare the data to later train our deep learning model. To do so,

- we will start a new W&B `run` and use our raw data artifact
- split the data and save the splits into a new W&B Artifact
- join information about the split with our EDA Table

In [1]:
import os, warnings
import wandb

import pandas as pd
from fastai.vision.all import *
from sklearn.model_selection import StratifiedGroupKFold

import params
warnings.filterwarnings('ignore')

## Start a W&B run and get stored artifacts

In [2]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="data_split")

[34m[1mwandb[0m: Currently logged in as: [33merinaldi[0m ([33merinaldi-team[0m). Use [1m`wandb login --relogin`[0m to force relogin


Let's use artifact we previously saved to W&B (we're storing artifact names and other global parameters in `params`).

In [3]:
raw_data_at = run.use_artifact(f'{params.RAW_DATA_AT}:latest')
path = Path(raw_data_at.download())

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 846.57MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:1.1


In [5]:
type(raw_data_at)

wandb.apis.public.Artifact

The download creates a _local_ `artifacts` folder

In [4]:
path.ls()

(#5) [Path('artifacts/bdd_simple_1k:v0/images'),Path('artifacts/bdd_simple_1k:v0/labels'),Path('artifacts/bdd_simple_1k:v0/LICENSE.txt'),Path('artifacts/bdd_simple_1k:v0/eda_table.table.json'),Path('artifacts/bdd_simple_1k:v0/media')]

To split data between training, testing and validation, we need file names, groups (derived from the file name) and target (here we use our rare class bicycle for stratification). We previously saved these columns to EDA table, so let's retrieve it from the table now. 

In [6]:
fnames = os.listdir(path/'images')
groups = [s.split('-')[0] for s in fnames]

In [8]:
groups  # they correspond to different videos.

['a59131a5',
 '6886b3d9',
 '115e4aff',
 'b803d91d',
 'c665137e',
 '6b293d3e',
 '898ac5b9',
 'a91b7555',
 '16e186ec',
 'b18cb922',
 'c7d8260d',
 '56d9d586',
 '9a888ffa',
 '6268fd0c',
 '00e9be89',
 '31690fd0',
 '3a7896fe',
 '01b29c03',
 'a91b7555',
 '0ea7f502',
 '31a9e4c6',
 'a3b51b78',
 '84af3c94',
 '7234b48f',
 '76df02f8',
 '0d03369b',
 'b6b047b4',
 '55856ef6',
 '08ab784d',
 '5d03a12f',
 'a7542d36',
 '71ec3de7',
 '05b10029',
 '9dec6d97',
 '4adb062d',
 '843af705',
 '75b3cdd3',
 '150a973a',
 '21fcab2c',
 '67c3676f',
 '14df900d',
 '4ace777f',
 '3d0d454e',
 '0acc0c71',
 '9c38185c',
 '156b8888',
 '8e74dd69',
 '0f145ef9',
 '0f1a2d28',
 '1b22fdfb',
 '9d5576ab',
 '5b37ba9e',
 '835df354',
 '5231447b',
 '5dff6680',
 'ab6a3dc3',
 '22d63fa2',
 '9f442f32',
 '340257de',
 '3426cfd1',
 'c6d7400d',
 '9861744f',
 '4c7d75ce',
 '022ec367',
 'ad62ff26',
 '7f92cc43',
 '53af996f',
 '00e9be89',
 '069837be',
 '15430558',
 '6ea4cd53',
 '4899c3bf',
 '67ed7da0',
 '9dccd5bd',
 '346be4c0',
 '661cdd5a',
 'c122b0dd',

The artifact we downloaded contained the table. We can also `get` it from the `Artifact` object

In [9]:
orig_eda_table = raw_data_at.get("eda_table")

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 846.57MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:0.3


In [10]:
type(orig_eda_table)

wandb.data_types.Table

In [11]:
y = orig_eda_table.get_column('bicycle')

In [13]:
np.unique(y)

array([0, 1])

## Create 10 stratified folds

Now we will split the data into train (80%), validation (10%) and test (10%) sets. As we do that, we need to be careful to:

- *avoid leakage*: for that reason we are grouping data according to video identifier (we want to make sure our model can generalize to new cars or video frames)
- handle the *label imbalance*: for that reason we stratify data with our target column

We will use sklearn's `StratifiedGroupKFold` to split the data into 10 folds and assign 1 fold for test, 1 for validation and the rest for training.

In [14]:
df = pd.DataFrame()
df['File_Name'] = fnames
df['fold'] = -1

In [16]:
cv = StratifiedGroupKFold(n_splits=10)
for i, (train_idxs, test_idxs) in enumerate(cv.split(fnames, y, groups)):
    df.loc[test_idxs, ['fold']] = i  # assign split number to 'fold' column of the test examples

In [17]:
df.head()

Unnamed: 0,File_Name,fold
0,a59131a5-00000000.jpg,3
1,6886b3d9-6ab2b28d.jpg,3
2,115e4aff-00000000.jpg,7
3,b803d91d-671b8cff.jpg,8
4,c665137e-6fffaf45.jpg,7


All the files (rows in the `df`) will have an associated split number (`fold`). We assign `fold=0` to the test set and `fold=1` to the validation set

In [18]:
df['Stage'] = 'train'
df.loc[df.fold == 0, ['Stage']] = 'test'
df.loc[df.fold == 1, ['Stage']] = 'valid'
del df['fold']
df.Stage.value_counts()

train    800
valid    100
test     100
Name: Stage, dtype: int64

Save the splits to a csv file (2 columns: file name and data split)

In [19]:
df.to_csv('data_split.csv', index=False)

We will now create a new artifact and add our data there. 

In [20]:
processed_data_at = wandb.Artifact(params.PROCESSED_DATA_AT, type="split_data")

In [21]:
processed_data_at.add_file('data_split.csv')
processed_data_at.add_dir(path)

[34m[1mwandb[0m: Adding directory to artifact (./artifacts/bdd_simple_1k:v0)... Done. 1.5s


Finally, the split information may be relevant for our analyses - rather than uploading images again, we will save the split information to a new table and join it with EDA table we created previously. 

In [22]:
data_split_table = wandb.Table(dataframe=df[['File_Name', 'Stage']])  # use File_Name as in the previous table for joining

In [23]:
join_table = wandb.JoinedTable(orig_eda_table, data_split_table, "File_Name")

## Add split dataset as new artifact

Let's add it to our artifact, log it and finish our `run`. 

In [24]:
processed_data_at.add(join_table, "eda_table_data_split")

ArtifactManifestEntry(path='eda_table_data_split.joined-table.json', digest='YvCnLQmJm8ISmhHyrsPo/A==', ref=None, birth_artifact_id=None, size=123, extra={}, local_path='/Users/enrythebest/Library/Application Support/wandb/artifacts/staging/tmp0ywz4wjp')

In [25]:
run.log_artifact(processed_data_at)

<wandb.sdk.wandb_artifacts.Artifact at 0x297ef4f40>

In [26]:
# this will do the upload
run.finish()