# Tutorial 2: Using In-Disk Data in FastEstimator

In Tutorial 1, we introduced our 3 main APIs and general workflow of a deep learning task:  `Pipeline` -> `Network` -> `Estimator`. Then we used in-memory data for training. But what if the dataset size is too big to fit in memory? Say, data is in the size of ImageNet? 

The short answer is: user will use one more API for disk data: `RecordWriter`, such that the overall workflow becomes: `RecordWriter` -> `Pipeline` -> `Network` -> `Estimator`. In this tutorial, we are going to show you how to do in-disk data training in FastEstimator.

## Before we start:

Two things are required regarding in-disk data : 
* Data files, obviously :)
* A csv file that describes the data (prepare two csv files if you have a separate validation set)

In the csv file, the rows of csv represent different examples and columns represent different features within example. For example, for a classification task, a csv may look like:

| image  | label  |
|---|---|
|/data/image1.png   | 0  |
|/data/image2.png   |  1 |
|/data/image3.png | 0  |
|... | .  |

The csv of a multi-mask segmentation task may look like:

| img  | msk1  | msk2  |
|---|---|---|
|/data/image1.png   | /maska/mask1.png  |/maskb/mask1.png|
|/data/image2.png   |  /maska/mask2.png |/maskb/mask2.png|
|/data/image3.png | /maska/mask3.png  |/maskb/mask3.png|
|... | ...  |...|


Please keep in mind that, there is no restriction on the data folder structures, number of features or name of features. Now, let's generate some in-disk data for this tutorial:

In [None]:
from fastestimator.dataset.mnist import load_data

train_csv, eval_csv, folder_path = load_data()
print("training csv path is {}".format(train_csv))
print("evaluation csv path is {}".format(eval_csv))
print("mnist image path is {}".format(folder_path))

Let's take a look at the csv file and image file

In [None]:
import pandas as pd
df_train = pd.read_csv(train_csv)
df_train.head()

In [None]:
import matplotlib.pyplot as plt
import os
img = plt.imread(os.path.join(folder_path, df_train["x"][1]))
plt.imshow(img)
print("ground truth of image is {}".format(df_train["y"][1]))

## Step 0: RecordWriter


In FastEstimator, we convert user's in-disk data to TFRecord for the best training speed. 

`save_dir` is required to specify the path to write the record. 

`train_data` can either be a csv path or a dictionary like the one used in tutorial 1. 

`validation_data` is optional and can take all input formats of `train_data`. In addition, `validation_data` can also take a floating point number between 0 to 1 which indicates the validation split ratio, then validation data will be randomly sampled from training data. 

Before converting data to TFRecord, users can apply a series of propcoessing tasks to the data in `ops` argument, we will talk about them in detail in tutorial 3.

In [None]:
from fastestimator.record.preprocess import ImageReader
import fastestimator as fe
import tempfile

writer = fe.RecordWriter(save_dir=os.path.join(folder_path, "FEdata"),
                         train_data=train_csv,
                         validation_data=eval_csv,
                         ops=ImageReader(inputs="x", parent_path=folder_path, outputs="x", grey_scale=True))

## Step 1 and above: Pipeline -> Network -> Estimator, same as tutorial 1.

In [None]:
from fastestimator.pipeline.processing import Minmax
from fastestimator.architecture import LeNet
from fastestimator.network.model import FEModel
from fastestimator.network.model import ModelOp
from fastestimator.network.loss import SparseCategoricalCrossentropy

pipeline = fe.Pipeline(batch_size=32, data=writer, ops=Minmax(inputs="x", outputs="x"))

model = FEModel(model_def=LeNet, model_name="lenet", optimizer="adam", loss_name="loss")
network = fe.Network(ops=[ModelOp(inputs="x", model=model, outputs="y_pred"), 
                          SparseCategoricalCrossentropy(y_pred="y_pred", y_true="y", outputs="loss")])
estimator = fe.Estimator(network=network, pipeline=pipeline, epochs=2)

## Start training

In [None]:
estimator.fit()

## Before we finish

As mentioned in tutorial 1, the preprocessing in the `RecordWriter` is a place for "once-for-all" type preprocessing. If a preprocessing functions only needs to be done once (e,g, Resize and Rescale), then it is recommended to put them in `RecordWriter`. By doing this you can reduce the amount of computation needed during training and thereby train faster.

Now, to summarize the conceptural workflow in FastEstimator for any deep learning task:
0. Optional, but if I have in-disk data or want to apply some preprocessing once and for all  --> Express them in `RecordWriter`
1. How do I want my data to be processed during the training, --> Express them in `Pipeline`
2. How do I want my network architecture and loss to be defined, what are the connections between networks if there are multiple of them. --> Express them in `Network`
3. How long do I train the model, what do I need during training loop --> Express them in `Estimator`