# unit 1.4 - datasets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/culurciello/deep-learning-course-source/blob/main/source/lectures/14-datasets.ipynb)

We will now look at how data is organized in professional datasets. We will look at samples of data, datasets. We will look into train, development and test dataset splits; also at batches of samples and epochs used during training.

Creating a proper dataset is a complex and time-consuming endeavour. Careful management of data is important in neural network training, as important as the books one needs to read to learn a subject proper. 

## Samples

Samples are "examples" used for training neural networks. Typically, in supervised learning, they are composed of two parts: an input data, and a label. Inputs are data that is fed as input to a neural network. Labels are the "ground truth" or "desired outputs". This is the output that our neural network should produce given the input.

![](images/dataset1.png)

## Datasets

A dataset is a collection of samples. The samples make up a list and this list is the dataset. A dataset is used to train or test a neural network.

As an example, in supervised learning for categorization, a dataset may contain a set of pictures for each category. Imaging we want a neural network that can tell animals apart. The dataset is a collection of samples = (input=picture of animal, label=type of animal). A dataset contains a list of these samples.

When training a neural network, the dataset samples need to be in random order. This was proven to be a more effective way of training. If a network is trained by providing all examples of one category, then another, etc, it can be more prone to getting stuck into local minima and forgetting older categories.

Randomization of a dataset is a must in neural network training. To avoid learning problems, always check that you code and routines apply randomization during training.

## Train, development, test datasets

So far we have talked about "The Dataset" as one giant list. In most cases, we will split the dataset into 2 or 3 portions. 

![](images/dataset2.png)

If you want to train a neural network on your data, then you need to use some 80-90% of your data to train, but you may want to save 10-20% of your data to test your trained neural network. Testing means checking that the neural network can perform with high accuracy (# correct / # total number of samples) on your data. If you train on ALL your data, the neural network may memorize all your samples well and give an high training accuracy (a low training loss), but then you would not have any ways to test if this network can perform well on unseen examples. Therefore in most cases, we split a dataset into two parts: training and testing datasets. 

If you are employing others to train a neural network then you will do the same: give them 80-90% of your data to trains, but save 10-20% of data to check their results later. The company you employed will receive a single dataset, and they will have to again divide it into 2 parts: train and development datasets, often called: train, dev sets.
They will train and test on their own, but then provide you with a trained neural network that you can test on your saved test dataset. The company has never seen that test set so they cannot use these test samples for training. This will make sure you can verify that your neural network is able to perform on unseen data.

To summarize: first randomize your entire dataset, then always split the dataset into train, test. If you are employing other to do the job you can split into train, test and only give them the train dataset. They will then again split into train2 and dev datasets. Effectively this splits a dataset into 3 portions: train, dev, test. 

You may find datasets split in 2 or 3 parts. If the dataset has 2 portions (train, test), you will only train on the "train dataset" and test on the "train dataset". If the dataset has 3 portions, you will only train on the "train dataset" and test on the "dev dataset". You will leave "test" for your final submission, after you trained your final neural network model.

Dev and Test dataset are useful to redesign your neural network architecture, to improve on your hyper-parameters (all parameters that are not neural network weights).

## Batches and epochs

When using gradient-descent as method of training neural networks, such as back-propagation, researchers found out that by averaging gradients over multiple samples one can obtain faster training times and sometimes more accuracy. 

If your computing hardware allows it, running on the ENTIRE dataset at once is the fastest option. This is called "batch gradient descent". The slowest option is to run on one sample at a time. This is called "stochastic gradient descent". 

Training of a neural network requires running through the entire dataset multiple times. Each time you use the entire dataset is called "epoch". Running 10 epochs means you will use your dataset for training a total of 10 times.

Unfortunately training on the entire dataset may be prohibitive if the dataset is large and does not fit into your computer memory. In this case, which is the typical case, instead of training on one sample at a time, we can use "batches" of samples. Usually a power of 2, or as many as your training computer hardware can afford: 64, 128, 256 or even more in some cases. This is called "mini-batch gradient descent."

![](images/dataset3.png)

When training, you will then take a batch of samples and back-propagate in chunks. In addition to improving training times, batching data also can maximize the efficiency of your computing hardware. This means that your hardware can be fully utilized and run faster. Therefore batches are not only used in training, but also testing.

![](images/batching.png)

Why is batching helfpul during training? Because averaging gradients makes the training updates align in the direction of the best local minimum of the loss function. Even averaging a few samples, as in "mini-batch" gradient descent, we are more efficient that the pure stochastic gradient descent that utilizes one sample at a time.  

A final note: When training a neural network, dataset samples should be re-arranged randomly at each epoch.

## When to stop training?

How many epochs should we use to train a neural network? A naive answers would be to run until the train loss is the lowest. Sometimes the loss function on the train dataset will continue to decrease at each epoch indefinitely. Because of that, one needs to check both the train loss and also the test loss after every epoch!

![](images/stopping.png)

A neural network can begin to memorize very effectively all the training samples, thus showing a lower and lower loss at each epoch. But by learning specific samples, the network will also lose generalization capabilities. In a typical case the test loss will have a minimum as a function of epochs. The epoch when that minimum occurs is the ideal point when we stop training. This point cannot be known a priory, and is empirical. This means you will need to run training for as many epochs as you need to be convinced that you passed the minimum test loss. 

Your final trained model can be the set of weights which provided the minimum test loss. As such it is a good idea to use a script that saves the weights of the network with the minimum test loss.
