# Building Instustrial Deep Learning Solutions

<p style="text-align: center;">Daniele Solombrino <br>
dansolombrino@{gmail.com,GitHub}</p>

## Virtual Environments

First rule of a good coding project: hygiene!

Every project should have its own **Python virtual environment**, so as packages installed for one project do not collide with packages needed by other projects.

Python virtual environments can be created in different ways.

Today, we will use `python venv`, which comes already installed in the latest Python versions (>=3.3)

First, we are going to create the virtual environment, by pasting the following code into the terminal

`python -m venv digits_classification`

Then, we are going to activate the virtual environment by pasting this other line into the terminal

`source digits_classification/bin/activate`

## requirements.txt

Each project requires some Python packages to be installed.

Add them all in the file named `requirements.txt`.

Anytime someone has to replicate your environment, they will simply do a `pip3 install -r requirements.txt` on their machine, after having created and activated the virtual environment.

## The problem

You are handling a client which gave you a problem to solve.

This problem includes the following need: recognize handwritten digits that have been OCR'd from a system the client is already using.

## Data

### Data loading

After some time spent online, you find information about this dataset, which is the closest to the inputs that the client has: [MNIST](https://en.wikipedia.org/wiki/MNIST_database)

You start looking around and you find different sources to get it, including [this](https://www.kaggle.com/datasets/hojjatk/mnist-dataset) Kaggle page and [this](https://github.com/yawen-d/Neural-Network-on-MNIST-with-NumPy-from-Scratch) GitHub repo.

You look at the pages and you find that the GitHub has the files in a much much more convenient way, so you decide to download them from there.

For practicality, the data has been already downloaded for you and are included in this repo.

Remember what we have stated in the previous weeks: never, EVER, trust code coming from other people blindly.

For this reason, we are going to look into the data.

The file is in hdf5 format, which needs its Python lib to be handled correctly.

We add the library to the `requirements.txt` file and install it by doing `pip3 install h5py`

In [2]:
import h5py

In [3]:
MNIST_data = h5py.File('MNISTdata.hdf5', 'r')

Since we have zero trust on whoever provided the data, first and foremost, we are going to see that data is stored inside the archive file.

h5 files can be treated as dictionaries, so the first thing that makes sense is to look at the keys of the archive.

In [4]:
from rich import print # nicer prints

In [5]:
print(MNIST_data.keys())

### Data splitting

The archive has four Tensors: X_train, X_test, Y_train, Y_test.

A couple of questions for you:

1) Why $X$ and $Y$?
2) Why train and test?

$X$ are the inputs and $Y$ are the outputs, as we are in a supervised setting scenario!

As we stated these weeks, we want Machine/Deep Learning models that generalize to never-seen-before data! <br>
For this reason, we hold a portion of the data out, the test data, to verify our performance on data never seen before!<br>
To be complete, we are also going to get another portion of data not used for training, the "validation" set.<br>
We are going to give more details about test vs. validation it in the appropriate time, for now, we are just going to save the various data splits.

In [6]:
# train data
X_train = MNIST_data['x_train'][:]
Y_train = MNIST_data['y_train'][:]

# test data
X_test = MNIST_data['x_test'][:]
Y_test = MNIST_data['y_test'][:]

In [7]:
print(f"X_train.shape: {X_train.shape}")
print(f"Y_train.shape: {Y_train.shape}")
print(f"X_test.shape: {X_test.shape}")
print(f"Y_test.shape: {Y_test.shape}")

Do these shapes make sense to you? What do you think these values mean?

$60k$ and $10k$ are the number of data samples in each split.

$768$ is the number of features that each data sample has.

$1$ is the label for each sample

Let's create the additional validation split

In [8]:
X_val = X_train[50000:60000]
Y_val = Y_train[50000:60000]

X_train = X_train[0:50000]
Y_train = Y_train[0:50000]

In [9]:
print(f"X_train.shape: {X_train.shape}")
print(f"Y_train.shape: {Y_train.shape}")
print(f"X_val.shape: {X_val.shape}")
print(f"Y_val.shape: {Y_val.shape}")
print(f"X_test.shape: {X_test.shape}")
print(f"Y_test.shape: {Y_test.shape}")

There is not a fixed rule about the split sizes. The most used values are: 60/20/20, 70/20/10 and 80/10/10 (train %/val %/test %).

In this case, we are using a 70/15/15, as other percentages would've required more operations to be computed.

Hygiene rule: never, EVER, have overlapping data samples between the splits. 

Training data is for training.
Validation data is for validation.
Test data is for testing.

Each split has its own "goal".

### Providing data to models

We have some data in memory, but we still need to have a way of bringing this data to the model that we will use for our predictions.

The framework we are going to use, PyTorch, provides an abstract class, `Dataset`, which we are supposed to specialize.

The abstract class `Dataset` requires the specification of two methods, `init()` and `get_item()`.

The `init()` method simply builds the specialized `Dataset` object, while the `get_item()` method is responsible for actually providing data when necessary. 