## The lifecycle of a Machine Learning project

- Planning/choosing a goal
- Data collection & labelling
- Creating features & preprocessing
- Training and optimization
- Deployment

## Planning

Choosing what to work on and what is the measurement of success is the most important part of the project.

Get help here! Ask domain experts, business people, meditate on it. Do spend some time to think about it. But don't linger for too long. Analysis paralysis is a very common phenomenon in the real world!

## Prototyping a baseline model

Jupyter notebooks are a great prototyping/experimentation tool. You can use a notebook(s) to get some quick ideas about the feasibility and performance of a model.

After you get some results, you'll proceed to create a full-blown project containing the baseline model. This will be a lot of work, but some rewards you might expect are bug fixes, new ideas, and developing something that (hopefully) has a real-world impact.

Next, we'll go over an example task of automating the decision of whether a bank customer has good or bad credit risk.

### Getting your data

If you haven't solved any real-world ML problems yet, you might believe that most datasets get stored as CSV files. While this format is great when learning, you'll often need to understand SQL, pickle, HDF5, Parquet, and many more. 

#### Labeling

Sometimes your data will have labels, but it might not be exactly the data you need. Other times, you'll miss labels altogether. What can you do?

Creating a labeling infrastructure will be time well spent, for sure. Increasing the data size and reducing the noise in your data (having less wrong data) will dramatically increase the predictive power of your model(s).

You will always want more and cleaner data. So keep getting it, slowly but surely. At some point, you'll notice you start getting diminishing returns. Depending on the problem you're solving, it might be a good idea to focus on other issues in the project.

#### Looking at the data

This might sound boring and complete nonsense but go through different examples from the data. Can you figure out the labels for each one? Are the labels correct? Are the labels consistent?

Remember, feeding your model with crappy data will give you garbage results. At this stage of the project, you don't want to deal with crap - there will be plenty later on.

In [1]:
from sklearn import datasets

features, targets = datasets.fetch_openml(
    name="credit-g", 
    version=1, 
    return_X_y=True, 
    as_frame=True
)

features.shape, targets.shape

((1000, 20), (1000,))

In [2]:
features.head()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,4.0,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,2.0,real estate,22.0,none,own,1.0,skilled,1.0,none,yes
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,3.0,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,4.0,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,4.0,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes


In [3]:
targets.head()

0    good
1     bad
2    good
3    good
4     bad
Name: class, dtype: category
Categories (2, object): ['good', 'bad']

In [4]:
targets.value_counts()

good    700
bad     300
Name: class, dtype: int64

### Note on reproducibility


In [5]:
import random
import numpy as np
import torch

RANDOM_SEED = 42

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED + 1)
torch.manual_seed(RANDOM_SEED + 2)

<torch._C.Generator at 0x7f89e95d9528>

## Feature engineering

One of the advantages of Deep Neural Networks is to automate the process of feature engineering. At least, that was the grand promise.

In practice, adding manual features might significantly improve the performance of your model. But creating good features is black magic. Ideas for those come almost always from spending absurd amounts of time with the raw data.

Start by thinking of a couple of features and encode them. Use classical ML algorithms (like Random Forest) to evaluate their importance. Those features will be prime candidates for inclusion in your Deep Learning model later on.

In [6]:
FEATURES = [
  "duration", 
  "credit_amount", 
  "age", 
  "existing_credits", 
  "residence_since"
]

In [7]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(
  features[FEATURES], 
  targets, 
  test_size=0.2
)

label_encoder = preprocessing.LabelEncoder()
label_encoder = label_encoder.fit(y_train)

X_train = X_train.to_numpy()
y_train = label_encoder.transform(y_train)

X_test = X_test.to_numpy()
y_test = label_encoder.transform(y_test)

In [8]:
X_train.shape, y_train.shape

((800, 5), (800,))

In [9]:
X_test.shape, y_test.shape

((200, 5), (200,))

## Training and evaluation

Training a Deep Neural Net using any of the popular libraries for Deep Learning is relatively straightforward. That is given you keep playing with toy examples.

In practice, the training might include a lot of hacks that change the generic process just a bit - enough to introduce bugs and write tons of incomprehensive code.

Using a library like scikit-learn is a great first choice for building a baseline model. It takes very little time, the code is easier to understand and you can gain a lot of insight into the problem you're solving. 

Here's a quick example of how you can use a Random Forest classifier:

In [10]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200)
model = model.fit(X_train, y_train)

In [11]:
model.score(X_test, y_test)

0.68

### Training a Deep Neural Net in PyTorch

That is a great start, but we're interested in building a Neural Net. Let's create a simple one using PyTorch:

In [28]:
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader

class CreditTypeDataset(Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels
      
  def __getitem__(self, index):
    return (
      torch.from_numpy(self.features[index]).float(), 
      self.labels[index]
    )

  def __len__(self):
    return len(self.features)

In [29]:
dataset = CreditTypeDataset(X_train, y_train)
data_loader = DataLoader(dataset, batch_size=4, shuffle=True)
features_batch, labels_batch = next(iter(data_loader))

In [30]:
features_batch

tensor([[2.4000e+01, 1.2850e+03, 3.2000e+01, 1.0000e+00, 4.0000e+00],
        [1.5000e+01, 1.5320e+03, 3.1000e+01, 1.0000e+00, 3.0000e+00],
        [1.2000e+01, 5.2200e+02, 4.2000e+01, 2.0000e+00, 4.0000e+00],
        [1.2000e+01, 1.8580e+03, 2.2000e+01, 1.0000e+00, 1.0000e+00]])

In [31]:
labels_batch

tensor([0, 1, 1, 1])

In [32]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class CreditTypeClassifierNet(nn.Module):

  def __init__(self, n_features, n_credit_types):
    super(CreditTypeClassifierNet, self).__init__()
    self.fc1 = nn.Linear(n_features, n_features * 2)
    self.fc2 = nn.Linear(n_features * 2, n_credit_types)

  def forward(self, x):    
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

  def create_optimizer(self):
    return optim.Adam(self.parameters(), lr=0.01)

  def create_criterion(self):
    return nn.CrossEntropyLoss()

In [33]:
model = CreditTypeClassifierNet(len(FEATURES), 2)

In [34]:
model

CreditTypeClassifierNet(
  (fc1): Linear(in_features=5, out_features=10, bias=True)
  (fc2): Linear(in_features=10, out_features=2, bias=True)
)

In [86]:
class Phase:
  TRAIN = "train"
  TEST = "test"

  values = [Phase.TRAIN, Phase.TEST]

In [87]:
class Evaluator:

  def __init__(self, criterion):
    self.criterion = criterion

  def eval(self, model, X, y, phase: Phase):
    with torch.set_grad_enabled(phase == Phase.TRAIN):
      model = model.train() if phase == Phase.TRAIN else model.eval()
      outputs = model(X)
      loss = self.criterion(outputs, y)

      _, predicted = outputs.max(dim=1)
      correct_count = torch.sum(predicted == y)

      return loss, correct_count

In [88]:
from collections import defaultdict, namedtuple

Progress = namedtuple("Progress", ["loss", "accuracy"])

class ProgressLogger:

  def __init__(self):
    self.progress = defaultdict(
      lambda: {Phase.TRAIN: None, Phase.TEST: None}
    )

  @staticmethod
  def _round(value, precision=3):
    return np.round(value, precision)

  def save_progress(self, epoch, phase, loss, accuracy):
    self.progress[epoch][phase] = Progress(loss, accuracy)

  def log(self, epoch):
    print(f"Epoch {epoch + 1}")

    train_progress = self.progress[epoch][Phase.TRAIN]
    print(f"Train: loss {ProgressLogger._round(train_progress.loss)} accuracy {ProgressLogger._round(train_progress.accuracy)}")

    test_progress = self.progress[epoch][Phase.TEST]
    print(f"Test: loss {ProgressLogger._round(test_progress.loss)} accuracy {ProgressLogger._round(test_progress.accuracy)}")
    print()

In [93]:
class Trainer:

  def __init__(self, train_dataset, test_dataset):

    train_data_loader = DataLoader(train_dataset)
    test_data_loader = DataLoader(test_dataset)

    self.data_loaders = {
        Phase.TRAIN: train_data_loader,
        Phase.TEST: test_data_loader
    }

    self.dataset_sizes = {
        Phase.TRAIN: len(train_dataset),
        Phase.TEST: len(test_dataset)
    }

  def train(self, model, n_epochs):
    optimizer = model.create_optimizer()

    logger = ProgressLogger()

    evaluator = Evaluator(model.create_criterion())

    for epoch in range(n_epochs):
    
      for phase in Phase.values:

        phase_loss = 0.0
        phase_correct = 0

        for inputs, labels in self.data_loaders[phase]:

          loss, correct_count = evaluator.eval(model, inputs, labels, phase)

          if phase == Phase.TRAIN:

            optimizer.zero_grad()
            loss.backward()

            optimizer.step()

          phase_loss += loss.item() * inputs.size(0)
          phase_correct += correct_count

        epoch_loss = phase_loss / self.dataset_sizes[phase]
        epoch_accuracy = phase_correct.double() / self.dataset_sizes[phase]

        logger.save_progress(epoch, phase, epoch_loss, epoch_accuracy)

      logger.log(epoch)

In [94]:
train_dataset = CreditTypeDataset(X_train, y_train)
test_dataset = CreditTypeDataset(X_test, y_test)

trainer = Trainer(train_dataset, test_dataset)
trainer.train(model, n_epochs=10)

Epoch 1
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 2
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 3
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 4
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 5
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 6
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 7
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 8
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 9
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675

Epoch 10
Train: loss 0.608 accuracy 0.70625
Test: loss 0.632 accuracy 0.675



### Evaluation

How well will your model do in production? To answer this question, you need answers to the following two:

- What resources (CPU, GPU, RAM and disk space) do my model need to run? What is the expected response time?
- How well will the predicted values match the real ones?

You can usually answer the first question using a variety of tools from the software development world (like time, top and htop). However, you need to take into account the size of the input data. If you're loading into memory large text or image data, they might overflow and crash your program. Make sure you know the bounds of your data.

How good your model predictions are? A wide variety of statistical tests are available to evaluate the performance of different models. And they are very good at what they do. But having large amounts of data changes the game a little bit. You can use simple tools like accuracy, confusion matrix, precision, recall and apply appropriate thresholding. Proper evaluation of your model can be done only if you're intimately familiar with the domain.

One critical step in the process is looking at errors. Where your model makes errors? You should manually go through some errors to get a feel for them. How do you solve those? One simple and effective way to make your model better - add more data, matching the conditions where the model makes errors.

## Deployment

Deploying your model allows you to get your work to your users. It might be that millions will use it (given you work at a company like Google) or just you. Either way, you'll need to make your model available.

The most common way of deploying your model is behind a REST API. You can also embed it into a user's device (building a mobile app for iOS or Android).

### Building an API