# Model Training

## Compute Class Weights

To handle class imbalance in Machine Learning, there are several methods.
<br><br>
One of them is adjusting the class weights. 
<br><br>
By giving higher weights to the minority class and lower weights to the majority class, we can regularize the loss function.
<br><br>
Misclassifying the minority class will result in a higher loss due to the higher weight.
<br><br>
To incorporate class weights in Tensorflow, use `scikit-learn`'s `compute_class_weight` function

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.utils import compute_class_weight

X, y = ...

# will return an array with weights for each class, e.g. [0.6, 0.6, 1.]
class_weights = compute_class_weight(
  class_weight="balanced",
  classes=np.unique(y),
  y=y
)

# to get a dictionary with {<class>:<weight>}
class_weights = dict(enumerate(class_weights))

model = tf.keras.Sequential(...)
model.compile(...)

# using class_weights in the .fit() method
model.fit(X, y, class_weight=class_weights, ...)

## Reset TensorFlow/Keras Global State

In Tensorflow/Keras, when you create multiple models in a loop, you will need `tf.keras.backend.clear_session()`.

Keras manages a global state, which includes configurations and the current values (weights and biases) of the models.

So when you create a model in a loop, the global state gets bigger and bigger with every created model. To clear the state, 𝐝𝐞𝐥 𝐦𝐨𝐝𝐞𝐥 will not work because it will only delete the Python variable.

So `tf.keras.backend.clear_session()` is a better option. It will reset the state of a model and helps avoid clutter from old models.

See the first example below. Each iteration of this loop will increase the size of the global state and of your memory.

In the second example, the memory consumption stays constant by clearing the state with every iteration.

In [None]:
import tensorflow as tf

def create_model():
  model = tf.keras.Sequential(...)
  return model

# without clearing session
for _ in range(20):
  model = create_model()
  
# with clearing session
for _ in range(20):
  tf.keras.backend.clear_session()
  model = create_model

## Find dirty labels with cleanlab

Do you want to identify noisy labels in your dataset?

Try `cleanlab` for Python.

`cleanlab` is a data-centric AI package to automatically detect noisy labels and address dataset issues to fix them via confident learning algorithms.

It works with nearly every model possible:

- XGBoost
- scikit-learn models
- Tensorflow
- PyTorch
- HuggingFace
- etc.

In [None]:
!pip install cleanlab

import cleanlab
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

clf = RandomForestClassifier(n_estimators=100)

cl = cleanlab.classification.CleanLearning(clf)

label_issues = cl.find_label_issues(X, y)

print(label_issues.query('is_label_issue == True'))

    is_label_issue  label_quality  given_label  predicted_label
70            True           0.07            1                2
77            True           0.01            1                2

## Evaluate your Classifier with classification_report

Would you like to evaluate your Machine Learning model quickly?

Try `classification_report` from scikit-learn

With `classification_report`, you can quickly assess the performance of your model.

It summarizes Precision, Recall, F1-Score, and Support for each class.

## Obtain Reproducible Optimizations Results in Optuna

Optuna is a powerful hyperparameter optimization framework that supports many machine learning frameworks, including TensorFlow, PyTorch, and XGBoost.

But you need to be careful with reproducible results for hyperparameter tuning.tuple

To achieve reproducible results, you need to set the seed for your Sampler.

Below you can see how it is done for `TPESampler`.

In [None]:
import optuna
from optuna.samplers import TPESampler

def objective(trial):
    ...
    
sampler = TPESampler(seed=42)
study = optuna.create_study(sampler=sampler)
study.optimize(objective, n_trials=100)

## Find bad labels with doubtlab

Do you want to find bad labels in your data?

Try `doubtlab` for Python.

With `doubtlab`, you can define reasons to doubt your labels and take a closer look.

Reasons to doubt your labels can be for example:

- 𝐏𝐫𝐨𝐛𝐚𝐑𝐞𝐚𝐬𝐨𝐧: When the confidence values are low for any label
- 𝐖𝐫𝐨𝐧𝐠𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧𝐑𝐞𝐚𝐬𝐨𝐧: When a model cannot predict the listed label
- 𝐃𝐢𝐬𝐚𝐠𝐫𝐞𝐞𝐑𝐞𝐚𝐬𝐨𝐧: When two models disagree on a prediction.
- 𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞𝐑𝐞𝐚𝐬𝐨𝐧: When the relative difference between label and prediction is too high

So, identify your noisy labels and fix them.


In [None]:
from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
model = LogisticRegression()
model.fit(X, y)

# Define reasons to check
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model),
}

# Pass reasons to DoubtLab instance
doubt = DoubtEnsemble(**reasons)

# Returns DataFrame with reasoning
predicates = doubt.get_predicates(X, y)

## Get notified when your model is finished with training

Never stare at your screen, waiting for your model to finish training.

Try `knockknock` for Python.

`knockknock` is a library that notifies you when your training is finished.

You only need to add a decorator.

Currently, you can get a notification through 12 different channels
like:

- Email
- Slack
- Telegram
- Discord
- MS Teams


Use it for your future model training and don’t stick to your screen.

In [None]:
# !pip install knockknock
from knockknock import email_sender

@email_sender(recipient_emails=["coolmail@python.com", "2coolmail@python.com"], sender_email="anothercoolmail@python.com")
def train_model(model, X, y):
    model.fit(X, y)

## Get Model Summary in PyTorch with torchinfo

Do you want a Model summary in PyTorch?

Like in Keras with `model.summary()`?

Use `torchinfo`.

With `torchinfo`, you can get a model summary as you know it from
Keras.

Just add one line of code.

In [None]:
# !pip install torchinfo

import torch
from torchinfo import summary

class MyModel(torch.nn.Module)
  ...
  
model = MyModel()

BATCH_SIZE = 16
summary(model, input_size=(BATCH_SIZE, 1, 28, 28))

'''
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Net                                      [16, 10]                  --
├─Sequential: 1-1                        [16, 4, 7, 7]             --
│    └─Conv2d: 2-1                       [16, 4, 28, 28]           40
│    └─BatchNorm2d: 2-2                  [16, 4, 28, 28]           8
│    └─ReLU: 2-3                         [16, 4, 28, 28]           --
│    └─MaxPool2d: 2-4                    [16, 4, 14, 14]           --
│    └─Conv2d: 2-5                       [16, 4, 14, 14]           148
│    └─BatchNorm2d: 2-6                  [16, 4, 14, 14]           8
│    └─ReLU: 2-7                         [16, 4, 14, 14]           --
│    └─MaxPool2d: 2-8                    [16, 4, 7, 7]             --
├─Sequential: 1-2                        [16, 10]                  --
│    └─Linear: 2-9                       [16, 10]                  1,970
==========================================================================================
Total params: 2,174
Trainable params: 2,174
Non-trainable params: 0
Total mult-adds (M): 1.00
==========================================================================================
Input size (MB): 0.05
Forward/backward pass size (MB): 1.00
Params size (MB): 0.01
Estimated Total Size (MB): 1.06
==========================================================================================
'''

## Boost scikit-learns performance with Intel Extension

Scikit-learn is one of the most popular ML packages for Python.

But, to be honest, their algorithms are not the fastest ones.

With Intel’s Extension for scikit-learn, `scikit-learn-intelex`. you can speed up training time for some favourite algorithms like:

- Support Vector Classifier/Regressor
- Random Forest Classifier/Regressor
- LASSO
- DBSCAN

Just add two lines of code.

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.svm import SVR
from sklearn.datasets import make_regression

X, y = make_regression(
n_samples=100000, 
n_features=10, 
noise=0.5)

svr = SVR()

svr.fit(X, y)

## Incorportate Domain Knowledge into XGBoost with Feature Interaction Constraints

Want to incorporate your domain knowledge into `𝐗𝐆𝐁𝐨𝐨𝐬𝐭`?

Try using 𝐅𝐞𝐚𝐭𝐮𝐫𝐞 𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭𝐬.

Feature Interaction Constraints allow you to control which features are allowed to interact with each other and which are not while building the trees.

For example, the constraint [0, 1] means that Feature_0 and Feature_1 are allowed to interact with each other but with no other variable. Similarly, [3, 5, 9] means that Feature_3, Feature_5, and Feature_9 are allowed to interact with each other but with no other variable.

With this in mind, you can define feature interaction constraints:

- Based on domain knowledge, when you know that some features interactions will lead to better results
- Based on regulatory constraints in your industry/company where some features can not interact with each other.

In [None]:
import xgboost as xgb

X, y = ...

dmatrix = xgb.DMatrix(X, label=y)

params = {
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "interaction_constraints": [[0,2 ], [1, 3, 4]]
}

model_with_constraints = xgb.train(params, dmatrix)

## Powerful AutoML with FLAML

Do you always hear about AutoML?

And want to try it out?

Use `FLAML` for Python.

`FLAML` (Fast and Lightweight AutoML) is an AutoML package developed by Microsoft.

It can do Model Selection, Hyperparameter tuning, and Feature Engineering automatically.

Thus, it removes the pain of choosing the best model and parameters so that you can focus more on your data.

Per default, its estimator list contains only tree-based models like XGBoost, CatBoost, and LightGBM. But you can also add custom models.

A powerful library!

## Aspect-based Seniment Analysis with PyABSA

Traditional sentiment analysis focuses on determining the overall sentiment of a piece of text.

For example, the sentence :

“The food was bad and the staff was rude”

would output only a negative sentiment.

But, what if I want to extract, which aspects have a negative or positive sentiment?

That’s the responsibility of aspect-based sentiment analysis.

It aims to identify and extract the sentiment expressed towards specific aspects of a text.

For the sentence:

”The battery life is excellent but the camera quality is bad.”

a model's output would be:

- Battery life: positive
- Camera quality: negative

With aspect-based sentiment analysis, you can understand the opinions and feelings expressed about specific aspects.

To do that in Python, use the package `PyABSA`.

It contains pre-trained models with an easy-to-use API for aspect-term extraction and sentiment classification.

`PyABSA` can be used for a variety of applications, such as:

- Customer feedback analysis
- Product reviews analysis
- Social media monitoring

In [None]:
# !pip install pyabsa==1.16.27

from pyabsa import ATEPCCheckpointManager

extractor = ATEPCCheckpointManager.get_aspect_extractor(
                  checkpoint="multilingual",
                  auto_device=False
)
                                                        
example = ["Location and food were excellent but stuff was very unfriendly."]
result = extractor.extract_aspect(inference_source=example, pred_sentiment=True)

print(result)

## Use XGBoost for Random Forests

In [None]:
from xgboost import XGBRFRegressor

xgbrf = XGBRFRegressor(n_estimators=100)

X = np.random.rand(100000, 10)
y = np.random.rand(100000)

xgbrf.fit(X, y)