# Class-weighted Learning
- Class-imbalance problem is actually a quite common problem. For instance, there are much more purchasers among mobile app users and much more non-criminals than criminals in society.
- However, if class imbalance is too severe (i.e., training set is highly skewed), it is likely to  bear undesirable effects. 
    - For instance, algorithm will tend to vote for majority class, all the time.
    - This is highly risky since we might lose track of purchasers among mobile app users and criminals, which are relatively rare among training instances

In [18]:
import numpy as np
from sklearn.utils import class_weight
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from collections import Counter
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

## Load Dataset
- Breast cancer dataset in ```sklearn```
- doc: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

In [2]:
data = load_breast_cancer()
X_data = data.data.tolist()
y_data = data.target.tolist()

In [3]:
print("Number of malignant instances (0): ", Counter(y_data)[0])
print("Number of benign instances (1): ", Counter(y_data)[1])

Number of malignant instances (0):  212
Number of benign instances (1):  357


In [4]:
# delete some of malignant instances to generate class-imbalance situation artificially
for i in range(200):
    if y_data[i] == 0:
        X_data[i] = None
        y_data[i] = None

In [5]:
X_data = [x for x in X_data if x != None]
y_data = [y for y in y_data if y != None]

In [6]:
print("Number of malignant instances (0): ", Counter(y_data)[0])
print("Number of benign instances (1): ", Counter(y_data)[1])

Number of malignant instances (0):  108
Number of benign instances (1):  357


In [7]:
X_train, X_test, y_train, y_test = train_test_split(np.asarray(X_data), np.asarray(y_data), test_size = 0.2, random_state = 7) 

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(372, 30) (93, 30) (372,) (93,)


## Computing class weights
- We compute class weights based on training dataset, and deliver as parameter when fitting

In [8]:
weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)

In [9]:
class_weights = dict(zip(np.unique(y_train), weights))
print("Computed class weights: ", class_weights)

Computed class weights:  {0: 2.2409638554216866, 1: 0.643598615916955}


## Naive Learning

In [28]:
def simple_mlp():
    model = Sequential()
    model.add(Dense(10, input_shape = (X_train.shape[1],), activation = 'relu'))
    model.add(Dense(1, activation = 'sigmoid'))
    model.compile(optimizer = Adam(lr = 0.001), loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model

In [29]:
model = simple_mlp()
model.fit(X_train, y_train, epochs = 50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f776ab28080>

In [30]:
y_prob = model.predict(X_test)
y_pred = y_prob.round()

In [31]:
print("% of predicted 1's: ", y_pred.sum()/len(y_pred))
print("ROC AUC score: ", roc_auc_score(y_test, y_prob))
print("Overall Accuracy Score: ", accuracy_score(y_test, y_pred))

% of predicted 1's:  0.7526881720430108
ROC AUC score:  0.9523529411764706
Overall Accuracy Score:  0.9139784946236559


## Class-weighted learning

In [39]:
model = simple_mlp()
model.fit(X_train, y_train, epochs = 50, class_weight = class_weights)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f77670537f0>

In [40]:
y_prob = model.predict(X_test)
y_pred = y_prob.round()

In [41]:
print("% of predicted 1's: ", y_pred.sum()/len(y_pred))
print("ROC AUC score: ", roc_auc_score(y_test, y_prob))
print("Overall Accuracy Score: ", accuracy_score(y_test, y_pred))

% of predicted 1's:  0.7849462365591398
ROC AUC score:  0.9311764705882354
Overall Accuracy Score:  0.9032258064516129
