# Class-weighted Learning
- Class-imbalance problem is actually a quite common problem. For instance, there are much more purchasers among mobile app users and much more non-criminals than criminals in society.
- However, if class imbalance is too severe (i.e., training set is highly skewed), it is likely to  bear undesirable effects. 
    - For instance, algorithm will tend to vote for majority class, all the time.
    - This is highly risky since we might lose track of purchasers among mobile app users and criminals, which are relatively rare among training instances

In [134]:
import numpy as np
from sklearn.utils import class_weight
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import adam

## Load Dataset
- Breast cancer dataset in ```sklearn```
- doc: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

In [88]:
data = load_breast_cancer()
X_data = data.data.tolist()
y_data = data.target.tolist()

In [89]:
print("Number of malignant instances (0): ", Counter(y_data)[0])
print("Number of benign instances (1): ", Counter(y_data)[1])

Number of malignant instances (0):  212
Number of benign instances (1):  357


In [90]:
# delete some of malignant instances to generate class-imbalance situation artificially
for i in range(200):
    if y_data[i] == 0:
        X_data[i] = None
        y_data[i] = None

In [91]:
X_data = [x for x in X_data if x != None]
y_data = [y for y in y_data if y != None]

In [92]:
print("Number of malignant instances (0): ", Counter(y_data)[0])
print("Number of benign instances (1): ", Counter(y_data)[1])

Number of malignant instances (0):  108
Number of benign instances (1):  357


In [93]:
X_train, X_test, y_train, y_test = train_test_split(np.asarray(X_data), np.asarray(y_data), test_size = 0.2, random_state = 7) 

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(372, 30) (93, 30) (372,) (93,)


## Computing class weights
- We compute class weights based on training dataset, and deliver as parameter when fitting

In [94]:
weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)

In [95]:
class_weights = dict(zip(np.unique(y_train), weights))
print("Computed class weights: ", class_weights)

Computed class weights:  {0: 2.2409638554216866, 1: 0.643598615916955}


## Naive Learning

In [149]:
def simple_mlp():
    model = Sequential()
    model.add(Dense(10, input_shape = (X_train.shape[1],), activation = 'relu'))
    model.add(Dense(1, activation = 'sigmoid'))
    model.compile(optimizer = adam(lr = 0.001), loss = 'binary_crossentropy', metrics = ['acc'])
    return model

In [None]:
model = simple_mlp()
model.fit(X_train, y_train, epochs = 100)

In [152]:
y_pred = model.predict(X_test).round()

In [153]:
print("% of predicted 1's: ", y_pred.sum()/len(y_pred))
print("Overall Accuracy Score: ", accuracy_score(y_pred, y_test))

% of predicted 1's:  1.0
Overall Accuracy Score:  0.731182795699


## Class-weighted learning

In [None]:
model = simple_mlp()
model.fit(X_train, y_train, epochs = 100, class_weight = class_weights)

In [155]:
y_pred = model.predict(X_test).round()

In [156]:
print("% of predicted 1's: ", y_pred.sum()/len(y_pred))
print("Overall Accuracy Score: ", accuracy_score(y_pred, y_test))

% of predicted 1's:  0.763440860215
Overall Accuracy Score:  0.967741935484
