### What is the impact of data imbalance?

In [1]:
from sklearn.linear_model import LogisticRegression

In [2]:
import numpy as np

### Data generation

The input consists of three number between (0, 1) independently generated. The output is 1 when the sum of the number is greater than 1.5, otherwise 0.

In [3]:
def get_imbalanced_data():
    data = np.random.random((10000, 3))
    y = (data.sum(axis=1) > 1.5).astype(np.int8)
    x_zeros = data[(y == 0)][:990]
    assert len(x_zeros) == 990, "retry"
    x_ones = data[(y == 1)][:10]
    x = np.vstack((x_ones, x_zeros))
    y = (x.sum(axis=1) > 1.5).astype(int)
    return x, y

In [4]:
def get_balanced_data():
    data = np.random.random((10000, 3))
    y = (data.sum(axis=1) > 1.5).astype(np.int8)
    x_zeros = data[(y == 0)][:500]
    assert len(x_zeros) == 500, "retry"
    x_ones = data[(y == 1)][:500]
    x = np.vstack((x_ones, x_zeros))
    y = (x.sum(axis=1) > 1.5).astype(int)
    return x, y

In [5]:
def train(x, y):
    lr = LogisticRegression(C=10000) # High value of C to ensure no weight regularization
    lr.fit(x, y)
    return lr    

### Train balanced data

We train a logisitc regression model here. The output of the logisitc regression model will be a plane with coefficients $(c_0, c_1, c_2, c_3)$. For a given input $(x_1, x_2, x_3)$, we will classify the input to belong to class 1 if $c_0 + c_1 x_1 + c_2 x_2 + c_3 x_3 > 0$. Otherwise we say that it belongs to class 0. For a good classifier, $E_{x}[c_0 + c_1 x_1 + c_2 x_2 + c_3 x_3] = 0$. We will check this value for the trained model for both class balanced and imbalanced data

In [8]:
for _ in range(10):
    x_balanced, y_balanced = get_balanced_data()
    lr_balanced = train(x_balanced, y_balanced)
    print(lr_balanced.intercept_ + lr_balanced.coef_.sum()/2)

[0.58999268]
[0.02556999]
[-0.22096021]
[0.56560189]
[-0.05742626]
[0.31944081]
[-0.1985386]
[-0.27049313]
[-0.75952504]
[-0.36116866]


### Train imbalanced data

In [9]:
for _ in range(10):
    x_imbalanced, y_imbalanced = get_imbalanced_data()
    lr_imbalanced = train(x_imbalanced, y_imbalanced)
    print(lr_imbalanced.intercept_ + lr_imbalanced.coef_.sum()/2)

[-2.33068237]
[-4.96141314]
[-7.99122521]
[-3.20028506]
[-4.41962578]
[-8.44975192]
[-6.70729705]
[-6.13358125]
[-2.65972415]
[-4.93137753]


**The particular results indicate that because of the class imbalance problem, the model is more likely to say a class 0.**

One way to overcome this problem is to change the threshold for deciding the class label. In particular, find the hyperparameter $\eta$ and determine an example belongs to class 1 when $c_0 + c_1 x_1 + c_2 x_2 + c_3 x_3 > \eta$. However, now you have to tune a new hyper-parameter.