## Handling Unbalanced Dataset in Deep Learning with Class Weight

#### What is unbalanced dataset?

When the samples per classes in a dataset varies to different numbers, then the dataset is called unbalanced dataset. 

For example,
Let's consider a image dataset has two classes, A and B.
And the number of samples in these classes are as bellow- 

| Class Name | Number of Samples |
|------------------|---------------------------|
| Class A | 25 |
| Class B | 75 |

So, if you are already familiar with machine learning and deep learning, then its almost certain that any model trained on this dataset will not generalize for the problem. The model will be biased towards class B.

#### How can we solve the above problem of biasness?

There are two ways to solve the above problem-

1. Increase the number of samples in class A, so that both the classes have almost equal number of data points
2. Explicitly tell the model during training to put more focus on class A while optimizing the loss

### Class weight

Since class B has 3 times more data than class A, so class B will get 3 times more focus by the model during loss optimzation.

So we can say that the classes have the following weights with the current amount of data- 

| Class Name | Current Weight |
|-----------------|----------------------|
| Class A | 1 |
| Class B | 3 |

We can explicitly tell the model to put more focus on class A by adding more weights. But how much weight we should add is dependent on the current weight.

### Class weight calculation

So, how can we calculate the class weight?

'''python
A = 25
B = 75

majority = max(A, B)

A_weight = majority / count(A)
B_weight = majority / count(B)
'''

Let's define a simple method that will help to update the class weight - 

In [11]:
from collections import Counter
def get_class_weights(y):
    counter = Counter(y)
    #get maximum number of occurances of a class
    majority = max(counter.values()) 
    return  {cls: float(majority/count) for cls, count in counter.items()}

In [12]:
y = [1, 1, 0, 0, 0, 2, 3, 1, 2, 5] #class labels for ground truth that express the occurances of a classes in the dataset
weights = get_class_weights(y)

In [10]:
weights

{1: 1.0, 0: 1.0, 2: 1.5, 3: 3.0, 5: 3.0}

Now we can pass this dictionary formated class weights data as *class weight* parameter in keras fit method while training out model. 

I have performed an experiment with the well know MNIST dataset and it seems that training with class weight gains signicant performance gain in the model accuracy.

Check the following jupyter notebooks for detail explanation of the experiment and the experimental result

[Explanation of Class weight and analyzing MNIST data]()

[Training models with and without class weight and evaluate performance]()

### References

[Scikit learn class weight computation method](https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html)