# Lecture 6 - Adjusting class imbalance

Load the COVID data set

In [29]:
df = epi7913A::oncovid

set.seed(171)

head(df)

Unnamed: 0_level_0,case_status,age_group,gender,date_reported,exposure,health_region
Unnamed: 0_level_1,<dbl>,<fct>,<fct>,<dbl>,<fct>,<fct>
1,0,40-49,Female,285.4583,Close Contact,York Region Public Health Services
2,0,<20,Male,297.4583,Close Contact,York Region Public Health Services
3,0,50-59,Male,274.4583,Not Reported,Peel Public Health
4,0,20-29,Female,260.4583,Close Contact,Halton Region Health Department
5,0,30-39,Female,307.5,Not Reported,Wellington-Dufferin-Guelph Public Health
6,0,40-49,Female,306.5,Close Contact,Halton Region Health Department


## Check the class balance (outcome variable is *case_status*)

In [2]:
table(df$case_status)


    0     1 
87494  3486 

## Sampling the data for the purpose of balancing
### We draw data points in a manner that results in equal number of points in each of the outcome classes (in this case case_status: 0 or 1)


## The balancing can be done by:
- ## over-sampling,
- ## under-sampling, or 
- ## both


## 1. Balancing by over-sampling:
### The idea is to sample many more data points from the minority class (with replacement) to even out the distribution to the same size as the majority class.

In [3]:
# obtain a balanced sample of the data by "over" sampling the minority class
df_overBalanced <- ROSE::ovun.sample(case_status ~., data=df, p=0.5, seed = 11, method = "over")$data

# look at the new class ratio
cat("The class ratio of the new sample:")
table(df_overBalanced$case_status)

The class ratio of the new sample:


    0     1 
87494 87647 

## 2. Balancing by under-sampling:

### The idea is to sample fewer data points from the majority class to even out the distribution to the same size as the minority class

In [4]:
# obtain a balanced sample of the data by "under" sampling the majority class
df_underBalanced <- ROSE::ovun.sample(case_status ~., data=df, p=0.5, seed = 11, method = "under")$data

# look at the new class ratio
cat("The class ratio of the new sample:")
table(df_underBalanced$case_status)

The class ratio of the new sample:


   0    1 
3513 3486 

## 3. Balancing by both (over- and under-sampling):
### The idea is to sample fewer data points from the majority class, as well as, to sample more points from the minority class to achieve a balanced data set in size almost equal to that of the original data set (achieve an even class distribution).

In [5]:
# obtain a balanced sample of the data by "both" over and under sampling the minority
# and the majority classes respectively
df_bothBalanced <- ROSE::ovun.sample(case_status ~., data=df, p=0.5, seed = 11, method = "both")$data

# look at the new class ratio
cat("The class ratio of the new sample:")
table(df_bothBalanced$case_status)

The class ratio of the new sample:


    0     1 
45364 45616 

## 3. Balancing by SMOTE:
This algorithm simulates additional minority class observations within the range of the existing variables. This implementation expects all of the variables to be numeric.

In [26]:
df$case_status <- as.factor(df$case_status)
df$age_group <- as.numeric(df$age_group)
df$gender <- as.numeric(df$gender)
df$exposure <- as.numeric(df$exposure)
df$health_region <- as.numeric(df$health_region)
df_smote<-themis::smote(df, var="case_status")
         
# look at the new class ratio
cat("The class ratio of the new sample:")
table(df_smote$case_status)

The class ratio of the new sample:


    0     1 
87494 87494 

## **Important Note:**

### data balancing is considered to be a way of manipulating the data and should only be carried out on the training data to preserve the integrity of the hold out data for testing. In a sense, we preserve the testing strategy to be blinded to how the data is manipulated.