# Active Learning IMDB sentiment analysis

The well-known and cliche IMDB dataset contains 50k reviews labeled with binary sentiment classification (`positive` or `negative`). It is usually used to train sentiment analysis for Natural Language Processing. However, in reality, we do not have the luxury of labeled data that is made available to us and the labeling process is a costly and tedious process. Active learning or sometimes known as "human-in-the-loop" learning are one of the tools that can effectively alleviate the costly process of data labeling. 

As such, we will take the existing IMDB data and simulate an active learning environment.  In our case, we will only have a limited set of initial labeled data to initiate the model development. At the same time, we will also have a pool of unlabeled data, which we will use in our active learning process.  

To do this, we will need to prepare the existing 50k of the IMDB reviews data as follows:
1. `Initial Training` set: we will allow ourselves to get **5,000** randomly-sampled **labeled examples** of the reviews for the initial development of the model.
2. `Validation` set: we will set aside another **10,000** randomly-sampled **labeled examples** of the reviews for the validation of the model throughout the whole end-to-end active learning cycle. Ideally, we should update our `Validation` sets from time to time and monitor the model performance (for drift) from time to time, but for the simplicity of this project, we will assume that these initially 10k sampled examples will be statically used from the point of Initial Training stage to the end of Active Learning stage.     
2. We will use the remaining **35,000** **unlabeled examples** as the `Unlabeled Pool` of dataset, where our active learning model will sample based on `uncertainty sampling` 

Let's make sure to check our environments are correct first:

In [5]:
!pip freeze | grep scikit-learn
!pip freeze | grep pandas

scikit-learn==1.1.1
pandas==1.4.3


### Preparing the dataset to simulate active learning environment

In [7]:
import pandas as pd
from pathlib import Path

In [8]:
data_path = Path("./dataset")

In [16]:
df = pd.read_csv(data_path / "IMDB_Dataset.csv")
print("Original Dataset Size:", len(df), "rows")

Original Dataset Size: 50000 rows


We will not spend too much time the EDA process of the dataset as we are focusing on the active learning MLOps workflow in this project. We will go ahead and sample the datasets for the different purposes

In [17]:
# sample 5,000 rows for training
training_set = df.sample(5000)

# remove the original data 
df = df.drop(training_set.index)

# check to make sure the dataset is correctly sampled
print("Training Set:", len(training_set), "rows")
print("Original Dataset:", len(df), "rows")

Training Set: 5000 rows
Original Dataset: 45000 rows


We can see the distribution of the classes (`positive` vs `negative`) for the `training_set` are well-balanced:

In [28]:
training_set["sentiment"].value_counts()

negative    2511
positive    2489
Name: sentiment, dtype: int64

In [21]:
# save to file
training_set.to_csv(data_path / "training.csv")

In [18]:
# sample 10,000 rows for validation
validation_set = df.sample(10000)

# remove the original data 
df = df.drop(validation_set.index)

# check to make sure the dataset is correctly sampled
print("Validation Set:", len(validation_set), "rows")
print("Original Dataset:", len(df), "rows")

Validation Set: 10000 rows
Original Dataset: 35000 rows


We can see the distribution of the classes (`positive` vs `negative`) for the `validation_set` are also well-balanced:

In [26]:
validation_set["sentiment"].value_counts()

negative    5040
positive    4960
Name: sentiment, dtype: int64

In [23]:
# save to file
validation_set.to_csv(data_path / "validation.csv")

In [20]:
# use the remaining 35,000 examples as unlabeled pool of dataset
unlabeled_pool = df.copy()

# remove labels
unlabeled_pool = unlabeled_pool.drop(columns="sentiment")

# check the shape
print("Shape of Unlabeled Pool:", unlabeled_pool.shape)


Shape of Unlabeled Pool: (35000, 1)


We can see the distribution of the classes (`positive` vs `negative`) for the remainder of the original `df` are also well-balanced, which meant that our `unlabeled_pool` is also balanced:

In [29]:
df["sentiment"].value_counts()

positive    17551
negative    17449
Name: sentiment, dtype: int64

In [24]:
# save to file
unlabeled_pool.to_csv(data_path / "unlabeled.csv")