# Active Learning IMDB sentiment analysis

The well-known and cliche IMDB dataset contains 50k reviews labeled with binary sentiment classification (`positive` or `negative`). It is usually used to train sentiment analysis for Natural Language Processing. However, in reality, we do not have the luxury of labeled data that is made available to us and the labeling process is a costly and tedious process. Active learning or sometimes known as "human-in-the-loop" learning are one of the tools that can effectively alleviate the costly process of data labeling. 

As such, we will take the existing IMDB data and simulate an active learning environment.  In our case, we will only have a limited set of initial labeled data to initiate the model development. At the same time, we will also have a pool of unlabeled data, which we will use in our active learning process.  

To do this, we will need to prepare the existing 50,000 of the IMDB reviews data as follows:
1. `Initial Training` set: we will allow ourselves to get **5,000** randomly-sampled **labeled examples** of the reviews for the initial development of the model.
2. `Validation` set: we will set aside another **2,000** randomly-sampled **labeled examples** of the reviews for the validation of the model throughout the whole end-to-end active learning cycle. Ideally, we should update our `Validation` sets from time to time and monitor the model performance (for drift) from time to time, but for the simplicity of this project, we will assume that these initially 10k sampled examples will be statically used from the point of Initial Training stage to the end of Active Learning stage.     
2. We will use the remaining **43,000** **unlabeled examples** as the `Unlabeled Pool` of dataset, where our active learning model will sample based on `uncertainty sampling` 

Let's make sure to check our environments are correct first:

In [1]:
!pip freeze | grep scikit-learn
!pip freeze | grep pandas

scikit-learn==1.1.1
pandas==1.4.3


In [17]:
import pandas as pd
from pathlib import Path
import sqlite3

In [20]:
data_path = Path("./dataset")

### Setting up an sqlite database 

In [48]:
conn = sqlite3.connect(data_path / 'dataset.db')
c = conn.cursor()

In [49]:
# create new tables

# training table
c.execute('''CREATE TABLE IF NOT EXISTS training (id int, review text, sentiment text)''')

# validation table
c.execute('''CREATE TABLE IF NOT EXISTS validation (id int, review text, sentiment text)''')

# unlabeled pool table
c.execute('''CREATE TABLE IF NOT EXISTS  unlabeled_pool (id int, review text)''')


<sqlite3.Cursor at 0x7fc3a17ba500>

### Training Dataset

### Preparing the dataset to simulate active learning environment

In [10]:
df = pd.read_csv(data_path / "IMDB_Dataset.csv")
print("Original Dataset Size:", len(df), "rows")

Original Dataset Size: 50000 rows


In [11]:
df["sentiment"].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

We will not spend too much time the EDA process of the dataset as we are focusing on the active learning MLOps workflow in this project. We will go ahead and sample the datasets for the different purposes

### Training Dataset

In [13]:
# sample 5,000 rows for training
training_set = df.sample(5000, random_state=158)

# remove the original data 
df = df.drop(training_set.index)

# check to make sure the dataset is correctly sampled
print("Training Set:", len(training_set), "rows")
print("Original Dataset:", len(df), "rows")

Training Set: 5000 rows
Original Dataset: 45000 rows


In [15]:
training_set.head(10)

Unnamed: 0,review,sentiment
24708,Okay. Who was it? Who gave Revolver 10 out of ...,negative
1946,One of the best 'guy' movies I've ever seen ha...,positive
24381,Wow...sheer brilliance.<br /><br />Turning a t...,negative
46271,It's wonderful to see that Shane Meadows is al...,positive
8995,"This is a pretty obscure, dumb horror movie se...",positive
46380,OSS 117 was fun from start to finish.<br /><br...,positive
38049,It's nice that these three young directors hav...,negative
24573,"That's what I thought, when I heard about the ...",positive
30594,Watch the 1936 version. As personally annoying...,negative
549,Another Spanish movie about the 1936 Civil War...,positive


We can see the distribution of the classes (`positive` vs `negative`) for the `training_set` are well-balanced:

In [16]:
training_set["sentiment"].value_counts()

negative    2518
positive    2482
Name: sentiment, dtype: int64

In [23]:
# save to file
training_set.to_csv(data_path / "training.csv")

In [50]:
# export to sql table 'training'
training_set.reset_index().rename(columns={"index": "id"}).to_sql('training', conn, if_exists='append', index=False)

5000

### Validation Set

In [53]:
# sample 10,000 rows for validation
validation_set = df.sample(2000, random_state=158)

# remove the original data 
df = df.drop(validation_set.index)

# check to make sure the dataset is correctly sampled
print("Validation Set:", len(validation_set), "rows")
print("Original Dataset:", len(df), "rows")

Validation Set: 2000 rows
Original Dataset: 43000 rows


We can see the distribution of the classes (`positive` vs `negative`) for the `validation_set` are also well-balanced:

In [54]:
validation_set["sentiment"].value_counts()

positive    1001
negative     999
Name: sentiment, dtype: int64

In [55]:
# save to file
validation_set.to_csv(data_path / "validation.csv", index_label=False)

In [56]:
# export to sql table 'validation'
validation_set.reset_index().rename(columns={"index": "id"}).to_sql('validation', conn, if_exists='append', index=False)

2000

In [66]:
pd.read_sql('''SELECT * FROM validation''', conn)

Unnamed: 0,id,review,sentiment
0,46788,Andy Goldsworthy is a taoist master of the fir...,positive
1,31303,I would not consider myself as one of Leonard ...,negative
2,44672,Nothing but the director's juvenile fantasy co...,negative
3,30408,Letting the class watch this in English was a ...,negative
4,3400,This is high grade cheese fare of B movie kung...,negative
...,...,...,...
1995,22743,When John Wayne filmed his Alamo story he had ...,positive
1996,30884,"Being a middle aged mom myself, I very much ap...",positive
1997,28231,One True Thing may have seemed like a horror m...,negative
1998,24033,Made it through the first half an hour and des...,negative


### Unlabeled Pool

In [57]:
# use the remaining 35,000 examples as unlabeled pool of dataset
unlabeled_pool = df.copy()

# remove labels
unlabeled_pool = unlabeled_pool.drop(columns="sentiment")

# check the shape
print("Shape of Unlabeled Pool:", unlabeled_pool.shape)


Shape of Unlabeled Pool: (43000, 1)


We can see the distribution of the classes (`positive` vs `negative`) for the remainder of the original `df` are also well-balanced, which meant that our `unlabeled_pool` is also balanced:

In [58]:
df["sentiment"].value_counts()

positive    21517
negative    21483
Name: sentiment, dtype: int64

In [61]:
# save to file
unlabeled_pool.to_csv(data_path / "unlabeled.csv",  index_label=False)

In [62]:
# export to sql table 'unlabeled_pool'
unlabeled_pool.reset_index().rename(columns={"index": "id"}).to_sql('unlabeled_pool', conn, if_exists='append', index=False)

43000

In [65]:
pd.read_sql('''SELECT * FROM unlabeled_pool''', conn)

Unnamed: 0,id,review
0,0,One of the other reviewers has mentioned that ...
1,1,A wonderful little production. <br /><br />The...
2,2,I thought this was a wonderful way to spend ti...
3,3,Basically there's a family where a little boy ...
4,6,I sure would like to see a resurrection of a u...
...,...,...
42995,49995,I thought this movie did a down right good job...
42996,49996,"Bad plot, bad dialogue, bad acting, idiotic di..."
42997,49997,I am a Catholic taught in parochial elementary...
42998,49998,I'm going to have to disagree with the previou...
