## Simple Active Learn

This notebook shows a very simple implementation of active learn.

Active learn is a technique that can enable data scientist to generate a labeled dataset from an unlabeled one easier than labeling it completely.
There is labeling involved, but this process uses a model to identify the best samples to annotate reducing the time that it takes and providing immediate feedback on when the dataset is good enough.

Active Learn can be splitted into 3 steps.
* 1: Seeding
* 2: Similarity
* 3: Iteration

On 1, we provide seeds (i.e. samples of each class) to identify potentially easy datapoints.
Then, we do a similarity search on the dataset and present other similar samples that could be also used as starting points.
Finally, we iterate by training a model with the current dataset, making predictions in the unlabeled dataset, and the best samples to annotate next. This last step can be repeated until we are pleased with the metrics.

It is important to note, that though the first two steps are not strictly necessary and one can jump right to the iteration, providing seeds helps the model to find the right samples sooner and could save precious time.

In [1]:
# Import Libraries

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import HashingVectorizer
import scipy
import numpy as np
from ipyannotations.text import ClassLabeller

### Load dataset

For this experiment, are going to use the IMDB reviews sentiment dataset, and though we have the actual label, only the review column will be used. 

We'll generate the labels using Active Learn

The reason for keeping a dataset with labels in here, is so we can later compare a model generated through active learn vs a model generated with all the labels from the beginning.

For this, we are doing an 80-20 split of the dataset.

In [6]:
dataset = pd.read_csv('./../data/IMDB Dataset.csv')

In [7]:
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [8]:
train_dataset, test_dataset = train_test_split(dataset, test_size=0.2)

In [9]:
al_dataset = train_dataset[['review']]

In [10]:
train_dataset.head()

Unnamed: 0,review,sentiment
14351,"I think it's a great movie!! It's fun, maybe a...",negative
44066,"Drifting around on bootlegs, sometimes thought...",negative
10909,"This movie had some andrenaline kickers, but i...",negative
2676,This is a extremely well-made film. The acting...,positive
2514,The defining scene to this movie is when the f...,negative


In [11]:
al_dataset.head()

Unnamed: 0,review
14351,"I think it's a great movie!! It's fun, maybe a..."
44066,"Drifting around on bootlegs, sometimes thought..."
10909,"This movie had some andrenaline kickers, but i..."
2676,This is a extremely well-made film. The acting...
2514,The defining scene to this movie is when the f...


### Seeds

What is very insteresting about seeds, is that they don't really need to be part of the dataset. Though in many scenarios it'd be easier to just go to the dataset and pick a few examples, here we already know what good and bad reviews are, so we can just come up with a few examples.

In [12]:
good_reviews_seeds = [
    "This movie was great!",
    "I had a great time watching this masterpiece.",
    "The actors were very good in this movie",
    "A good movie"
]

In [13]:
bad_reviews_seeds = [
    "This movie was terrible!",
    "I didn't enjoy this movie at all!",
    "Don't watch it. It is a waste of time",
    "A boring movie"
]

### Similarity

Once we have the seeds, now we can find similar samples in our corpus to start the labelling process.

Based on the problem, this is a point where we can get very creative or keep it simple. We are going for the second option here.

For each seed, we'll get two similar reviews

In [14]:
vectorizer = HashingVectorizer(n_features=600,ngram_range=(1,6),lowercase=False,analyzer='char_wb')

In [15]:
good_vectors = vectorizer.fit_transform(good_reviews_seeds).todense()

In [16]:
bad_vectors = vectorizer.fit_transform(bad_reviews_seeds).todense()

In [17]:
reviews_vectors = vectorizer.fit_transform(al_dataset.review.tolist()).todense()

In [18]:
NUM_SAMPLES = 2

# Calculate cosine distance (opposite to cosine similarity)
good_distances = scipy.spatial.distance.cdist(good_vectors, reviews_vectors, metric='cosine')
bad_distances = scipy.spatial.distance.cdist(bad_vectors, reviews_vectors, metric='cosine')

# Sort in ascending order and pick the first 10 of each
closest_k_good = np.argsort(good_distances)[:, :NUM_SAMPLES].flatten()
closest_k_bad = np.argsort(bad_distances)[:, :NUM_SAMPLES].flatten()

good_to_label_idxs = list(set(list(closest_k_good)))
bad_to_label_idxs = list(set(list(closest_k_bad)))

In [36]:
import IPython.display

def display_text(ix):
    display('Review:')


In [38]:
widget = ClassLabeller(
    features=["A", "B"],
    #display_function=display_text,
    options=["positive", "negative"],
    allow_freetext=False
)
widget

ClassLabeller(children=(Box(children=(Output(layout=Layout(margin='auto', min_height='50px')),), layout=Layout…