# Data Labeling with a GUI

We've all been there. We have a large data collection we were lucky enough to procure, but now it's sitting there waiting for us to manually label it. This is arguebly the least fun activity related to data science.

As data scientists, we sometimes forget that we are not only meant to create tools and programs which improve others' workflow, but also our own. In this post, I will briefly go over one such method of improving the data labeling workflow.

First, we will procure some data composed of images, and we will briefly discuss the theoratical time required to label the entire dataset manually. Then we will introduce a simple GUI which can make our lives slightly easier. Finally, we will explore how to combine the GUI component with a simple ML technique to further improve our lives.

Ready? Here we go!

## Step 1: Let's procure some data

I'm going to grab a dataset from Kaggle called dogs-cats-images "https://www.kaggle.com/chetankv/dogs-cats-images". As you may imagine, it contains images of cats and dogs. 4,000 of each.

Once we procure the dataset, we end up with the following directory structure:

`IMAGE GOES HERE`

Granted, if we have an unlabeled dataset, it won't be nicely split into two directories so I will modify this structure and copy all files into a single "Unknown" directory.

`IMAGE GOES HERE`

Now let's explore the most inefficient hand labeling method for a second. We'd open a csv file, and for each image we will copy its name, and add an appropriate label.

Labeling 10 examples like that took me about 1 minute and 36 seconds, averaging at ~9 seconds per image. Given that there are 8,000 images to label (4,000 for cats and 4,000 for dogs). That would take me about 20 hours. That's assuming I don't stop to sleep, or cry at the sheer inefficiency of this process.

So we can all agree this is not a reasonable method.
What if we had a Graphical User Interface (GUI) to help us?

## Employing a GUI to help with labeling

I'm going to employ the tkinter library which comes shipped with Python to create a simple GUI which can help me label the images.

First, let's build a simple image generator:

In [None]:
import os

def image_generator():
    for filename in os.listdir('dogs-cats-images/unknown'):
        yield 'dogs-cats-images/unknown/'+filename
        
my_image_generator = image_generator()
current_image_name = None # The current image which was just classified

Notice this function uses the `yield` command rather than return. It will return the images one at a time as we call `next(my_image_generator)` until the images are all looped through.

Next, we will build a simple GUI which will employ the generator.

First, let's create a function to load an image, given a label to put the image in:

In [None]:
'''import tkinter
from PIL import ImageTk, Image

window = tkinter.Tk()
window.title("GUI")

image_name = next(my_image_generator)
my_image = ImageTk.PhotoImage(Image.open(image_name).resize((250, 250), 
                                                            Image.ANTIALIAS))

image_label = tkinter.Label(window, image = my_image)
image_label.pack()

counter_label = tkinter.Label(window, text="0")
counter_label.pack()

tkinter.Button(window, text = "Cat").pack(side='left')
tkinter.Button(window, text = "Dog").pack(side='right')


window.mainloop()'''

Well that's nice, we now have a GUI which opens up and let's us press buttons to label the image, but the buttons don't do anything yet, we have to bind them with a command.

So we're going to create a function which will be triggered every time one of the buttons is pressed:

In [None]:
hand_labeled_images = []
hand_labeled_classes = []

def selection_made(selected_class, image_label, counter_label):
    global my_image_generator
    global current_image_name
    global hand_labeled_images, hand_labeled_classes
    

    if (current_image_name is not None):
        hand_labeled_images.append(current_image_name)
        hand_labeled_classes.append(selected_class)

    old_text = counter_label.text
    new_text = str(int(old_text)+1)
    counter_label.configure(text = new_text)
    counter_label.text = new_text
    
    try:
        current_image_name = next(my_image_generator)
    except StopIteration:
        window.quit()
        
    image = ImageTk.PhotoImage(Image.open(current_image_name).resize((250, 250), 
                                                                     Image.ANTIALIAS))
    image_label.configure(image=image)
    image_label.image = image

In [None]:
def close_window():
    global window
    
    window.destroy()

Now this function can be called whenever one of our buttons is pressed. The buttons prefer a function without arguments, so we will use some lambda magic to trick them.

In [None]:
import tkinter
from PIL import ImageTk, Image

window = tkinter.Tk()
window.title("GUI")

current_image_name = next(my_image_generator)
my_image = ImageTk.PhotoImage(Image.open(current_image_name).resize((250, 250), 
                                                                    Image.ANTIALIAS))

image_label = tkinter.Label(window, image = my_image)
image_label.pack()

counter_label = tkinter.Label(window, text="0")
counter_label.pack()
counter_label.text = "0"

tkinter.Button(window, text = "Cat", command=lambda: selection_made('cat', image_label, counter_label)).pack(side='left')
tkinter.Button(window, text = "Dog", command=lambda: selection_made('dog', image_label, counter_label)).pack(side='right')

tkinter.Button(text = 'Save', command=close_window).pack(side='bottom')


window.mainloop()

Once we have all images labeled, we can save them nicely into a dataframe as follows:

In [None]:
import pandas as pd
labeled_df = pd.DataFrame({'image_name':hand_labeled_images, 'image_label': hand_labeled_classes})
labeled_df.to_csv('labeled_examples.csv', index=False)

Timing myself again, it took ~28 seconds to label 20 images, which averages out to 1.4 seconds/image. If we have 8,000 images to label, that would take 3 hours to do so. 

MUCH better than the 20 hours it would take to do so manually, and not bad at all for several lines of code. It took me about 6 hours to get a basic idea to tkinter, and that saved me ~17 hours of manual labeling, so we're looking at ~11 hours saved.

However, we haven't considered one important point; Once I've labeled some number of images, I have a small training set. Not terribly good, but it will make some examples trivial to mark for even a simple model. So let's leverage our ML skills and employ them as well

## Using an ML for labeling

Rather than labeling every image by hand, we are going to decide that once X images are labeled, we will start leveraging a machine learning model.

First, we're going to change our generator a little:

In [1]:
import os
import random

def image_generator():
    file_list = os.listdir('dogs-cats-images/unknown')
    random.shuffle(file_list)
    for filename in file_list:
        yield 'dogs-cats-images/unknown/'+filename
        
my_image_generator = image_generator()
current_image_name = None # The current image which was just classified

The reason we end up shuffling our image set is to make sure we will very quickly run into an example of every class (very likely with binary classification but consider a case with more classes)

Next, let's create a function which will train a model on an existing dataset. I'm choosing to use a logistic regression since it trains quickly, and I can use the `warm_start` option

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import numpy as np

my_model = None
my_encoder = LabelEncoder()
initial_train_threshold = 10
retraining_threshhold = initial_train_threshold

UNCERTAINTY_THRESHOLD = 0.95

def retrain_model(hand_labeled_image_names, hand_labeled_classes, counter_label, main_window):
    
    global retraining_threshhold
    
    global my_encoder, my_model

    
    
    if (len(hand_labeled_image_names) == 0):
        return None


    if ('cat' not in hand_labeled_classes) or ('dog' not in hand_labeled_classes):
        return my_model


    if (len(hand_labeled_classes) < retraining_threshhold):
        return my_model
    else:
        retraining_threshhold *= 2

    old_text = counter_label.text

    counter_label.configure(text = "Re-training")
    counter_label.text = "Re-training"
    main_window.update()

    images_set = np.zeros((len(hand_labeled_image_names), 
                           250, 
                           250, 
                           3))




    for i, image_name in enumerate(hand_labeled_image_names):
        image = Image.open(image_name).resize((250, 250), 
                                              Image.ANTIALIAS)
        images_set[i][:, :, :] = np.array(image)



    X = images_set.reshape((images_set.shape[0], -1))
    y = my_encoder.fit_transform(hand_labeled_classes)

    
    if (my_model is None):
        my_model = LogisticRegression(verbose=1, warm_start=True, solver='saga', max_iter=1000)
    my_model.fit(X,y)

    counter_label.configure(text = old_text)
    counter_label.text = old_text
    main_window.update()

    return my_model

Now we can rewrite the selection function. 



In [3]:
hand_labeled_images = []
hand_labeled_classes = []

def reload_image(image_name, image_label, main_window):
    image = Image.open(image_name).resize((250, 250), 
                                          Image.ANTIALIAS)
    render_image = ImageTk.PhotoImage(image)
    image_label.configure(image=render_image)
    image_label.image = render_image
    main_window.update()
    

def selection_made(selected_class, image_label, counter_label, main_window):
    global my_image_generator
    global current_image_name
    global hand_labeled_images, hand_labeled_classes
    global my_model
    

    if (current_image_name is not None):
        hand_labeled_images.append(current_image_name)
        hand_labeled_classes.append(selected_class)

    old_text = counter_label.text
    new_text = str(int(old_text)+1)
    counter_label.configure(text = new_text)
    counter_label.text = new_text
    
    try:
        current_image_name = next(my_image_generator)
        image = Image.open(current_image_name).resize((250, 250), 
                                                      Image.ANTIALIAS)
    except StopIteration:
        main_window.quit()
    
    if (len(hand_labeled_images) >= initial_train_threshold):        
        if (('dog' in hand_labeled_classes) and ('cat' in hand_labeled_classes)):
        
            model_uncertain = False
            while not (model_uncertain):    
                my_model = retrain_model(hand_labeled_images, 
                                         hand_labeled_classes, 
                                         counter_label, 
                                         main_window)

                new_X = np.array(image).reshape((1, -1))
                reload_image(current_image_name, image_label, main_window)

                probability_predictions = my_model.predict_proba(new_X)[0]


                if (np.max(probability_predictions) <= UNCERTAINTY_THRESHOLD):
                    model_uncertain = True
                else:
                    class_label = np.array([np.argmax(probability_predictions)])
                    class_name = my_encoder.inverse_transform(class_label)[0]
                    hand_labeled_images.append(current_image_name)
                    hand_labeled_classes.append(class_name)

                    old_text = counter_label.text
                    new_text = str(int(old_text)+1)
                    counter_label.configure(text = new_text)
                    counter_label.text = new_text
                    main_window.update()

                    try:
                        current_image_name = next(my_image_generator)
                        image = Image.open(current_image_name).resize((250, 250), 
                                                                      Image.ANTIALIAS)
                    except StopIteration:
                        model_uncertain = True
                        main_window.quit()
    
    reload_image(current_image_name, image_label, main_window)
    

In [None]:
import tkinter
from PIL import ImageTk, Image


def close_window():
    global window
    
    window.destroy()
    
    
    

window = tkinter.Tk()
window.title("GUI")

current_image_name = next(my_image_generator)
my_image = ImageTk.PhotoImage(Image.open(current_image_name).resize((250, 250), 
                                                                    Image.ANTIALIAS))

image_label = tkinter.Label(window, image = my_image)
image_label.pack()

counter_label = tkinter.Label(window, text="0")
counter_label.pack()
counter_label.text = "0"

tkinter.Button(window, text = "Cat", command=lambda: selection_made('cat', image_label, counter_label, window)).pack(side='left')
tkinter.Button(window, text = "Dog", command=lambda: selection_made('dog', image_label, counter_label, window)).pack(side='right')

tkinter.Button(text = 'Save', command=close_window).pack(side='bottom')


window.mainloop()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 13 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 26 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   25.7s finished


Using this method took overall 14 minutes and 40 seconds for this dataset

In [None]:
import pandas as pd
labeled_df = pd.DataFrame({'image_name':hand_labeled_images, 'image_label': hand_labeled_classes})
labeled_df.to_csv('labeled_examples.csv', index=False)

In [None]:
print(labeled_df.shape)
labeled_df.head()

## SOMETHING NEW

In [None]:
import pandas as pd
import os
import random
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import numpy as np
import tkinter
from PIL import ImageTk, Image

In [None]:
def create_dataset():
    file_list = os.listdir('dogs-cats-images/unknown')
    random.shuffle(file_list)
    
    
    dataset_df = pd.DataFrame({'file': file_list})
    dataset_df['file'] = dataset_df['file'].apply(lambda x: 'dogs-cats-images/unknown/'+x)
    
    dataset_df['labeled'] = 0
    dataset_df['class'] = "None"
    
    return dataset_df

In [None]:
def carve_train_and_test_set(dataset_df):
    train_files = dataset_df[dataset_df['labeled'] == 1]
    test_files = dataset_df[dataset_df['labeled'] == 0]
    
    if (train_files.shape[0] < 10):
        return None, None, dataset_df['file']
    else:
        train_size = train_files.shape[0]
        test_size = test_files.shape[0]
        
        X_train = np.zeros((train_size, 250, 250, 3))
        X_test = np.zeros((test_size, 250, 250, 3))
        y_train = train_files['class']
        
        for filename in train_files['file'].values:
            print(filename)
            assert(False)
        
        return X_train, y_train, X_test
    

In [None]:
the_df = create_dataset() 
X_train, y_train, X_test = carve_train_and_test_set(the_df)
the_df.head()

In [None]:
my_model = None
my_encoder = LabelEncoder()
initial_train_threshold = 10
retraining_threshhold = initial_train_threshold

UNCERTAINTY_THRESHOLD = 0.95

def retrain_model(X_train, y_train, counter_label, main_window):
    
    global retraining_threshhold
    
    global my_encoder, my_model

    
    
    if (len(y_train) == 0):
        return None


    if ('cat' not in y_train) or ('dog' not in y_train):
        return my_model


    if (len(y_train) < retraining_threshhold):
        return my_model
    else:
        retraining_threshhold *= 2

    old_text = counter_label.text

    counter_label.configure(text = "Re-training")
    counter_label.text = "Re-training"
    main_window.update()




    X = X_train.reshape((images_set.shape[0], -1))
    y = my_encoder.fit_transform(y_train)

    
    if (my_model is None):
        my_model = LogisticRegression(verbose=1, warm_start=True)
    my_model.fit(X,y)

    counter_label.configure(text = old_text)
    counter_label.text = old_text
    main_window.update()

    return my_model

In [None]:


def reload_image(image_name, image_label, main_window):
    image = Image.open(image_name).resize((250, 250), 
                                          Image.ANTIALIAS)
    render_image = ImageTk.PhotoImage(image)
    image_label.configure(image=render_image)
    image_label.image = render_image
    main_window.update()
    

def selection_made(selected_class, image_label, counter_label, main_window, dataset_df):
    global my_image_generator
    global current_image_name
    global my_model
    

    dataset_df.sort_values(by='labeled', ascending=False, axis=0, inplace=True)
    display(dataset_df)
    if (current_image_name is not None):
        hand_labeled_images.append(current_image_name)
        hand_labeled_classes.append(selected_class)

    old_text = counter_label.text
    new_text = str(int(old_text)+1)
    counter_label.configure(text = new_text)
    counter_label.text = new_text
    
    try:
        current_image_name = next(my_image_generator)
        image = Image.open(current_image_name).resize((250, 250), 
                                                      Image.ANTIALIAS)
    except StopIteration:
        main_window.quit()
    
    if (len(hand_labeled_images) >= initial_train_threshold):        
        if (('dog' in hand_labeled_classes) and ('cat' in hand_labeled_classes)):
        
            model_uncertain = False
            while not (model_uncertain):    
                my_model = retrain_model(hand_labeled_images, 
                                         hand_labeled_classes, 
                                         counter_label, 
                                         main_window)

                new_X = np.array(image).reshape((1, -1))
                reload_image(current_image_name, image_label, main_window)

                probability_predictions = my_model.predict_proba(new_X)[0]


                if (np.max(probability_predictions) <= UNCERTAINTY_THRESHOLD):
                    model_uncertain = True
                else:
                    class_label = np.array([np.argmax(probability_predictions)])
                    class_name = my_encoder.inverse_transform(class_label)[0]
                    hand_labeled_images.append(current_image_name)
                    hand_labeled_classes.append(class_name)

                    old_text = counter_label.text
                    new_text = str(int(old_text)+1)
                    counter_label.configure(text = new_text)
                    counter_label.text = new_text
                    main_window.update()

                    try:
                        current_image_name = next(my_image_generator)
                        image = Image.open(current_image_name).resize((250, 250), 
                                                                      Image.ANTIALIAS)
                    except StopIteration:
                        model_uncertain = True
                        main_window.quit()
    
    reload_image(current_image_name, image_label, main_window)
    

In [None]:



def close_window():
    global window
    
    window.destroy()
    
    
    
the_df = create_dataset() 

window = tkinter.Tk()
window.title("GUI")

current_image_name = the_df['file'].iloc[0]
my_image = ImageTk.PhotoImage(Image.open(current_image_name).resize((250, 250), 
                                                                    Image.ANTIALIAS))

image_label = tkinter.Label(window, image = my_image)
image_label.pack()

counter_label = tkinter.Label(window, text="0")
counter_label.pack()
counter_label.text = "0"

tkinter.Button(window, text = "Cat", command=lambda: selection_made('cat', image_label, counter_label, window, the_df)).pack(side='left')
tkinter.Button(window, text = "Dog", command=lambda: selection_made('dog', image_label, counter_label, window, the_df)).pack(side='right')

tkinter.Button(text = 'Save', command=close_window).pack(side='bottom')


window.mainloop()

In [None]:
labeled_df.head()

In [None]:
assert(False)

In [None]:
next(my_image_generator)

In [None]:

hand_labeled_images = []
hand_labeled_classes = []

def selection_made(selected_class):
    global my_image_generator
    global current_image_name
    

    if (current_image_name is not None):
        hand_labeled_images.append(current_image_name)
        hand_labeled_classes.append(selected_class)
              
    
    
    try:
        current_image_name = next(my_images)
    except StopIteration:
        window.quit()
        
    image = ImageTk.PhotoImage(Image.open(current_image_name).resize((global_height, global_width), Image.ANTIALIAS))
    label.configure(image=image)
    label.image = image

In [None]:
import pandas as pd
import numpy as np

import tkinter
from PIL import ImageTk, Image


from sklearn.preprocessing import LabelEncoder



global_height = 250
global_width = 250

global_divisor = 10
global_certainty_threshold = 0.85


hand_labeled_images = []
hand_labeled_classes = []
current_image_name = None


my_encoder = LabelEncoder()
my_model = None




In [None]:
# https://www.kaggle.com/chetankv/dogs-cats-images

dog_directory = 'dogs-cats-images/dogs'
cat_directory = 'dogs-cats-images/cats'

image_names = []
images_classes = []

for directory in [dog_directory, cat_directory]:

    for filename in os.listdir(directory):
        new_directory = 'dogs-cats-images/unknown'
        image_names.append(str(new_directory)+"/"+str(filename))
        if ('cat' in filename):
            images_classes.append('cat')
        else:
            images_classes.append('dog')
            
true_labels_df = pd.DataFrame({'Image':image_names, 'Class':images_classes})
display(true_labels_df.head())

true_labels_df.to_csv('labels.csv', index=False)

In [None]:
hand_labels_df = pd.read_csv('hand_labels.csv')
hand_labels_df.head()

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
def retrain_model(labeled_images, labeled_classes):
    
    global window
    global labeled_counter
    global global_divisor
    global my_encoder
    global my_model
    
    if (len(labeled_images) == 0):
        return None
    

    if ('cat' not in labeled_classes) or ('dog' not in labeled_classes):
        return my_model
    
    
    if ((len(labeled_images) % global_divisor) > 0):
        return my_model
    else:
        global_divisor *= 2
    
    old_text = labeled_counter.text
    
    labeled_counter.configure(text = "Re-training")
    labeled_counter.text = "Re-training"
    window.update()
    
    images_set = np.zeros((len(labeled_images), 
                           global_height, 
                           global_width, 
                           3))
    
    
    
    
    for i, image_name in enumerate(labeled_images):
        image = Image.open(image_name).resize((global_height, global_width), Image.ANTIALIAS)
        images_set[i][:, :, :] = np.array(image)
        
        
    
    X = images_set.reshape((images_set.shape[0], -1))
    y = my_encoder.fit_transform(labeled_classes)
    
    allowed_depth = int(0.00001*X.shape[1])
    
    
    
    #my_model = DecisionTreeClassifier(max_depth=allowed_depth, min_samples_leaf=2)
    if (my_model is None):
        my_model = LogisticRegression(verbose=1, warm_start=True)
    my_model.fit(X,y)
    
    labeled_counter.configure(text = old_text)
    labeled_counter.text = old_text
    window.update()
    
    return my_model


#retrain_model(hand_labeled_images, hand_labeled_classes)

In [None]:
def selection_made_cat():
    selection_made('cat')
    
def selection_made_dog():
    selection_made('dog')

def selection_made(selected_class):
    global label, labeled_counter
    global my_images
    global current_image_name
    global my_model
    global window
    
    #print(current_image, selected_class)
    if (current_image_name is not None):
        hand_labeled_images.append(current_image_name)
        hand_labeled_classes.append(selected_class)
    
    #print(len(hand_labeled_images), end='\r')
    
    labeled_counter.configure(text = str(len(hand_labeled_images)))
    labeled_counter.text = str(len(hand_labeled_images))
          
    
    my_model = retrain_model(hand_labeled_images, hand_labeled_classes)
    
    if (my_model is not None):
        model_uncertain = False
        while not (model_uncertain):
            try:
                current_image_name = next(my_images)
            except StopIteration:
                model_uncertain = True
                window.quite()
            
            #print("labeling ",current_image_name)
            
            image = Image.open(current_image_name).resize((global_height, global_width), Image.ANTIALIAS)
            image_object = ImageTk.PhotoImage(image)
            label.configure(image=image_object)
            label.image = image_object
            
            new_X = np.array(image).reshape((1, -1))
            predictions = my_model.predict_proba(new_X)[0]
            
            
            
            if (np.max(predictions) <= global_certainty_threshold):
                print(predictions, end='\r')
                model_uncertain = True
                image = ImageTk.PhotoImage(Image.open(current_image_name).resize((global_height, global_width), Image.ANTIALIAS))
                label.configure(image=image)
                label.image = image
            else:
                new_y = np.argmax(predictions)
                selected_class = my_encoder.inverse_transform(np.array([new_y]))[0]
                hand_labeled_images.append(current_image_name)
                hand_labeled_classes.append(selected_class)
                #print("\n",len(hand_labeled_images), end='\r')
                
            
            labeled_counter.configure(text = str(len(hand_labeled_images)))
            labeled_counter.text = str(len(hand_labeled_images))
            
            my_model = retrain_model(hand_labeled_images, hand_labeled_classes)
            window.update()
                
    else:
        try:
            current_image_name = next(my_images)
        except StopIteration:
            model_uncertain = True
            window.quit()
        image = ImageTk.PhotoImage(Image.open(current_image_name).resize((global_height, global_width), Image.ANTIALIAS))
        label.configure(image=image)
        label.image = image

In [None]:


# Let's create the Tkinter window
window = tkinter.Tk()
window.title("GUI")

# Finally, to display the image you will make use of the 'Label' method and pass the 'image' variriable as a parameter and use the pack() method to display inside the GUI.
label = tkinter.Label(window, image = None)
label.pack()

labeled_counter = tkinter.Label(window, text="0")
labeled_counter.pack()

selection_made('bla')

tkinter.Button(window, text = "Cat", command=selection_made_cat).pack(side='left')
tkinter.Button(window, text = "Dog", command=selection_made_dog).pack(side='right')


window.mainloop()