# Weak supervision pipeline 

## What does this notebook do?

This notebook is an initial rough pipeline for developing and applying weak labels to images of book pages from the internet archive. The goal is to use weak supervision to apply labels to these images indicating if the page is illustrated or not.  

## Data used
This notebook uses the following two datasets:

### [ImageIN/ImageIn_annotations](https://huggingface.co/datasets/ImageIN/ImageIn_annotations)

This is a subset of images of book pages sampled from the internet archive with annotations for whether the page contains an illustration or not. This dataset serves as the ground truth to support the development and evaluation of weak labelling functions. 

### [ImageIN/IA_loaded](https://huggingface.co/datasets/ImageIN/IA_loaded) 

This is a subset of images of book pages sampled from the internet archive with no labels. 

In [1]:
%pip install datasets wandb transformers timm sklearn snorkel piffle requests-cache httpx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.1-py3-none-any.whl (431 kB)
[K     |████████████████████████████████| 431 kB 7.6 MB/s 
[?25hCollecting wandb
  Downloading wandb-0.13.3-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 81.0 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 57.1 MB/s 
[?25hCollecting timm
  Downloading timm-0.6.11-py3-none-any.whl (548 kB)
[K     |████████████████████████████████| 548 kB 72.2 MB/s 
[?25hCollecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting snorkel
  Downloading snorkel-0.9.9-py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 86.5 MB/s 
[?25hCollecting piffle
  Downloading piffle-0.4.0-py2.py3-none-any.whl (15 kB)
Collecting requests-cache
  Downloading requests_cache-0.9

We need to be authenticated to access some of the datasets stored on the 🤗 hub and in order to push our annotated dataset to the hub.

In [2]:
from huggingface_hub import notebook_login

In [3]:
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


In [4]:
from datasets import load_dataset

We load both of our datasets. One contains labels which we'll use to evaluate whether a label function is useful or not. The other is not annotated at all. 

In [5]:
LABELLED_DATASET = "ImageIN/ImageIn_annotations"
UNLABELLED_DATASET = "davanstrien/IA_loaded"

In [6]:
labelled_ds = load_dataset(LABELLED_DATASET, split="train")
unlabelled_ds = load_dataset(UNLABELLED_DATASET, split="train", use_auth_token=True)



Downloading:   0%|          | 0.00/2.23k [00:00<?, ?B/s]



Downloading and preparing dataset None/None (download: 1.81 GiB, generated: 1.84 GiB, post-processed: Unknown size, total: 3.65 GiB) to /root/.cache/huggingface/datasets/ImageIN___parquet/ImageIN--ImageIn_annotations-3c053cc2ce1c24bf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/477M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/493M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/503M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/467M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/ImageIN___parquet/ImageIN--ImageIn_annotations-3c053cc2ce1c24bf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.




Downloading:   0%|          | 0.00/1.37k [00:00<?, ?B/s]



Downloading and preparing dataset None/None (download: 1.02 GiB, generated: 712.20 MiB, post-processed: Unknown size, total: 1.72 GiB) to /root/.cache/huggingface/datasets/davanstrien___parquet/davanstrien--IA_loaded-aa0d3d0c4ebca489/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/550M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/davanstrien___parquet/davanstrien--IA_loaded-aa0d3d0c4ebca489/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


In [78]:
unlabelled_ds

Dataset({
    features: ['image', 'manifest_url', 'license', 'label', 'attribution', 'loaded_image', 'detr_preds_count', 'manuscript_count', 'mean_rgb', 'illustration_classifier'],
    num_rows: 66836
})

In [7]:
from httpx import ReadTimeout
from PIL import UnidentifiedImageError
import PIL
from httpx import HTTPError
from datasets import load_dataset_builder
from transformers import pipeline
from transformers import AutoModelForObjectDetection, AutoFeatureExtractor
import dask.dataframe as dd
from piffle.image import IIIFImageClient, ParseError
from PIL import Image
from typing import Optional
import io
from requests_cache import CachedSession
import requests
import numpy as np
from PIL import UnidentifiedImageError
from sklearn.model_selection import train_test_split

Quickly compare the columns to see what we have in our data. 

In [8]:
labelled_ds[0]

{'image': 'https://iiif.archivelab.org/iiif/holylandbiblebok0000cunn$49/full/full/0/default.jpg',
 'manifest_url': 'https://iiif.archivelab.org/iiif/holylandbiblebok0000cunn/manifest.json',
 'license': '',
 'label': '34',
 'attribution': 'The Internet Archive',
 'id': 11967,
 'choice': 'not-illustrated',
 'annotator': 1,
 'annotation_id': 6063,
 'created_at': Timestamp('2022-09-26 09:58:03.514234+0000', tz='UTC'),
 'updated_at': Timestamp('2022-09-26 09:58:03.514258+0000', tz='UTC'),
 'lead_time': 3.618,
 'loaded_image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1757x3005 at 0x7F209D7C4690>}

In [24]:
unlabelled_ds[0]

{'image': 'https://iiif.archivelab.org/iiif/historicalcollec01howe_1$258/full/full/0/default.jpg',
 'manifest_url': 'https://iiif.archivelab.org/iiif/historicalcollec01howe_1/manifest.json',
 'license': None,
 'label': '263',
 'attribution': 'The Internet Archive',
 'loaded_image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=250x250 at 0x7F244A69B0D0>}

In [9]:
def resize_url_image(url):
    try:
        iiif = IIIFImageClient.init_from_url(url)
        iiif.size = "250,250"
        return iiif.__str__()
    except ParseError:
        return None

## Labelling functions

Below we generate some additional columns for our data that we can use as data points for labelling functions. Normally we would do this work inside a labelling function but because some of these functions are expensive to run we do it upfront and save the results.  

### Heuristics 

One simple indicator of whether a page is illustrated is the mean RGB values of an image. Some images in our dataset are blank. We would expect these to not be illustrated and also be quite easy to detect using the mean RGB value for that image. The below function returns the mean rgb value for an image which we can then use inside a labelling function.

In [None]:
def calculate_mean_rgb_values(image: PIL.Image) -> Optional[float]:
    if image is None:
        return None
    image = image.convert("RGB")
    image_array = np.asarray(image)
    return np.mean(image_array)

## Existing models

We may be able to leverage existing computer vision models to generate some potential signal for weak labels. 



### Generic object detection model 

This function returns the predictions of an [object detection model](https://huggingface.co/facebook/detr-resnet-50) trained end-to-end on COCO 2017 object detection. 

In [12]:
detr_pipe = pipeline("object-detection", device=0)

No model was supplied, defaulted to facebook/detr-resnet-50 and revision 2729413 (https://huggingface.co/facebook/detr-resnet-50).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/4.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/167M [00:00<?, ?B/s]

Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-rsb-weights/resnet50_a1_0-14fe96d1.pth" to /root/.cache/torch/hub/checkpoints/resnet50_a1_0-14fe96d1.pth


Downloading:   0%|          | 0.00/274 [00:00<?, ?B/s]

In [13]:
preds = detr_pipe(labelled_ds[:16]["loaded_image"])
preds

[[],
 [],
 [{'score': 0.9938995838165283,
   'label': 'book',
   'box': {'xmin': 0, 'ymin': 20, 'xmax': 2505, 'ymax': 3322}}],
 [],
 [{'score': 0.9972436428070068,
   'label': 'book',
   'box': {'xmin': 4, 'ymin': 14, 'xmax': 2412, 'ymax': 3442}}],
 [{'score': 0.9922118186950684,
   'label': 'book',
   'box': {'xmin': 7, 'ymin': 48, 'xmax': 2488, 'ymax': 3369}}],
 [],
 [{'score': 0.9844426512718201,
   'label': 'book',
   'box': {'xmin': 0, 'ymin': 0, 'xmax': 2146, 'ymax': 3372}}],
 [],
 [],
 [{'score': 0.9604831337928772,
   'label': 'book',
   'box': {'xmin': -5, 'ymin': 0, 'xmax': 1973, 'ymax': 3261}}],
 [{'score': 0.9940760135650635,
   'label': 'book',
   'box': {'xmin': -2, 'ymin': 2, 'xmax': 2136, 'ymax': 3519}}],
 [],
 [],
 [],
 []]

In [25]:
def get_detr_object_count(image_batch):
    try:
        return detr_pipe(image_batch)
    except Exception:
        return [None] * len(image_batch)

### Illustration bounding box detection model 

In [26]:
model_id = "biglam/detr-resnet-50_fine_tuned_nls_chapbooks"

In [27]:
model = AutoModelForObjectDetection.from_pretrained(model_id)
extractor = AutoFeatureExtractor.from_pretrained(model_id)

In [28]:
manu_pipe = pipeline(
    "object-detection", model=model, feature_extractor=extractor, device=0
)

In [29]:
def get_manuscript_illustration_model_object_count(image_batch):
    try:
        return manu_pipe(image_batch)
    except Exception:
        return [None] * len(image_batch)

### Baseline illustration classification model 

In [30]:
from transformers import AutoModelForImageClassification

model_id = "ImageIN/convnext-base-224_finetuned_on_ImageIn_annotations"
model = AutoModelForImageClassification.from_pretrained(model_id)
extractor = AutoFeatureExtractor.from_pretrained(model_id)

In [31]:
illustration_classifier = pipeline(
    "image-classification", model=model, feature_extractor=extractor, device=0
)

In [32]:
illustration_classifier(labelled_ds[0]["loaded_image"])

[{'score': 0.9981074333190918, 'label': 'not-illustrated'},
 {'score': 0.001892568077892065, 'label': 'illustrated'}]

In [33]:
def get_illustration_classifier(image_batch):
    try:
        return illustration_classifier(image_batch)
    except Exception:
        # return None
        return [None] * len(image_batch)

In [34]:
def pre_process(batch):
    images = batch["loaded_image"]
    return {
        "detr_preds_count": get_detr_object_count(images),
        "manuscript_count": get_manuscript_illustration_model_object_count(images),
        "mean_rgb": [calculate_mean_rgb_values(image) for image in images],
        "illustration_classifier": get_illustration_classifier(images),
    }

In [35]:
labelled_ds = labelled_ds.map(pre_process,
                batched=True, 
                batch_size=512,
                writer_batch_size=1024)

  0%|          | 0/4 [00:00<?, ?ba/s]

In [37]:
labelled_ds

Dataset({
    features: ['image', 'manifest_url', 'license', 'label', 'attribution', 'id', 'choice', 'annotator', 'annotation_id', 'created_at', 'updated_at', 'lead_time', 'loaded_image', 'detr_preds_count', 'manuscript_count', 'mean_rgb', 'illustration_classifier'],
    num_rows: 1896
})

## Create DataFrame

We create a DataFrame without the images. This will be used to apply snorkel label functions

In [38]:
labelled_ds_with_out_image = labelled_ds.remove_columns("loaded_image")

In [39]:
df = labelled_ds_with_out_image.to_pandas()

In [40]:
df

Unnamed: 0,image,manifest_url,license,label,attribution,id,choice,annotator,annotation_id,created_at,updated_at,lead_time,detr_preds_count,manuscript_count,mean_rgb,illustration_classifier
0,https://iiif.archivelab.org/iiif/holylandbible...,https://iiif.archivelab.org/iiif/holylandbible...,,34,The Internet Archive,11967,not-illustrated,1,6063,2022-09-26 09:58:03.514234+00:00,2022-09-26 09:58:03.514258+00:00,3.618,[],[],184.929093,"[{'label': 'not-illustrated', 'score': 0.99810..."
1,https://iiif.archivelab.org/iiif/ancientgreekf...,https://iiif.archivelab.org/iiif/ancientgreekf...,,p.,The Internet Archive,11966,illustrated,1,6062,2022-09-26 09:57:59.634415+00:00,2022-09-26 09:57:59.634439+00:00,6.742,[],"[{'box': {'xmax': 1372, 'xmin': 282, 'ymax': 2...",223.384461,"[{'label': 'illustrated', 'score': 0.999426603..."
2,https://iiif.archivelab.org/iiif/saintstheirsy...,https://iiif.archivelab.org/iiif/saintstheirsy...,,25,The Internet Archive,11965,not-illustrated,1,6061,2022-09-26 09:57:52.637790+00:00,2022-09-26 09:57:52.637832+00:00,4.282,"[{'box': {'xmax': 2505, 'xmin': 0, 'ymax': 332...",[],195.217543,"[{'label': 'not-illustrated', 'score': 0.99941..."
3,https://iiif.archivelab.org/iiif/orchidaceaeil...,https://iiif.archivelab.org/iiif/orchidaceaeil...,,p.,The Internet Archive,11964,not-illustrated,1,6060,2022-09-26 09:57:48.066408+00:00,2022-09-26 09:57:48.066436+00:00,4.041,[],[],230.156799,"[{'label': 'not-illustrated', 'score': 0.99941..."
4,https://iiif.archivelab.org/iiif/movingpicture...,https://iiif.archivelab.org/iiif/movingpicture...,,1339,The Internet Archive,11963,not-illustrated,1,6059,2022-09-26 09:57:43.710760+00:00,2022-09-26 09:57:43.710808+00:00,19.219,"[{'box': {'xmax': 2412, 'xmin': 4, 'ymax': 344...",[],198.680118,"[{'label': 'not-illustrated', 'score': 0.99704..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1891,https://iiif.archivelab.org/iiif/bnf-bpt6k6566...,https://iiif.archivelab.org/iiif/bnf-bpt6k6566...,,p.,The Internet Archive,10007,illustrated,1,4103,2022-09-15 16:16:51.178485+00:00,2022-09-15 16:16:51.178509+00:00,0.946,"[{'box': {'xmax': 4241, 'xmin': 29, 'ymax': 53...","[{'box': {'xmax': 2845, 'xmin': 462, 'ymax': 4...",201.694078,"[{'label': 'illustrated', 'score': 0.999818146..."
1892,https://iiif.archivelab.org/iiif/atreatiseillu...,https://iiif.archivelab.org/iiif/atreatiseillu...,,p.,The Internet Archive,10006,illustrated,1,4102,2022-09-15 16:16:49.989463+00:00,2022-09-15 16:16:49.989490+00:00,1.624,"[{'box': {'xmax': 784, 'xmin': 485, 'ymax': 97...",[],161.211220,"[{'label': 'illustrated', 'score': 0.999732792..."
1893,https://iiif.archivelab.org/iiif/illustrations...,https://iiif.archivelab.org/iiif/illustrations...,,69,The Internet Archive,10005,not-illustrated,1,4101,2022-09-15 16:16:48.121866+00:00,2022-09-15 16:16:48.121905+00:00,0.847,"[{'box': {'xmax': 3033, 'xmin': 0, 'ymax': 401...",[],217.400830,"[{'label': 'not-illustrated', 'score': 0.99934..."
1894,https://iiif.archivelab.org/iiif/chaitanyahisc...,https://iiif.archivelab.org/iiif/chaitanyahisc...,,69,The Internet Archive,10004,not-illustrated,1,4100,2022-09-15 16:16:47.011223+00:00,2022-09-15 16:16:47.011249+00:00,0.873,"[{'box': {'xmax': 2258, 'xmin': -1, 'ymax': 39...",[],172.259406,"[{'label': 'not-illustrated', 'score': 0.99901..."


In [41]:
df_train, df_valid = train_test_split(df, test_size=0.2)

In [42]:
ABSTAIN = -1
ILLUSTRATED = 0
NOT_ILLUSTRATED = 1

### Creating our Snorkel Labelling functions

These label functions are applied to Snorkel. 

In [43]:
from snorkel.labeling import labeling_function
from snorkel.preprocess import preprocessor

In [44]:
@labeling_function()
def detr_preds_count_none(x):
    "no objects often means no illustration"
    return NOT_ILLUSTRATED if len(x.detr_preds_count) == 0 else ABSTAIN

In [56]:
@labeling_function()
def many_detr_objects(x):
    return ILLUSTRATED if len(x.detr_preds_count) >= 2 else ABSTAIN

In [57]:
@labeling_function()
def manuscript_object(x):
    return ILLUSTRATED if len(x.manuscript_count) >= 2 else ABSTAIN

In [58]:
@labeling_function()
def manuscript_no_object(x):
    return NOT_ILLUSTRATED if len(x.manuscript_count) == 0 else ABSTAIN

In [59]:
@labeling_function()
def rgb_above_threshold(x, threshold=250):
    return NOT_ILLUSTRATED if x.mean_rgb >= threshold else ABSTAIN

In [60]:
@labeling_function()
def rgb_below_threshold(x, threshold=64):
    return NOT_ILLUSTRATED if x.mean_rgb <= threshold else ABSTAIN

In [61]:
@labeling_function()
def predicted_not_illustrated_high_prob(x):
    return (
        NOT_ILLUSTRATED
        if x.illustration_classifier[0]["label"] == "not-illustrated"
        and x.illustration_classifier[0]["score"] > 0.95
        else ABSTAIN
    )

In [62]:
@labeling_function()
def predicted_illustrated_high_prob(x):
    return (
        ILLUSTRATED
        if x.illustration_classifier[0]["label"] != "not-illustrated"
        and x.illustration_classifier[0]["score"] > 0.95
        else ABSTAIN
    )

Combining labelling functions 

In [64]:
lfs = [
    detr_preds_count_none,
    manuscript_no_object,
    many_detr_objects,
    manuscript_object,
    rgb_above_threshold,
    rgb_below_threshold,
    predicted_not_illustrated_high_prob,
    predicted_illustrated_high_prob,
]

In [65]:
from snorkel.labeling import PandasLFApplier

In [66]:
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 1516/1516 [00:00<00:00, 12737.05it/s]


In [67]:
from snorkel.labeling import LFAnalysis

In [68]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
detr_preds_count_none,0,[1],0.614116,0.613456,0.097625
manuscript_no_object,1,[1],0.725594,0.723615,0.077177
many_detr_objects,2,[0],0.082454,0.082454,0.032322
manuscript_object,3,[0],0.029024,0.029024,0.015172
rgb_above_threshold,4,[1],0.023747,0.023747,0.001319
rgb_below_threshold,5,[1],0.004617,0.004617,0.0
predicted_not_illustrated_high_prob,6,[1],0.746042,0.724934,0.009894
predicted_illustrated_high_prob,7,[0],0.246042,0.202507,0.145778


In [69]:
L_test = applier.apply(df=df_valid)

100%|██████████| 380/380 [00:00<00:00, 12673.63it/s]


In [70]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

100%|██████████| 500/500 [00:00<00:00, 1158.32epoch/s]


In [71]:
L_test = applier.apply(df=df_valid)

100%|██████████| 380/380 [00:00<00:00, 11949.76it/s]


In [72]:
Y_test = df_valid.choice.values

In [73]:
Y_test = np.array([0 if label == "illustrated" else 1 for label in Y_test])

In [74]:
label_model_acc = label_model.score(
    L=L_test, Y=Y_test, metrics=["f1", "accuracy"], tie_break_policy="random"
)
label_model_acc

{'f1': 0.9862068965517242, 'accuracy': 0.9789473684210527}

In [75]:
df

Unnamed: 0,image,manifest_url,license,label,attribution,id,choice,annotator,annotation_id,created_at,updated_at,lead_time,detr_preds_count,manuscript_count,mean_rgb,illustration_classifier
0,https://iiif.archivelab.org/iiif/holylandbible...,https://iiif.archivelab.org/iiif/holylandbible...,,34,The Internet Archive,11967,not-illustrated,1,6063,2022-09-26 09:58:03.514234+00:00,2022-09-26 09:58:03.514258+00:00,3.618,[],[],184.929093,"[{'label': 'not-illustrated', 'score': 0.99810..."
1,https://iiif.archivelab.org/iiif/ancientgreekf...,https://iiif.archivelab.org/iiif/ancientgreekf...,,p.,The Internet Archive,11966,illustrated,1,6062,2022-09-26 09:57:59.634415+00:00,2022-09-26 09:57:59.634439+00:00,6.742,[],"[{'box': {'xmax': 1372, 'xmin': 282, 'ymax': 2...",223.384461,"[{'label': 'illustrated', 'score': 0.999426603..."
2,https://iiif.archivelab.org/iiif/saintstheirsy...,https://iiif.archivelab.org/iiif/saintstheirsy...,,25,The Internet Archive,11965,not-illustrated,1,6061,2022-09-26 09:57:52.637790+00:00,2022-09-26 09:57:52.637832+00:00,4.282,"[{'box': {'xmax': 2505, 'xmin': 0, 'ymax': 332...",[],195.217543,"[{'label': 'not-illustrated', 'score': 0.99941..."
3,https://iiif.archivelab.org/iiif/orchidaceaeil...,https://iiif.archivelab.org/iiif/orchidaceaeil...,,p.,The Internet Archive,11964,not-illustrated,1,6060,2022-09-26 09:57:48.066408+00:00,2022-09-26 09:57:48.066436+00:00,4.041,[],[],230.156799,"[{'label': 'not-illustrated', 'score': 0.99941..."
4,https://iiif.archivelab.org/iiif/movingpicture...,https://iiif.archivelab.org/iiif/movingpicture...,,1339,The Internet Archive,11963,not-illustrated,1,6059,2022-09-26 09:57:43.710760+00:00,2022-09-26 09:57:43.710808+00:00,19.219,"[{'box': {'xmax': 2412, 'xmin': 4, 'ymax': 344...",[],198.680118,"[{'label': 'not-illustrated', 'score': 0.99704..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1891,https://iiif.archivelab.org/iiif/bnf-bpt6k6566...,https://iiif.archivelab.org/iiif/bnf-bpt6k6566...,,p.,The Internet Archive,10007,illustrated,1,4103,2022-09-15 16:16:51.178485+00:00,2022-09-15 16:16:51.178509+00:00,0.946,"[{'box': {'xmax': 4241, 'xmin': 29, 'ymax': 53...","[{'box': {'xmax': 2845, 'xmin': 462, 'ymax': 4...",201.694078,"[{'label': 'illustrated', 'score': 0.999818146..."
1892,https://iiif.archivelab.org/iiif/atreatiseillu...,https://iiif.archivelab.org/iiif/atreatiseillu...,,p.,The Internet Archive,10006,illustrated,1,4102,2022-09-15 16:16:49.989463+00:00,2022-09-15 16:16:49.989490+00:00,1.624,"[{'box': {'xmax': 784, 'xmin': 485, 'ymax': 97...",[],161.211220,"[{'label': 'illustrated', 'score': 0.999732792..."
1893,https://iiif.archivelab.org/iiif/illustrations...,https://iiif.archivelab.org/iiif/illustrations...,,69,The Internet Archive,10005,not-illustrated,1,4101,2022-09-15 16:16:48.121866+00:00,2022-09-15 16:16:48.121905+00:00,0.847,"[{'box': {'xmax': 3033, 'xmin': 0, 'ymax': 401...",[],217.400830,"[{'label': 'not-illustrated', 'score': 0.99934..."
1894,https://iiif.archivelab.org/iiif/chaitanyahisc...,https://iiif.archivelab.org/iiif/chaitanyahisc...,,69,The Internet Archive,10004,not-illustrated,1,4100,2022-09-15 16:16:47.011223+00:00,2022-09-15 16:16:47.011249+00:00,0.873,"[{'box': {'xmax': 2258, 'xmin': -1, 'ymax': 39...",[],172.259406,"[{'label': 'not-illustrated', 'score': 0.99901..."


In [76]:
labelled_ds

Dataset({
    features: ['image', 'manifest_url', 'license', 'label', 'attribution', 'id', 'choice', 'annotator', 'annotation_id', 'created_at', 'updated_at', 'lead_time', 'loaded_image', 'detr_preds_count', 'manuscript_count', 'mean_rgb', 'illustration_classifier'],
    num_rows: 1896
})

In [None]:
L_train = applier.apply(df=df)

100%|██████████| 1896/1896 [00:00<00:00, 12928.69it/s]


In [None]:
probs_train = label_model.predict_proba(L_train)

In [None]:
len(probs_train)

1896

In [None]:
probs_train.max(axis=1)

array([0.90770403, 0.99976089, 0.85902286, ..., 0.85902286, 0.85902286,
       0.90770403])

In [None]:
from snorkel.utils import probs_to_preds

In [None]:
probs_to_preds(probs_train)

array([1, 0, 1, ..., 1, 1, 1])

In [None]:
mask = (L_train != -1).any(axis=1)
mask

array([ True,  True,  True, ...,  True,  True,  True])

In [None]:
len(probs_train[mask])

1894

In [None]:
df_train_filtered.iloc[0][0]

'https://iiif.archivelab.org/iiif/holylandbiblebok0000cunn$49/full/full/0/default.jpg'

In [None]:
labelled_ds.select(mask)

Dataset({
    features: ['image', 'manifest_url', 'license', 'label', 'attribution', 'id', 'choice', 'annotator', 'annotation_id', 'created_at', 'updated_at', 'lead_time', 'loaded_image', 'detr_preds_count', 'manuscript_count', 'mean_rgb', 'illustration_classifier'],
    num_rows: 1896
})

In [None]:
(probs_train != -1).any(axis=1)

array([ True,  True,  True, ...,  True,  True,  True])

In [77]:
unlabelled_ds = unlabelled_ds.map(
    pre_process, batched=True, batch_size=512, writer_batch_size=1024
)

  0%|          | 0/131 [00:00<?, ?ba/s]



In [81]:
unlabelled_ds_without_image = unlabelled_ds.remove_columns("loaded_image")

In [82]:
df_full = unlabelled_ds_without_image.to_pandas()

In [83]:
df_full

Unnamed: 0,image,manifest_url,license,label,attribution,detr_preds_count,manuscript_count,mean_rgb,illustration_classifier
0,https://iiif.archivelab.org/iiif/historicalcol...,https://iiif.archivelab.org/iiif/historicalcol...,,263,The Internet Archive,"[{'box': {'xmax': 249, 'xmin': 0, 'ymax': 250,...","[{'box': {'xmax': 96, 'xmin': 45, 'ymax': 174,...",179.541067,"[{'label': 'illustrated', 'score': 0.999695539..."
1,https://iiif.archivelab.org/iiif/notesonscienc...,https://iiif.archivelab.org/iiif/notesonscienc...,,p.,The Internet Archive,"[{'box': {'xmax': 249, 'xmin': 0, 'ymax': 250,...",[],205.527088,"[{'label': 'not-illustrated', 'score': 0.99895..."
2,https://iiif.archivelab.org/iiif/scandinavianf...,https://iiif.archivelab.org/iiif/scandinavianf...,,334,The Internet Archive,"[{'box': {'xmax': 249, 'xmin': 0, 'ymax': 250,...",[],195.941051,"[{'label': 'not-illustrated', 'score': 0.99932..."
3,https://iiif.archivelab.org/iiif/cataloguepict...,https://iiif.archivelab.org/iiif/cataloguepict...,,336,The Internet Archive,[],[],222.432368,"[{'label': 'not-illustrated', 'score': 0.99954..."
4,https://iiif.archivelab.org/iiif/illustratedit...,https://iiif.archivelab.org/iiif/illustratedit...,,107,The Internet Archive,"[{'box': {'xmax': 249, 'xmin': 0, 'ymax': 250,...",[],197.561888,"[{'label': 'not-illustrated', 'score': 0.99909..."
...,...,...,...,...,...,...,...,...,...
66831,https://iiif.archivelab.org/iiif/narrativedisc...,https://iiif.archivelab.org/iiif/narrativedisc...,,p.,The Internet Archive,[],[],203.316592,"[{'label': 'not-illustrated', 'score': 0.99955..."
66832,https://iiif.archivelab.org/iiif/landscapeillu...,https://iiif.archivelab.org/iiif/landscapeillu...,,p.,The Internet Archive,[],"[{'box': {'xmax': 205, 'xmin': 35, 'ymax': 71,...",245.947008,"[{'label': 'not-illustrated', 'score': 0.99859..."
66833,https://iiif.archivelab.org/iiif/rice-lewis-il...,https://iiif.archivelab.org/iiif/rice-lewis-il...,,734,The Internet Archive,"[{'box': {'xmax': 229, 'xmin': 150, 'ymax': 19...",[],173.834384,"[{'label': 'illustrated', 'score': 0.999768555..."
66834,https://iiif.archivelab.org/iiif/villageworkin...,https://iiif.archivelab.org/iiif/villageworkin...,,41,The Internet Archive,[],[],185.071595,"[{'label': 'not-illustrated', 'score': 0.99904..."


In [84]:
L_train = applier.apply(df=df_full)

100%|██████████| 66836/66836 [00:05<00:00, 12935.20it/s]


In [85]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
detr_preds_count_none,0,[1],0.590999,0.589278,0.089084
manuscript_no_object,1,[1],0.809923,0.80524,0.092944
many_detr_objects,2,[0],0.070217,0.070142,0.024478
manuscript_object,3,[0],0.016698,0.016698,0.006793
rgb_above_threshold,4,[1],0.026213,0.026198,0.000584
rgb_below_threshold,5,[1],0.004638,0.004623,0.000658
predicted_not_illustrated_high_prob,6,[1],0.745796,0.74219,0.003441
predicted_illustrated_high_prob,7,[0],0.237387,0.190736,0.140568


In [86]:
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

100%|██████████| 500/500 [00:00<00:00, 1254.01epoch/s]


In [87]:
probs_train = label_model.predict_proba(L_train)

In [90]:
len(probs_train) == len(unlabelled_ds)

True

In [95]:
unlabelled_ds_with_probs = unlabelled_ds.add_column(
    "snorkel_label_model_probs", probs_train.tolist()
)

In [97]:
unlabelled_ds_with_probs[0]

{'image': 'https://iiif.archivelab.org/iiif/historicalcollec01howe_1$258/full/full/0/default.jpg',
 'manifest_url': 'https://iiif.archivelab.org/iiif/historicalcollec01howe_1/manifest.json',
 'license': None,
 'label': '263',
 'attribution': 'The Internet Archive',
 'loaded_image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=250x250 at 0x7F230BDCBF10>,
 'detr_preds_count': [{'box': {'xmax': 249, 'xmin': 0, 'ymax': 250, 'ymin': 0},
   'label': 'laptop',
   'score': 0.9441924691200256}],
 'manuscript_count': [{'box': {'xmax': 96,
    'xmin': 45,
    'ymax': 174,
    'ymin': 113},
   'label': 'early_printed_illustration',
   'score': 0.9674574732780457}],
 'mean_rgb': 179.54106666666667,
 'illustration_classifier': [{'label': 'illustrated',
   'score': 0.9996955394744873},
  {'label': 'not-illustrated', 'score': 0.0003044367767870426}],
 'snorkel_label_model_probs': [0.9999849810676339, 1.5018932366001192e-05]}

In [100]:
from snorkel.utils import probs_to_preds

arg_max_preds = probs_to_preds(probs_train)

In [103]:
unlabelled_ds_with_snorkel_labels = unlabelled_ds_with_probs.add_column(
    "snorkel_label", arg_max_preds
)

In [104]:
unlabelled_ds_with_snorkel_labels[0]

{'image': 'https://iiif.archivelab.org/iiif/historicalcollec01howe_1$258/full/full/0/default.jpg',
 'manifest_url': 'https://iiif.archivelab.org/iiif/historicalcollec01howe_1/manifest.json',
 'license': None,
 'label': '263',
 'attribution': 'The Internet Archive',
 'loaded_image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=250x250 at 0x7F22C79D5DD0>,
 'detr_preds_count': [{'box': {'xmax': 249, 'xmin': 0, 'ymax': 250, 'ymin': 0},
   'label': 'laptop',
   'score': 0.9441924691200256}],
 'manuscript_count': [{'box': {'xmax': 96,
    'xmin': 45,
    'ymax': 174,
    'ymin': 113},
   'label': 'early_printed_illustration',
   'score': 0.9674574732780457}],
 'mean_rgb': 179.54106666666667,
 'illustration_classifier': [{'label': 'illustrated',
   'score': 0.9996955394744873},
  {'label': 'not-illustrated', 'score': 0.0003044367767870426}],
 'snorkel_label_model_probs': [0.9999849810676339, 1.5018932366001192e-05],
 'snorkel_label': 0}

In [106]:
import datasets

In [107]:
unlabelled_ds_with_snorkel_labels = unlabelled_ds_with_snorkel_labels.cast_column(
    "snorkel_label", datasets.ClassLabel(names=["illustrated", "not-illustrated"])
)

Casting the dataset:   0%|          | 0/7 [00:00<?, ?ba/s]

In [108]:
unlabelled_ds_with_snorkel_labels.push_to_hub(
    "ImageIN/unlabelled_IA_with_snorkel_labels", private=True
)



  0%|          | 0/23 [00:00<?, ?ba/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/23 [00:00<?, ?ba/s]

  0%|          | 0/23 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading metadata:   0%|          | 0.00/4.52k [00:00<?, ?B/s]

