<a href="https://colab.research.google.com/github/eddyxu/rikai/blob/lei%2Funcertainty_sampling/notebooks/UncertaintySampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Uncertainty Sampling

Uncertainty Sampling is one [Active Learning](https://en.wikipedia.org/wiki/Active_learning_(machine_learning))
strategy to use the uncertainty in model detection to find examples to be labelled.

![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)(https://colab.research.google.com/github/eto-ai/rikai/]

In [8]:
!python -V
!nvidia-smi
!df -h

Python 3.7.12
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Filesystem      Size  Used Avail Use% Mounted on
overlay         108G   43G   65G  40% /
tmpfs            64M     0   64M   0% /dev
shm             5.8G     0  5.8G   0% /dev/shm
/dev/root       2.0G  1.2G  817M  59% /sbin/docker-init
tmpfs           6.4G   44K  6.4G   1% /var/colab
/dev/sda1        81G   47G   34G  59% /etc/hosts
tmpfs           6.4G     0  6.4G   0% /proc/acpi
tmpfs           6.4G     0  6.4G   0% /proc/scsi
tmpfs           6.4G     0  6.4G   0% /sys/firmware


In [2]:
!pip install rikai[mlflow,torch]

Collecting rikai[mlflow,torch]
  Downloading rikai-0.1.0-py3-none-any.whl (102 kB)
[?25l[K     |███▏                            | 10 kB 19.9 MB/s eta 0:00:01[K     |██████▍                         | 20 kB 27.0 MB/s eta 0:00:01[K     |█████████▋                      | 30 kB 15.5 MB/s eta 0:00:01[K     |████████████▉                   | 40 kB 12.1 MB/s eta 0:00:01[K     |████████████████                | 51 kB 8.5 MB/s eta 0:00:01[K     |███████████████████▏            | 61 kB 7.9 MB/s eta 0:00:01[K     |██████████████████████▍         | 71 kB 8.7 MB/s eta 0:00:01[K     |█████████████████████████▋      | 81 kB 9.6 MB/s eta 0:00:01[K     |████████████████████████████▉   | 92 kB 9.1 MB/s eta 0:00:01[K     |████████████████████████████████| 102 kB 5.8 MB/s 
Collecting antlr4-python3-runtime==4.8
  Downloading antlr4-python3-runtime-4.8.tar.gz (112 kB)
[?25l[K     |███                             | 10 kB 21.1 MB/s eta 0:00:01[K     |█████▉                          |

In [7]:
# From https://pytorch.org/vision/0.11/models.html#object-detection-instance-segmentation-and-person-keypoint-detection

COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

In [8]:
import mlflow
from pyspark.sql import SparkSession
from rikai.spark.utils import get_default_jar_version

MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"

rikai_version = get_default_jar_version(use_snapshot=False)

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
spark = (
    SparkSession
    .builder
    .config("spark.jars.packages", f"ai.eto:rikai_2.12:{rikai_version}")
    .config(
        "spark.sql.extensions",
        "ai.eto.rikai.sql.spark.RikaiSparkSessionExtensions",
    )
    .config(
        "spark.rikai.sql.ml.registry.mlflow.tracking_uri",
        MLFLOW_TRACKING_URI,
    )
    .config("spark.executor.memory", "8g")
    .config("spark.driver.memory", "4g")
    .master("local[2]")
    .getOrCreate()
);

# Preparing data

Use rikai.contrib.coco.convert to create a Coco Rikai dataset under "./coco"

In [10]:
from pathlib import Path

coco_dir = Path("./coco")
if not coco_dir.exists() and True:
  !mkdir -p coco
  !wget http://images.cocodataset.org/zips/val2017.zip -O coco/val2017.zip
  !wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip -O coco/annotations.zip
  !cd coco && find . -name '*.zip' -exec unzip {} \; && rm *.zip

from rikai.contrib.datasets.coco import convert

df = convert(spark, "coco")
df.write.saveAsTable("coco")

loading annotations into memory...
Done (t=19.08s)
creating index...
index created!
loading annotations into memory...
Done (t=0.60s)
creating index...
index created!


In [11]:
spark.sql("SHOW TABLES").show()

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default|     coco|      false|
+--------+---------+-----------+



In [12]:
spark.sql("select * from coco").printSchema()

root
 |-- image_id: long (nullable = true)
 |-- annotations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- label_id: integer (nullable = true)
 |    |    |-- label: string (nullable = true)
 |    |    |-- area: float (nullable = true)
 |    |    |-- bbox: box2d (nullable = true)
 |-- image: image (nullable = true)
 |-- split: string (nullable = true)



In [13]:
import rikai
from torchvision.models.detection.ssd import ssd300_vgg16
from rikai.contrib.torch.inspect.ssd import SSDClassScoresExtractor
from rikai.contrib.torch.detections import OUTPUT_SCHEMA

ssd = ssd300_vgg16(pretrained=True)
class_scores_extractor = SSDClassScoresExtractor(ssd, topk_candidates=90)

print(OUTPUT_SCHEMA)

with mlflow.start_run():
    rikai.mlflow.pytorch.log_model(
        ssd, 
        "model", 
        OUTPUT_SCHEMA,
        pre_processing="rikai.contrib.torch.transforms.ssd.pre_processing",
        post_processing="rikai.contrib.torch.transforms.ssd.post_processing",
        registered_model_name="ssd"
    )
with mlflow.start_run():
    rikai.mlflow.pytorch.log_model(
        class_scores_extractor,
        "model_scores",
        SSDClassScoresExtractor.SCHEMA,
        pre_processing="rikai.contrib.torch.inspect.ssd.class_scores_extractor_pre_processing",
        post_processing="rikai.contrib.torch.inspect.ssd.class_scores_extractor_post_processing",
        registered_model_name="class_scores"
    )

ModuleNotFoundError: ignored

In [None]:
spark.sql("CREATE OR REPLACE MODEL ssd OPTIONS (batch_size=128) USING 'mlflow:/ssd'")
spark.sql("CREATE OR REPLACE MODEL class_scores OPTIONS (batch_size=128) USING 'mlflow:/class_scores'")

In [None]:
spark.sql("SHOW MODELS").show()


# Least Confidence

**Least Confidence** looks for predicted labels with the lowest degree of confidence

$$ 1 - P(y_1 | x) $$

In [None]:
df = spark.sql("""
SELECT image_id, image, explode(ML_PREDICT(ssd, image)) AS ssd FROM (
    SELECT image_id, image FROM coco LIMIT 1000
) ORDER BY ssd.score ASC
""").cache()

In [None]:
from rikai.viz import Text

for row in df.take(3):
    text = COCO_INSTANCE_CATEGORY_NAMES[row.ssd.label_id]
    display(row.image 
        | row.ssd.box@{"color": "yellow", "width": 3} 
        | Text(f"{text} | {row.ssd.score:.3f}", (row.ssd.box.xmin, row.ssd.box.ymax + 3))@{"color": "yellow"}
    )


# Least Margin of Confidence

**Margin of Confidence** looks for training examples with the lowest difference between most likely and second most likely labels. Intuitively, it gives insights into where the model is confused the most.

$$ P(y_1 | x) - P(y_2 | x) $$

In [None]:
%%sql

SELECT image_id, image, detection, detection.scores[0] - detection.scores[1] as margin FROM (
    SELECT image_id, image, explode(ML_PREDICT(class_scores, image)) AS detection FROM (
        SELECT image_id, image FROM coco LIMIT 100
    )
) ORDER BY margin

In [None]:
df.printSchema()
df.cache()

In [None]:
first = df.first()
label1 = COCO_INSTANCE_CATEGORY_NAMES[first.detection.label_ids[0]]
label2 = COCO_INSTANCE_CATEGORY_NAMES[first.detection.label_ids[1]]
text = f"{label1} - {label2} = {first.margin}"
box = first.detection.box
(
    first.image 
    | box@{"color": "yellow", "width": 3} 
    | Text(text, (box.xmin, box.ymax))@{"color": "yellow"}
)

# Entropy

**Entropy** observing the average level of uncertainty over all the labels.

$$ \frac{-\sum_{i=1}^{n}P(y_i | x)log_{2}P(y_i | x)}{log_2{n}}$$

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from scipy.stats import entropy as scipyEntropy

@udf(returnType=FloatType())
def entropy(arr) -> float:
    return float(scipyEntropy(arr))

spark.udf.register("entropy", entropy)

In [None]:
%%sql
SELECT image_id, image, detection, entropy(detection.scores) as entropy FROM (
    SELECT image_id, image, explode(ML_PREDICT(class_scores, image)) AS detection FROM (
        SELECT image_id, image FROM coco LIMIT 1000
    )
) ORDER BY entropy DESC

In [None]:
df.cache()

In [None]:
first = df.first()
text = COCO_INSTANCE_CATEGORY_NAMES[first.detection.label_ids[0]]
box = first.detection.box
print(box)
(
    first.image 
    | box@{"color": "yellow", "width": 3} 
    | Text(text, (box.xmin, box.ymax))@{"color": "yellow"}
)