# Full Pipeline for Automated Passenger Tracking - Evaluation
This notebook contains the evaluation of the full pipeline. It also shows how the full pipeline works. The testset used in this notebook was a custom dataset, which consists of frames from real train CCTV footage, provided to us by Televic GSP/Rail, which has been manually annotated by the authors of this notebook.

## Step 0: Reading in the test-data.
In this step, we'll read in and parse the raw annotated testset. After each step, we'll save the progress, so we don't have to rerun the entire pipeline every time we want to update our code.

In [1]:
import os
import json
import pandas as pd
from tqdm import tqdm
from pathlib import Path

In [2]:
# Define some constants and paths
test_set_folder = Path().resolve() / 'testset'
image_folder = test_set_folder / 'images'
description_folder = test_set_folder / 'descriptions'
assert test_set_folder.exists(), "Can't find the folder containing the testset"
assert image_folder.exists(), "Can't find the folder containing the testset images"
assert description_folder.exists(), "Can't find the folder containing the testset description jsons"

In [3]:
all_data = []
for json_file in tqdm(os.listdir(description_folder)):
    with open(description_folder / json_file, "r") as fd:
        data = json.load(fd)
    source_file = Path(data["asset"]['path']).name
    for region in data['regions']:
        tag = region['tags'][0]
        point1, point2 = [(region['points'][0]['x'], region['points'][0]['y']), (region['points'][2]['x'], region['points'][2]['y'])]
        all_data.append([source_file, tag, point1, point2])
df_step_0 = pd.DataFrame(all_data, columns=["image", "tag", "left bottom", "right top"])
df_step_0.to_csv(str(test_set_folder / "step_0_df.csv"), index=False)

100%|██████████| 23/23 [00:00<00:00, 2090.97it/s]


In [4]:
df_step_0

Unnamed: 0,image,tag,left bottom,right top
0,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,Persoon 1,"(0, 266.62240663900417)","(303.8008298755187, 1024)"
1,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,Persoon 2,"(116.84647302904564, 205.0124481327801)","(189.07883817427387, 268.746887966805)"
2,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,Persoon 3,"(278.3070539419087, 74.35684647302905)","(402.58921161825725, 388.78008298755185)"
3,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,Persoon 5,"(826.4232365145228, 294.2406639004149)","(979.3858921161826, 374.97095435684645)"
4,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,Persoon 6,"(787.1203319502075, 370.7219917012448)","(1018.688796680498, 778.6224066390041)"
...,...,...,...,...
147,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,Persoon 4,"(551.3029045643153, 222.00829875518673)","(834.9211618257261, 472.6970954356847)"
148,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,Persoon 7,"(389.84232365145226, 146.58921161825725)","(550.2406639004149, 619.2863070539419)"
149,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,Persoon 9,"(250.68879668049792, 169.95850622406638)","(433.3941908713693, 655.402489626556)"
150,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,Persoon 5,"(0, 281.49377593361)","(278.3070539419087, 1024)"


## Step 1: Extracting people from the images
The first step in the pipeline is to extract the people from the images. Currently, these extracted persons are saved as separate images.

In [5]:
import torch
import cv2
from PIL import Image
import matplotlib.pyplot as plt
from transformers import DetrImageProcessor, DetrForObjectDetection, DetrFeatureExtractor

2023-05-16 13:41:41.476645: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-16 13:41:41.610870: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-16 13:41:42.808919: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /vsc-hard-mounts/leuven-data/351/vsc35135/miniconda3/lib/python3.9/site-packages/cv2/..

In [6]:
# Define some constants and paths
DETR_MODEL = "facebook/detr-resnet-50"
DETR_OUTPUT = Path().resolve()/ 'IMAGE_OUTPUT_PATH'
assert DETR_OUTPUT.exists(), f"Can't find the folder to store the cut-out persons: {DETR_OUTPUT}"

First, read in the data that was prepared for us in the previous step.

In [7]:
df_step_0 = pd.read_csv(test_set_folder / "step_0_df.csv")

Then, load the model and feature extractor

In [8]:
detr_feature_extractor = DetrFeatureExtractor.from_pretrained(DETR_MODEL)
detr_model = DetrForObjectDetection.from_pretrained(DETR_MODEL)
detr_model = detr_model.eval()

The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


In [9]:
def get_iou(left_bottom, right_top, bounding_box):
    X, Y, W, H = bounding_box.int()
    box1 = [*eval(left_bottom), *eval(right_top)]
    box2 = [X.item(), Y.item(), (X+W).item(), (Y+H).item()]
    x1, y1, x2, y2 = max(box1[0], box2[0]), max(box1[1], box2[1]), min(box1[2], box2[2]), min(box1[3], box2[3])
    interArea = abs(max((x2 - x1, 0)) * max((y2 - y1), 0))
    box1Area = abs((box1[2] - box1[0]) * (box1[3] - box1[1]))
    box2Area = abs((box2[2] - box2[0]) * (box2[3] - box2[1]))
    iou = interArea / float(box1Area + box2Area - interArea)
    return iou

In [10]:
step_1_data = []

for index, row in tqdm(df_step_0.iterrows(), total=len(df_step_0)):
    # Load image
    image = Image.open(image_folder / row["image"])
    # Extract image features
    inputs = detr_feature_extractor(image, return_tensors="pt")
    inputs.keys()
    outputs = detr_model(**inputs)
    # keep bounding boxes with a 0.90 threshold and get their labels
    probability = outputs.logits.softmax(-1)[0, :, :-1]
    threshold = probability.max(-1).values > 0.90
    labels = [detr_model.config.id2label[p.argmax().item()] for p in probability[threshold]]
    # rescale 
    size = torch.tensor(image.size[::-1]).unsqueeze(0)
    output_after = detr_feature_extractor.post_process(outputs, size)
    bounding_boxes = output_after[0]['boxes'][threshold]
    
    i = 0
    for bounding_box, label in zip(bounding_boxes, labels):
        # Only keep bounding boxes labeled as person
        if label == "person":
            truth_label, max_iou = None, 0
            i_image = cv2.imread(str(image_folder / row["image"]))
            stem = row["image"].split(".")[0]
            save_as = DETR_OUTPUT / f"{stem}_{i}.JPG"
            X, Y, W, H = bounding_box.int()
            coordinates = i_image[Y:H, X:W]
            coordinates = cv2.cvtColor(coordinates, cv2.COLOR_BGR2RGB)
            plt.imshow(coordinates)
            plt.savefig(str(save_as))
            iou = get_iou(row["left bottom"], row["right top"], bounding_box)
            if iou > max_iou:
                truth_label = row["tag"]
                max_iou = iou
            plt.close()
            step_1_data.append([row["image"], f"{stem}_{i}.JPG", truth_label, max_iou])
            i += 1
    image.close()
df_step_1 = pd.DataFrame(step_1_data, columns=["full_image", "image_snippet", "person_id", "iou"])

  0%|          | 0/152 [00:00<?, ?it/s]`post_process` is deprecated and will be removed in v5 of Transformers, please use `post_process_object_detection`
100%|██████████| 152/152 [03:15<00:00,  1.29s/it]


In [19]:
df_step_1.to_csv(str(test_set_folder / "step_1_df.csv"), index=False)
df_step_1

Unnamed: 0,full_image,image_snippet,person_id,iou
0,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_0.JPG,,0.000000
1,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_1.JPG,,0.000000
2,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_2.JPG,Persoon 1,0.001705
3,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_3.JPG,Persoon 1,0.439613
4,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_4.JPG,,0.000000
...,...,...,...,...
1017,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,,0.000000
1018,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,Persoon 1,0.022706
1019,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,,0.000000
1020,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,,0.000000


## Step 2: Creating captions for each image
The second step in the pipeline is to use these extracted images as input in the transformer encoder-decoder pair to create captions.

In [12]:
from transformers import GPT2TokenizerFast, ViTImageProcessor, VisionEncoderDecoderModel

In [13]:
# Define some constants and paths
VITGPT2_BASE_MODEL = "nlpconnect/vit-gpt2-image-captioning"
VITGPT2_RETRAINED_MODEL = Path().resolve() / 'Models' / 'retrained_model'
DETR_OUTPUT = Path().resolve() / 'IMAGE_OUTPUT_PATH'
assert VITGPT2_RETRAINED_MODEL.exists(), f"Can't find the retrained model. Did you forget to unzip it?"
assert DETR_OUTPUT.exists(), f"Can't find the folder to store the cut-out persons: {DETR_OUTPUT}"

First, read in the data that was prepared for us in the previous step.

In [14]:
df_step_1 = pd.read_csv(test_set_folder / "step_1_df.csv")

Then, load the model and feature extractor

In [15]:
visionencoderdecoder = VisionEncoderDecoderModel.from_pretrained(VITGPT2_RETRAINED_MODEL)
vitgpt2_image_processor = ViTImageProcessor.from_pretrained(VITGPT2_BASE_MODEL)
vitgpt2_tokenizer = GPT2TokenizerFast.from_pretrained(VITGPT2_BASE_MODEL)

# GPT2 only has bos/eos tokens but not decoder_start/pad tokens
vitgpt2_tokenizer.pad_token = vitgpt2_tokenizer.eos_token
# update the model config
visionencoderdecoder.config.eos_token_id = vitgpt2_tokenizer.eos_token_id
visionencoderdecoder.config.decoder_start_token_id = vitgpt2_tokenizer.bos_token_id
visionencoderdecoder.config.pad_token_id = vitgpt2_tokenizer.pad_token_id

In [16]:
transcriptions = []
i_images = []
df_step_2 = df_step_1.copy(deep=True)
model_kwargs = {"max_new_tokens": 25}
for index, row in tqdm(df_step_1.iterrows(), total=len(df_step_1)):
    if index % 10 == 9:
        pixel_values = vitgpt2_image_processor(images=i_images, return_tensors="pt").pixel_values
        output_ids = visionencoderdecoder.generate(pixel_values=pixel_values, **model_kwargs)
        predictions = vitgpt2_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        transcriptions.extend(predictions)
        df_step_2["caption"] = pd.Series(transcriptions)
        i_images = []
    else:
        full_image_path = DETR_OUTPUT / row['image_snippet']
        i_image = Image.open(full_image_path)
        if i_image.mode != "RGB":
            i_image = i_image.convert(mode="RGB")
        i_images.append(i_image)
pixel_values = vitgpt2_image_processor(images=i_images, return_tensors="pt").pixel_values
output_ids = visionencoderdecoder.generate(pixel_values=pixel_values, **model_kwargs)
predictions = vitgpt2_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
transcriptions.extend(predictions)
df_step_2["caption"] = pd.Series(transcriptions)

100%|██████████| 1022/1022 [06:01<00:00,  2.83it/s]


In [18]:
df_step_2.to_csv(str(test_set_folder / "step_2_df.csv"), index=False)
df_step_2

Unnamed: 0,full_image,image_snippet,person_id,iou,caption
0,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_0.JPG,,0.000000,A woman in a black coat is walking down the st...
1,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_1.JPG,,0.000000,A man in a black coat is walking down the stre...
2,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_2.JPG,Persoon 1,0.001705,A woman in a black coat is walking down the st...
3,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_3.JPG,Persoon 1,0.439613,A woman in a black coat is walking down the st...
4,vlc-record-2019-02-15-10h16m31s-rtsp___80_0.jpg,vlc-record-2019-02-15-10h16m31s-rtsp___80_0_4.JPG,,0.000000,A man in a black coat is walking down the stre...
...,...,...,...,...,...
1017,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,,0.000000,
1018,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,Persoon 1,0.022706,
1019,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,,0.000000,
1020,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,vlc-record-2019-02-15-10h16m31s-rtsp___80_1014...,,0.000000,
