If your goal is to only detect class A, and you have limited labeled data for class A compared to the unlabeled data, a semi-supervised learning approach can be beneficial. Here's a suggested approach using semi-supervised learning in your scenario:

Labeled Data for Class A: Utilize the available labeled data for class A to train a YOLOv8 model initially. This labeled dataset provides the ground truth annotations for class A instances, allowing the model to learn from the labeled examples. However, the limited amount of labeled data may result in suboptimal performance.

Self-Supervised Pretraining: As you mentioned, you have a larger unlabeled dataset. You can leverage this unlabeled data to train a self-supervised model using one of the methods discussed earlier (e.g., instance discrimination, context prediction, or contrastive learning). This self-supervised model will learn to extract useful features and representations from the unlabeled data.

Pretraining the Initial Layers: Take the pretrained weights from the self-supervised model and use them to initialize the initial layers of your YOLOv8 model. By doing so, you provide the YOLOv8 model with pretrained representations that capture meaningful information from the unlabeled data.

Semi-Supervised Training: After initializing the initial layers, fine-tune the YOLOv8 model using a combination of the limited labeled data for class A and the unlabeled data. You can employ semi-supervised learning techniques, such as consistency regularization or pseudo-labeling, to utilize the unlabeled data effectively.

Consistency Regularization: Apply consistency regularization to encourage the model to produce consistent predictions on different augmentations or transformations of the same unlabeled sample. This helps the model to learn more robust and generalizable representations, even in the absence of labeled data for class A.

Pseudo-labeling: Pseudo-labeling can also be employed, where the model's predictions on the unlabeled data are used as pseudo-labels for training. This allows you to utilize the unlabeled data effectively and expand the labeled dataset, thereby improving the model's performance.

By combining the limited labeled data for class A with the larger unlabeled dataset and leveraging the benefits of self-supervised learning, you can improve the performance of your YOLOv8 model in detecting class A instances. The self-supervised pretraining helps to capture useful representations from the unlabeled data, while the semi-supervised training enables the model to learn from both labeled and unlabeled data, enhancing its ability to generalize.

Remember to validate and fine-tune the model's hyperparameters and settings based on the performance on a separate validation or test set to ensure the best results.

In [1]:
# import necessary packages:
import numpy as np
import os
import torch
import pandas as pd
import json
from torch.utils.data import Dataset
import cv2
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold
import shutil
import ast

In [2]:
print(torch.cuda.is_available())

True


In [3]:
# flags and directories for data and csv:
kaggle_env = False
if kaggle_env:
    data_directory = "/kaggle/input/hubmap-hacking-the-human-vasculature"
if not kaggle_env:
    #local_env = r"C:\Users\labadmin\PycharmProjects\hubmap" # edit local directory accordingly
    local_env = r"C:\Users\Kevin\PycharmProjects\hubmap"
    #data_directory = r"C:\Users\labadmin\Desktop\hubmap"
    data_directory = r"C:\Users\Kevin\Desktop\hubmap"
wsi_csv_src = os.path.join(data_directory,"wsi_meta.csv")
tile_csv_src = os.path.join(data_directory,"tile_meta.csv")
train_tile_src = os.path.join(data_directory,"train")
json_src = os.path.join(data_directory,"polygons.jsonl")
wsi_df = pd.read_csv(wsi_csv_src)
tile_df = pd.read_csv(tile_csv_src)

In [5]:
wsi_df

Unnamed: 0,source_wsi,age,sex,race,height,weight,bmi
0,1,58,F,W,160.0,59.0,23.0
1,2,56,F,W,175.2,139.6,45.5
2,3,73,F,W,162.3,87.5,33.2
3,4,53,M,B,166.0,73.0,26.5


### IMPORTANT EDA CONCLUSIONS (from: https://www.kaggle.com/code/leonidkulyk/eda-hubmap-hhv-interactive-annotations): DATASET 1 = EXPERT REVIEWED ANNOTATED DATASET, DATASET 2 EXPERT NON-REVIEWED SPARSE ANNOTATIONS, DATASET 3 NO LABELS! ALSO, THERE ARE 15 UNIQUE SOURCE_WSIS, WITH EACH WSI HAVING MOSTLY 600 ANNOTATIONS, WITH SOME HAVING 200-500.

In [5]:
tile_df.head(10)

Unnamed: 0,id,source_wsi,dataset,i,j
0,0006ff2aa7cd,2,2,16896,16420
1,000e79e206b7,6,3,10240,29184
2,00168d1b7522,2,2,14848,14884
3,00176a88fdb0,7,3,14848,25088
4,0033bbc76b6b,1,1,10240,43008
5,003504460b3a,3,2,8192,11776
6,00359ab8338b,8,3,6656,9216
7,00488ca285ee,9,3,8192,37888
8,004daf1cbe75,3,2,6144,11264
9,004fb033dd09,7,3,20480,31232


In [6]:
class Hubmap_dataset(Dataset):
    #credits to: https://www.kaggle.com/code/alincijov/dataset-hubmap-vasc-768x768-segments
    def __init__(self, json_path, image_dir, augments = False, train=True):
        #read jsonl file, contains all info about the annotations
        with open(json_path) as json_file:
            json_list = list(json_file) #list of all jsons
        #create new df
        dataset = []
        for json_str in tqdm(json_list, desc="Json Data Loaded"):
            result = json.loads(json_str)

            annotations = result['annotations']
            for ann in annotations:
                row = {}
                row["id"] = result["id"]
                row["type"] = ann["type"]
                row["coordinates"] = ann["coordinates"]
                row["mask"] = self.coordinates_to_masks(ann["coordinates"], (512, 512))[0]
                row["rle"] = self.mask2enc(row["mask"])
                dataset.append(row)

        # define dataset, to make it easier to get...
        self.dataset = pd.DataFrame(dataset, columns=["id", "type", "coordinates", "mask", "rle"])
        self.train = train
        self.image_dir = image_dir

    def enc2mask(self, encs, shape):
        '''
        Function to go from input RLE encodings to mask
        :param encs: input RLE encodings
        :param shape: shape of output mask
        :return:
        '''
        img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
        for m,enc in enumerate(encs):
            if isinstance(enc,np.float) and np.isnan(enc):
                continue #skip nan's
            s = enc.split()
            for i in range(len(s)//2):
                start = int(s[2*i]) - 1
                length = int(s[2*i+1])
                img[start:start+length] = 1 + m
        return img.reshape(shape).T

    def mask2enc(self, mask, n=1):
        '''
        Function to go from input mask to RLE encodings
        :param mask:
        :param n:
        :return:
        '''
        pixels = mask.T.flatten()
        encs = []
        for i in range(1,n+1):
            p = (pixels == i).astype(np.int8)
            if p.sum() == 0: encs.append(np.nan)
            else:
                p = np.concatenate([[0], p, [0]])
                runs = np.where(p[1:] != p[:-1])[0] + 1
                runs[1::2] -= runs[::2]
                encs.append(' '.join(str(x) for x in runs))
        return encs

    def coordinates_to_masks(self, coordinates, shape):
        masks = []
        for coord in coordinates:
            mask = np.zeros(shape, dtype=np.uint8) #512x512 empty initialization
            cv2.fillPoly(mask, [np.array(coord)], 1)
            masks.append(mask)
        return masks #return appended masks

    def __getitem__(self, idx):

        data = self.dataset.iloc[idx]

        imageLoc = os.path.join(self.image_dir,data.id+".tif")
        img = cv2.imread(imageLoc)

        type_struct = data.type
        coord = data.coordinates[0]

        # create mask array
        mask = np.zeros((512, 512), dtype=np.float32)
        points = np.array(coord)
        points = points.reshape((1, -1, 2))
        mask = cv2.fillPoly(mask, pts=points, color=(255))

        return img, type_struct, mask

    def __len__(self):
        return len(self.dataset)

In [7]:
annotation_df = Hubmap_dataset(json_path = json_src, augments=False,image_dir=train_tile_src, train=True)

Json Data Loaded: 100%|██████████| 1633/1633 [00:13<00:00, 117.69it/s]


### AS SEEN BELOW, ID MATCHES THE ID OF THE IMAGE DATASET NAME, TYPE IS EITHER BLOOD_VESSEL(TARGET), GLOMERULUS (WE DONT WANT), AND UNSURE. MOST IMPORTANT: # OF ANNOTATED TILES IS 1633, WHILE NUMBER OF UNANNOTATED TILES IS 5400 TO MAKE IT TOTAL OF 7033 TILES IN THE IMAGES WE ARE GIVEN. NEED TO TRAIN SEMISUPERVISED MODEL!

In [8]:
annotation_df.dataset

Unnamed: 0,id,type,coordinates,mask,rle
0,0006ff2aa7cd,glomerulus,"[[[167, 249], [166, 249], [165, 249], [164, 24...","[[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",[2 105 513 107 1025 111 1537 113 2049 114 2561...
1,0006ff2aa7cd,blood_vessel,"[[[283, 109], [282, 109], [281, 109], [280, 10...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[139350 10 139860 17 140370 22 140880 27 14139...
2,0006ff2aa7cd,blood_vessel,"[[[104, 292], [103, 292], [102, 292], [101, 29...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[37108 4 37618 8 38129 11 38640 14 39152 15 39...
3,0006ff2aa7cd,blood_vessel,"[[[505, 442], [504, 442], [503, 442], [502, 44...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[250783 11 251292 20 251801 26 252311 30 25282...
4,0006ff2aa7cd,blood_vessel,"[[[375, 477], [374, 477], [373, 477], [372, 47...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[167352 5 167863 7 168374 10 168886 11 169398 ...
...,...,...,...,...,...
17513,ffd3d193c71e,blood_vessel,"[[[184, 308], [183, 308], [182, 308], [181, 30...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[88875 8 89385 11 89896 13 90407 15 90918 16 9...
17514,ffd3d193c71e,blood_vessel,"[[[42, 92], [41, 92], [40, 92], [39, 92], [38,...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[14393 12 14904 20 15415 23 15925 28 16434 40 ...
17515,ffd3d193c71e,blood_vessel,"[[[287, 480], [286, 480], [285, 480], [284, 48...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[140243 6 140750 13 141261 15 141772 17 142284...
17516,ffd3d193c71e,blood_vessel,"[[[493, 388], [492, 388], [491, 388], [490, 38...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[244001 5 244512 7 244603 6 245023 9 245112 11...


In [5]:
save_df = False # save only once
if save_df:
    annotation_df = annotation_df.dataset
    annotation_df.to_excel(os.path.join(data_directory,"annotation_meta.xlsx"))

### To create a train_df for an object detection model only using Dataset1, filter out the annotation_df above:

In [6]:
if not save_df:
    tmp_df = pd.read_excel(os.path.join(data_directory,"annotation_meta.xlsx"))
else:
    tmp_df = annotation_df.dataset.copy()
tmp_df

Unnamed: 0.1,Unnamed: 0,id,type,coordinates,mask,rle
0,0,0006ff2aa7cd,glomerulus,"[[[167, 249], [166, 249], [165, 249], [164, 24...",[[0 1 1 ... 0 0 0]\n [1 1 1 ... 0 0 0]\n [1 1 ...,['2 105 513 107 1025 111 1537 113 2049 114 256...
1,1,0006ff2aa7cd,blood_vessel,"[[[283, 109], [282, 109], [281, 109], [280, 10...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['139350 10 139860 17 140370 22 140880 27 1413...
2,2,0006ff2aa7cd,blood_vessel,"[[[104, 292], [103, 292], [102, 292], [101, 29...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['37108 4 37618 8 38129 11 38640 14 39152 15 3...
3,3,0006ff2aa7cd,blood_vessel,"[[[505, 442], [504, 442], [503, 442], [502, 44...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['250783 11 251292 20 251801 26 252311 30 2528...
4,4,0006ff2aa7cd,blood_vessel,"[[[375, 477], [374, 477], [373, 477], [372, 47...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['167352 5 167863 7 168374 10 168886 11 169398...
...,...,...,...,...,...,...
17513,17513,ffd3d193c71e,blood_vessel,"[[[184, 308], [183, 308], [182, 308], [181, 30...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['88875 8 89385 11 89896 13 90407 15 90918 16 ...
17514,17514,ffd3d193c71e,blood_vessel,"[[[42, 92], [41, 92], [40, 92], [39, 92], [38,...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['14393 12 14904 20 15415 23 15925 28 16434 40...
17515,17515,ffd3d193c71e,blood_vessel,"[[[287, 480], [286, 480], [285, 480], [284, 48...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['140243 6 140750 13 141261 15 141772 17 14228...
17516,17516,ffd3d193c71e,blood_vessel,"[[[493, 388], [492, 388], [491, 388], [490, 38...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['244001 5 244512 7 244603 6 245023 9 245112 1...


In [11]:
supervised_tile_df = tile_df[tile_df["dataset"] == 1]
supervised_tile_ids = supervised_tile_df.id.tolist() #list of picture ids that are in dataset 1
supervised_tile_df

Unnamed: 0,id,source_wsi,dataset,i,j
4,0033bbc76b6b,1,1,10240,43008
16,00656c6f2690,1,1,10240,46080
17,0067d5ad2250,2,1,23552,22528
33,00d75ad65de3,1,1,8192,39424
34,00da70813521,1,1,10240,46592
...,...,...,...,...,...
6844,f86347534ec1,2,1,16896,20992
6895,faba1bf818ae,1,1,3072,39424
6933,fc6def641612,1,1,7680,40960
6951,fd2437954fd8,1,1,5120,39424


In [12]:
supervised_filtered_tile_df = tmp_df[tmp_df['id'].isin(supervised_tile_ids)]
supervised_filtered_tile_df = supervised_filtered_tile_df.reset_index(drop=True)
supervised_filtered_tile_df

Unnamed: 0,id,type,coordinates,mask,rle
0,0033bbc76b6b,blood_vessel,"[[[169, 228], [168, 228], [167, 228], [166, 22...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[73387 16 73897 28 74401 38 74911 43 75420 49 ...
1,0033bbc76b6b,blood_vessel,"[[[1, 59], [0, 59], [0, 58], [0, 57], [0, 56],...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[30 31 541 32 1053 31 1566 27 2079 25 2594 21 ...
2,0033bbc76b6b,unsure,"[[[177, 37], [176, 37], [175, 37], [174, 37], ...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[78855 14 79364 19 79874 22 80385 24 80897 25 ...
3,0033bbc76b6b,blood_vessel,"[[[406, 511], [405, 511], [404, 511], [403, 51...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[190976 1 191485 4 191995 6 192506 7 193017 8 ...
4,00656c6f2690,blood_vessel,"[[[511, 426], [511, 426], [510, 426], [510, 42...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[257947 6 258457 12 258967 16 259478 18 259989...
...,...,...,...,...,...
4401,fd2437954fd8,blood_vessel,"[[[481, 454], [480, 454], [479, 454], [478, 45...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[238002 14 238510 20 239012 32 239520 37 24000...
4402,fd2437954fd8,blood_vessel,"[[[416, 511], [415, 511], [414, 511], [413, 51...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[191998 3 192502 11 193012 13 193522 15 194032...
4403,fd2437954fd8,blood_vessel,"[[[18, 362], [17, 362], [16, 362], [16, 361], ...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[3365 50 3874 66 4385 70 4896 72 5407 74 5918 ...
4404,fe248458ea89,blood_vessel,"[[[131, 501], [130, 501], [129, 501], [128, 50...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[61412 4 61922 8 62432 10 62444 6 62943 21 634...


In [13]:
types = supervised_filtered_tile_df.type.tolist()
print(np.unique(types,return_counts=True))

(array(['blood_vessel', 'glomerulus', 'unsure'], dtype='<U12'), array([3498,   36,  872], dtype=int64))


In [14]:
# now let's just add some extra info from the tile meta, such as source_wsi, i, j, and full image path of the source dataset.
mapping = supervised_tile_df.set_index('id')[['source_wsi', 'i', 'j']].to_dict(orient='index')

# Assign the values from dataframe A to dataframe B based on the 'id' column
supervised_filtered_tile_df[['source_wsi', 'i', 'j']] = supervised_filtered_tile_df['id'].map(mapping).apply(pd.Series)
supervised_filtered_tile_df

Unnamed: 0,id,type,coordinates,mask,rle,source_wsi,i,j
0,0033bbc76b6b,blood_vessel,"[[[169, 228], [168, 228], [167, 228], [166, 22...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[73387 16 73897 28 74401 38 74911 43 75420 49 ...,1,10240,43008
1,0033bbc76b6b,blood_vessel,"[[[1, 59], [0, 59], [0, 58], [0, 57], [0, 56],...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[30 31 541 32 1053 31 1566 27 2079 25 2594 21 ...,1,10240,43008
2,0033bbc76b6b,unsure,"[[[177, 37], [176, 37], [175, 37], [174, 37], ...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[78855 14 79364 19 79874 22 80385 24 80897 25 ...,1,10240,43008
3,0033bbc76b6b,blood_vessel,"[[[406, 511], [405, 511], [404, 511], [403, 51...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[190976 1 191485 4 191995 6 192506 7 193017 8 ...,1,10240,43008
4,00656c6f2690,blood_vessel,"[[[511, 426], [511, 426], [510, 426], [510, 42...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[257947 6 258457 12 258967 16 259478 18 259989...,1,10240,46080
...,...,...,...,...,...,...,...,...
4401,fd2437954fd8,blood_vessel,"[[[481, 454], [480, 454], [479, 454], [478, 45...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[238002 14 238510 20 239012 32 239520 37 24000...,1,5120,39424
4402,fd2437954fd8,blood_vessel,"[[[416, 511], [415, 511], [414, 511], [413, 51...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[191998 3 192502 11 193012 13 193522 15 194032...,1,5120,39424
4403,fd2437954fd8,blood_vessel,"[[[18, 362], [17, 362], [16, 362], [16, 361], ...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[3365 50 3874 66 4385 70 4896 72 5407 74 5918 ...,1,5120,39424
4404,fe248458ea89,blood_vessel,"[[[131, 501], [130, 501], [129, 501], [128, 50...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[61412 4 61922 8 62432 10 62444 6 62943 21 634...,1,10240,44032


### As seen above, we have 3498 annotations for a confident blood vessel prediction (one we can use for supervised training!) For now, let's only focus at blood_vessel, since we are doing one object detection model. If we feel like it's not doing that well, we can come back to this and train a 3-class segmentation model.

In [15]:
final_df = supervised_filtered_tile_df[supervised_filtered_tile_df["type"] == "blood_vessel"]
final_df = final_df.reset_index(drop=True)
final_df # these are all from dataset 1.


Unnamed: 0,id,type,coordinates,mask,rle,source_wsi,i,j
0,0033bbc76b6b,blood_vessel,"[[[169, 228], [168, 228], [167, 228], [166, 22...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[73387 16 73897 28 74401 38 74911 43 75420 49 ...,1,10240,43008
1,0033bbc76b6b,blood_vessel,"[[[1, 59], [0, 59], [0, 58], [0, 57], [0, 56],...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[30 31 541 32 1053 31 1566 27 2079 25 2594 21 ...,1,10240,43008
2,0033bbc76b6b,blood_vessel,"[[[406, 511], [405, 511], [404, 511], [403, 51...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[190976 1 191485 4 191995 6 192506 7 193017 8 ...,1,10240,43008
3,00656c6f2690,blood_vessel,"[[[511, 426], [511, 426], [510, 426], [510, 42...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[257947 6 258457 12 258967 16 259478 18 259989...,1,10240,46080
4,00656c6f2690,blood_vessel,"[[[157, 404], [156, 404], [155, 404], [154, 40...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[58240 8 58750 13 59261 15 59772 16 60284 17 6...,1,10240,46080
...,...,...,...,...,...,...,...,...
3493,fd2437954fd8,blood_vessel,"[[[481, 454], [480, 454], [479, 454], [478, 45...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[238002 14 238510 20 239012 32 239520 37 24000...,1,5120,39424
3494,fd2437954fd8,blood_vessel,"[[[416, 511], [415, 511], [414, 511], [413, 51...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[191998 3 192502 11 193012 13 193522 15 194032...,1,5120,39424
3495,fd2437954fd8,blood_vessel,"[[[18, 362], [17, 362], [16, 362], [16, 361], ...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[3365 50 3874 66 4385 70 4896 72 5407 74 5918 ...,1,5120,39424
3496,fe248458ea89,blood_vessel,"[[[131, 501], [130, 501], [129, 501], [128, 50...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[61412 4 61922 8 62432 10 62444 6 62943 21 634...,1,10240,44032


In [19]:
save_final_df = False
if save_final_df:
    save_src = r"\\fatherserverdw\Kevin\hubmap\obj_detect_bvonly.xlsx"
    final_df.to_excel(save_src)

In [18]:
ids = final_df.id.tolist()
np.unique(ids) #416 unique images with blood vessel annotations that are trustworthy

array(['0033bbc76b6b', '00656c6f2690', '0067d5ad2250', '00d75ad65de3',
       '00da70813521', '0596bfb19322', '06034408218a', '06b972c417e7',
       '072f5307f243', '0754412b2917', '0788fc3be62e', '07bdbe578ded',
       '089a9e6be240', '093a52e9bc1d', '097dd2ed6c14', '0a10b8716b30',
       '0a43459733e7', '0a5be90855e3', '0acd70e887b3', '0b8029db1fb4',
       '0b849dde56be', '0b89ab7f9f07', '0c3086bd8efb', '0c54942878fa',
       '0d6c0cfee2db', '0d9d65340ef8', '0dce2b8d2c25', '0e0836cf1824',
       '0e6eeb39d8f6', '0e8aed930dc6', '0f5b52a768e2', '10162104cbaf',
       '103906a73e6d', '11128ba6d78c', '1222b4306c01', '12ab9ac0fc55',
       '130d4d323ce4', '13aa34ced90d', '1423b40ad0dc', '15e5df255b86',
       '15ed954c0cdf', '19551df15a1e', '1a54cda8f32d', '1a785cc0d167',
       '1b12cccdfdd2', '1bd726055ad8', '1c6c39a22324', '1c8af4861691',
       '1d8aa548c370', '1e0efbb8979f', '1e14b9acc9b9', '1e2fd6c2950d',
       '1f0d3cd4b621', '209426451a36', '212fc627af24', '2136df4a5aeb',
      

### Note: Dataset 1 has two WSI's (WSI's # 1 and 2, which are from two, white female patients). Dataset 2 (sparse annotations not expert reviewed) has two WSI's (WSI's # 3 and 4, which are from one white female and one black male). Dataset 3 has nine further WSI's (WSI's # 6~14) and are unlabeled (we will later use it for contrastive learning).

### Since we only deal with dataset 1 right now, we will do stratification based on the source_wsi 1 and 2 since they are unbalanced. Let's do 5 fold stratified CV, so include the random split in the df and save it, so that we can load this df during training of object detection model. If segmentation model, we would have to do stratifiedgroupkfold, where stratify over classes and over groups of patients or source_wsis. But first, let's create a .txt file for label for each blood vessel image for yolov8 by modifying our final_df above. Stuff for yolov8 is done in preprocessing_yolov8.ipynb

### Since not a lot of training data, let's try to include annotations from dataset 2 and also glomerulus compare how it does to dataset 1 and bv only:

In [7]:
tmp_df = pd.read_excel(os.path.join(data_directory,"annotation_meta.xlsx"))

In [13]:
supervised_tile_df = tile_df[tile_df["dataset"] == 1]
supervised_tile_df2 = tile_df[tile_df["dataset"] == 2]
supervised_tile_df3 = pd.concat([supervised_tile_df,supervised_tile_df2],axis=0) # 422 + 1211 = 1633 images
supervised_tile_ids = supervised_tile_df3.id.tolist() #list of picture ids that are in dataset 1
supervised_tile_df3

Unnamed: 0,id,source_wsi,dataset,i,j
4,0033bbc76b6b,1,1,10240,43008
16,00656c6f2690,1,1,10240,46080
17,0067d5ad2250,2,1,23552,22528
33,00d75ad65de3,1,1,8192,39424
34,00da70813521,1,1,10240,46592
...,...,...,...,...,...
7016,ff434af74304,4,2,3072,22528
7017,ff4897b3eda6,4,2,11776,20992
7021,ff66dec71c4c,3,2,5120,10752
7025,ff99cdef0f2a,4,2,5120,24064


In [17]:
supervised_filtered_tile_df = tmp_df[tmp_df['id'].isin(supervised_tile_ids)]
supervised_filtered_tile_df = supervised_filtered_tile_df.reset_index(drop=True)
supervised_filtered_tile_df

Unnamed: 0.1,Unnamed: 0,id,type,coordinates,mask,rle
0,0,0006ff2aa7cd,glomerulus,"[[[167, 249], [166, 249], [165, 249], [164, 24...",[[0 1 1 ... 0 0 0]\n [1 1 1 ... 0 0 0]\n [1 1 ...,['2 105 513 107 1025 111 1537 113 2049 114 256...
1,1,0006ff2aa7cd,blood_vessel,"[[[283, 109], [282, 109], [281, 109], [280, 10...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['139350 10 139860 17 140370 22 140880 27 1413...
2,2,0006ff2aa7cd,blood_vessel,"[[[104, 292], [103, 292], [102, 292], [101, 29...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['37108 4 37618 8 38129 11 38640 14 39152 15 3...
3,3,0006ff2aa7cd,blood_vessel,"[[[505, 442], [504, 442], [503, 442], [502, 44...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['250783 11 251292 20 251801 26 252311 30 2528...
4,4,0006ff2aa7cd,blood_vessel,"[[[375, 477], [374, 477], [373, 477], [372, 47...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['167352 5 167863 7 168374 10 168886 11 169398...
...,...,...,...,...,...,...
17513,17513,ffd3d193c71e,blood_vessel,"[[[184, 308], [183, 308], [182, 308], [181, 30...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['88875 8 89385 11 89896 13 90407 15 90918 16 ...
17514,17514,ffd3d193c71e,blood_vessel,"[[[42, 92], [41, 92], [40, 92], [39, 92], [38,...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['14393 12 14904 20 15415 23 15925 28 16434 40...
17515,17515,ffd3d193c71e,blood_vessel,"[[[287, 480], [286, 480], [285, 480], [284, 48...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['140243 6 140750 13 141261 15 141772 17 14228...
17516,17516,ffd3d193c71e,blood_vessel,"[[[493, 388], [492, 388], [491, 388], [490, 38...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['244001 5 244512 7 244603 6 245023 9 245112 1...


In [18]:
types = supervised_filtered_tile_df.type.tolist()
print(np.unique(types,return_counts=True))

(array(['blood_vessel', 'glomerulus', 'unsure'], dtype='<U12'), array([16054,   567,   897], dtype=int64))


In [22]:
# now let's just add some extra info from the tile meta, such as source_wsi, i, j, and full image path of the source dataset.
mapping = supervised_tile_df3.set_index('id')[['source_wsi', 'i', 'j']].to_dict(orient='index')

# Assign the values from dataframe A to dataframe B based on the 'id' column
supervised_filtered_tile_df[['source_wsi', 'i', 'j']] = supervised_filtered_tile_df['id'].map(mapping).apply(pd.Series)
supervised_filtered_tile_df

Unnamed: 0.1,Unnamed: 0,id,type,coordinates,mask,rle,source_wsi,i,j
0,0,0006ff2aa7cd,glomerulus,"[[[167, 249], [166, 249], [165, 249], [164, 24...",[[0 1 1 ... 0 0 0]\n [1 1 1 ... 0 0 0]\n [1 1 ...,['2 105 513 107 1025 111 1537 113 2049 114 256...,2,16896,16420
1,1,0006ff2aa7cd,blood_vessel,"[[[283, 109], [282, 109], [281, 109], [280, 10...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['139350 10 139860 17 140370 22 140880 27 1413...,2,16896,16420
2,2,0006ff2aa7cd,blood_vessel,"[[[104, 292], [103, 292], [102, 292], [101, 29...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['37108 4 37618 8 38129 11 38640 14 39152 15 3...,2,16896,16420
3,3,0006ff2aa7cd,blood_vessel,"[[[505, 442], [504, 442], [503, 442], [502, 44...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['250783 11 251292 20 251801 26 252311 30 2528...,2,16896,16420
4,4,0006ff2aa7cd,blood_vessel,"[[[375, 477], [374, 477], [373, 477], [372, 47...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['167352 5 167863 7 168374 10 168886 11 169398...,2,16896,16420
...,...,...,...,...,...,...,...,...,...
17513,17513,ffd3d193c71e,blood_vessel,"[[[184, 308], [183, 308], [182, 308], [181, 30...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['88875 8 89385 11 89896 13 90407 15 90918 16 ...,3,7680,16896
17514,17514,ffd3d193c71e,blood_vessel,"[[[42, 92], [41, 92], [40, 92], [39, 92], [38,...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['14393 12 14904 20 15415 23 15925 28 16434 40...,3,7680,16896
17515,17515,ffd3d193c71e,blood_vessel,"[[[287, 480], [286, 480], [285, 480], [284, 48...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['140243 6 140750 13 141261 15 141772 17 14228...,3,7680,16896
17516,17516,ffd3d193c71e,blood_vessel,"[[[493, 388], [492, 388], [491, 388], [490, 38...",[[0 0 0 ... 0 0 0]\n [0 0 0 ... 0 0 0]\n [0 0 ...,['244001 5 244512 7 244603 6 245023 9 245112 1...,3,7680,16896


In [25]:
np.unique(supervised_filtered_tile_df.type,return_counts=True)

(array(['blood_vessel', 'glomerulus', 'unsure'], dtype=object),
 array([16054,   567,   897], dtype=int64))

In [28]:
### Total of 16054 bv annotations and 567 glomerulus annotations!

final_df = supervised_filtered_tile_df[supervised_filtered_tile_df["type"] == "blood_vessel"]
final_df2 = supervised_filtered_tile_df[supervised_filtered_tile_df["type"] == "glomerulus"]
final_df3 = pd.concat([final_df,final_df2],axis=0)
final_df3 = final_df3.reset_index(drop=True)
final_df3  # these are all from dataset 1.

save_final_df = False #already saved
if save_final_df:
    save_src = r"\\fatherserverdw\Kevin\hubmap\obj_detect_bv_glo.xlsx"
    final_df3.to_excel(save_src)
ids = final_df3.id.tolist()
len(np.unique(ids))  #1622 unique images with blood vessel + glomerulus annotations

1622