If your goal is to only detect class A, and you have limited labeled data for class A compared to the unlabeled data, a semi-supervised learning approach can be beneficial. Here's a suggested approach using semi-supervised learning in your scenario:

Labeled Data for Class A: Utilize the available labeled data for class A to train a YOLOv8 model initially. This labeled dataset provides the ground truth annotations for class A instances, allowing the model to learn from the labeled examples. However, the limited amount of labeled data may result in suboptimal performance.

Self-Supervised Pretraining: As you mentioned, you have a larger unlabeled dataset. You can leverage this unlabeled data to train a self-supervised model using one of the methods discussed earlier (e.g., instance discrimination, context prediction, or contrastive learning). This self-supervised model will learn to extract useful features and representations from the unlabeled data.

Pretraining the Initial Layers: Take the pretrained weights from the self-supervised model and use them to initialize the initial layers of your YOLOv8 model. By doing so, you provide the YOLOv8 model with pretrained representations that capture meaningful information from the unlabeled data.

Semi-Supervised Training: After initializing the initial layers, fine-tune the YOLOv8 model using a combination of the limited labeled data for class A and the unlabeled data. You can employ semi-supervised learning techniques, such as consistency regularization or pseudo-labeling, to utilize the unlabeled data effectively.

Consistency Regularization: Apply consistency regularization to encourage the model to produce consistent predictions on different augmentations or transformations of the same unlabeled sample. This helps the model to learn more robust and generalizable representations, even in the absence of labeled data for class A.

Pseudo-labeling: Pseudo-labeling can also be employed, where the model's predictions on the unlabeled data are used as pseudo-labels for training. This allows you to utilize the unlabeled data effectively and expand the labeled dataset, thereby improving the model's performance.

By combining the limited labeled data for class A with the larger unlabeled dataset and leveraging the benefits of self-supervised learning, you can improve the performance of your YOLOv8 model in detecting class A instances. The self-supervised pretraining helps to capture useful representations from the unlabeled data, while the semi-supervised training enables the model to learn from both labeled and unlabeled data, enhancing its ability to generalize.

Remember to validate and fine-tune the model's hyperparameters and settings based on the performance on a separate validation or test set to ensure the best results.

In [1]:
# import necessary packages:
import numpy as np
import os
import torch
import pandas as pd
import json
from torch.utils.data import Dataset
import cv2
from tqdm import tqdm

In [2]:
print(torch.cuda.is_available())

True


In [3]:
# flags and directories for data and csv:
kaggle_env = False
if kaggle_env:
    data_directory = "/kaggle/input/hubmap-hacking-the-human-vasculature"
if not kaggle_env:
    #local_env = r"C:\Users\labadmin\PycharmProjects\hubmap" # edit local directory accordingly
    local_env = r"C:\Users\Kevin\PycharmProjects\hubmap"
    #data_directory = r"C:\Users\labadmin\Desktop\hubmap"
    data_directory = r"C:\Users\Kevin\Desktop\hubmap"
wsi_csv_src = os.path.join(data_directory,"wsi_meta.csv")
tile_csv_src = os.path.join(data_directory,"tile_meta.csv")
train_tile_src = os.path.join(data_directory,"train")
json_src = os.path.join(data_directory,"polygons.jsonl")
wsi_df = pd.read_csv(wsi_csv_src)
tile_df = pd.read_csv(tile_csv_src)

In [4]:
wsi_df

Unnamed: 0,source_wsi,age,sex,race,height,weight,bmi
0,1,58,F,W,160.0,59.0,23.0
1,2,56,F,W,175.2,139.6,45.5
2,3,73,F,W,162.3,87.5,33.2
3,4,53,M,B,166.0,73.0,26.5


### IMPORTANT EDA CONCLUSIONS (from: https://www.kaggle.com/code/leonidkulyk/eda-hubmap-hhv-interactive-annotations): DATASET 1 = EXPERT REVIEWED ANNOTATED DATASET, DATASET 2 EXPERT NON-REVIEWED SPARSE ANNOTATIONS, DATASET 3 NO LABELS! ALSO, THERE ARE 15 UNIQUE SOURCE_WSIS, WITH EACH WSI HAVING MOSTLY 600 ANNOTATIONS, WITH SOME HAVING 200-500.

In [5]:
tile_df.head(10)

Unnamed: 0,id,source_wsi,dataset,i,j
0,0006ff2aa7cd,2,2,16896,16420
1,000e79e206b7,6,3,10240,29184
2,00168d1b7522,2,2,14848,14884
3,00176a88fdb0,7,3,14848,25088
4,0033bbc76b6b,1,1,10240,43008
5,003504460b3a,3,2,8192,11776
6,00359ab8338b,8,3,6656,9216
7,00488ca285ee,9,3,8192,37888
8,004daf1cbe75,3,2,6144,11264
9,004fb033dd09,7,3,20480,31232


In [6]:
class Hubmap_dataset(Dataset):
    #credits to: https://www.kaggle.com/code/alincijov/dataset-hubmap-vasc-768x768-segments
    def __init__(self, json_path, image_dir, augments = False, train=True):
        #read jsonl file, contains all info about the annotations
        with open(json_path) as json_file:
            json_list = list(json_file) #list of all jsons
        #create new df
        dataset = []
        for json_str in tqdm(json_list, desc="Json Data Loaded"):
            result = json.loads(json_str)

            annotations = result['annotations']
            for ann in annotations:
                row = {}
                row["id"] = result["id"]
                row["type"] = ann["type"]
                row["coordinates"] = ann["coordinates"]
                row["mask"] = self.coordinates_to_masks(ann["coordinates"], (512, 512))[0]
                row["rle"] = self.mask2enc(row["mask"])
                dataset.append(row)

        # define dataset, to make it easier to get...
        self.dataset = pd.DataFrame(dataset, columns=["id", "type", "coordinates", "mask", "rle"])
        self.train = train
        self.image_dir = image_dir

    def enc2mask(self, encs, shape):
        '''
        Function to go from input RLE encodings to mask
        :param encs: input RLE encodings
        :param shape: shape of output mask
        :return:
        '''
        img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
        for m,enc in enumerate(encs):
            if isinstance(enc,np.float) and np.isnan(enc):
                continue #skip nan's
            s = enc.split()
            for i in range(len(s)//2):
                start = int(s[2*i]) - 1
                length = int(s[2*i+1])
                img[start:start+length] = 1 + m
        return img.reshape(shape).T

    def mask2enc(self, mask, n=1):
        '''
        Function to go from input mask to RLE encodings
        :param mask:
        :param n:
        :return:
        '''
        pixels = mask.T.flatten()
        encs = []
        for i in range(1,n+1):
            p = (pixels == i).astype(np.int8)
            if p.sum() == 0: encs.append(np.nan)
            else:
                p = np.concatenate([[0], p, [0]])
                runs = np.where(p[1:] != p[:-1])[0] + 1
                runs[1::2] -= runs[::2]
                encs.append(' '.join(str(x) for x in runs))
        return encs

    def coordinates_to_masks(self, coordinates, shape):
        masks = []
        for coord in coordinates:
            mask = np.zeros(shape, dtype=np.uint8) #512x512 empty initialization
            cv2.fillPoly(mask, [np.array(coord)], 1)
            masks.append(mask)
        return masks #return appended masks

    def __getitem__(self, idx):

        data = self.dataset.iloc[idx]

        imageLoc = os.path.join(self.image_dir,data.id+".tif")
        img = cv2.imread(imageLoc)

        type_struct = data.type
        coord = data.coordinates[0]

        # create mask array
        mask = np.zeros((512, 512), dtype=np.float32)
        points = np.array(coord)
        points = points.reshape((1, -1, 2))
        mask = cv2.fillPoly(mask, pts=points, color=(255))

        return img, type_struct, mask

    def __len__(self):
        return len(self.dataset)

In [7]:
train_dataset= Hubmap_dataset(json_path = json_src, augments=False,image_dir=train_tile_src, train=True)

Json Data Loaded: 100%|██████████| 1633/1633 [00:15<00:00, 107.20it/s]


In [8]:
### AS SEEN BELOW, ID MATCHES THE ID OF THE IMAGE DATASET NAME, TYPE IS EITHER BLOOD_VESSEL(TARGET), GLOMERULUS (WE DONT WANT), AND UNSURE. MOST IMPORTANT: # OF ANNOTATED TILES IS 1633, WHILE NUMBER OF UNANNOTATED TILES IS 5400 TO MAKE IT TOTAL OF 7033 TILES IN THE IMAGES WE ARE GIVEN. NEED TO TRAIN SEMISUPERVISED MODEL!

In [9]:
train_dataset.dataset

Unnamed: 0,id,type,coordinates,mask,rle
0,0006ff2aa7cd,glomerulus,"[[[167, 249], [166, 249], [165, 249], [164, 24...","[[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",[2 105 513 107 1025 111 1537 113 2049 114 2561...
1,0006ff2aa7cd,blood_vessel,"[[[283, 109], [282, 109], [281, 109], [280, 10...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[139350 10 139860 17 140370 22 140880 27 14139...
2,0006ff2aa7cd,blood_vessel,"[[[104, 292], [103, 292], [102, 292], [101, 29...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[37108 4 37618 8 38129 11 38640 14 39152 15 39...
3,0006ff2aa7cd,blood_vessel,"[[[505, 442], [504, 442], [503, 442], [502, 44...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[250783 11 251292 20 251801 26 252311 30 25282...
4,0006ff2aa7cd,blood_vessel,"[[[375, 477], [374, 477], [373, 477], [372, 47...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[167352 5 167863 7 168374 10 168886 11 169398 ...
...,...,...,...,...,...
17513,ffd3d193c71e,blood_vessel,"[[[184, 308], [183, 308], [182, 308], [181, 30...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[88875 8 89385 11 89896 13 90407 15 90918 16 9...
17514,ffd3d193c71e,blood_vessel,"[[[42, 92], [41, 92], [40, 92], [39, 92], [38,...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[14393 12 14904 20 15415 23 15925 28 16434 40 ...
17515,ffd3d193c71e,blood_vessel,"[[[287, 480], [286, 480], [285, 480], [284, 48...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[140243 6 140750 13 141261 15 141772 17 142284...
17516,ffd3d193c71e,blood_vessel,"[[[493, 388], [492, 388], [491, 388], [490, 38...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",[244001 5 244512 7 244603 6 245023 9 245112 11...


In [10]:
save_df = False # save only once
if save_df:
    train_df_new = train_dataset.dataset
    train_df_new.to_excel(os.path.join(data_directory,"new_train.xlsx"))

### Now use RandStainNA to augment and normalize all of the training images: