# Data Challenge: [Help a Hematologist out!](https://helmholtz-data-challenges.de/web/challenges/challenge-page/93/overview) 
*** 
<b> Group: </b> 
> #      $ BLAMAD $  
<b> members </b> 
> Bashir K., 
> Lea G., 
> Ankita N., 
> Martin B., 
> Arnab M., 
> Dawit H. 
***

![logo](https://github.com/christinab12/Data-challenge-logo/blob/main/logo.jpg?raw=true)

## Getting started


This notebook is a short summary for getting started with the challenge ( found [here](https://helmholtz-data-challenges.de/web/challenges/challenge-page/93/overview)  ). Below you can find how to download the dataset and also the different labels along with exploring and analyzing the input and output data of the challenge, running a baseline model and creating a submission file to upload to the leaderboard.

***

<b>dataset:</b>

Three datasets, each constituting a different domain, will be used for this challenge:
> 1. The Acevedo_20 dataset with labels
> 2. The Matek_19 dataset with labels
> 3. The WBC dataset <b> without labels </b> (Used for domain adaptation and performance measurement)

The Acevedo_20 and Matek_19 datasets are labeled and should be used to train the model for the domain generalization task.
A small subpart of the WBC dataset, WBC1, will be downloadable from the beginning of the challenge. It is unlabeled and should be used for evaluation and domain adaptation techniques.

A second similar subpart of the WBC dataset, WBC2, will become available for download during phase 2 of the challenge, i.e. on the last day, 24 hours before submissions close.

***
<b>Goal: </b> 

The challenge here is in transfer learning, <b> precisely domain generalization (DG) and domain adaptation (DA) </b> techniques. The focus lies on using deep neural networks to classify single white blood cell images obtained from peripheral blood smears.
<b> Tthe goal of this challenge is to achieve a high performance, especially a high f1 macro score, on the WBC2 dataset. </b>

***
<b>Notes: </b>

This challenge wants to motivate research in domain generalization and adaptation techniques:

To make actual use of deep learning in the medical routine, it is important that the techniques can be used in realistic cases. If a peripheral blood smear is acquired from a patient and classified by a neural network, it is important that this works reliably. But the patient’s blood smear might very likely vary compared to the image domains used as training data of the network, resulting in not trustable results. To overcome this obstacle and build robust domain-invariant classifiers research in domain generalization and adaptation is needed.

***
<b>f1_score: </b>
[wikepedia](https://en.wikipedia.org/wiki/F-score)

> sklearn.metrics.f1_score(y_true, y_pred, *, labels=None, pos_label=1,<b> average='macro' </b>, sample_weight=None, zero_division='warn')

The formula can be see in [click here for the code](https://github.com/scikit-learn/scikit-learn/blob/36958fb24/sklearn/metrics/_classification.py#L1001) and is given as

> <g> F1 = 2 * (precision * recall) / (precision + recall) </g>
***


## Donwloading the data
***
 Uncomment the code below to download the dataset. Makesure you adjust the path according to where you download it

In [1]:
# !wget --user YraZEdrHytaCSza --password BgZL3j8DT4 https://hmgubox2.helmholtz-muenchen.de/public.php/webdav/Acevedo_20.zip -O Acevedo_20.zip #(230M) [application/zip]
# !wget --user YraZEdrHytaCSza --password BgZL3j8DT4 https://hmgubox2.helmholtz-muenchen.de/public.php/webdav/Matek_19.zip -O Matek_19.zip #(5.7G) [application/zip]
# !wget --user YraZEdrHytaCSza --password BgZL3j8DT4 https://hmgubox2.helmholtz-muenchen.de/public.php/webdav/WBC1.zip -O WBC1.zip #(357M) [application/zip]
# !wget --user YraZEdrHytaCSza --password BgZL3j8DT4 https://hmgubox2.helmholtz-muenchen.de/public.php/webdav/val_dummy.csv -O val_dummy.csv #44834 (44K) [text/csv]
# !wget --user YraZEdrHytaCSza --password BgZL3j8DT4 https://hmgubox2.helmholtz-muenchen.de/public.php/webdav/metadata2.csv -O metadata2.csv #2019059 (1.9M) [text/csv]
# print('download complete') 

# import shutil

# shutil.unpack_archive('Acevedo_20.zip', 'Datasets/Acevedo_20')
# shutil.unpack_archive('Matek_19.zip', 'Datasets/Matek_19')
# shutil.unpack_archive('WBC1.zip', 'Datasets/WBC1')
# !ls

<b> datapath </b>

In [2]:
data_path = {
        "Ace_20": "/beegfs/desy/user/hailudaw/challenge/Datasets/Acevedo_20", # Acevedo_20 Dataset
        "Mat_19": "/beegfs/desy/user/hailudaw/challenge/Datasets/Matek_19", # Matek_19 Dataset
        "WBC1": "/beegfs/desy/user/hailudaw/challenge/Datasets/WBC1" # WBC1 dataset
    }
    

<b> labels </b>

In [3]:
# Common classes of the datasets and their labels: 
# Highly underrepresented classes like atypical lymphocytes and smudge cells were left out.

label_map_all = {
        'basophil': 0,
        'eosinophil': 1,
        'erythroblast': 2,
        'myeloblast' : 3,
        'promyelocyte': 4,
        'myelocyte': 5,
        'metamyelocyte': 6,
        'neutrophil_banded': 7,
        'neutrophil_segmented': 8,
        'monocyte': 9,
        'lymphocyte_typical': 10
    }

label_map_reverse = {
        0: 'basophil',
        1: 'eosinophil',
        2: 'erythroblast',
        3: 'myeloblast',
        4: 'promyelocyte',
        5: 'myelocyte',
        6: 'metamyelocyte',
        7: 'neutrophil_banded',
        8: 'neutrophil_segmented',
        9: 'monocyte',
        10: 'lymphocyte_typical'
    }

# The unlabeled WBC dataset gets the classname 'Data-Val' for every image

label_map_pred = {
        'DATA-VAL': 0
    }

<b> convert the dataset to a Pandas frame and compute the mean </b> 

In [4]:
import pandas as pd
import numpy as np
import tqdm
import ntpath
import os
import skimage.io as io
savepaths=['metadata.csv', 'metadata_noisy.csv', 'metadata_rescaled.csv'] # path where the created dataframe will be stored
savepath = savepaths[0]  # path where the created dataframe will be stored

def finding_classes(data_dir):
    """
    this function finds the folders in the root path and considers them
    as classes
    """
    classes = [folder for folder in sorted(os.listdir(data_dir)) if not folder.startswith('.') and not folder.startswith('_')]
    return classes


def metadata_generator(data_path):
    #this function generates a pandas dataframe containing image information (paths, labels, dataset)
    metadata = pd.DataFrame(columns=["Image", "file", "label", "dataset", "set"])
    for ds in data_path:
        list_of_classes = finding_classes(data_path[ds])
        for cl in list_of_classes:
            metadata_dummy = pd.DataFrame(columns=["Image", "file", "label", "dataset", "set", 'mean1', 'mean2', 'mean3'])
            metadata_dummy["Image"] = None
            metadata_dummy["file"] =  io.imread_collection(os.path.join(data_path[ds], cl, "*")).files
            metadata_dummy["label"] = cl
            metadata_dummy["dataset"] = ds
            metadata_dummy["set"] = "train"
            for i in range(len(metadata_dummy)):
                metadata_dummy['Image'].loc[i]=ntpath.basename(metadata_dummy['file'][i])
            metadata = metadata.append(metadata_dummy, ignore_index=True)
            metadata_dummy = None
            
    return metadata

metadata = metadata_generator(data_path)

def compute_mean(dataframe=metadata, savepath=savepath, selected_channels=[0,1,2]):
    for idx in tqdm(range(len(dataframe)), position=0, leave=True):
        if dataframe.loc(idx, "dataset") != "WBC1":
            h5_file_path = dataframe.loc[idx,"file"]
            try:
                image= io.imread(h5_file_path)[:,:,selected_channels]
            except ValueError: 
                print(h5_file_path)
                break
            #image = rgb2hsv(image)
            dataframe.loc[idx, 'mean1']= np.mean(image[:,:,0])
            dataframe.loc[idx, 'mean2']= np.mean(image[:,:,1])
            dataframe.loc[idx, 'mean3']= np.mean(image[:,:,2])
    dataframe.to_csv(savepath, index=False)
    print(f'The dataframe was saved to {savepath}')
    print(dataframe)
    return dataframe

compute_mean()

<b> in parallel </b>

In [41]:
import ray
if not ray.is_initialized():
    ray.init()
import pandas as pd 
import ntpath
import os
import numpy as np
import tqdm
import skimage.io as io

savepaths=['metadata.csv', 'metadata_noisy.csv', 'metadata_rescaled.csv'] # path where the created dataframe will be stored
savepath = savepaths[0]  # path where the created dataframe will be stored
def finding_classes(data_dir):
    """
    this function finds the folders in the root path and considers them
    as classes
    """
    classes = [folder for folder in sorted(os.listdir(data_dir)) if not folder.startswith('.') and not folder.startswith('_')]
    return classes
classes = []
data_key = []
metadata = []

@ray.remote
def metadata_dummy_getter(ds = data_key, cl = classes):
    meta = []
    metadata_dummy = pd.DataFrame(columns=["Image", "file", "label", "dataset", "set", 'mean1', 'mean2', 'mean3'])
    metadata_dummy["Image"] = None
    metadata_dummy["file"] = io.imread_collection(os.path.join(data_path[ds], cl, "*")).files
    metadata_dummy["label"] = cl
    metadata_dummy["dataset"] = ds
    metadata_dummy["set"] = "train"
    for i in range(len(metadata_dummy)):
        metadata_dummy['Image'].loc[i]=ntpath.basename(metadata_dummy['file'][i])
        meta.append(metadata_dummy)
    return meta

@ray.remote
def metadata_generator(ds = data_path):
    #this function generates a pandas dataframe containing image information (paths, labels, dataset)
    metadata = pd.DataFrame(columns=["Image", "file", "label", "dataset", "set"])
    list_of_classes = finding_classes(data_path[ds]) 
    list_of_pandas = pd.concat(metadata.append(ray.get([metadata_dummy_getter.remote('Ace_20',cl) for cl in list_of_classes])))
    list_of_pandas.index = range(len(list_of_pandas))
    return list_of_pandas


In [42]:
# for ds in data_path:
#     list_classes = finding_classes(data_path[ds])
#     print(list_classes)
# x = ray.get([(metadata_dummy_getter.remote('Ace_20',cl) for cl in list_classes)])
metadata = pd.concat(ray.get([metadata_generator.remote(path) for path in data_path]))

RayTaskError(TypeError): [36mray::metadata_generator()[39m (pid=190328, ip=131.169.183.87)
  File "/tmp/ipykernel_189666/2327920929.py", line 43, in metadata_generator
  File "/beegfs/desy/user/hailudaw/anacon/envs/tor/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/beegfs/desy/user/hailudaw/anacon/envs/tor/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 347, in concat
    op = _Concatenator(
  File "/beegfs/desy/user/hailudaw/anacon/envs/tor/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 382, in __init__
    raise TypeError(
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

In [32]:
x

Unnamed: 0,Image,file,label,dataset,set,mean1,mean2,mean3
0,BA_47.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,basophil,Ace_20,train,,,
1,BA_580.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,basophil,Ace_20,train,,,
2,BA_1223.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,basophil,Ace_20,train,,,
3,BA_1581.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,basophil,Ace_20,train,,,
4,BA_2035.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,basophil,Ace_20,train,,,
...,...,...,...,...,...,...,...,...
587,PMY_988901.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,promyelocyte,Ace_20,train,,,
588,PMY_989352.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,promyelocyte,Ace_20,train,,,
589,PMY_991745.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,promyelocyte,Ace_20,train,,,
590,PMY_993959.jpg,/beegfs/desy/user/hailudaw/challenge/Datasets/...,promyelocyte,Ace_20,train,,,


In [6]:
@ray.remote
def compute_mean_meta(meta = metadata, idx = 0):
    # for idx in tqdm(range(len(meta)), position=0, leave=True):
    if meta.loc[idx, "dataset"] != "WBC1":
        h5_file_path = meta.loc[idx,"file"]
        image = io.imread(h5_file_path)
        #image = rgb2hsv(image)
        meta.loc[idx, 'mean1']= np.mean(image[:,:,0])
        meta.loc[idx, 'mean2']= np.mean(image[:,:,1])
        meta.loc[idx, 'mean3']= np.mean(image[:,:,2])
    return meta

@ray.remote
def compute_mean(meta = metadata):
    meta = pd.concat(ray.get([compute_mean_meta.remote(meta, idx) for idx in tqdm(range(len(meta)-1)]))
    return meta
    
# mean_meta = ray.get([compute_mean.remote(meta) for meta in tqdm(metadata[0:1])])
    # print(meta)
    # for idx in tqdm(range(len(meta)), position=0, leave=True):
    #     h5_file_path = meta.loc[idx,"file"]
    #     image = io.imread(h5_file_path)
    #     #image = rgb2hsv(image)
    #     meta.loc[idx, 'mean1']= np.mean(image[:,:,0])
    #     meta.loc[idx, 'mean2']= np.mean(image[:,:,1])
    #     meta.loc[idx, 'mean3']= np.mean(image[:,:,2])
    #     savepath = savepath + '_' + str(metadata)



SyntaxError: closing parenthesis ']' does not match opening parenthesis '(' (833299430.py, line 15)

###  Authors

> Armin Gruber

> Ali Boushehri

> Christina Bukas

> Dawit Hailu