# [How To] Upload Label Predictions 


In our latest <a href="https://www.datagym.ai/feature-introduction-upload-label-predictions/">Blog post</a> we introduced the __"Import Label"__ feature which allows DataGym users to import their annotated image data directly into their DataGym Projects. Thereby, our users are now able to inspect and evaluate the results of their prediction models from within DataGym. This workshop aims to give a real-life example of how to use the __"Import Labels"__ feature with our <a href="https://docs.datagym.ai/documentation/python-api/getting-started">Python API</a>. The code samples in this guide are also available as Jupyter Notebook at <a href="https://github.com/datagym-ai/datagym-python/tree/master/notebooks">GitHub</a>.

## Use Case

In their cutting-edge, paper Wang et al. [1] presented a new chest X-ray database, namely __"ChestX-ray8"__, which comprises 108,948 frontal-view X-ray images of 32,717 unique patients with the text-mined eight disease image labels (where each image can have multi-labels), from the associated radiological reports using natural language processing. They demonstrated that commonly occurring __thoracic diseases__ can be spatially-located via their uniﬁed weakly-supervised multi-label image classiﬁcation and disease localization framework.

<img src="https://media.datagym.ai/blog/chestxray/guide/thorax_scans.png" width="500px">

Wang et al. made their findings and resources <a href="https://m.box.com/shared_item/https%3A%2F%2Fnihcc.app.box.com%2Fv%2FChestXray-NIHCC/browse/36938765345">publicly available</a> to other researchers. Their datasets contain frontal-view chest X-ray PNG images as well as coordinates for bounding boxes that identify the location of the detected thoracic diseases. In this workshop, we use their ímages and annotated label-data to create a project in DataGym that imports and combines these resources and allows users to inspect and re-evaluate the predicted labels.

### Our Starting Point

We start our workshop with the <a href="https://m.box.com/shared_item/https%3A%2F%2Fnihcc.app.box.com%2Fv%2FChestXray-NIHCC/browse/36938765345">resources</a> provided by Wang et al.:

+  A <a href="https://media.datagym.ai/blog/chestxray/BBox_List_2017.csv">.csv file</a> that contains the bounding boxes, which identify the location of thorax diseases in X-ray images. Preview:

<img src="https://media.datagym.ai/blog/chestxray/guide/csv.png" width="750px">

+ A set of 880 X-ray images. Example:

<img src="https://media.datagym.ai/blog/chestxray/images/00027937_004.png" width="350px">

### Our Goal

Our goal is to import the Images and the labeled data into a DataGym Project. This allows the labelers to view the predicted diseases and to correct the size and location of the labels if necessary. Instead of labeling images from scratch, labelers have now the reduced task of correcting pre-labeled images.

<img src="https://media.datagym.ai/blog/chestxray/guide/labeled_segment.png" width="750px">



### References

[1] Wang, Xiaosong, et al. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." *Proceedings of the IEEE conference on computer vision and pattern recognition.* 2017.


# Import the Data

We use Pandas' DataFrames to read and process the prepared data from the .csv file.

In [1]:
import pandas as pd

csv_online = "https://media.datagym.ai/blog/chestxray/BBox_List_2017.csv"

df = pd.read_csv(csv_online)
print("Number of rows: {}".format(len(df)))
print("Number of images: {}\n".format(len(df['Image Index'].unique())))

df.head()

Number of rows: 984
Number of images: 880



Unnamed: 0,Image Index,Finding Label,Bbox [x,y,w,h],Unnamed: 6,Unnamed: 7,Unnamed: 8
0,00013118_008.png,Atelectasis,225.084746,547.019217,86.779661,79.186441,,,
1,00014716_007.png,Atelectasis,686.101695,131.543498,185.491525,313.491525,,,
2,00029817_009.png,Atelectasis,221.830508,317.053115,155.118644,216.949153,,,
3,00014687_001.png,Atelectasis,726.237288,494.95142,141.016949,55.322034,,,
4,00017877_001.png,Atelectasis,660.067797,569.780787,200.677966,78.101695,,,


As you can see, this is a bit messy. Therefore, we start by removing the unnecessary columns and changing the column names into a more readable format.

In [2]:
df = df.iloc[:,:6]  # select columns with content
df.columns = ['image_name', 'label', 'x', 'y', 'w', 'h']

df.head()

Unnamed: 0,image_name,label,x,y,w,h
0,00013118_008.png,Atelectasis,225.084746,547.019217,86.779661,79.186441
1,00014716_007.png,Atelectasis,686.101695,131.543498,185.491525,313.491525
2,00029817_009.png,Atelectasis,221.830508,317.053115,155.118644,216.949153
3,00014687_001.png,Atelectasis,726.237288,494.95142,141.016949,55.322034
4,00017877_001.png,Atelectasis,660.067797,569.780787,200.677966,78.101695


This looks much better!

# Prepare the Resources in DataGym

Before we can import our labeled data into DataGym, we need to create a Project and set a Label Configuration. If you have any trouble following these steps, please visit DataGym's <a href="https://docs.datagym.ai/documentation/">documentation</a>.

## Create a Project

Creating a Project in DataGym is rather simple. <a href="https://app.datagym.ai/">Sign in</a> to your account and create a Project named 'Research Project'

<img src="https://media.datagym.ai/blog/chestxray/guide/create_project.png" width="650px">


## Create a Label Configuration

In order to import annotated data into your Project, you need to define a Label Configuration first. The label configuration defines which labels and classifications are available in a Project. You can define geometries for labeling, attach classifications to them or create global classifications for more general questions.
<br>
<br>
In our case, we will create a geometry for each of label from our .csv file. These labels are:
+ Atelectasis
+ Cardiomegaly
+ Effusion
+ Infiltrate
+ Mass
+ Nodule
+ Pneumonia
+ Pneumothorax

You can also print these labels to check if the list is complete.

In [3]:
print(df['label'].unique())

['Atelectasis' 'Cardiomegaly' 'Effusion' 'Infiltrate' 'Mass' 'Nodule'
 'Pneumonia' 'Pneumothorax']


Follow the steps below to create a Label configuration <br> 

__Step 1:__ Navigate to the __"Label configuration"__ tab within your "Research Project"


<img src="https://media.datagym.ai/blog/chestxray/guide/create_config_01.png" width="650px">

<br>

__Step 2:__ Add a __Geometry__ to your Label configuration and choose a __rectangle__

<div>
    <div >
         <img src="https://media.datagym.ai/blog/chestxray/guide/create_config_02.png" width="350px">
    </div>
    <div>
         <img src="https://media.datagym.ai/blog/chestxray/guide/create_config_03.png" width="350px">
    </div>
</div>
    
<br>

__Step 3:__ Enter the first __label name__ as key in your new __geometry__

You can choose a color and shortcut to better distinguish between your annotations.

<img src="https://media.datagym.ai/blog/chestxray/guide/create_config_04.png" width="450px">

<br>

__Step 4:__ Click on __"save edits"__ to see your label entry in the overview

<img src="https://media.datagym.ai/blog/chestxray/guide/create_config_05.png" width="500px">

<br>

__Step 5:__ Add the remaining labels from the list.
The result should look like this:

<img src="https://media.datagym.ai/blog/chestxray/guide/create_config_06.png" width="500px">

Don't forget to save your configuration!


## Create a Dataset

In order to upload images to our Project, we need to create a Dataset that can hold the images from our annotated data. This time we can use the Python API to quickly generate a Dataset directly from a Jupyter Notebook. Therefore, we need the following Client methods:


```python
Client.get_project_by_name(project_name)

Client.create_dataset(name, owner, short_description)
```

If this is your first time using the Python API, please visit our <a href="https://docs.datagym.ai/documentation/python-api/getting-started">Getting Started Guide</a>.

In [4]:
from datagym import Client

client = Client("9489b7a6-fa4f-423e-8596-8ce5b3a74cb0")
client._endpoint.BASE_PATH = "http://localhost:8080/"

In [5]:
# fetch your new Project first
project = client.get_project_by_name(project_name="Research Project")

# create a dataset for the x-ray scans
client.create_dataset(name="xray_images", 
                      owner=project.owner, 
                      short_description="Chest X-ray")

<Dataset {'id': 'ee6115ec-bfc4-4a9f-8ee4-5037b771c90e', 'name': 'xray_images', 'short_description': 'Chest X-ray', 'timestamp': '1586174249366', 'owner': '3360f10f-a5ab-48a6-966c-cdba2d63116a', 'images': <List[Image] with 0 elements>}>

We easily created a Dataset via Python. But there aren't any images in this Dataset yet.

## Upload Images to the Dataset

### Prepare a list with image URLs

To upload images, we need a list of URLs that reference all of the annotated images. You can find the images on our server (https://media.datagym.ai/blog/chestxray/images/). Since we already now the image names from our .csv file, we can easily combine the "image_name" column with our image server path. Thereby, we can generate a URL list that links all the images from our .csv file.

In [6]:
image_url_path = "https://media.datagym.ai/blog/chestxray/images/"

image_urls = image_url_path + df['image_name']  # combine the Server path and image names

image_url_set = set(image_urls)  # convert numpy array into python set

print("Number of URLs: {}".format(len(image_url_set)))
print("Example URL: {}".format(list(image_url_set)[0]))

Number of URLs: 880
Example URL: https://media.datagym.ai/blog/chestxray/images/00027278_007.png


### Upload the list with the Python API 

Now we can upload the images to our Dataset. Therefore, we use the Client class of the Python API.

```python
Client.create_images_from_urls(dataset_id, image_url_set)
```
    

In [7]:
dataset = client.get_dataset_by_name("xray_images")  # fetch the dataset from DataGym

upload_results = client.create_images_from_urls(dataset_id=dataset.id, 
                                                image_url_set=image_url_set)

In [8]:
upload_results[0]

{'internal_image_ID': 'bf496881-cb06-443b-a1b3-ee306dcb2a3b',
 'external_image_ID': '00020819_002.png',
 'imageUrl': 'https://media.datagym.ai/blog/chestxray/images/00020819_002.png',
 'imageUploadStatus': 'SUCCESS'}

DataGym returns a success message for each uploaded image. The response contains the internal_image_IDs which are later needed as a reference to upload our annotated Data into DataGym's Datasets. 

### Create an internal image ID reference Dictionary

In the next step, we create a Dictionary that maps the internal image ID to the image name from our .csv file. This is needed to identify the images in our DataGym Project when we upload our annotated image data.

    image_ids_dict:
        
        Dict[image_name] = internal_image_id 
    
There are two ways to generate this Dictionary:

1. By using the response data from the image upload:

In [9]:
image_ids_dict = dict()

for image_response in upload_results:
    if image_response["imageUploadStatus"] == "SUCCESS":
        image_ids_dict[image_response["external_image_ID"]] = image_response["internal_image_ID"]

2. By using the Images from your Dataset

In [10]:
# Fetch your Dataset
dataset = client.get_dataset_by_name(dataset_name="xray_images")

image_ids_dict = dict()

for image in dataset.images:
    image_ids_dict[image.image_name] = image.id

## Connect the Dataset to the Project

We already created a Dataset and filled it with our images. The only thing that's left is to connect this dataset to our Research Project. The Python API provides a simple method to add a Dataset:
    
```python
Client.add_dataset(dataset_name, project_name)
```

In [11]:
# Fetch the Research Project and the CT Scan Dataset
dataset = client.get_dataset_by_name(dataset_name="xray_images")
project = client.get_project_by_name(project_name="Research Project")

client.add_dataset(dataset_id=dataset.id, project_id=project.id)

True

# Prepare the upload of annotated image data

We want to upload the annotated image data from our .csv file to our DataGym Project. DataGym uses a specific format for annotated data imports. Therefore, we have to convert our rows from the .csv file into this specific form.

## Understand the schema

Before we start, let's have a look at our .csv again.

In [12]:
df.head()

Unnamed: 0,image_name,label,x,y,w,h
0,00013118_008.png,Atelectasis,225.084746,547.019217,86.779661,79.186441
1,00014716_007.png,Atelectasis,686.101695,131.543498,185.491525,313.491525
2,00029817_009.png,Atelectasis,221.830508,317.053115,155.118644,216.949153
3,00014687_001.png,Atelectasis,726.237288,494.95142,141.016949,55.322034
4,00017877_001.png,Atelectasis,660.067797,569.780787,200.677966,78.101695


Every row in this table represents an annotated segment in an image. DataGym can add these annotations to its Project images when a specific JSON format is used. As an example we take the first row of the table above and convert it to valid DataGym JSON:

    {
    "internal_image_ID" : "ebfbc807-5c52-431c-8f23-28a70f66488c",
    "global_classifications" : {},
    "keepData": false,
    "labels" : {
        "atelectasis" :  [  
                   {
                     "geometry" : [ { "x" : 225.084746, 
                                      "y" : 547.019217, 
                                      "h" : 86.779661, 
                                      "w" : 79.186441  } ],
                     "classifications" : {  }
                   }
                 ]
               }
    }
    
|Property | Description| 
|:---|:---|
|__internal_image_ID__ | The internal ID to identify the image. <br><br> In order to address the correct image we have to replace the image name from our .csv with the internal image ID of our Dataset. For this case, we already prepared the *image_ids_dict* to get these internal image IDs by their image name.| 
|__keepData__ | If keepData is equal to false, all already existing labels for the current Image will be deleted after the labels upload. If keepData is equal to true, all new labels will be added to the already existing labels for the current Image. <br>Default value for keepData is true |
|__global\_classifications__ | Itcan be left empty because we haven't defined any additional global classifications in our Label configuration. |
| __labels__ | A label describes a geometry within an image. <br><br>As you can see, the annotated image segments end up in the __labels__ attribute. Remember that we defined a __Geometry__ in our DataGym Project for every label of our .csv file. One of these labels is 'atelectasis', which we created as a rectangle.
| __geometry__ | The __geometry__ attribute contains the coordinates of the annotated segment.
| __classification__ | The attribute named __classifications__ can be left empty because we haven't defined any additional classifications in our Label configuration.


## Convert the .csv into DataGym JSON

Follow this step-by-step guide to create the DataGym JSON from the .csv file

### 1. Create a Dictionary template for annotated data

First, we need to create a Dictionary that can hold all labels for every image in our .csv. Therefore, we define a nested Dictionary based on the image name and label. Since there can be multiple instances of annotated data (aka geometries) for each label, we initialize a (yet) empty list per label. The code snippet below results in a template Dictionary called __labels_per_image__ which has the following form: 


    labels_per_image: Dict[image_name][label]
    
    
    labels_per_image =
    
        {
            'image_name': {
                               'label_1': [],
                               'label_2': [],
                               ...

                          }
        }

In [13]:
labels_per_image = {}

for index,row in df.iterrows():  # iterate over the DataFrame rows
    image_name = row['image_name']
    label = row['label'].lower() # label keys must be lower case
    
    if image_name not in labels_per_image:
        labels_per_image[image_name] = dict()
    
    if label not in labels_per_image[image_name]:
        labels_per_image[image_name][label] = list()
    

### 2. Create and Add Label Entries

We iterate a second time over the DataFrame to generate a Label Entry for every row in our .csv. A __label_entry__ has the following form:

    label_entry = 
    
        {
         'geometry': [{'x': 343.438229166667,
                       'y': 446.198524305556,
                       'h': 53.4755555555556,
                       'w': 120.60444444444401}],
         'classifications': {}
        }
    
Then we add the Label Entries to our template Dictionary __labels_per_image__:

    labels_per_image =
    
        {
            'image_name': {
                               'label_1': [label_entry_1, label_entry_2],
                               'label_2': [label_entry_3],
                               ...

                          }
        }


In [14]:
for index,row in df.iterrows():
    rectangle = {
        "x": row['x'],
        "y": row['y'],
        "h": row['h'],
        "w": row['w'],
    }
    
    label_entry = {
        "geometry": [rectangle],
        "classifications" : {  }
    }
    
    
    image_name = row['image_name']
    label = row['label'].lower()
    
    labels_per_image[image_name][label].append(label_entry)
    
    

In [15]:
label_entry

{'geometry': [{'x': 343.438229166667,
   'y': 446.198524305556,
   'h': 53.4755555555556,
   'w': 120.60444444444401}],
 'classifications': {}}

### 3. Generate and fill the final template

The final step is to create a list of Dictionaries that hold the image data and label data we defined above. At this point, we can recreate the schema introduced at the beginning of this section. The __label\_data__ Dictionary is created for each image in our DataFrame. It also uses the __image_ids_dict__ to set the internal image IDs. The already formatted labels of the image can be set via the __labels_per_image__ Dictionary.
    
    label_data =
    
        {
         "internal_image_ID" : image_ids_dict[image_name],
         "global_classifications" : {},
         "keepData": false,
         "labels" : { labels_per_image[image_name] }
        }
        
The only thing left to do is to create a list with all the labeled data. This List is now in a valid DataGym JSON format and contains all annotated image segments from the .csv file.
    
    label_data_list =
    
        [
            label_data_1,
            label_data_2,
           
            ...
        ]

In [16]:
label_data_list = []

for image_name in df['image_name'].unique():  # iterate over all image names
    label_data = {}
    
    label_data['image_name'] = image_name
    label_data['internal_image_ID'] = image_ids_dict[image_name]
    label_data['global_classifications'] = {}
    label_data["keepData"] = False
    label_data['labels'] = labels_per_image[image_name]
    
    label_data_list.append(label_data)

Look at the last __label\_data__ object to get an example of the final form after the conversion:

In [17]:
import pprint

pprint.pprint(label_data)

{'global_classifications': {},
 'image_name': '00026920_000.png',
 'internal_image_ID': '337f70f0-e03f-498e-be84-fc0dbdd1ae31',
 'keepData': False,
 'labels': {'atelectasis': [{'classifications': {},
                             'geometry': [{'h': 53.4755555555556,
                                           'w': 120.60444444444401,
                                           'x': 343.438229166667,
                                           'y': 446.198524305556}]}]}}


## Import the Annotated Data into the DataGym Project

Now we can finally import the annotated data from the .csv file to our DataGym Project. We only need to pass the __label\_data\_list__ to the Python API Client.

```python
Client.import_label_data(project_id, label_data)
```
    

In [18]:
# fetch your Research Project first
project = client.get_project_by_name(project_name="Research Project")

errors = client.import_label_data(project_id=project.id, label_data=label_data_list)

In [19]:
errors # returns a list with possible errors (JSON Format, label keys/values, etc.)

[]

## Inspect your Results

Visit DataGym to inspect your labeled images.

<img src="https://media.datagym.ai/blog/chestxray/guide/labeled_segment.png" width="700px">

This is it! You imported your labeled data into your DataGym Project. Now you can evaluate the labeled data by adjusting misplaced labels or adding labels that were missing. There is no need anymore to start labeling images from scratch, labelers have now the reduced task of correcting pre-labeled images. Thereby, you can save time labeling and quickly improve your prediction models.