This script transforms the annotations produced with VGG into YOLO compatible annotations.

VGG provides images where for each box (Xmin, Ymin, BoxWidth, BoxHeight)

YOLO needs: 

    - BoxXRelCenter = (xmin + xmax)/(2* img_width) 
    - BoxYRelCenter = (ymin + ymax)/(2* img_heigth)
    - BoxRelWidth = (BoxWidth/img_width)
    - BoxRelHeight = (BoxHeight/img_heigth)

where:

    - xmax = xmin + BoxWidth
    - ymax = ymin + BoxHeight


To identify the class we will use the following convention:

- 0 - PELVIS
- 1 - SPINE
- 2 - ABDOMEN
- 3 - CHEST
- 4 - HEAD
- 5 - SHOULDER R
- 6 - SHOULDER L

In [1]:
import os #to move from one folder to another even if it is not working fine
import pandas as pd # all metadata are stored in df
import re # regular expressions
import cv2 #needed to get the size of the image
import shutil as sh# needed to copy files

In [2]:
## Type here the path to the corresponding folders - NOTE: do not include at the end of the path the folder separator

# Directory where the VGG annotations are - in this directory
#VGGAnnotatorDir = "D:\\BIG_DATA\\DSTI\\OneDrive - Data ScienceTech Institute\\2020-05-30-python_crash_course\\projects_with_assan\\healthcare\\material\\images\\pelvis\\tech-adjust\\testYoloFile"
VGGAnnotatorDir = "D:\\BIG_DATA\\DSTI\\OneDrive - Data ScienceTech Institute\\2020-05-30-python_crash_course\\projects_with_assan\\healthcare\\material\\images\\CHEST\\Andrea"

#name of the annotation file from VGG
#VGGAnnotationFile = "pelvis_13_10_2020_annotation.csv"
VGGAnnotationFile = "VGG_CHEST_Andrea_16Nov2020_merged_with_Mouna.csv"

# Directory with all and only images
imgDir = "D:\\BIG_DATA\DSTI\\OneDrive - Data ScienceTech Institute\\2020-05-30-python_crash_course\\projects_with_assan\\healthcare\\material\\images\\CHEST\\Andrea\\img"

# Directory where to save the <imageId>.txt files along with the <imageId>.png
imgAndYoloAnn = "D:\\BIG_DATA\DSTI\\OneDrive - Data ScienceTech Institute\\2020-05-30-python_crash_course\\projects_with_assan\\healthcare\\material\\images\\CHEST\\Andrea\\annotations"

# Directory where to save train.txt and test.txt
directoryTrainTxt = "D:\\BIG_DATA\DSTI\\OneDrive - Data ScienceTech Institute\\2020-05-30-python_crash_course\\projects_with_assan\\healthcare\\material\\images\\CHEST\\Andrea\\pathImg"

# Name of the txt file listing all images - make sure to specify the proper 
# name giving info about what images you are dealing with
# this file will be saved in the folder referred by directoryTrainTxt
nameTxt = "Images_chest.txt"

# Path of the folder on docker image that will host the pics
darknetFolderTxt = "/exchange/images"

In [3]:
# Dictionary with the class code for each class
classDic = dict({"PELVIS": int(0), "SPINE": int(1), "ABDOMEN": int(2), "CHEST":int(3), "HEAD":int(4), "SHOULDER R": int(5), "SHOULDER L": int(6)})

In [4]:
#import annotations
annotation = pd.read_csv(VGGAnnotatorDir + "\\" + VGGAnnotationFile)

#check annotations
#annotation.head(5)

In [5]:
# retain only columns of interest
annotShort = annotation[["filename","region_id","region_shape_attributes", "region_attributes"]]

In [6]:
# check
#annotShort.head(4)

In [7]:
def spotFront(attributes):
    """
    This function returns 1 if the input string has the word "FRONT", 0 otherwise 
    
    This is needed to select only those images which are taken from the FRONT view
    """
    if attributes.find("FRONT") != -1:
        res = 1
    else:
        res = 0
        
    return res

In [8]:
# Create column Front: it takes value 1 if the annotation refers to a FRONT image, 0 otherwise 
annotShort["Front"] = annotShort["region_attributes"].map( spotFront )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [9]:
#annotShort.head(3)

In [10]:
# retain only rows that has column Front = 1
frontAnnotations = annotShort[annotShort["Front"] == 1]

In [11]:
#frontAnnotations.head(3)

In [12]:
def extractAttribute(inputString):
    """
    This function returns the Section of the body that has the attribute :"FRONT":true
    """
    #print(type(inputString))
    charToRemove = str.maketrans("","","{}") # translation map to remove "{" and "}"
    
    inputString = inputString.translate(charToRemove) # remove "{" and "}"
    
    res = [x for x in inputString.split(",") if x.find("FRONT") != -1][0]
    
    res = res.split(":\"FRONT\":true")[0]
    
    res = res.replace("\"","")
    
    return res    

In [13]:
#type(frontAnnotations["region_attributes"][0])

In [14]:
# create column Section: it shows the body section the annotation refers to
frontAnnotations["Section"] = frontAnnotations["region_attributes"].map(extractAttribute)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [15]:
frontAnnotations.head(3)

Unnamed: 0,filename,region_id,region_shape_attributes,region_attributes,Front,Section
0,4800.png,0,"{""name"":""rect"",""x"":36,""y"":40,""width"":397,""heig...","{""HEAD"":{},""CHEST"":{""FRONT"":true},""ABDOMEN"":{}...",1,CHEST
1,4800.png,1,"{""name"":""rect"",""x"":189,""y"":0,""width"":83,""heigh...","{""HEAD"":{},""CHEST"":{},""ABDOMEN"":{},""PELVIS"":{}...",1,SPINE
2,4800.png,2,"{""name"":""rect"",""x"":35,""y"":343,""width"":399,""hei...","{""HEAD"":{},""CHEST"":{},""ABDOMEN"":{""FRONT"":true}...",1,ABDOMEN


In [16]:
def extractBox(boxCoord):
    """
    This function takes a string like the one in the column "region_shape_attributes" and returns
    the values of x, y, width, heigth - note: each of these values are returned as strings
    """
    
    
    tranMap = str.maketrans("","","{}\"") # translation map to remove "{" and "}" and "
    
    boxCoord = boxCoord.translate(tranMap) # remove "{" and "}"
    
    splittedBoxCoord = re.split(":|,",boxCoord)
    
    res = [ float(splittedBoxCoord[i]) for i in range(3,10,2) ]
    
    return res

In [17]:
frontAnnotations["Xmin, Ymin, Width, Heigth"] = frontAnnotations["region_shape_attributes"].map(extractBox)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [18]:
frontAnnotations.head(2)

Unnamed: 0,filename,region_id,region_shape_attributes,region_attributes,Front,Section,"Xmin, Ymin, Width, Heigth"
0,4800.png,0,"{""name"":""rect"",""x"":36,""y"":40,""width"":397,""heig...","{""HEAD"":{},""CHEST"":{""FRONT"":true},""ABDOMEN"":{}...",1,CHEST,"[36.0, 40.0, 397.0, 380.0]"
1,4800.png,1,"{""name"":""rect"",""x"":189,""y"":0,""width"":83,""heigh...","{""HEAD"":{},""CHEST"":{},""ABDOMEN"":{},""PELVIS"":{}...",1,SPINE,"[189.0, 0.0, 83.0, 416.0]"


In [19]:
def getImgDim(imageFile, imgPath):
    """
    This function takes in input the image file name along with its path, and returns the dimension of the image
    """
    
    finalPath = os.path.join(imgPath, imageFile)
    
    idImg = cv2.imread(finalPath)
    
    return idImg.shape

In [20]:
#imgDir = os.path.join(imgDir,"images_renamed_pos") using os.path.join does not work

# all images file names
imgID = os.listdir(imgDir)

#dictionary where the {key:value} is {imgFileName:tuple(Img_heigth, Img_width, channel)} ==> needed to get image dimensions
IDdimension = { imgID[i] : getImgDim(imgID[i], imgDir) for i in range(len(imgID))}

VGG provides images where for each box (Xmin, Ymin, BoxWidth, BoxHeight)

YOLO needs: 

    - BoxXRelCenter = (xmin + xmax)/(2* img_width) 
    - BoxYRelCenter = (ymin + ymax)/(2* img_heigth)
    - BoxRelWidth = (BoxWidth/img_width)
    - BoxRelHeight = (BoxHeight/img_heigth)

where:

    - xmax = xmin + BoxWidth
    - ymax = ymin + BoxHeight


In [21]:
def VggToYolo(filename, vggAnn, imageDimDict ):
    """
    This function will be applied in order to create a column YoloAnn that has the Annotation in Yolo compatible
    format.
    E.g.:
    
    dataframe["YoloAnn"] = dataframe[["filename", "Xmin, Ymin, Width, Heigth"]].apply(VggToYolo(imageDimDict = IDdimension))
    
    Input:
    
        filename = filename of an image

        vggAnn = list of 4 floats [BoxXmax, BoxYmax, BoxWidth, BoxHeigth]

        imageDimDict = dictionary with {imgFileName:tuple(Img_heigth, Img_width, channel)}
    
    Output:
    
        [BoxXRelCenter, BoxYRelCenter, BoxRelWidth, BoxRelHeight]
        
        where
            - BoxXRelCenter = (xmin + xmax)/(2* img_width) 
            - BoxYRelCenter = (ymin + ymax)/(2* img_heigth)
            - BoxRelWidth = (BoxWidth/img_width)
            - BoxRelHeight = (BoxHeight/img_heigth)

    
    """
    #print(filename)
    
    #extract real image height
    img_heigth = imageDimDict[filename][0] 
    
    #extract real image width
    img_width = imageDimDict[filename][1] 
    
    #Vgg info
    xmin , ymin, BoxWidth, BoxHeight = vggAnn


    
    #infer the absolute xmax and ymax of the annotation
    xmax = xmin + BoxWidth
    ymax = ymin + BoxHeight
    
    
    #Yolo annotation
    BoxXRelCenter = "%0.6f" % (round((xmin + xmax)/(2* img_width),6))
    BoxYRelCenter = "%0.6f" % (round((ymin + ymax)/(2* img_heigth),6))
    BoxRelWidth = "%0.6f" % (round((BoxWidth/img_width),6))
    BoxRelHeight = "%0.6f" % (round((BoxHeight/img_heigth),6))
    
    # Conver Yolo annotation 
    BoxXRelCenterNum = round((xmin + xmax)/(2* img_width),6)
    BoxYRelCenterNum = round((ymin + ymax)/(2* img_heigth),6)
    BoxRelWidthNum = round((BoxWidth/img_width),6)
    BoxRelHeightNum = round((BoxHeight/img_heigth),6)
    
    if (BoxXRelCenterNum < 0) or (BoxYRelCenterNum < 0) or (BoxRelWidthNum < 0) or (BoxRelHeightNum < 0) or (BoxXRelCenterNum > 1) or (BoxYRelCenterNum > 1) or (BoxRelWidthNum > 1) or (BoxRelHeightNum > 1):
        print("ERROR: one of the Yolo coordinate is outside the interval [0,1]")
        print(f"File name is {filename}")
        print(f"BoxXRelCenterNum is {BoxXRelCenterNum}")
        print(f"BoxYRelCenterNum is {BoxYRelCenterNum}")
        print(f"BoxRelWidth is {BoxRelWidth}")
        print(f"BoxRelHeight is {BoxRelHeight}")
    
    return [BoxXRelCenter, BoxYRelCenter, BoxRelWidth, BoxRelHeight]
    

In [22]:

frontAnnotations["Yolo XcRel, YcRel, WidthRel, HeightRel"] = (
    
    frontAnnotations[["filename", "Xmin, Ymin, Width, Heigth"]].apply(lambda x: VggToYolo(x.loc["filename"] ,x.loc["Xmin, Ymin, Width, Heigth"],IDdimension), axis = 1)
    )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [23]:
#frontAnnotations[["filename", "Xmax, Ymax, Width, Heigth","Yolo XcRel, YcRel, WidthRel, HeightRel"]].head(3)

In [24]:
print(frontAnnotations["Yolo XcRel, YcRel, WidthRel, HeightRel"][2])

['0.519956', '0.856540', '0.884701', '0.265823']


In [25]:
frontAnnotations["class"] = frontAnnotations["Section"].map(classDic)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [26]:
YoloAnnotations = frontAnnotations[["filename", "class","Xmin, Ymin, Width, Heigth","Yolo XcRel, YcRel, WidthRel, HeightRel"]]

## Create the txt files needed by Yolo

Each file has to be named <pictureId>.txt and has to have:
    - as many lines as the annotations on the picture
    - each line has the:
        - class the object belongs to, following the same convention established in the file *.names
        - the relative coordinates of the box: XcRel, YcRel, WidthRel, HeightRel

In [27]:
# Algo is: 
# extract the file name 
# extract the class
# extract the Yolo coordinates

def YoloFileAnnotation(picId, classId, YoloCoord, targetDirImgAndAnn):
    """
    This function has to be applied to YoloANnotations dataframe and creates as many files as the images 
    listed in YoloAnnotations.
    Each file has for each image as many rows as the annotation, and for each annotation lists the 
    class and its Yolo-compatible coordinates
    """
    
    #replace "png" with "txt"
    picId = picId.replace("png","txt")
    
    #print(picId)
    
    #build the string to be 
    output = (str(classId) + " ")
    output += " ".join(str(x) for x in YoloCoord)
    
    try:
        with open(imgAndYoloAnn + "\\" + picId,"a") as f:
            f.write(output + "\n")
        #print("Executed")
    except Exception as e:
        print("Cannot write the file on disk. \nCheck whether the below target folder exists/is accessible:\n")
        print(targetDirImgAndAnn)
        
    

In [28]:
def ExecuteYoloFileAnnotation(inputDF, targetDirImgAndAnn):
    """
    This function applies the function YoloFileAnnotation to the dataframe inputDF
    """
    inputDF.apply(lambda x:  YoloFileAnnotation(x.loc["filename"], x.loc["class"], x.loc["Yolo XcRel, YcRel, WidthRel, HeightRel"], targetDirImgAndAnn), axis = 1)
    
    
    

In [29]:
ExecuteYoloFileAnnotation(YoloAnnotations, imgAndYoloAnn)

## Move images to folder pointed by variable imgAndYoloAnn

In [30]:
def moveImage(imageId, currentDir, newDir):
    
    sh.copy(currentDir + "\\" + imageId, newDir)

In [31]:
YoloAnnotations["filename"].map( lambda x: moveImage(x,imgDir, imgAndYoloAnn ) )

0       None
1       None
2       None
3       None
4       None
        ... 
1090    None
1091    None
1092    None
1093    None
1094    None
Name: filename, Length: 1075, dtype: object

## Create file with list of images for train and test

In [32]:
#extract images filename
ArrayFile = YoloAnnotations["filename"].unique()

In [33]:
# Create train.txt
def CreateTrainTxt(inputSeriesImages, SaveTotargetDir, targetFileName, ImagePathInsideDarknet):
    """
    This function will write a txt file named as targetFilename that lists all images; 
    Specifically:
    
    - inputSeriesImages: Series object containing file names 
    - SaveTotargetDir: local directory where the final txt file will be saved
    - targetFileName: name of the txt file that this functio will create
    - ImagePathInsideDarknet: each entry in the txt file will be the concatenation of ImagePathInsideDarknet and 
        the image name taken from inputSeriesImages
    
    """
    
    # Convert input Series in list
    listImages = list(inputSeriesImages)
    
    # create a list made of strings where each element is made from the concatenation of darknetFolderTxt and the imagesFileName 
    finalListImages = [ImagePathInsideDarknet + "/" + i for i in listImages]
    
    #
    content = "\n".join(finalListImages)
    
    try:
        with open(SaveTotargetDir + "\\" + targetFileName,"a") as f:
            f.write(content + "\n")
        #print("Executed")
    except Exception as e:
        print("Cannot write the file on disk. \nCheck whether the below target folder exists/is accessible:\n")
        print(SaveTotargetDir)


In [34]:
CreateTrainTxt(ArrayFile, directoryTrainTxt, nameTxt, darknetFolderTxt)

In [35]:
len(ArrayFile)

363

In [37]:
from sklearn.model_selection import train_test_split

In [38]:
train, test = train_test_split(ArrayFile)

In [54]:
CreateTrainTxt(train, directoryTrainTxt, "train_chest.txt", darknetFolderTxt)

In [55]:
CreateTrainTxt(test, directoryTrainTxt, "test_chest.txt", darknetFolderTxt)