<a href="https://colab.research.google.com/github/candicesheehan/MusselCNN/blob/main/Mussel_3_VIA_to_RetinaNet_subsetted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Subset and convert VIA annotations CSV file to RetinaNet CSV format

**Before running this script, make sure that your Google Drive folder contains all of the tiles and the `spatial_data.json` that you created (step 1), and the annotations `csv` that you exported from VIA (step 2).**

<a href="https://colab.research.google.com/github/candicesheehan/MusselCNN/blob/master/3_VIA_to_RetinaNet_subsetted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<center> Be sure to update this hyperlink above if you clone and want to point to a different GitHub </center>

### Connect to our Google Drive folder and pull `csv` and `json` files
Note: when you run this it will give you a link that you must click. You must give Google some permissions, then copy a code into a box that comes up in the output section of this code.

If customizing this code, you will need to point the `drive_folder` variable to a URL for your shared google drive folder.

In [None]:
!pip install -U -q PyDrive
import os, json, csv
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# choose a local (colab) directory to store the data.
local_download_path = os.path.expanduser('VIA_annotations')
try:
  os.makedirs(local_download_path)
except: pass

# 2. Auto-iterate using the query syntax
#    https://developers.google.com/drive/v2/web/search-parameters

# set variable to the destination google drive folder you want to pull from
drive_folder = 'https://drive.google.com/drive/folders/1INuRNVKvKMy8L_Nb6lmoVbyvScWK0-0D'

# this bit points the code to that google drive folder
pointer = str("'" + drive_folder.split("/")[-1] + "'" + " in parents")

file_list = drive.ListFile({'q': pointer}).GetList()

# this bit pulls all csv and json files from the directory specified above
for f in file_list:
  fname = os.path.join(local_download_path, f['title'])
  if fname.endswith(".json") or fname.endswith(".csv"):
    f_ = drive.CreateFile({'id': f['id']})
    f_.GetContentFile(fname)
    print("Pulled file: " + fname)


### Set up the python environment

In [None]:
# import necessary modules
import os, csv, random, json

# set pseudo-random values for replicability
random.seed(1)

# use this variable to set input directory
input_dir = local_download_path

# use this variable to set output directory
output_dir = 'RetinaNet_annotations'

# create the directory if it doesn't already exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

### Identify necessary files from among files in the input directory

In [None]:
annotations_file = []
spatial_data_file = []

for fname in os.listdir(input_dir):
  if fname.endswith(".csv"): 
    annotations_candidate = "{i}/{f}".format(i=input_dir, f=fname)
    with open(annotations_candidate, "r") as f:
      if next(csv.reader(f, delimiter=","))[0:3] == ['filename', 'file_size', 'file_attributes']:
        annotations_file = annotations_candidate
      else: continue

  if fname.endswith(".json"):
    spatial_data_candidate = "{i}/{f}".format(i=input_dir, f=fname)
    with open(spatial_data_candidate) as f:
      try:
        image_list = list(json.load(f)["tile_pointers"]["image_locations"].keys())
        spatial_data_file = spatial_data_candidate
      except: continue

if annotations_file == []:
  raise Exception("VIA annotations file not found")
elif spatial_data_file == []:
  raise Exception("tile spatial data file not found")

print("annotations file identified as " + annotations_file)
print("tile data file identified as " + spatial_data_file)

### Shuffle and split images into 3 datasets: Training, Testing, Validation

In [None]:
# create a list of tiles from our tile spatial data json
with open(spatial_data_file) as f:
    image_list = list(json.load(f)["tile_pointers"]["image_locations"].keys())

# shuffle the image list randomly and get total count
random.shuffle(image_list)
total_count = len(image_list)

# set indices for breaking up the total dataset into TTV parts
test_fraction, valid_fraction, train_fraction = 0.1, 0.04, 0.86

# spit error if the math don't add up
if (sum([test_fraction, valid_fraction, train_fraction]) != 1.0):
   raise Exception("fractions should add up to 1")

test_index = int(total_count * test_fraction)
valid_index = int(total_count * (test_fraction + valid_fraction))

# use indices to break up dataset into the three parts
test_dataset, valid_dataset, train_dataset = image_list[:test_index], image_list[test_index:valid_index], image_list[valid_index:]
print(len(test_dataset), len(valid_dataset), len(train_dataset))

# spit out CSV listing the image subsets
subset_list = []
for row in test_dataset:
        subset_list.append([row, "testing"])
for row in valid_dataset:
        subset_list.append([row, "validation"])
for row in train_dataset:
        subset_list.append([row, "training"])
with open(output_dir + '/subset_list.csv', 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(subset_list)

### Reformat annotations from VIA to RetinaNet format
The following loop pulls each annotation, line-by-line, from the VIA exported CSV, extracts the necessary information, reformats it into the format that RetinaNet requires (https://github.com/fizyr/keras-retinanet#annotations-format), then reassembles a new CSV line-by-line that RetinaNet can receive

In [None]:
# Create blank variable for each annotations list as we build it
image_annotations_train, image_annotations_test, image_annotations_valid = [], [], []

# Create blank list for class names
class_list = []

# read each line, parse it, convert it, put it all back together
# then drop it in the appropriate subset
with open(annotations_file, "r") as f:
    reader = csv.reader(f, delimiter=",")
    for line in reader: 
        # output we want:
        # format: path/to/image.jpg,x1,y1,x2,y2,class_name
        # example: /data/imgs/img_001.jpg,837,346,981,456,cow
        if 'filename' in line[0]:
            # bypassing comments in csv
            continue
        if '{}' in line[5]:
            #bypassing empty images
            continue
            
        filename = line[0]
        
        # pulling from column named "region_shape_attributes"
        box_entry = json.loads(line[5])
        top_left_x, top_left_y, width, height = box_entry["x"], box_entry["y"], box_entry["width"], box_entry["height"]
 
        if width == 0 or height == 0:
            continue
            # skip tiny/empty boxes
        
        # convert from "top left and width/height" to "x and y values at each corner of the box"
        if top_left_x < 0:
            top_left_x = 1
        if top_left_y < 0:
            top_left_y = 1
        x1, x2, y1, y2 = top_left_x, top_left_x + width, top_left_y, top_left_y + height 
        
        # pulling from column named "region_attributes" to get class names
        name = json.loads(line[6])["Age Class"]

        # skip unknown class, in this case. Might be useful in other applications though,
        # e.g. total object count irrespective of class
        if name == "Unknown":
            continue

        # build list of classes as we encounter new names
        if name not in class_list:
            class_list.append(name)

          # create the annotation row
        new_row = [filename, x1, y1, x2, y2, name]
        
        # append the row to the correct subset (training, testing, or validation)
        if filename in train_dataset:
            image_annotations_train.append(new_row)
        elif filename in test_dataset:
            image_annotations_test.append(new_row)
        else:
            image_annotations_valid.append(new_row)

ttv_ = list(map(len, [image_annotations_train, image_annotations_test, image_annotations_valid]))
ttv = list(map(int, [x/sum(ttv_)*100 for x in ttv_]))
print("total breakdown of annotations: {n} - {tr}% training set, {t}% testing set, {v}% validation set".format(tr=str(ttv[0]), t=str(ttv[1]), v=str(ttv[2]), n=ttv_))

### Output annotations.csv and classes.csv

In [None]:
with open(output_dir + '/annotations_train.csv', 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(image_annotations_train)

with open(output_dir + '/annotations_test.csv', 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(image_annotations_test)

with open(output_dir + '/annotations_valid.csv', 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(image_annotations_valid)

In [None]:
# this bit uses our class_list (built during annotations processing) to create our classes file
# note again that "unknown" ambiguous cases have been excluded in this case

detection_classes = []

for i in range(0, len(class_list)):
    detection_classes.append([class_list[i], i])

with open(output_dir + '/classes.csv', 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(detection_classes)

#### Zip data folder for download

In [None]:
# zip up the output directory into an archive for download
import subprocess
output_file_name = 'Step_3_{o}'.format(o=output_dir)
subprocess.call(['zip', '-r', output_file_name + '.zip', '/content/' + output_dir])

from google.colab import files
files.download(output_file_name + ".zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

At the end of this script you should have downloaded and 5 `csv` files (Testing, Training and Validation annotation subsets, the subset list  and the classes list). Drop these all in the google directory so they can be ingested by our CNN code in the next step.

Next steps:

4) train, refine, and test CNN using VIA annotations and the tiles generated here

5) export CNN outputs