<a href="https://colab.research.google.com/github/aubricot/computer_vision_with_eol_images/blob/master/object_detection_for_image_cropping/multitaxa/multitaxa_split_train_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Split EOL user crops dataset into train and test for all taxa
---
*Last Updated 29 March 2020*  
Instead of creating image annotations from scratch, EOL user-generated cropping coordinates are used to create training and testing data to teach object detection models and evaluate model accuracy for YOLO via darkflow, SSD and Faster-RCNN object detection models, respectively. 

Following the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle), for each taxon 80% of the original EOL crops dataset are randomly selected to be training data and the remaining 20% will be used to test model accuracy. 

Resulting train and test datasets for each taxon are exported for further pre-processing in [multitaxa_preprocessing.ipynb](https://github.com/aubricot/computer_vision_with_eol_images/tree/master/object_detection_for_image_cropping/multitaxa/multitaxa_preprocessing.ipynb), before they are ready to use with the object detection models.

In [0]:
# Mount google drive to import/export files
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Run for each taxon (Coleoptera, Anura, Squamata and Carnivora), change names where you see '# TO-DO'

In [0]:
import pandas as pd
import numpy as np

# Read in EOL user-generated cropping data
# TO-DO: Change to anura, coleoptera, squamata, and carnivora _crops.tsv
crops = pd.read_csv('drive/My Drive/fall19_smithsonian_informatics/train/carnivora_crops.tsv', sep="\t", header=0)
print(crops.head())

# Randomly select 80% of data to use for training
# set seed with random_state=2 for reproducible results
idx = crops.sample(frac = 0.8, random_state=2).index
train = crops.iloc[idx]
print(train.head())

# Select the remaining 20% of data for testing using the inverse index from above
test = crops.iloc[crops.index.difference(idx)]
print(test.head())

# Write test and train to tsvs 
# TO-DO: Change to anura, coleoptera, squamata, and carnivora _crops.tsv _crops_train.tsv and  _crops_test.tsv
train.to_csv('drive/My Drive/fall19_smithsonian_informatics/train/carnivora_crops_train.tsv', sep='\t', header=True, index=False)
test.to_csv('drive/My Drive/fall19_smithsonian_informatics/train/carnivora_crops_test.tsv', sep='\t', header=True, index=False)