# Split EOL user crops dataset into train and test
---
*Last Updated 11 February 2020*  
Instead of creating image annotations from scratch, EOL user-generated cropping coordinates are used to create training and testing data to teach object detection models and evaluate model accuracy for YOLO via darkflow, SSD and Faster-RCNN object detection models, respectively. 

Following the [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle), 80% of the original EOL crops dataset are randomly selected to be training data and the remaining 20% will be used to test model accuracy. 

Resulting train and test datasets are exported for further pre-processing in [lepidoptera_preprocessing.ipynb](https://github.com/aubricot/object_detection_for_image_cropping/blob/master/lepidoptera_preprocessing.ipynb), before they are ready to use with the object detection models.

In [1]:
# Mount google drive to import/export files
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import pandas as pd
import numpy as np

# Read in EOL user-generated cropping data
crops = pd.read_csv('drive/My Drive/fall19_smithsonian_informatics/train/lepidoptera_crops.tsv', sep="\t", header=0)
print(crops.head())

# Randomly select 80% of data to use for training
# set seed with random_state=2 for reproducible results
idx = crops.sample(frac = 0.8, random_state=2).index
train = crops.iloc[idx]
print(train.head())

# Select the remaining 20% of data for testing using the inverse index from above
test = crops.iloc[crops.index.difference(idx)]
print(test.head())

# Write test and train to tsvs 
train.to_csv('drive/My Drive/fall19_smithsonian_informatics/train/lepidoptera_crops_train.tsv', sep='\t', header=True, index=False)
test.to_csv('drive/My Drive/fall19_smithsonian_informatics/train/lepidoptera_crops_test.tsv', sep='\t', header=True, index=False)

In [12]:
# katie todo: figure out why in preprocessing.ipynb crops_train.tsv and crops_train_notaug.tsv have diff num rows...
# fix to correct for this 18 feb 19
# need to delete test images not found within test_notaug 
import pandas as pd
import numpy as np
import os

# Read in test images df
test = pd.read_csv('drive/My Drive/fall19_smithsonian_informatics/train/lepidoptera_crops_test.tsv', sep="\t", header=0)
test = test['data_object_id']
print(test.head())

# Read in test images transf df
transf = pd.read_csv('drive/My Drive/fall19_smithsonian_informatics/train/lepidoptera_crops_test_notaug.tsv', sep="\t", header=0)
transf = transf['data_object_id']
print(transf.head())

# Get test images not found in transf df
delfiles = test[~test.isin(transf)]
print(delfiles.head())

# Get test image filenames and paths not found in transf df
delnames = [(str(delfile) + '.jpg') for delfile in delfiles]
print(delnames)
delpaths = [('drive/My Drive/fall19_smithsonian_informatics/train/test_images/' + str(delname)) for delname in delnames]
print(delpaths)
print(len(delpaths))

# Delete test image files not found in transf df from Google Drive
#for delpath in delpaths: 
  #if os.path.exists(delpath):
    #os.remove(delpath)

0     1997045
1    20605305
2    12518535
3    25755145
4    25766731
Name: data_object_id, dtype: int64
0    1997701
1    1999202
2    5818670
3    5818728
4    5819278
Name: data_object_id, dtype: int64
0     1997045
1    20605305
2    12518535
3    25755145
4    25766731
Name: data_object_id, dtype: int64
['1997045.jpg', '20605305.jpg', '12518535.jpg', '25755145.jpg', '25766731.jpg', '24942871.jpg', '25808950.jpg', '9001609.jpg', '29872439.jpg', '26863443.jpg', '25205619.jpg', '31809982.jpg', '19885397.jpg', '12516623.jpg', '19175371.jpg', '29963868.jpg', '21946267.jpg', '32293086.jpg', '28720329.jpg', '32315171.jpg', '32305744.jpg', '32287009.jpg', '1997127.jpg', '1997135.jpg', '19605919.jpg', '29206711.jpg', '19607092.jpg', '19606562.jpg', '20604809.jpg', '20816732.jpg', '26182577.jpg', '24965720.jpg', '26327044.jpg', '22488379.jpg']
['drive/My Drive/fall19_smithsonian_informatics/train/test_images/1997045.jpg', 'drive/My Drive/fall19_smithsonian_informatics/train/test_images/2060