## Split polyps / non-polyps dataset

Use `polyps` and `non_polyps` from `cropped` to copy files into `train` and `validation` from `data_polyps`. This folder should already exists and have both subfolders (or ajust the script to create them). The current **data_polyps** folder will be the dataset folder for the next deep learning classifications.

Therefore, we shall use a dataset split percentage such as **75% train** and **25% test**.

We need `os` and `shutil` to manage the files, `random` to randomly split the dataset in train and validation subsets. You should have a folder structure such as:

* `./data_polyps/train/polyps`
* `./data_polyps/train/non_polyps`
* `./data_polyps/validation/polyps`
* `./data_polyps/validation/non_polyps`

In [0]:
import os
import random
import shutil

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


These lines are the locations for the source files (the entire dataset) and the future locations of the splitted dataset in train and validation subsets:

In [0]:
# Source dataset: from where to copy the files
sourceFolderClass1 = '/content/drive/My Drive/Colab Notebooks/dataset/cropped/polyps'
sourceFolderClass2 = '/content/drive/My Drive/Colab Notebooks/dataset/cropped/polyps'
# Destination folders: splitted dataset in train and validation for polyps and non-polyps
destFolderClass1_tr  = '/content/drive/My Drive/Colab Notebooks/dataset/train/polyps'
destFolderClass2_tr  = '/content/drive/My Drive/Colab Notebooks/dataset/train/non_polyps'
destFolderClass1_val = '/content/drive/My Drive/Colab Notebooks/dataset/validation/polyps'
destFolderClass2_val = '/content/drive/My Drive/Colab Notebooks/dataset/validation/non_polyps'

Get the list with all the files in the source folder:

In [4]:
sourceFiles1 = os.listdir(sourceFolderClass1)
sourceFiles2 = os.listdir(sourceFolderClass2)
print("Class 1 - polyps:", len(sourceFiles1))
print("Class 2 - non-polyps:", len(sourceFiles2))

Class 1 - polyps: 606
Class 2 - non-polyps: 606


Let's suffle the listw with the source files using a random seed:

In [0]:
random.seed(1)
random.shuffle(sourceFiles1)
random.shuffle(sourceFiles2)

We shall define a number of files to copy in the `validation` subfolder for each class. If you want a different split, you should modify `val_files`.

In [6]:
# No of file to copy in VALIDATION folder for each class
val_files = 151

# Copy the first 151 files for polyps and non-polyps into validation folders
print('--> Validation split ...')
for i in range(val_files):
    # copy validation polyps
    File1 = os.path.join(sourceFolderClass1,sourceFiles1[i])
    File2 = os.path.join(destFolderClass1_val,  sourceFiles1[i])
    shutil.copy(File1,File2)
    # copy validation non-polyps
    File1 = os.path.join(sourceFolderClass2, sourceFiles2[i])
    File2 = os.path.join(destFolderClass2_val,   sourceFiles2[i])
    shutil.copy(File1, File2)

print('--> Done!')

--> Validation split ...
--> Done!


In [7]:
# Copy polyps to train
print('--> Train split ...')
for i in range(val_files,len(sourceFiles1)):
    File1 = os.path.join(sourceFolderClass1,  sourceFiles1[i])
    File2 = os.path.join(destFolderClass1_tr, sourceFiles1[i])
    shutil.copy(File1,File2)
# copy non-polyps to train
for i in range(val_files,len(sourceFiles2)):    
    File1 = os.path.join(sourceFolderClass2,  sourceFiles2[i])
    File2 = os.path.join(destFolderClass2_tr, sourceFiles2[i])
    shutil.copy(File1, File2)

print('--> Done!')

--> Train split ...
--> Done!


Now we have a splitted dataset into train and validation subfolder with each class inside:
* **1212** images in the entire dataset;
* **910** images for training: 455 polyps + 455 non-polyps;
* **302** images for validation: 151 polyps + 151 non-polyps.

Let's check the composition of the subsets for the future classification:

In [8]:
print('--> Dataset: data_polyps')
print('> Train - polyps:', len(os.listdir(destFolderClass1_tr)))
print('> Train - non-polyps:', len(os.listdir(destFolderClass2_tr)))
print('> Validation - polyps:', len(os.listdir(destFolderClass1_val)))
print('> Validation - non-polyps:', len(os.listdir(destFolderClass2_val)))

--> Dataset: data_polyps
> Train - polyps: 569
> Train - non-polyps: 565
> Validation - polyps: 265
> Validation - non-polyps: 261


We are ready to use Deep Learning to find a classifier for polyps/non-polyps images. Remember you could modify the dataset splitting, remove manually a specific list of files, use different names for the folders, etc.

Let's create some classifiers with the next script [3-Small_CNNs.ipynb](./3-Small_CNNs.ipynb).

Have fun with DL! @muntisa

### Acknowledgements

I gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research ([https://developer.nvidia.com/academic_gpu_seeding](https://developer.nvidia.com/academic_gpu_seeding)).