# Organize directories

The first step in training a ML model on sequences of images acquired from the ProtectCode app is to ensure the directory structure is set up for subsequent data loading. 

These commands assume the following: 

1) That the top directory for the training data and this notebook are both located in ~/src/YPB-AI/ on the EC2 instance. To get a zipped directory from S3 to that directory on your EC2 instance, you can run the following bash command: "aws s3 cp s3://your-bucket-name/yourZipFile.zip ~/src/YPB-AI/yourZipFile.zip". And to unzip it, you can run "unzip yourZipFile.zip"

2) That the jupyter notebook server was opened from the ~/src/ directory on the EC2 instance via the command "jupyter notebook --ip=0.0.0.0 --no-browser".

3) That underneath the top directory (which I usually call "Training" or something of the like), there are a series of subdirectories named and organized by classification label. For instance, each subdirectory that contains true negative image sequences, according to my implementation, contains the word "Negative". And subdirectories with true positives do not contain the word "Negative". 



In [1]:
import numpy as np
from glob import glob
import os, sys
import re

Set the name of your top data directory below. 

In [2]:
topDirName = 'Training'

Lump together all subdirectories with the same classification label. 

In [3]:
#Assumes current working directory is ~/src/YPB-AI
allSamples = np.array(glob(topDirName + '/*'))
os.system('mkdir ' + topDirName + '/Absent')
os.system('mkdir ' + topDirName + '/Present')

for sample in allSamples: 
    if 'Negative' in sample or 'Sat' in sample:
        cmdString = "mv " + sample + " " + topDirName + "/Absent"
    else:
        cmdString = "mv " + sample + " " + topDirName + "/Present"
    os.system(cmdString)

Each directory within the classification subdirectory contains a series of folders that have the following contents:

1) A .txt file containing some timing and other metadata about the app run.

2) A folder called 'Photos' which contains the sequence of images acquired.

The next cell will obliterate the middle directory (the directory just below the classification directory) and repackage each run's photos into a uniquely identifiable folder of its own (lacking the text file) at the level where the middle directories once sat.



In [4]:
# Find 'vestigial' directories to be deleted after unpacking.
presentVestDirs = np.array(glob(topDirName + '/Present/*'))
absentVestDirs = np.array(glob(topDirName + '/Absent/*'))

# Unpack the JPG photo files into identifiable directories.
allPresentSamples = np.array(glob(topDirName + '/Present/*/*'))
for sample in allPresentSamples:
    # Grab the unique identifiers -- phone model, sample # and date/time
    digestedSample = re.split('/',sample)
    digestedTimestamp = re.split('_',digestedSample[3])
    newDir = digestedSample[2] + '_' + digestedTimestamp[1] + '_' + digestedTimestamp[2]
    # Make the new directory
    cmdString = 'mkdir ' + topDirName + '/Present/' + newDir
    os.system(cmdString)
    # Add the photos directly to it
    cmdString = 'mv ' + sample + '/Photos/*' + ' ' + topDirName + '/Present/' + newDir
    os.system(cmdString)
    
allAbsentSamples = np.array(glob(topDirName + '/Absent/*/*'))
for sample in allAbsentSamples:
    # Grab the unique identifiers -- phone model, sample # and date/time
    digestedSample = re.split('/',sample)
    digestedTimestamp = re.split('_',digestedSample[3])
    newDir = digestedSample[2] + '_' + digestedTimestamp[1] + '_' + digestedTimestamp[2]
    # Make the new directory
    cmdString = 'mkdir ' + topDirName + '/Absent/' + newDir
    os.system(cmdString)
    # Add the photos directly to it
    cmdString = 'mv ' + sample + '/Photos/*' + ' ' + topDirName + '/Absent/' + newDir
    os.system(cmdString)
    
    
# Delete vestigial directories.
for direc in presentVestDirs:
    os.system('rm -rf ' + direc)
    
for direc in absentVestDirs:
    os.system('rm -rf ' + direc)

Look through all directories and find image sequences containing the file "IMG_9999.JPG". If found, sort and rename to an appropriate image sequence. This is necessary because when iOS reaches "IMG_9999.JPG" it wraps back around to "IMG_0000.JPG" for the next in the sequence. If this is left as is, the sorting operation that follows will sort the images into the incorrect timing order. 

In [5]:
# Get the filenames.
imgSeqsToCorrect = os.popen('cd ~/src/YPB-AI/; find -name "*IMG_9999.JPG"').read()
# Split into unique entities.
imgSeqsToCorrect = re.split('\n',imgSeqsToCorrect)
# Drop the last entry, it's empty.
imgSeqsToCorrect.pop();

# Change any files that don't start with 'IMG_9' (i.e., any files that come after the wraparound) to start with 'J'.
# This will mean that when they are sorted, they will come correctly after the wraparound.
for sample in imgSeqsToCorrect:
    digestedSample = re.split('/',sample)
    pathToDir = os.path.join(digestedSample[0],digestedSample[1],digestedSample[2],digestedSample[3])
    filesInDir = os.listdir(pathToDir)
    for i in range(len(filesInDir)):
        if 'IMG_9' not in filesInDir[i]:
            newFile = filesInDir[i].replace('I','J',1)
            cmdString = 'mv ' + os.path.join(pathToDir,filesInDir[i]) + ' ' + os.path.join(pathToDir,newFile)
            os.system(cmdString)

# Use a CNN to transform image sequences to a pickle file and organize the results.

In [6]:
import torch
import torchvision.models as models
from PIL import Image
import pickle

Use a pretrained CNN model (here using DenseNet121 because it has worked well in the past) with the final classifier layer chopped off.

In [7]:
# Import the CNN model and move it to CUDA if available.
model = models.densenet121(pretrained=True)

for param in model.parameters():
    param.requires_grad = False
del model.classifier

if torch.cuda.is_available():
    model.to('cuda');

Define the function that will transform model output into a pickle file.

In [19]:
def transform_to_pickle(model,topdir):
    model.eval()
    ndx = 0
    topdirs = [direc for direc in os.listdir(topdir) if direc[0] is not '.']
    
    # Calculate the total number of stacks to transform.
    numStacks = 0
    for classDir in topdirs:
        thisClassDir = os.path.join(topdir,classDir)
        directories = [direc for direc in os.listdir(thisClassDir) if direc[0] is not '.']
        numStacks += len(directories)
    
    for classDir in topdirs:
        thisClassDir = os.path.join(topdir,classDir)
        directories = [direc for direc in os.listdir(thisClassDir) if direc[0] is not '.']
        
        
        # Do the transformation.
        for photoDir in directories:
            
            thisPhotoDir = os.path.join(thisClassDir, photoDir)
            ##pin pin fix##
            
#             print(thisPhotoDir)
#             print(os.path.isfile(photoDir))
#             continue
            
            if not os.path.isdir(thisPhotoDir):
                continue

            ## end ##
            files = [photofile for photofile in os.listdir(thisPhotoDir) if photofile.lower().endswith('jpg')]
            print("Transforming image stack {}".format(thisPhotoDir))
            files.sort()
            output = []
            # Make sure to appropriately transform the image size and normalize the RGB values, as the model expects. 
            for file in files:
                thisFile = os.path.join(thisPhotoDir,file)
                imgArr = np.float32(np.array(Image.open(thisFile).resize((224,224))).reshape([1,3,224,224]))
                imgArr[0][0] = ((imgArr[0][0] - 123.675)/58.395)
                imgArr[0][1] = ((imgArr[0][1] - 116.28)/57.12)
                imgArr[0][2] = ((imgArr[0][2] - 103.53)/57.375)
                if torch.cuda.is_available():
                    output.append(model(torch.cuda.FloatTensor(imgArr)))
                else:
                    output.append(model(torch.FloatTensor(imgArr)))

            FinalOutput = torch.FloatTensor(len(files),1,1024)
            for i in range(len(files)):
                FinalOutput[i] = output[i]
            pathStr = thisPhotoDir + '.pickle'
            with open(pathStr,'wb') as f:
                pickle.dump(FinalOutput,f)
                ndx += 1
                print("{}% complete.".format(ndx/numStacks*100))
    

Do the transformation

In [20]:
transform_to_pickle(model,topDirName)

NotADirectoryError: [Errno 20] Not a directory: 'Training/Absent/XS_Negative_Dark_14-Mar-2019_17-13-38.pickle'

Clean up the directory, removing everything but the pickle files.

In [45]:
# Assemble all the pickle files and remove the rest.
classDirs = [direc for direc in os.listdir(topDirName) if direc[0] is not '.']
for classDir in classDirs:
    newDir = os.path.join(topDirName,classDir)
    fileDirs = [direc for direc in os.listdir(newDir) if direc[0] is not '.' and not direc.lower().endswith('.pickle')]
    for fileDir in fileDirs:
        pickleFileDir = os.path.join(newDir,fileDir)
        os.system('rm -rf ' + pickleFileDir)

# Split the directory into train/test/validation sets

In [61]:
import random

In [82]:
finalDirName = 'Split_Data'

In [83]:
def split_folders(inputDir, outputDir, trainValTestRatio, seed):
    # Assume inputDir is organized as 'inputDir/class/topFolder/photos'
    classes = os.listdir(inputDir)
    # Exclude folders where '.' is the first character.
    classes = [directory for directory in classes if directory[0] != '.']
    numClasses = len(classes)
    # Create output directories
    os.system('mkdir ' + outputDir)
    os.system('mkdir ' + outputDir + '/train')
    os.system('mkdir ' + outputDir + '/valid')
    os.system('mkdir ' + outputDir + '/test')
    # Pull files and put into the appropriate folders.
    for directory in classes:
        files = os.listdir(os.path.join(inputDir, directory))
        files.sort()
        random.seed(seed)
        random.shuffle(files)
        numFiles = len(files)
        # Get the indices.
        ndxAmts = np.intc((np.round(trainValTestRatio / np.sum(trainValTestRatio) * numFiles)))
        numNdxes = np.sum(ndxAmts)
        if numNdxes != numFiles:
            ndxAmts[0] += numFiles-numNdxes
        trainFiles = files[0:ndxAmts[0]]
        valFiles = files[ndxAmts[0]:ndxAmts[0]+ndxAmts[1]]
        testFiles = files[ndxAmts[0]+ndxAmts[1]:]
        # Put the training directories in the appropriate folder
        # First make the class directories.
        os.system('mkdir ' + outputDir + '/train/' + directory)
        os.system('mkdir ' + outputDir + '/valid/' + directory)
        os.system('mkdir ' + outputDir + '/test/' + directory)
        for file in trainFiles:
            os.system('cp -r ' + inputDir + '/' + directory + '/' + file + ' ' + outputDir + '/train/' + directory + '/' + file)
        for file in valFiles:
            os.system('cp -r ' + inputDir + '/' + directory + '/' + file + ' ' + outputDir + '/valid/' + directory + '/' + file)
        for file in testFiles:
            os.system('cp -r ' + inputDir + '/' + directory + '/' + file + ' ' + outputDir + '/test/' + directory + '/' + file)

In [84]:
split_folders(topDirName, finalDirName,[0.8,0.1,0.1],7778)