## Downloading the data

The dataset is retrieved from the website http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz using as a reference the following research paper


> Lukas Bossard, Matthieu Guillaumin, Luc Van Gool - Food-101 – Mining Discriminative Components with Random Forests

The Food-101 data set consists of images from Foodspotting [1]. Any use beyond
   scientific fair use must be negociated with the respective picture owners
   according to the Foodspotting terms of use [2].

[1] http://www.foodspotting.com/
[2] http://www.foodspotting.com/terms/

In [1]:
#importing essential modules
import pandas as pd
import numpy as np
import os
from os import path
import time
from random import seed, choice
import shutil

In [2]:
#deleting the data folder in case a cleaning is required
#shutil.rmtree("../data")

In [3]:
#the data will be downloaded and automatically extracted to the data folder ../data/food-101/images
%mkdir ../data
!wget -O ../data/food-101.tar.gz http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz
!tar -zxf ../data/food-101.tar.gz -C ../data

--2020-03-07 08:30:11--  http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz
Resolving data.vision.ee.ethz.ch (data.vision.ee.ethz.ch)... 129.132.52.162
Connecting to data.vision.ee.ethz.ch (data.vision.ee.ethz.ch)|129.132.52.162|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://data.vision.ee.ethz.ch/cvl/food-101.tar.gz [following]
--2020-03-07 08:30:12--  https://data.vision.ee.ethz.ch/cvl/food-101.tar.gz
Connecting to data.vision.ee.ethz.ch (data.vision.ee.ethz.ch)|129.132.52.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4996278331 (4.7G) [application/x-gzip]
Saving to: ‘../data/food-101.tar.gz’


2020-03-07 08:32:12 (39.6 MB/s) - ‘../data/food-101.tar.gz’ saved [4996278331/4996278331]



## Organise train and test set

In [4]:
#dividing into train and test set using the json metadata 

metafolder = "../data/food-101/meta/"
train_meta = pd.read_json(path_or_buf = metafolder + "train.json")
test_meta = pd.read_json(path_or_buf = metafolder + "test.json")

> ### Organising metdatada for training, testing and validation

In [5]:
#organising metadata for training, testing and validation
validation_split = 0.2
val_split_idx = int(np.floor(train_meta.shape[0]*validation_split))

#folder with all the food images
data_dir = "../data/food-101/images/"
folders_sorted = sorted(os.listdir(data_dir))

#number of categories to randomly select
nc = 5

#selecting a randomn subset of categories
seed(42)

selection = []
while len(selection) < nc:
    pick = choice(folders_sorted)
    if pick not in set(selection):
        selection.append(pick)
        
print("Selected categories : {}".format(', '.join(map(str, selection))))

Selected categories : ramen, carrot_cake, beef_carpaccio, strawberry_shortcake, escargots


In [6]:
#create folder with data to upload to s3
%mkdir ../data/s3_train_test_data
%mkdir ../data/s3_train_test_data/train_img 
%mkdir ../data/s3_train_test_data/valid_img
%mkdir ../data/s3_train_test_data/test_img

train_meta = train_meta[selection].iloc[:train_meta.shape[0] - val_split_idx]

valid_meta = train_meta[selection].iloc[train_meta.shape[0] - val_split_idx:]

test_meta = test_meta[selection]

#Setting train, validation and test set target folder
#target folder - train
trainfolder = "../data/s3_train_test_data/train_img/"

#target folder - validation
validfolder = "../data/s3_train_test_data/valid_img/"

#target folder -test
testfolder = "../data/s3_train_test_data/test_img/"

print("{} images used for training".format(train_meta.shape[0]*train_meta.shape[1]))
print("{} images used for validation".format(valid_meta.shape[0]*valid_meta.shape[1]))
print("{} images used for testing".format(test_meta.shape[0]*test_meta.shape[1]))

3000 images used for training
750 images used for validation
1250 images used for testing


In [7]:
#dividing into train and test set using the json metadata 

def organise_files_from_df(df, datafolder, datatarget):
    """
    This function moves files contained in a folder (datafolder) to a target path (datatarget),
    based on the information contained on a dataframe (df) where each column corresponds to a 
    class name (sub-folder). Every column of the dataset contains a list of filenames to be moved.
    """
    
    #creating target folder
    if not path.exists(datatarget):
        os.mkdir(datatarget)
    
    #iterating through dataframe columns ( =  labels)
    for label in list(df.columns):
        
        #create folder
        foldername = datatarget + str(label)
        
        if not path.exists(foldername):
            os.mkdir(foldername)
        
        #move each file
        for file in list(df[label]):
            
            fileoriginal =  datafolder + file + ".jpg"
            filetarget = datatarget +"/" + file + ".jpg"
            
            try:
                if not path.exists(filetarget):
                    shutil.copyfile(fileoriginal, filetarget)

            except FileNotFoundError:
                print("File {} not found!".format(file))
                pass

#origin folder
imagefolder = "../data/food-101/images/"


organise_files_from_df(train_meta, imagefolder, trainfolder)

organise_files_from_df(valid_meta, imagefolder, validfolder)

organise_files_from_df(test_meta, imagefolder, testfolder)

#to delete the origin folder, uncomment the line below
#shutil.rmtree(imagefolder)

## Load Data to S3

>The below cells load in some AWS SageMaker libraries, starts a SageMaker session and creates a default bucket. After creating this bucket, it upload the locally stored data to S3.

In [8]:
import boto3
import sagemaker

# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

> ### Upload training and testing data

In [9]:
prefix = "food-classifier"
datafolder = "../data/s3_train_test_data"
# upload all data to S3

#this is slow!
start = time.time()
input_data = sagemaker_session.upload_data(path=datafolder, bucket=bucket, key_prefix=prefix)
end = time.time()

print("Data uploaded to s3 after {} seconds".format(end - start))

Data uploaded to s3 after 361.97146677970886 seconds


In [10]:
# check that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    #print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('All good!')

All good!


> ## Checking model.py

In [25]:
!pygmentize pytorch_source/model.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorchvision.models[39;49;00m [34mas[39;49;00m [04m[36mmodels[39;49;00m

[37m#importing pretrained ResNet for transfer learning[39;49;00m
ResNetTransfer = models.resnet50(pretrained=[36mTrue[39;49;00m) 


> ## Checking train.py

In [26]:
!pygmentize pytorch_source/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[34mfrom[39;49;00m [04m[36mtorchvision[39;49;00m [34mimport[39;49;00m datasets
[34mimport[39;49;00m [04m[36mtorchvision.transforms[39;49;00m [34mas[39;49;00m [04m[36mtransforms[39;49;00m

[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mfrom[39;49;00m [04m[36mtorch.optim.lr_

In [3]:
import boto3
import sagemaker

# S3 bucket to load
bucket = "sagemaker-eu-central-1-515611759963"

prefix = "food-classifier"

train_test_folder = 's3://{}/{}/'.format(bucket, prefix)

# session and role
sagemaker_session = sagemaker.Session(default_bucket=bucket)
role = sagemaker.get_execution_role()

train_test_folder

's3://sagemaker-eu-central-1-515611759963/food-classifier/'

> ### Upload training and testing data

In [4]:

#datafolder = "../data/s3_train_test_data"
# upload all data to S3

#this is slow!
#start = time.time()
#input_data = sagemaker_session.upload_data(path=datafolder, bucket=bucket, key_prefix=prefix)
#end = time.time()

#print("Data uploaded to s3 after {} seconds".format(end - start))

> ## Create pytorch estimator

In [5]:
# import a PyTorch wrapper
from sagemaker.pytorch import PyTorch, PyTorchModel

In [7]:
# specify an output path
# prefix is specified above
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate a pytorch estimator
estimator = PyTorch(entry_point='train.py',
                    source_dir='pytorch_source', 
                    role=role,
                    framework_version= '1.1.0', #'1.3.1',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    output_path=output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'n_classes': nc + 1,  # num of classes for the fully connected layer at the end of the network (defined on the first cells and increased by 1 for pytorch counting)
                        'n_epochs': 3,
                        'img_short_side_resize':256,
                        'img_input_size':224,
                        'num_workers':16,
                        'batch_size':64
                    })

In [8]:
#estimator.fit({'training': input_data})

estimator.fit({'training': train_test_folder}) #to use already loaded data

2020-03-08 17:17:19 Starting - Starting the training job...
2020-03-08 17:17:21 Starting - Launching requested ML instances...
2020-03-08 17:18:18 Starting - Preparing the instances for training.........
2020-03-08 17:19:37 Downloading - Downloading input data......
2020-03-08 17:20:40 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-03-08 17:21:00,643 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-03-08 17:21:00,668 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-03-08 17:21:01,293 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-03-08 17:21:01,512 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-03-08 17:21:01,512 sagemaker-containers INFO 

> ## Deploy

In [9]:
#predictor = estimator.deploy(instance_type='ml.p2.xlarge', initial_instance_count=1)

-------------!

> ## Delete endpoint

In [10]:
#predictor.delete_endpoint()