# Data & Model Preparation
This notebook will prepare the dataset and model for the module evaluation lab.  This is an optional step if you have kept your artifacts from previous modules.

## Import modules and initialize parameters for this notebook

In [10]:
import sagemaker
from sagemaker import get_execution_role
import glob
import random
import shutil
import os

role = get_execution_role()
sess = sagemaker.Session()

account = sess.account_id()
region = sess.boto_region_name
bucket = sess.default_bucket() # or use your own custom bucket name
prefix = 'BIRD-Sagemaker-Deployment'

## Dataset
The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species (the original technical report can be found here). Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.

Run the cell below to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data.

In [11]:
!yum install unzip
!unzip ./CUB_10_2011.zip

/bin/bash: yum: command not found
Archive:  ./CUB_10_2011.zip
   creating: CUB_10_2011/
  inflating: CUB_10_2011/image_class_labels.txt  
  inflating: CUB_10_2011/bounding_boxes.txt  
  inflating: CUB_10_2011/images.txt  
  inflating: CUB_10_2011/classes.txt  
   creating: CUB_10_2011/parts/
  inflating: CUB_10_2011/parts/parts.txt  
  inflating: CUB_10_2011/parts/part_click_locs.txt  
  inflating: CUB_10_2011/parts/part_locs.txt  
   creating: CUB_10_2011/.ipynb_checkpoints/
  inflating: CUB_10_2011/.ipynb_checkpoints/classes-checkpoint.txt  
  inflating: CUB_10_2011/.ipynb_checkpoints/image_class_labels-checkpoint.txt  
   creating: CUB_10_2011/images/
   creating: CUB_10_2011/images/010.Red_winged_Blackbird/
  inflating: CUB_10_2011/images/010.Red_winged_Blackbird/Red_Winged_Blackbird_0096_5019.jpg  
  inflating: CUB_10_2011/images/010.Red_winged_Blackbird/Red_Winged_Blackbird_0028_4709.jpg  
  inflating: CUB_10_2011/images/010.Red_winged_Blackbird/Red_Winged_Blackbird_0089_4188.jpg

Generate test samples for this lab

In [12]:
img_array = []
image_folder = 'CUB_10_2011/images'
dst = 'build/image_classification/images'

if not os.path.exists(dst):
    os.makedirs(dst)
    print("make new directory.....")

for sub_dir in (glob.glob(f'{image_folder}/*')):
    for filename in (glob.glob(f'{sub_dir}/*')):
        img_array.append(filename)

for i in range(10):
    rand_index = random.randint(0,len(img_array)-1)
    shutil.copy(img_array[rand_index], dst)

make new directory.....


Copy data to s3

In [13]:
s3_raw_data = f's3://{bucket}/{prefix}/full/data'
!aws s3 cp --recursive ./CUB_10_2011 $s3_raw_data

upload: CUB_10_2011/.ipynb_checkpoints/classes-checkpoint.txt to s3://sagemaker-ap-south-1-650790930882/BIRD-Sagemaker-Deployment/full/data/.ipynb_checkpoints/classes-checkpoint.txt
upload: CUB_10_2011/attributes/certainties.txt to s3://sagemaker-ap-south-1-650790930882/BIRD-Sagemaker-Deployment/full/data/attributes/certainties.txt
upload: CUB_10_2011/README to s3://sagemaker-ap-south-1-650790930882/BIRD-Sagemaker-Deployment/full/data/README
upload: CUB_10_2011/.ipynb_checkpoints/image_class_labels-checkpoint.txt to s3://sagemaker-ap-south-1-650790930882/BIRD-Sagemaker-Deployment/full/data/.ipynb_checkpoints/image_class_labels-checkpoint.txt
upload: CUB_10_2011/attributes/class_attribute_labels_continuous.txt to s3://sagemaker-ap-south-1-650790930882/BIRD-Sagemaker-Deployment/full/data/attributes/class_attribute_labels_continuous.txt
upload: CUB_10_2011/images.txt to s3://sagemaker-ap-south-1-650790930882/BIRD-Sagemaker-Deployment/full/data/images.txt
upload: CUB_10_2011/classes.txt to

In [14]:
!rm -rf ./CUB_10_2011
!rm -rf attributes.txt

In [15]:
from sagemaker.sklearn.processing import SKLearnProcessor

from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
)
import time 

timpstamp = str(time.time()).split('.')[0]
# SKlearnProcessor for preprocessing
output_prefix = f'{prefix}/outputs'
output_s3_uri = f's3://{bucket}/{output_prefix}'

class_selection = '1, 2, 3, 4, 5, 6, 7, 8'
input_annotation = 'classes.txt'
processing_instance_type = "ml.m5.xlarge"
processing_instance_count = 1

sklearn_processor = SKLearnProcessor(base_job_name = f"{prefix}-preprocess",  # choose any name
                                    framework_version='0.20.0',
                                    role=role,
                                    instance_type=processing_instance_type,
                                    instance_count=processing_instance_count)

In [17]:
sklearn_processor.run(
    code='./preprocessing.py',
    arguments=["--classes", class_selection, 
               "--input-data", input_annotation],
    inputs=[ProcessingInput(source=s3_raw_data, 
            destination="/opt/ml/processing/input")],
    outputs=[
            ProcessingOutput(source="/opt/ml/processing/output/train", destination = output_s3_uri +'/train'),
            ProcessingOutput(source="/opt/ml/processing/output/valid", destination = output_s3_uri +'/valid'),
            ProcessingOutput(source="/opt/ml/processing/output/test", destination = output_s3_uri +'/test'),
            ProcessingOutput(source="/opt/ml/processing/output/manifest", destination = output_s3_uri +'/manifest'),
        ],
    )

INFO:sagemaker:Creating processing-job with name BIRD-Sagemaker-Deployment-preprocess-2023-07-07-12-49-45-496


...................................[34m['001.Black_footed_Albatross', '002.Laysan_Albatross', '003.Sooty_Albatross', '004.Groove_billed_Ani', '005.Crested_Auklet', '006.Least_Auklet', '007.Parakeet_Auklet', '008.Rhinoceros_Auklet'][0m
[34mUsing 424 images from 8 classes[0m
[34mnum images total: 11788[0m
[34mnum train: 255[0m
[34mnum val: 84[0m
[34mnum test: 85[0m
[34mCopying files for 84 images in channel: valid...[0m
[34mCopying files for 85 images in channel: test...[0m
[34mCopying files for 255 images in channel: train...[0m
[34mFinished running processing job[0m



This is where your images and annotation files are located.  You will need these for this module.

In [18]:
print(f"Test dataset located here: {output_s3_uri +'/test'} ===========")

print(f"Test annotation file is located here: {output_s3_uri +'/manifest'} ===========")



In [19]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
from sagemaker.tensorflow import TensorFlow

TF_FRAMEWORK_VERSION = '2.4.1'

hyperparameters = {
    'initial_epochs':  5,
    'batch_size': 8,
    'fine_tuning_epochs': 20, 
    'dropout': 0.4,
    'data_dir': '/opt/ml/input/data'
}

metric_definitions = [{'Name': 'loss',      'Regex': 'loss: ([0-9\\.]+)'},
                  {'Name': 'acc',       'Regex': 'accuracy: ([0-9\\.]+)'},
                  {'Name': 'val_loss',  'Regex': 'val_loss: ([0-9\\.]+)'},
                  {'Name': 'val_acc',   'Regex': 'val_accuracy: ([0-9\\.]+)'}]


distribution = {'parameter_server': {'enabled': False}}
DISTRIBUTION_MODE = 'FullyReplicated'
    
train_in = TrainingInput(s3_data=output_s3_uri +'/train', distribution=DISTRIBUTION_MODE)
val_in   = TrainingInput(s3_data=output_s3_uri +'/valid', distribution=DISTRIBUTION_MODE)
test_in  = TrainingInput(s3_data=output_s3_uri +'/test', distribution=DISTRIBUTION_MODE)

inputs = {'train':train_in, 'test': test_in, 'validation': val_in}

training_instance_type = 'ml.c5.4xlarge'

training_instance_count = 1

In [20]:
model_path = f"s3://{bucket}/{prefix}"

estimator = TensorFlow(entry_point='train-mobilenet.py',
               source_dir='./code',
               output_path=model_path,
               instance_type=training_instance_type,
               instance_count=training_instance_count,
               distribution=distribution,
               hyperparameters=hyperparameters,
               metric_definitions=metric_definitions,
               role=role,
               framework_version=TF_FRAMEWORK_VERSION, 
               py_version='py37',
               base_job_name=prefix,
               script_mode=True)

In [21]:
estimator.fit(inputs)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker:Creating training-job with name: BIRD-Sagemaker-Deployment-2023-07-07-12-56-42-862


2023-07-07 12:56:43 Starting - Starting the training job...
2023-07-07 12:56:58 Starting - Preparing the instances for training......
2023-07-07 12:57:45 Downloading - Downloading input data...
2023-07-07 12:58:16 Training - Training image download completed. Training in progress.[34m2023-07-07 12:58:29.121450: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2023-07-07 12:58:29.121572: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2023-07-07 12:58:29.146704: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2023-07-07 12:58:30,350 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2023-07-07 12:58:30,358 sagemaker-training-toolkit INFO     No GPUs detected (normal i

In [22]:
training_job_name = estimator.latest_training_job.name

print(f"model artifacts file is uploaded here: {model_path}/{training_job_name}/output ========")



