# Feature extraction
<p> 
## Overview
This notebook demonstrates the image pre-processing necessary in transfer learning, discussed in the [How to Retrain an Image Classifier for New Categories](https://www.tensorflow.org/tutorials/image_retraining) tutorial. Transfer learning consists of using a pre-trained model, Inception V3, to create a new image classifier based on new images and custom classes. The pre-processing part covered in this notebook executes in two steps:
*   Convert images to JPEG format
*   Extract features from the images by running them through the Inception V3 model

The main advantage of transfer learning is a cosiderable reduction in costs and training time compared to traning the classifier from scratch.

This notebook does preprocessing using [Cloud Dataflow](https://cloud.google.com/dataflow/) for image and text manipulation and [TensorFlow](https://www.tensorflow.org/) for feature extraction.

### Prepare imports
Execute the code below to import all necessary Python module and initialize global variables that will be used later on.

In [1]:
import IPython
import apache_beam as beam
import datetime
import logging
import os
import shutil
import tempfile
import warnings

from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.io import tfrecordio

warnings.filterwarnings(action='ignore')

project_id = datalab_project_id()
bucket = 'gs://candies-{}-dantest'.format(project_id)
preprocess_dir = '{}/candies_preprocessed_cloud'.format(bucket)
model_dir = '{}/candies_model_cloud'.format(bucket)
staging_dir = '{}/staging'.format(bucket)

In [None]:
!gsutil mb $bucket

### Visualize training data
We will train our classifier to recognize images of candies. Let's have a look at a few samples from the dataset.

In [2]:
import pandas as pd
from cStringIO import StringIO

def resize_image(image_str_tensor):
  """Decodes jpeg string, resizes it and re-encode it to jpeg."""
  import tensorflow as tf
  
  # These constants are set by Inception v3's expectations.
  height = 299
  width = 299
  channels = 3

  image = tf.image.decode_jpeg(image_str_tensor, channels=channels)
  # Note resize expects a batch_size, but tf_map supresses that index,
  # thus we have to expand then squeeze.  Resize returns float32 in the
  # range [0, uint8_max]
  image = tf.expand_dims(image, 0)
  image = tf.image.resize_bilinear(image, [height, width], align_corners=False)
  image = tf.squeeze(image, squeeze_dims=[0])
  image = tf.cast(image, dtype=tf.uint8)
  image = tf.image.encode_jpeg(image, quality=100)
  return image


def display_images(image_files):
  """Predict using a deployed (online) model."""

  import IPython
  import base64
  import collections
  import tensorflow as tf
  from tensorflow.python.lib.io import file_io as tf_file_io
  
  images = []
  for image_file in image_files:
    with tf_file_io.FileIO(image_file, 'r') as ff:
      images.append(ff.read())

  # To resize, run a tf session so we can reuse 'decode_and_resize()'
  # which is used in prediction graph. This makes sure we don't lose
  # any quality in prediction, while decreasing the size of the images
  # submitted to the model over network.
  image_str_tensor = tf.placeholder(tf.string, shape=[None])
  image = tf.map_fn(resize_image, image_str_tensor, back_prop=False)
  feed_dict = collections.defaultdict(list)
  feed_dict[image_str_tensor.name] = images
  with tf.Session() as sess:
    images_resized = sess.run(image, feed_dict=feed_dict)

  html = '<table>'
  for i, image in enumerate(images_resized):
    encoded_image = base64.b64encode(image)
    image_html = "<td><img src='data:image/jpg;base64, %s'></td>" % encoded_image
    if i % 4 == 0:
      html = html + '<tr>'
    html = html + image_html
    if i % 4 == 3:
      html = html + '</tr>'
  #IPython.display.display(IPython.display.Image(data=image))
  html = html + '</table>'
  IPython.display.display(IPython.display.HTML(html))


train_dataset = 'gs://candies-ml/dataset/metadata/train_candies560.csv'
#train_dataset = 'gs://candies-ml/dataset/metadata/train_candies16.csv'
%storage read --object $train_dataset --variable text
df = pd.read_csv(StringIO(text), names=['image_uri', 'category'])

images = []
categories = df['category'].drop_duplicates().values
for category in categories:
  image_uris = df.loc[df['category'] == category]
  images.extend(image_uris['image_uri'].sample(n=4, replace=False).values)
display_images(images)

## Load and prepare the training data
Code the pipeline transformations

### Exercise 1 - Load images and labels from CSV
TODO : Write the exercise description. Extract the list of labels and of labeled images from the CSV configuration file on GCS.

In [None]:
class LoadImagesAndLabels(beam.PTransform):
  """Load labels and labeled images from GCS.
  """

  def __init__(self, train_dataset):
    super(LoadImagesAndLabels, self).__init__('LoadImagesAndLabels')
    self._train_dataset = train_dataset

  def expand(self, begin):
    """TODO : write the description of the function
    """
    import csv
    # TODO : write what needs to be done in the exercise
    raise NotImplementedError()

### Solution
To see the solution please display the code in the box below.

In [3]:
class LoadImagesAndLabels(beam.PTransform):
  """Load labels and labeled images from GCS.
  """

  def __init__(self, train_dataset):
    super(LoadImagesAndLabels, self).__init__('LoadImagesAndLabels')
    self._train_dataset = train_dataset

  def expand(self, begin):
    import csv
    images = (begin |
              'ReadCSVFile' >> ReadFromText(self._train_dataset, strip_trailing_newlines=True) |
              'DictFromCSV' >> beam.Map(lambda line: csv.DictReader([line], fieldnames=['image_url', 'label']).next()))  
    labels = (images |
              'ExtractLabels' >> beam.Map(lambda x: str(x['label'])) |
              'CombineLabels' >> beam.transforms.combiners.Count.PerElement() |
              'KeepLabelName' >> beam.Map(lambda label_count: label_count[0]))
    return images, labels

### Exercise 2 - Label images with label ids
TODO : Write the exercise description. Extracts (uri, label_ids) tuples from CSV rows.

In [None]:
class LabelImagesWithIdsDoFn(beam.DoFn):
  """Extracts (uri, label_ids) tuples from CSV rows.
  """

  def start_bundle(self, context=None):
    self.label_to_id_map = {}

  def process(self, element, all_labels):
    all_labels = list(all_labels)
    # TODO : write what needs to be done in the exercise
    raise NotImplementedError()

### Solution
To see the solution please display the code in the box below.

In [4]:
class LabelImagesWithIdsDoFn(beam.DoFn):
  """Extracts (uri, label_ids) tuples from CSV rows.
  """

  def start_bundle(self, context=None):
    self.label_to_id_map = {}

  def process(self, element, all_labels):
    all_labels = list(all_labels)
    # DataFlow cannot garuantee the order of the labels when materializing it.
    # The labels materialized and consumed by training may not be in the same order
    # as the one used in preprocessing. So we need to sort it in both preprocessing
    # and training so the order matches.
    all_labels.sort()
    if not self.label_to_id_map:
      for i, label in enumerate(all_labels):
        label = label.strip()
        if label:
          self.label_to_id_map[label] = i

    # Row format is:
    # image_uri,label_id
    if not element:
      return

    uri = element['image_url']
    label_id = self.label_to_id_map[element['label'].strip()]
    yield uri, label_id

### Create the data preparation transform

In [5]:
class PrepareTrainingData(beam.PTransform):
  """Load labels and labeled images from GCS.
  """

  def __init__(self, train_dataset):
    super(PrepareTrainingData, self).__init__('PrepareTrainingData')
    self._train_dataset = train_dataset

  def expand(self, begin):
    images, labels = begin | 'LoadImagesAndLabels' >> LoadImagesAndLabels(self._train_dataset)
    labeled_images = images | 'LabelImagesWithIds' >> beam.ParDo(LabelImagesWithIdsDoFn(), beam.pvalue.AsIter(labels))
    return labeled_images, labels

## Extract features from images

### Convert images to JPEG format
Read files from GCS and convert images to JPEG format. We do this even for JPEG images to remove variations such as different number of channels.

In [6]:
class ConvertImagesToJpegDoFn(beam.DoFn):

  def process(self, element):
    
    import cStringIO
    from PIL import Image
    from tensorflow.python.lib.io import file_io as tf_file_io

    uri, label_id = element
    try:
      with tf_file_io.FileIO(uri, 'r') as f:
        img = Image.open(f).convert('RGB')
    # A variety of different calling libraries throw different exceptions here.
    # They all correspond to an unreadable file so we treat them equivalently.
    # pylint: disable broad-except
    except Exception as e:
      logging.exception('Error processing image %s: %s', uri, str(e))
      return

    # Convert to desired format and output.
    output = cStringIO.StringIO()
    img.save(output, 'jpeg')
    image_bytes = output.getvalue()
    yield uri, label_id, image_bytes

### Extract features from images

Embeds image bytes and labels, stores them in tensorflow.Example.

In [7]:
class ExtractTFExampleFromImagesDoFn(beam.DoFn):
  """Embeds image bytes and labels, stores them in tensorflow.Example.

  (uri, label_ids, image_bytes) -> (tensorflow.Example).

  Output proto contains 'label', 'image_uri' and 'embedding'.
  The 'embedding' is calculated by feeding image into input layer of image
  neural network and reading output of the bottleneck layer of the network.

  Attributes:
    image_graph_uri: an uri to gcs bucket where serialized image graph is
                     stored.
  """
  
  def __init__(self, checkpoint_path):
    import tensorflow as tf
    self._tf = tf
    self._tf_train = tf.train
    self.tf_session = None
    self.graph = None
    self.preprocess_graph = None
    self._checkpoint_path = checkpoint_path

  def start_bundle(self, context=None):
    # There is one tensorflow session per instance of TFExampleFromImageDoFn.
    # The same instance of session is re-used between bundles.
    # Session is closed by the destructor of Session object, which is called
    # when instance of TFExampleFromImageDoFn() is destructed.
    import mltoolbox.image.classification._preprocess as preprocess
    if not self.graph:
      self.graph = self._tf.Graph()
      self.tf_session = self._tf.InteractiveSession(graph=self.graph)
      with self.graph.as_default():
        self.preprocess_graph = preprocess.EmbeddingsGraph(self.tf_session, self._checkpoint_path)

  def finish_bundle(self, context=None):
    if self.tf_session is not None:
      self.tf_session.close()

  def process(self, element):

    def _bytes_feature(value):
      return self._tf_train.Feature(bytes_list=self._tf_train.BytesList(value=value))

    def _float_feature(value):
      return self._tf_train.Feature(float_list=self._tf_train.FloatList(value=value))

    uri, label_id, image_bytes = element

    try:
      embedding = self.preprocess_graph.calculate_embedding(image_bytes)
    except self._tf.errors.InvalidArgumentError as e:
      logging.warning('Could not encode an image from %s: %s', uri, str(e))
      return

    features = self._tf_train.Features(
      feature={
        'image_uri': _bytes_feature([str(uri)]),
        'embedding': _float_feature(embedding.ravel().tolist())
      })
    example = self._tf_train.Example(features=features)
    example.features.feature['label'].int64_list.value.append(label_id)

    yield example
    
class ExtractFeatures(beam.PTransform):
  """Load labels and labeled images from GCS.
  """

  def __init__(self):
    super(ExtractFeatures, self).__init__('ExtractFeatures')
    self._checkpoint = 'gs://cloud-ml-data/img/flower_photos/inception_v3_2016_08_28.ckpt'

  def expand(self, labeled_images):
    return (labeled_images |
            'ConvertImagesToJpeg' >> beam.ParDo(ConvertImagesToJpegDoFn()) |
            'ExtractTFExampleFromImages' >> beam.ParDo(ExtractTFExampleFromImagesDoFn(self._checkpoint)))

## Create training and test data

### Exercise 3 - Create training and test data
TODO : Write the exercise description. Point attendees to the documentation of PartitionFn.

TODO : Make sure that all classes are equelly represented in the training and test data.

In [8]:
class TrainEvalSplitPartitionFn(beam.PartitionFn):
  """Split train and eval data."""
  def partition_for(self, element, num_partitions):
    import random
    # TODO : write what needs to be done in the exercise
    raise NotImplementedError()

### Solution
To see the solution please display the code in the box below.

In [9]:
class TrainEvalSplitPartitionFn(beam.PartitionFn):
  """Split train and eval data."""
  def partition_for(self, element, num_partitions):
    import random
    return 1 if random.random() > 0.7 else 0

## Write features to Cloud Storage

In [10]:
class ExampleProtoCoder(beam.coders.Coder):
  """A coder to encode and decode TensorFlow Example objects."""

  def __init__(self):
    import tensorflow as tf
    self._tf_train = tf.train

  def encode(self, example_proto):
    return example_proto.SerializeToString()

  def decode(self, serialized_str):
    example = self._tf_train.Example()
    example.ParseFromString(serialized_str)
    return example


class SaveFeatures(beam.PTransform):
  """Save Features in a TFRecordIO format.
  """

  def __init__(self, file_path_prefix):
    super(SaveFeatures, self).__init__('SaveFeatures')
    self._file_path_prefix = file_path_prefix

  def expand(self, features):
    return (features |
            'Write features' >> tfrecordio.WriteToTFRecord(file_path_prefix=self._file_path_prefix,
                                                           file_name_suffix='.tfrecord.gz',
                                                           coder=ExampleProtoCoder()))

## Create the pipeline

Boilerplate code to create the pipeline

In [11]:
def create_pipeline(train_dataset, output_dir, project, pipeline_option):
  """Create the Dataflow pipeline."""
  import csv
  
  job_name = ('preprocess-image-classification-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'))
  options = {
      'staging_location': os.path.join(output_dir, 'tmp', 'staging'),
      'temp_location': os.path.join(output_dir, 'tmp'),
      'job_name': job_name,
      'project': project,
      'teardown_policy': 'TEARDOWN_ALWAYS',
      'no_save_main_session': True
  }
  if pipeline_option is not None:
    options.update(pipeline_option)
  opts = beam.pipeline.PipelineOptions(flags=[], **options)
  
  p = beam.Pipeline('DataflowRunner', options=opts)
  #p = beam.Pipeline('DirectRunner', options=opts) 
  
  # Load the images from Cloud Storage and extract Inception features from them
  labeled_images, labels = p | 'PrepareTrainingData' >> PrepareTrainingData(train_dataset)
  preprocessed_data = labeled_images | 'ExtractFeatures' >> ExtractFeatures()
  
  # Create the training and evaluation datasets
  training_data, eval_data = preprocessed_data | 'CreateTrainingAndEvalDatasets' >> beam.Partition(TrainEvalSplitPartitionFn(), 2)
  
  # Write the training and evaluation datasets on Clud Storage
  output_train_path = os.path.join(output_dir, job_name, 'train')
  output_eval_path = os.path.join(output_dir, job_name, 'eval')
  output_labels_file = os.path.join(output_dir, job_name, 'labels')
  output_latest_file = os.path.join(output_dir, 'latest')
  train_save = training_data | 'WriteTrainingData' >> SaveFeatures(output_train_path)
  eval_save = eval_data | 'WriteEvalData' >> SaveFeatures(output_eval_path)
  labels_save = labels | 'WriteLabels' >> WriteToText(output_labels_file, shard_name_template='')

  # Write the checkpoint for the next step
  ([eval_save, train_save, labels_save] |
   'WaitEndOfWrite' >> beam.Flatten() |
   'CombineWriteResults' >> beam.transforms.combiners.Sample.FixedSizeGlobally(1) |
   'JobNameToMap' >> beam.Map(lambda path: job_name) |
   'WriteJobNameToLatest' >> beam.io.textio.WriteToText(output_latest_file, shard_name_template=''))
  
  return p

## Run the pipeline 
Run the pipeline and wait until it finishes.

In [12]:
%run 'Common.ipynb'

In [13]:
pipeline_runner = PipelineRunner(train_dataset, preprocess_dir, project_id, pipeline_options={'num_workers': 2})
#pipeline_runner = PipelineRunner(train_dataset, preprocess_dir, project_id)
pipeline_runner.run()

DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 581, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 166, in execute
    op.start()
  File "dataflow_worker/operations.py", line 283, in dataflow_worker.operations.DoOperation.start (dataflow_worker/operations.c:10680)
    def start(self):
  File "dataflow_worker/operations.py", line 284, in dataflow_worker.operations.DoOperation.start (dataflow_worker/operations.c:10574)
    with self.scoped_start_state:
  File "dataflow_worker/operations.py", line 289, in dataflow_worker.operations.DoOperation.start (dataflow_worker/operations.c:9775)
    pickler.loads(self.spec.serialized_fn))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 225, in loads
    return dill.loads(s)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in loads
    return load(file)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce
    value = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
    return getattr(__import__(module, None, None, [obj]), obj)
ImportError: No module named api.generator.api
