d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

#Convolutional Neural Networks

We will use pre-trained Convolutional Neural Networks (CNNs), trained with the image dataset from [ImageNet](http://www.image-net.org/), to demonstrate two aspects. First, how to explore and classify images. And second, how to use transfer learning with existing trained models (next lab).


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Analyze popular CNN architectures
 - Apply pre-trained CNNs to images using Pandas Scalar Iterator UDF

In [3]:
%run "./Includes/Classroom-Setup"

## VGG16
![vgg16](https://neurohive.io/wp-content/uploads/2018/11/vgg16-neural-network.jpg)

We are going to start with the VGG16 model, which was introduced by Simonyan and Zisserman in their 2014 paper [Very Deep Convolutional Networks for Large Scale Image Recognition](https://arxiv.org/abs/1409.1556).

Let's start by downloading VGG's weights and model architecture.

In [5]:
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input, decode_predictions, VGG16
import numpy as np

vgg16Model = VGG16(weights="imagenet")

We can look at the model summary. Look at how many parameters there are! Imagine if you had to train all 138,357,544 parameters from scratch! This is one motivation for re-using existing model weights.

**RECAP**: What is a convolution? Max pooling?

**Question**: What do the input and output shapes represent?

In [8]:
vgg16Model.summary()

## Inception-V3 + Batch Normalization

In 2016, developers from [Google published a paper](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf) updating their Inception architecture with a number of optimizations.  This included a technique known as batch normalization.

Each layer in a deep neural network (with 10+ layers, for instance) expects the inputs from the previous layer to come from the same distribution.  However, in practice each layer is being updated, changing the distribution of its output to the next layer.  This is called "internal covariate shift" and can result in an unstable learning process since each layer is effectively learning a moving target. 

**Batch normalization is a technique that applies to very deep neural networks (especially CNNs) that standardizes the inputs to a layer for each mini-batch.** Generally speaking, this reduces the number of training epochs needed by stabilizing the learning process.

Batch normalization should generally not be used with dropout, another regularization technique discussed in the Advanced Keras notebook.  While there's some contention over which is a more effective method ([see this paper for details](https://link.springer.com/article/10.1007/s11042-019-08453-9)), batch normalization is generally preferred over dropout for deep neural networks.

In [10]:
from tensorflow.keras.applications.inception_v3 import InceptionV3

inceptionModel = InceptionV3()

Take a look at the architecture noted where batch normalization is performed.

In [12]:
inceptionModel.summary()

Looking for more reference architectures?  Check out [`tf.keras.applications` for what's available out of the box.](https://www.tensorflow.org/api_docs/python/tf/keras/applications)

## Apply pre-trained model

We are going to make a helper method to resize our images to be 224 x 224, and output the top 3 classes for a given image.

In TensorFlow, it represents the images in a channels-last manner: (samples, height, width, color_depth)

In [15]:
def predict_images(images, model):
  for i in images:
    print(f"Processing image: {i}")
    img = image.load_img(i, target_size=(224, 224))
    # Convert to numpy array for Keras image format processing
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    preds = model.predict(x)
    # Decode the results into a list of tuples (class, description, probability)
    print(f"Predicted: {decode_predictions(preds, top=3)[0]}\n")

-sandbox
## Images
<div style="text-align: left; line-height: 0; padding-top: 9px;">
  <img src="https://files.training.databricks.com/images/pug.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
  <img src="https://files.training.databricks.com/images/strawberries.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
  <img src="https://files.training.databricks.com/images/rose.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
  
</div>

Let's make sure the datasets are already mounted.

In [17]:
img_paths = [
  "/dbfs/mnt/training/dl/img/pug.jpg", 
  "/dbfs/mnt/training/dl/img/strawberries.jpg", 
  "/dbfs/mnt/training/dl/img/rose.jpg"
]

predict_images(img_paths, vgg16Model)

The network did so well with the pug and strawberry! What happened with the rose? Well, it turns out that `rose` was not one of the 1000 categories that VGG16 had to predict. But it is quite interesting it predicted `sea_anemone` and `vase`.

You can play around with this with your own images by doing the following:

Get a new file: 

`%sh wget image_url.jpg`

`%fs cp file:/databricks/driver/image_name.jpg yourName/tmp/image_name.jpg `


OR

You can upload this file via the Data UI and read in from the FileStore path (e.g. `/dbfs/FileStore/image_name.jpg`).

-sandbox
## Classify Co-Founders of Databricks
<div style="text-align: left; line-height: 0; padding-top: 9px;">
  <img src="https://files.training.databricks.com/images/Ali-Ghodsi-4.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
  <img src="https://files.training.databricks.com/images/andy-konwinski-1.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
  <img src="https://files.training.databricks.com/images/ionS.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
  <img src="https://files.training.databricks.com/images/MateiZ.jpg" height="200" width="150" alt="Databricks Nerds!" style=>
  <img src="https://files.training.databricks.com/images/patrickW.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
  <img src="https://files.training.databricks.com/images/Reynold-Xin.jpg" height="150" width="150" alt="Databricks Nerds!" style=>
</div>

Load these images into a DataFrame.

In [22]:
df = spark.read.format("image").load("/mnt/training/dl/img/founders/")
df.cache().count()
display(df)

Let's wrap the prediction code inside a UDF so we can apply this model in parallel on each row of the DataFrame.

In [24]:
from pyspark.sql.types import StringType, ArrayType

# Helper func to return top3 results as strings in an array
def get_results(path, model, preprocess_input, decode_predictions, target_size=(224,224)):
  img = image.load_img(path, target_size=target_size)
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)
  x = preprocess_input(x)
  preds = model.predict(x)

  # Decode the results into a list of tuples (class, description, probability)  
  top_3 = decode_predictions(preds, top=3)[0]
  result = []
  for _, label, prob in top_3:
    result.append(f"{label}: {prob:.3f}")
  return result

# Define UDF to do preprocessing and prediction steps
@udf(ArrayType(StringType()))
def vgg16_predict_udf(image_data):
  path = image_data[0].replace("dbfs:", "/dbfs")
  model = VGG16(weights="imagenet")
  return get_results(path, model, preprocess_input, decode_predictions)

display(df.withColumn("Top 3 VGG16 Predictions", vgg16_predict_udf("image")))

### Vectorized UDF

As of Spark 2.3, there are Vectorized UDFs available in Python to help speed up the computation.

* [Blog post](https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html)
* [Documentation](https://spark.apache.org/docs/latest/sql-programming-guide.html#pyspark-usage-guide-for-pandas-with-apache-arrow)

<img src="https://databricks.com/wp-content/uploads/2017/10/image1-4.png" alt="Benchmark" width ="500" height="1500">

Vectorized UDFs utilize Apache Arrow to speed up computation. Let's see how that helps improve our processing time.

The user-defined functions are executed by: 
* [Apache Arrow](https://arrow.apache.org/), is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes with near-zero (de)serialization cost. See more [here](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html).
* pandas inside the function, to work with pandas instances and APIs.

-sandbox
### Pandas Scalar Iterator UDF

If you define your own UDF to apply a model to each record of your DataFrame in Python, opt for vectorized UDFs for optimized serialization and deserialization. However, if your model is very large, then there is high overhead for the Pandas UDF to repeatedly load the same model for every batch in the same Python worker process. In Spark 3.0, Pandas UDFs can accept an iterator of pandas.Series or pandas.DataFrame so that you can load the model only once instead of loading it for every series in the iterator.

This way the cost of any set-up needed (like loading the VGG16 model in our case) will be incurred fewer times. When the number of images you’re working with is greater than `spark.conf.get('spark.sql.execution.arrow.maxRecordsPerBatch')`, which is 10,000 by default, you'll see significant speed ups over a pandas scalar UDF because it iterates through batches of pd.Series.

It has the general syntax of: 
```@pandas_udf(...)
def predict(iterator):
  model = ... # load model
  for features in batch_iter:
    yield model.predict(features)```


<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If the workers cached the model weights after loading it for the first time, subsequent calls of the same UDF with the same model loading will become significantly faster.

In [27]:
import pandas as pd
from pyspark.sql.functions import pandas_udf
from typing import Iterator

def preprocess(image_path):
  path = image_path.replace("dbfs:", "/dbfs")
  img = image.load_img(path, target_size=(224, 224))
  x = image.img_to_array(img)
  x = preprocess_input(x)
  return x

@pandas_udf(ArrayType(StringType()))
def vgg16_predict_pandas_udf(image_data_iter: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
  # Load model outside of for loop
  model = VGG16(weights="imagenet") 
  for image_data_series in image_data_iter:
    image_path_series = image_data_series["origin"]
    # Apply functions to entire series at once
    x = image_path_series.map(preprocess) 
    x = np.stack(list(x.values))
    preds = model.predict(x, batch_size=6)
    top_3s = decode_predictions(preds, top=3)
    
    # Format results
    results = []
    for top_3 in top_3s:
      result = []
      for _, label, prob in top_3:
        result.append(f"{label}: {prob:.3f}")
      results.append(result)
    yield pd.Series(results)

display(df.withColumn("Top 3 VGG16 Predictions (pandas udf)", vgg16_predict_pandas_udf("image")))

Yikes! These are not the most natural predictions (because ImageNet did not have a `person` category). In the next lab, we will cover how to utilize existing components of the VGG16 architecture, and how to retrain the final classifier.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>