# Deep Learning with Spark
Databricks provides a Deep Learning Spark library to run Deep Learning models by leveraging built in models provided by Spark ML. It uses the fast distributed processing engine of Spark to run parallel and distributed models.

https://github.com/databricks/spark-deep-learning

# Installation
We can either create a conda environment or start the deep learning library directly in the PySpark shell.

## Conda Environment
If you want to use the deep learning package from within your jupyter notebook, you can create a conda environment and use this environment as a kernel. Therefore, you have to create a new conda environment with the required packages. We will call that environment pyspark-dl:

*conda create -n pyspark-dl python=3.6 pyspark six=1.11.0 nomkl pandas=0.23.4 h5py=2.8.0 pillow=4.1.1 cloudpickle=0.5.2 tensorflow=1.12.0 keras=2.2.4 paramiko=2.4.1 wrapt=1.10.11 nb_conda*

Now you have to activate your *pyspark-dl* environment so we can install more required packages with pip:

*source activate pyspark-dl*

Run the following line to install the missing packages via pip:

*pip install tensorframes kafka tensorflowonspark jieba*

Now you can deactivate your environment again using the following command:

*source deactivate pyspark-dl*

Start jupyter notebook and select the *pyspark-dl* environment as a kernel (Kernel - Change Kernel - *pyspark-dl* or select the *pyspark-dl* kernel when creating a new notebook).


See https://github.com/databricks/spark-deep-learning/blob/master/environment.yml for the full list of packages.

## PySpark Shell

In order to run PySpark with Deep Learning support in the shell, we need to invoke PySpark with Spark's Deep Learning library. We can use the following command to start a PySpark shell and install the packages of Spark Deep Learning:

pyspark  --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11

Also see https://databricks.github.io/spark-deep-learning/docs/_site/quick-start.html for installation instructions.

# Image Data Ingestion
From Spark 2.3, Spark supports reading and storing images into DataFrames. We need to import ImageSchema to read images. Let's read and preprocess the images data in given code sections.

In [1]:
from pyspark.ml.image import ImageSchema
from pyspark.sql.functions import lit

img_dir = "personalities/"

#Read images and Create training & test DataFrames for transfer learning
jobs_df = ImageSchema.readImages(img_dir + "/jobs")
zuckerberg_df = ImageSchema.readImages(img_dir + "/zuckerberg")

#define 1 as jobs class and 0 as zuckerberg class
jobs_df = jobs_df.withColumn("label", lit(1))
zuckerberg_df = zuckerberg_df.withColumn("label", lit(0))

data=jobs_df.unionAll(zuckerberg_df)

data.show(10)

+--------------------+-----+
|               image|label|
+--------------------+-----+
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
|[file:/home/dan/g...|    1|
+--------------------+-----+
only showing top 10 rows



## Training and Testing Data

In [2]:
train_df, test_df = data.randomSplit([0.6, 0.4])

## Deep Learning Pipeline definition

In [13]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer

# image features
featurizer = DeepImageFeaturizer(inputCol="image", 
                                 outputCol="features", 
                                 modelName="InceptionV3")
# model definition
lr = LogisticRegression(maxIter=20, 
                        regParam=0.05, 
                        elasticNetParam=0.3, 
                        labelCol="label")

# create pipeline
pipeline = Pipeline(stages=[featurizer, lr])

pipeline.getStages()

[DeepImageFeaturizer_de011117e8ba, LogisticRegression_e9088ce523fa]

## Traing and Test Model

In [5]:
# train model
model = pipeline.fit(train_df)    # train_df is a dataset of images and labels

# create predictions of test_df
predictions = model.transform(test_df)

predictions.select("label", "prediction").show()

## Evaluate Model

In [7]:
predictionAndLabels = predictions.select("prediction", "label")

evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

print("Training set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

Training set accuracy = 0.6363636363636364


# Exercise
Run spark using pyspark --master local[2] --packages databricks:spark-deep-learning:1.2.0-spark2.3-s_2.11.

## Image Data Ingestion
Import Image schema class

In [1]:
# Your code goes here

## Data Ingestion
Download and preprocess [flowers data] (http://download.tensorflow.org/example_images/flower_photos.tgz).

## Training and Testing Data
Split data into training and testing data set with 70,30 ratio.

In [None]:
# your code goes here

## Deep Learning Pipeline definition
Import required classes and define deep learning pipeline.

In [2]:
#Your code goes here


## Traing and Test Model

In [3]:
#Your code goes here

## Evaluate Model

In [4]:
# your code goes here

# References
- https://medium.com/linagora-engineering/making-image-classification-simple-with-spark-deep-learning-f654a8b876b8
- https://github.com/databricks/spark-deep-learning