### 4. Solution deployment on the cloud environment

Since the solution works, now it is time to deploy it on a larger scale on a real cluster of machines.

**Attention:** this work was done using MacOS, the commands described below are therefore carried out exclusively in this environment.

<u>Several constraints to take into account:</u>
1. Which cloud provider to choose?
2. Which solution from this provider to use?
3. Where to store the data?
4. How to set up the tools in this new environment?


#### 4.1. Choice of cloud provider: Amazon Web Service (AWS)

The best known service provider, which currently offers the broadest services in cloud computing is **Amazon Web Service**. Some of their services are perfectly adapted to this problem and that is why this provider is used.

The primary objective is to be able to <u>rent computing power on demand</u>, using AWS. The idea is to be able to obtain enough computing power to be able to process the images, whatever the workload, even if the volume of data increases significantly.

Additionally, the ability to use this computing power on demand allows to drastically reduce costs if one compares the costs of reting a complete server over a fixed period of time(e.g., 1 month, 1 year).

#### 4.2. Choice of the technical solution: Elastic MapReduce (EMR)

Several solutions are available: 
1. **IAAS** solution (Infrastructure As A Service)
In this setup, AWS provides blank servers on which a administrator access is available. The servers are named EC2 instances. Simply put, with this solution, it is possible to reproduce the solution implemented locally. 

<u>All the tools are installed manually, then the following script is run: </u>
- Installation of **Java**, **Spark**, etc.
- Installation of **Python** (via Anaconda, for instance)
- Installation of **Jupyter Notebook**
- Installation of **additional libraries**
- It will also be necessary to **ensure the installation the libraries needed for all machines (workers) in the cluster**

- <u> Advantages: </u>
    - Total freedom to implement the solution
    - Easy to implement from a model that runs locally on a Linux machine
- <u> Disadvantages: </u>
    - Time-consuming
    - Need to install and set up the entire solution
    - Possible technical problems when installing the tools (problems that did not exist on the local machine may appear on th EC2 server)
    - Non-sustainable solution over time, it will be necessary to ensure that the tools are updated and possibly having to reinstall Spark, Java, etc.

2. **PAAS** solution (Platform As A Service)
AWS provides a lot of different services. In one of these, there is an offer that allows to rent **EC2 instances** with pre-installed and set up applications: this is the **EMR service**.

<u> EMR service present the follwing characteristics: </u>
- Spark will already be installed
- Possibility to request  the installation of **Tensorflow** as well as **JupyterHub**
- Possibility to **additional packages** to install when initializing the server **on all machines in the cluster**

- <u> Advantages: </u>
    - Easy to implement
        - It takes very little setup to get a perfectly functional environment
    - Speed of implementation
        - Once the first setup is completed, it is very easy and very fast to recreate the identical clusters which will be available almost instantly (the time to instantiate the servers - approximately 15/20 minutes)
    - Hardware and software solutions optimized by AWS engineers
        - It is known that the installed versions will work and that the proposed architecture is optimized
    - Stability of the solution
    - Scalable solution
        - It is easy to obtain an up-to-date version of each new instatiation of each package, being guaranteed of their compatilibity with the rest of the environment
    - Safer
        - Any security patches will be automatically updated at each new instatiation of the EMR cluster
- <u> Disadvantages: </u>
    - Perhaps a certain lack of freedom on the version of the packages available? (It was not a problem in this project)
    - Price?


PAAS solution is the best/most appropriate solution for this problem, so it will be choosen to use Amazon Web Service **EMR services**. It is more suitable for this problem and it allows an implementation that is both faster and more efficient than IAAS solution.

#### 4.3. Choice of data storage solution: Simple Storage Service (S3)

<u>Amazon offers a very effective solution for data storage management</u>: **Amazon S3**. S3 for Amazon Simple Storage Service.

It might be tempting to store our data on the space allocated by the **EC2** server, but if we do not take steps to then save it on another medium, <u>the data will be lost</u> when the server will be terminated (the server is terminated when it is not used for cost reasons). In fact, if you decide to use the disk space of the EC2 server you will have to come up with a solution to save the data before the server is terminated. In addition, we would be exposed to certain problems if our data were to **saturate** the available space on our servers (slowdowns, malfunctions).

<u>Using **Amazon S3** allows you to overcome all these problems</u>. Available disk space is **unlimited**, and it is **independent of our EC2 servers**. Access to data is **very fast** because we stay in the AWS environment and we take care to <u>choose the same region for our **EC2** and **S3** servers</ u>.

Additionally, as we will see <u>it is possible to access data on **S3** in the same way as **accessing data on a local disk**</u>. We will simply use a **PATH in the format s3://...** </u>.

#### 4.4. Set up the working environment

The first step is to install and configure [**AWS Cli**](https://aws.amazon.com/fr/cli/), this is the **command line interface** 'AWS**. It allows us to **interact with different AWS services**, like **S3** for example.

To be able to use **AWS Cli**, you must configure it by first creating a user to whom you will give the authorizations you need. In this project the user must have at least total control over the S3 service.

<u>Users and their rights are managed via the AWS **AMI**</u> service.

Once the user has been created and its permissions configured, we create a **pair of keys** which will allow us to **connect without having to systematically enter our login/password**.

We must also configure **SSH access** to our future EC2 servers. Here too, via a key system which frees us from having to authenticate ourselves "by hand" at each connection.

All its configuration steps are perfectly described in the course of the project: [Perform distributed calculations on massive data / Discover Amazon Web Services](https://openclassrooms.com/fr/courses/4297166-realisez-des-calculs-distribues-sur-des-donnees-massives/4308686-decouvrez-amazon-web-services#/id/r-4355822)

#### 4.5. Upload data to S3

Our tools are configured. We now need to upload our working data to Amazon S3.

Here too the steps are described precisely in the course [Carry out distributed calculations on massive data / Store data on S3](https://openclassrooms.com/fr/courses/4297166-realisez-des-calculs-distribues-sur-des-donnees-massives/4308691-stockez-des-donnees-sur-s3)

I decide to only upload the data contained in the **Test** folder of the [project dataset](https://www.kaggle.com/moltean/fruits/download)


The first step is to **create a bucket on S3** into which we will upload the project data:
- **aws s3 mb s3://p8-data**

We check that the bucket has been created
- **aws s3 ls**
  - If the bucket name is displayed then it has been correctly created.

We then copy the contents of the "**Test**" folder into a "**Test**" directory on our "**p8-data**" bucket:
1. We place ourselves inside the **Test** directory
2. **aws sync. s3://p8-data/Test**

The **sync** command is useful for synchronizing two directories.

<u>Our project data is now available on Amazon S3</u>.

#### 4.6. Set up EMR server

Once again, the course [Perform distributed calculations on massive data / Deploy a distributed calculation cluster](https://openclassrooms.com/fr/courses/4297166-realisez-des-calculs-distribues-sur-des-donnees-massives/4308696-deployez-un-cluster-de-calculs-distribues) details the essential steps to launch a cluster with **EMR**.

<u>I will detail here the specific steps that allow us to to configure the server according to our needs</u>:

1. Click Create Cluster
![Create a cluster](img/EMR_creer.png)
2. Click Access advanced options
![Create a cluster](img/EMR_options_avancees.png)

##### 4.6.1. Step 1: Software and steps
##### 4.6.1.1. Software setup 

<u>Select the packages we will need as in the screenshot</u>:
1. We select the latest version of **EMR**, which is version **6.3.0** at the time of writing this document
2. We obviously check **Hadoop** and **Spark** which will be pre-installed in their most recent version
3. We will also need **TensorFlow** to import our model and carry out **transfer learning**
4. We will finally work with a **Jupyter notebook** via the **JupyterHub** application - As we will see in a moment we will <u>configure the application so that the notebooks</u>, as the rest of our working data, <u>is saved directly on S3</u>.
![Create a cluster](img/EMR_configuration_logiciels.png)

##### 4.6.1.2. Change software parameters

<u>Configure the persistence of notebooks created and opened via JupyterHub</u>:
- At this stage we can make specific configuration requests on our applications. The objective is, as with the rest of our working data, to avoid all the problems mentioned above. This is the objective in this step, <u>we will save and open the notebooks</u> not on the disk space of the EC2 instance (as would be the case in the default configuration of JupyterHub) but <u>directly on **Amazon S3**</u>.
- <u>two solutions are possible to achieve this</u>:
  1. Create a **JSON configuration file** which we **upload to S3** and then indicate the path to the JSON file
  2. Enter the configuration directly in JSON format
 
I personally created a JSON file when creating my first EMR instance, then when we decide to clone our server to easily recreate an identical one, the configuration of the JSON file is directly copied as in the capture below.

<u>Here is the content of my JSON file</u>:
[{"classification":"jupyter-s3-conf","properties":{"s3.persistence.bucket":"p8-data","s3.persistence.enabled":"true"}}]
![Change software settings](img/EMR_parametres_logiciel.png)

##### 4.6.2. Step 2: Material
At this step, leave the default choices. <u>The important thing here is the selection of our instances</u>:

1. I choose **M5** type instances which are **balanced type instances**
2. I choose the **xlarge** type which is the **least expensive instance available**
[More information about Amazon EC2 M5 instances](https://aws.amazon.com/fr/ec2/instance-types/m5/)
3. I select **1 Master instance** (the driver) and **2 Main instances** (the workers) for a total of 3 EC2 instances**.
![Choice of material](img/EMR_materiel.png)

##### 4.6.3. Step 3: Parameters of general cluster
##### 4.6.3.1. General options

<u>The first thing to do is to give the cluster a name</u>:

![Cluster name](img/EMR_nom_cluster.png)

##### 4.6.3.2. Bootstrapping actions

We are going to this step **choose the missing packages to install** and which will be useful in the execution of our notebook <u>The advantage of carrying out this step now is that the installed packages will be installed on all of the machines in the cluster</u>.

The procedure for creating the **bootstrap** file which contains all the instructions for installing all the packages we will need is explained in the course [Perform distributed calculations on massive data / Bootstrapping](https://openclassrooms.com/fr/courses/4297166-realisez-des-calculs-distribues-sur-des-donnees-massives/4308696-deployez-un-cluster-de-calculs-distribues#/id/r-4356490)

We therefore create a file named "**bootstrap-emr.sh**" which we <u>upload to S3</u>(I install it at the root of my **bucket**) and we add it as shown in the screenshot below:
![Bootstrapping actions](img/EMR_amorcage.png)

Here is the content of the **bootstrap-emr.sh** file. As we can see it is simply a "**pip install**" command to **install the missing libraries** as done locally. Once again, <u>it is necessary to carry out these actions at this step</u> so that <u>the packages are installed on all the machines in the cluster</u> and not just on the driver, as would be the case if we executed these commands directly in the JupyterHub notebook or in the EMR console (connected to the driver).
![Bootstrap file](img/EMR_bootstrap.png)

**setuptools** and **pip** are updated to avoid a problem with the installation of the **pyarrow** package. **Pandas** received a major update (1.3.0) less than a week ago at the time of writing this notebook, and the new version of **Pandas** requires a higher version recent version of **Numpy** than the version installed by default (1.16.5) when initializing **EC2** instances. <u>It does not seem possible to impose another version of Numpy than the one installed by default</u> even if we force the installation of a recent version of **Numpy** (in any case, nor simply nor intuitively). The update being very recent <u>the version of **Numpy** is not yet updated on **EC2**</u> but we can imagine that this will be the case very quickly and it will not will no longer be necessary to impose a specific version of **Pandas**. In the meantime, I request <u>the installation of the penultimate version of **Pandas (1.2.5)**</u>

##### 4.6.4. Step 4: Security
##### 4.6.4.1. Security options

A cette étape nous sélectionnons la **paire de clés EC2** créé précédemment. Elle nous permettra de se connecter en **ssh** à nos **instances EC2** sans avoir à entrer nos login/mot de passe. On laisse les autres paramètres par défaut.
![EMR security](img/EMR_securite.png)

##### 4.7. Server instantiation

All we have to do now is wait for the server to be ready. <br />
This step may take around **10 minutes**.

<u>Several stages follow one another, we can see the last step of **EMR cluster**</u>:

![Instanciation](img/EMR_instanciation.png)

<u>When the status displays in green: "**Waiting**" this means that the instantiation was successful and that our server is ready to be used</u>.

##### 4.8. Create the SSH tunnel to the EC2 instance (Primary)

##### 4.8.1. Create permissions on incoming connections

<u>We now want to be able to access our applications</u>:
  - **JupyterHub** for running our notebook
  - **Spark history server** for tracking the execution of our script's tasks when it is launched
 
However, <u>these applications are only accessible from the driver's local network</u>, and to access them we must **create an SSH tunnel to the driver**.

By default, this driver is located behind a firewall which blocks SSH access. <u>To open port 22 which corresponds to the port on which the SSH server listens, you must modify the **EC2 security group of the driver**</u>.

This step is described in the course [Perform distributed calculations on massive data / Launching an application from the driver](https://openclassrooms.com/fr/courses/4297166-realisez-des-calculs-distribues-sur-des-donnees-massives/4308696-deployez-un-cluster-de-calculs-distribues#/id/r-4356512):

*We will need to connect via SSH to the driver of our cluster. By default, this driver is located behind a firewall which blocks SSH access. To open port 22 which corresponds to the port on which the SSH server listens, you must modify the EC2 security group of the driver. On the EC2 console page, in the Network and Security tab, click Security Groups. You will need to modify the security group of ElasticMapReduce-Master. In the "Incoming" tab, add an SSH rule whose source is "Anywhere" (or "My IP" if you have a fixed IP address).*

![Configuration autorisation ports entrants pour ssh](img/EMR_config_ssh_01.png)

<u>Once this step is completed you should have a configuration similar to mine</u>:

![Configuration ssh terminée](img/EMR_config_ssh_02.png)

##### 4.8.2. Create the ssh tunnel to the Driver
<u>We then retrieve the command provided by Amazon to **establish the SSH tunnel**</u>:

![Récupérer la commande pour établir le tunnel ssh](img/EMR_tunnel_ssh.png)

<u>In my case, the command does not work as</u> and I had to **adapt it to my configuration**. The **ssh key** is located in a folder "**.ssh**" itself located in my **personal directory** whose symbol is, under Linux, identified by a tilde "**~** ".

Having followed the course [Perform distributed calculations on massive data / Launch an application from the driver](https://openclassrooms.com/fr/courses/4297166-realisez-des-calculs-distribues-sur-des-donnees-massives) I chose to use port **5555** instead of **8157**, even though the choice is not very important. I also encountered a <u>compatibility problem</u> with the "**-N**" argument list of arguments and their meanings available [here](https://explainshell.com/explain?cmd=ssh+-L+-N+-f+-l+-D). I decided to just delete it.

<u>Finally, I use the following command in a terminal to establish my ssh tunnel (only the URL changes from one instance to another)</u>.

<u>We enter "**yes**" to validate the connection and if the connection is established we obtain the following result</u>.

![Création du tunnel SSH](img/EMR_connexion_ssh_01.png)

We have **correctly established the ssh tunnel with the driver** on port "5555".

##### 4.8.3. FoxyProxy setup

A final step is necessary to access our applications, by asking our browser to use the ssh tunnel. I use **FoxyProxy** for this.
[Again, you can use the course to configure it](https://openclassrooms.com/fr/courses/4297166-realisez-des-calculs-distribues-sur-des-donnees-massives/4308701-realisez-la-maintenance-dun-cluster#/id/r-4356554).

Otherwise, open the **FoxyProxy** configuration and <u>click on **Add**</u> at the top left then fill in the elements as in the screenshot below:
![Configuration FoxyProxy Etape 1](img/EMR_foxyproxy_config_01.png)

##### 4.8.4. Access to EMR server applications via SSH tunnel

<u>We activate the **ssh tunnel** as seen previously then we ask our browser to use it with **FoxyProxy**</u>:

![FoxyProxy activation](img/EMR_foxyproxy_activation.png)

##### 4.9. Connect to JupyterHub notebook

To connect to **JupyterHub** in order to run our **notebook**, you must start by <u>clicking on the **JupyterHub**</u> application that appeared since we configured the * *ssh tunnel** and **foxyproxy** on our browser (refresh the page if this is not the case).

![Démarrage de JupyterHub](img/EMR_jupyterhub_connexion_01.png)

We go through any security warnings then arrive at a login page.
    
<u>We connect with the default information</u>:
  - <u>login</u>: **jovyan**
  - <u>password</u>: **jupyter**

![Connexion à JupyterHub](img/EMR_jupyterhub_connexion_02.png)

We then arrive at a blank notebook folder. Simply create one by clicking on "**New**" at the top right.

![Liste et création des notebook](img/EMR_jupyterhub_creer_notebooks.png)

It is also possible to <u>upload one directly to our **S3 bucket**</u>.

Thanks to the <u>**persistence** configured at cluster instantiation, we are currently in the tree of our **S3 bucket**</u>

![Notebook stockés sur S3](img/EMR_jupyterhub_S3.png)

I decide to **import a notebook already written locally directly to S3** and I open it from **the JupyterHub interface**.

##### 4.10. Code execution

I decide to run this part of the code from **JupyterHub hosted on our EMR cluster**. To avoid unnecessarily burdening the explanations in the **notebook**, I will not re-explain the common steps that we have already seen in the first part where we executed the code locally.

<u>Before you begin</u>, you must make sure to use the **pyspark kernel**.

**Using this kernel, a spark session is created at the execution of the first cell**. It is therefore no longer necessary to execute the code "spark = (SparkSession ..."** as when running our notebook locally.

##### 4.10.1. Start Spark session

In [1]:
# Running this cell will start Spark application

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
0,application_1707907462196_0001,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<u> Display current session information and links to Spark: </u>

In [2]:
%%info

ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
0,application_1707907462196_0001,pyspark,idle,Link,Link,,✔


##### 4.10.2. Install additional packages

The additional packages were installed via the **bootstrap** step when instantiating the server. 

Check the bootstrap file: bootstrap-emr.sh

##### 4.10.3. Import libraries

In [3]:
# General 
import os
import io

# Data handling
import pandas   as pd
import numpy    as np

# Image processing
import tensorflow   as tf
from PIL            import Image
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
from tensorflow.keras.preprocessing.image       import img_to_array
from tensorflow.keras                           import Model

# Big data
from pyspark.sql            import SparkSession
from pyspark.sql.functions  import col, pandas_udf, PandasUDFType, element_at, split
from pyspark.ml.feature     import PCA
from pyspark.sql.functions  import udf
from pyspark.ml.linalg      import Vectors, VectorUDT

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### 4.10.4. Set PATH to load images and save results

We directly access to the **data on S3** as if they were **stored locally**.

In [4]:
PATH = 's3://openclassrooms-p8-fruits-data'
PATH_Data = PATH+'/data/test'
PATH_Result = PATH+'/data/results'
print('PATH:        '+\
      PATH+'\nPATH_Data:   '+\
      PATH_Data+'\nPATH_Result: '+PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

PATH:        s3://openclassrooms-p8-fruits-data
PATH_Data:   s3://openclassrooms-p8-fruits-data/data/test
PATH_Result: s3://openclassrooms-p8-fruits-data/data/results

##### 4.10.5. Data processing

##### 4.10.5.1. Load data

In [5]:
images = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load(PATH_Data)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
images.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|s3://openclassroo...|2024-02-14 09:41:46|  7353|[FF D8 FF E0 00 1...|
|s3://openclassroo...|2024-02-14 09:41:46|  7350|[FF D8 FF E0 00 1...|
|s3://openclassroo...|2024-02-14 09:41:46|  7349|[FF D8 FF E0 00 1...|
|s3://openclassroo...|2024-02-14 09:41:46|  7348|[FF D8 FF E0 00 1...|
|s3://openclassroo...|2024-02-14 09:41:47|  7328|[FF D8 FF E0 00 1...|
+--------------------+-------------------+------+--------------------+
only showing top 5 rows

Only the image path is kept and a column containing the labels is added.

In [7]:
images = images.withColumn('label', element_at(split(images['path'], '/'),-2))
print(images.printSchema())
print(images.select('path','label').show(5,False))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)
 |-- label: string (nullable = true)

None
+---------------------------------------------------------------------+----------+
|path                                                                 |label     |
+---------------------------------------------------------------------+----------+
|s3://openclassrooms-p8-fruits-data/data/test/Watermelon/r_106_100.jpg|Watermelon|
|s3://openclassrooms-p8-fruits-data/data/test/Watermelon/r_109_100.jpg|Watermelon|
|s3://openclassrooms-p8-fruits-data/data/test/Watermelon/r_108_100.jpg|Watermelon|
|s3://openclassrooms-p8-fruits-data/data/test/Watermelon/r_107_100.jpg|Watermelon|
|s3://openclassrooms-p8-fruits-data/data/test/Watermelon/r_95_100.jpg |Watermelon|
+---------------------------------------------------------------------+----------+
only showing top 5 rows

None

##### 4.10.5.2. Model preparation

In [8]:
model = MobileNetV2(weights='imagenet',
                    include_top=True,
                    input_shape=(224, 224, 3))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5

In [9]:
new_model = Model(inputs=model.input,
                  outputs=model.layers[-2].output)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
brodcast_weights = sc.broadcast(new_model.get_weights())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
new_model.summary()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
Conv1 (Conv2D)                  (None, 112, 112, 32) 864         input_1[0][0]                    
__________________________________________________________________________________________________
bn_Conv1 (BatchNormalization)   (None, 112, 112, 32) 128         Conv1[0][0]                      
__________________________________________________________________________________________________
Conv1_relu (ReLU)               (None, 112, 112, 32) 0           bn_Conv1[0][0]                   
______________________________________________________________________________________________

In [12]:
def model_fn():
    """
    Returns a MobileNetV2 model with top layer removed 
    and broadcasted pretrained weights.
    """
    model = MobileNetV2(weights='imagenet',
                        include_top=True,
                        input_shape=(224, 224, 3))
    for layer in model.layers:
        layer.trainable = False
    new_model = Model(inputs=model.input,
                  outputs=model.layers[-2].output)
    new_model.set_weights(brodcast_weights.value)
    return new_model

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### 4.10.5.3. Outline of the process of loading images and the application of their featurization through the use of Pandas UDF

In [13]:
def preprocess(content):
    """
    Preprocesses raw image bytes for prediction.
    """
    img = Image.open(io.BytesIO(content)).resize([224, 224])
    arr = img_to_array(img)
    return preprocess_input(arr)

def featurize_series(model, content_series):
    """
    Featurize a pd.Series of raw images using the input model.
    :return: a pd.Series of image features
    """
    input = np.stack(content_series.map(preprocess))
    preds = model.predict(input)
    # For some layers, output features will be multi-dimensional tensors.
    # We flatten the feature tensors to vectors for easier storage in Spark DataFrames.
    output = [p.flatten() for p in preds]
    return pd.Series(output)

@pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
    '''
    This method is a Scalar Iterator pandas UDF wrapping our featurization function.
    The decorator specifies that this returns a Spark DataFrame column of type ArrayType(FloatType).

    :param content_series_iter: This argument is an iterator over batches of data, where each batch
                              is a pandas Series of image data.
    '''
    # With Scalar Iterator pandas UDFs, we can load the model once and then re-use it
    # for multiple data batches.  This amortizes the overhead of loading big models.
    model = model_fn()
    for content_series in content_series_iter:
        yield featurize_series(model, content_series)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…



##### 4.10.5.4. Run feature extraction actions

In [14]:
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1024")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
features_df = images.repartition(24).select(col("path"),
                                            col("label"),
                                            featurize_udf("content").alias("features")
                                           )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
print(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

s3://openclassrooms-p8-fruits-data/data/results

In [17]:
features_df.write.mode("overwrite").parquet(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##### 4.10.5.5. Apply dimension reduction to the test data

In [18]:
# Convert the features column to a dense vector
dense_vector = udf(lambda a: Vectors.dense(a), VectorUDT())
dense_df = features_df.select("path", "label", dense_vector("features").alias("dense_features"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [19]:
# Apply PCA to the dense vector
pca = PCA(k=50, inputCol="dense_features", outputCol="pca_features")
model = pca.fit(dense_df)
result = model.transform(dense_df).select("path", "label", "pca_features")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [20]:
# Write the result after PCA to a parquet file
result.write.mode("overwrite").parquet(PATH_Result + "/pca_results")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [21]:
print(PATH_Result + "/pca_results")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

s3://openclassrooms-p8-fruits-data/data/results/pca_results

##### 4.10.5.6. Load saved data and validate results

<u>Let's load into a pandas DataFrame the data that have just been saved: </u>

In [22]:
#df = pd.read_parquet(PATH_Result, engine='pyarrow')
df = spark.read.parquet(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
df.head()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Row(path='s3://openclassrooms-p8-fruits-data/data/test/Watermelon/r_87_100.jpg', label='Watermelon', features=[0.029508255422115326, 0.06816301494836807, 0.0, 0.016227003186941147, 0.557296633720398, 0.0, 0.7705082893371582, 0.38912272453308105, 0.0, 0.0, 0.6036859154701233, 0.16412188112735748, 0.00229330244474113, 0.08688440173864365, 0.056950509548187256, 0.0, 0.0, 0.007440071552991867, 0.0, 0.4715491831302643, 0.0, 0.0, 0.0, 0.0, 0.05052385851740837, 1.2084769010543823, 2.2853941917419434, 0.0, 0.0, 1.5637874603271484, 0.17250505089759827, 0.0, 0.15200825035572052, 0.860082745552063, 0.0, 0.017210079357028008, 0.6624481081962585, 3.016568660736084, 0.0, 0.0, 0.006589928641915321, 0.0, 0.10211249440908432, 0.0, 0.029085317626595497, 0.0, 0.42747583985328674, 0.23558345437049866, 1.3489508628845215, 0.20422177016735077, 0.0, 0.0, 0.17094318568706512, 0.0, 0.004170811735093594, 2.3509719371795654, 0.0, 0.6281321048736572, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.5626542568206787, 0.52347391843

In [24]:
df.columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['path', 'label', 'features']

In [25]:
df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

22688

In [26]:
print(df.printSchema())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- path: string (nullable = true)
 |-- label: string (nullable = true)
 |-- features: array (nullable = true)
 |    |-- element: float (containsNull = true)

None

In [27]:
#df_pca = pd.read_parquet(PATH_Result + '/pca_results', engine='pyarrow')
df_pca=spark.read.parquet(PATH_Result + "/pca_results")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
df_pca.head()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Row(path='s3://openclassrooms-p8-fruits-data/data/test/Watermelon/r_67_100.jpg', label='Watermelon', pca_features=DenseVector([-3.451, 5.4159, -5.3283, -4.8208, 5.9778, 5.4517, 2.1771, 0.2174, -9.5208, 0.0079, -2.0946, 2.9029, -1.7029, 6.3195, 0.1022, -0.0209, -4.8671, 0.4627, -0.6855, 5.3621, 0.3786, -0.0948, 6.3297, -1.4221, 0.9524, 0.6563, 1.3332, -3.4886, 1.2113, 1.5237, 2.8523, 0.6011, 1.84, -0.5275, -1.2889, -3.8125, -0.4837, -1.5041, -0.4523, -2.2355, 0.2967, 1.805, -0.3623, -0.2969, -3.2374, 0.2936, 1.1547, -1.9716, -0.8415, -2.4493]))

In [29]:
print(df_pca.printSchema())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- path: string (nullable = true)
 |-- label: string (nullable = true)
 |-- pca_features: vector (nullable = true)

None

In [30]:
df_pca.columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['path', 'label', 'pca_features']

In [31]:
df_pca.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

22688

In [None]:
df_pca.to_csv(PATH_Result + "pca_results/pca_matrix.csv", sep=";")

<u>We can also see the presence of files in "**parquet**" format on the **S3 server**</u>:

![Affichage des résultats sur S3](img/S3_Results.png)

##### 4.11 Track task progress with Spark History Server

It is possible to see the progress of current tasks with the **Spark history server**.

![Accès au serveur d'historique spark](img/EMR_serveur_historique_spark_acces.png)

**It is also possible to come back and study the tasks that have been carried out, in order to debug and optimize future tasks to be carried out.**

<u>When the command "**features_df.write.mode("overwrite").parquet(PATH_Result)**" <br />
was in progress, we could observe its progress:

![Progression execution script](img/EMR_jupyterhub_avancement.png)

<u>The **Spark history server** allows us a much more precise vision of the execution of the different tasks on the different machines in the cluster</u>:

![Suivi des tâches spark](img/EMR_SHSpark_01.png)

We can also see that our calculation cluster took a little **less than 10 minutes** to process the **22,688 images**.

![Temps de traitement](img/EMR_SHSpark_02.png)


##### 4.12. Termination of the EMR instance

Our work is now complete. The EMR machine cluster is **billed on demand**, and we continue to be billed even when the machines are idle. To **optimize billing**, we now need to **terminate the cluster**.

<u>I perform this command from the AWS interface</u>:

1. Start by **disabling the ssh tunnel in FoxyProxy** to avoid **timeout** issues.
![Désactivation de FoxyProxy](img/EMR_foxyproxy_desactivation.png)
2. Click on “**Terminate**”
![Cliquez sur Résilier](img/EMR_resiliation_01.png)
3. Confirm termination
![Confirmez la résiliation](img/EMR_resiliation_02.png)
4. Termination takes approximately **1 minute**
![Résiliation en cours](img/EMR_resiliation_03.png)
5. Termination is carried out
![Résiliation terminée](img/EMR_resiliation_04.png)

##### 4.13 Clone the EMR server (if necessary)

If we need to run our notebook again under the same conditions, we just need to **clone our cluster** and thus obtain a functional copy within 10 minutes, the time it takes to instantiate it.

<u>There are two solutions for this</u>:
1. <u>From the AWS interface</u>:
    1. Click “**Clone**”
    ![Cloner un cluster](img/EMR_cloner_01.png)
    2. The cluster configuration is recreated identically.You can go back to the different steps if you want to make changes. When everything is ready, click on "**Create cluster**"
    ![Vérification/Modification/Créer un cluster](img/EMR_cloner_02.png)
2. <u>On the command line</u> (with AWS CLI installed and configured and making sure to grant the necessary rights to the AMI account used)
    1. Click “**Export AWS CLI**”
    ![Exporter AWS CLI](img/EMR_cloner_cli_01.png)
    2. Copy/Paste the command **from a terminal**
    ![Copier Coller Commande](img/EMR_cloner_cli_02.png)

##### 4.14 S3 server tree at the end of the project
<u>For information, here is **the complete tree structure of my S3 p8-data bucket** at the end of the project</u>: *For the sake of readability, I do not list the 131 subfolders in the "Test" directory "*.

1. Results/_SUCCESS
1. Results/part-00000-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00001-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00002-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00003-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00004-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00005-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00006-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00007-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00008-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00009-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00010-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00011-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00012-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00013-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00014-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00015-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00016-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00017-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00018-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00019-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00020-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00021-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00022-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Results/part-00023-2cc36f38-19ef-4d8a-a0d1-5ddb309b3894-c000.snappy.parquet
1. Test/
1. bootstrap-emr.sh
1. jupyter-s3-conf.json
1. jupyter/jovyan/.s3keep
1. jupyter/jovyan/P8_01_Notebook.ipynb
1. jupyter/jovyan/_metadata
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/file-perm.sqlite
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/nbconvert/
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/nbconvert/templates/
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/nbconvert/templates/html/
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/nbconvert/templates/latex/
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/nbsignatures.db
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.aws-editors-workspace-metadata/notebook_secret
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.ipynb_checkpoints/
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.ipynb_checkpoints/Untitled-checkpoint.ipynb
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.ipynb_checkpoints/Untitled1-checkpoint.ipynb
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/.ipynb_checkpoints/test3-checkpoint.ipynb
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/Untitled.ipynb
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/Untitled1.ipynb
1. jupyter/jovyan/e-5OTY4VKPDT21945FF6DN15E35/test3.ipynb

### 5. Conclusion

We carried out this project **in two stages** taking into account the constraints imposed on us.

We **initially developed our solution locally**. The <u>first phase</u> consisted of **installing the Spark working environment**. **Spark** has a parameter that allows us to work locally and thus allows us to **simulate shared computing** by considering **each core of a processor as an independent worker**. We worked on a more **small data set**, the idea was simply to **validate the correct functioning of the solution**.

We have chosen to carry out **transfer learning** using the **MobileNetV2** model. This model was chosen for its **lightness** and its **speed of execution** as well as for the **low dimension of its output vector**.

The results were saved on disk in several partitions in "**parquet**" format.

<u>**The solution worked perfectly in local mode**</u>.

The <u>second phase</u> consisted of creating a **real calculation cluster**. The objective was to be able to **anticipate a future increase in workload**.

The best choice made was to use the service provider **Amazon Web Services** which allows us to **rent computing power on demand**, for a **completely acceptable cost**. This service is called **EC2** and is classified among the **Infrastructure As A Service** (IAAS) offers.

We have gone further by using a higher level service (**Platform As A Service** PAAS) using the **EMR** service which allows us to instantiate several servers at once (one cluster)** on which we were able to request the installation and configuration of several programs and libraries necessary for our project such as **Spark**, **Hadoop**, **JupyterHub** as well as the **TensorFlow library **.

In addition to being **quicker and more efficient to implement**, we have the **certainty of the correct functioning of the solution**, it having been previously validated by Amazon engineers.

We were also able to install, without difficulty, **the necessary packages on all the machines in the cluster**.

Finally, with very little modification, and even more simply, we were able to **run our notebook as we had done locally**. This time we executed the processing on **all the images in our "Test" folder**.

We opted for the **Amazon S3** service to **store our project data**. S3 offers, at a low cost, all the conditions we need to efficiently store and use our data. The allocated space is potentially **unlimited**, but costs will depend on the space used.

It will be **easy for us to cope with an increase in the workload** by simply **resizing** our cluster of machines (horizontally and/or vertically if necessary), the costs will increase accordingly but will remain significantly lower than the previous ones. costs generated by the purchase of equipment or the rental of dedicated servers.