<a href="https://colab.research.google.com/github/ankesh86/PySparkNotebooks/blob/main/DeepLearning_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Building a Multilayer Perceptron Model**

In [1]:
!pip install pyspark==3.4.0

Collecting pyspark==3.4.0
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317122 sha256=27942c7e87b7cd6cda130ddc77df92882100eb65e7ff15ca65d55a8a0b910024
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('deep_learning').getOrCreate()

## **Loading the Libraries**

In [3]:
import os
import numpy as np
import pandas as pd
from pyspark.sql.types import *

## Loading the files

In [5]:
data = spark.read.csv('sample_data/dl_data.csv', header=True, inferSchema=True)
data.printSchema()

root
 |-- Visit_Number_Bucket: string (nullable = true)
 |-- Page_Views_Normalized: double (nullable = true)
 |-- Orders_Normalized: integer (nullable = true)
 |-- Internal_Search_Successful_Normalized: double (nullable = true)
 |-- Internal_Search_Null_Normalized: double (nullable = true)
 |-- Email_Signup_Normalized: double (nullable = true)
 |-- Total_Seconds_Spent_Normalized: double (nullable = true)
 |-- Store_Locator_Search_Normalized: double (nullable = true)
 |-- Mapped_Last_Touch_Channel: string (nullable = true)
 |-- Mapped_Mobile_Device_Type: string (nullable = true)
 |-- Mapped_Browser_Type: string (nullable = true)
 |-- Mapped_Entry_Pages: string (nullable = true)
 |-- Mapped_Site_Section: string (nullable = true)
 |-- Mapped_Promo_Code: string (nullable = true)
 |-- Maped_Product_Name: string (nullable = true)
 |-- Mapped_Search_Term: string (nullable = true)
 |-- Mapped_Product_Collection: string (nullable = true)



# **Transformation of Data**

In [7]:
data = data.withColumnRenamed('Orders_Normalized','label')
data.printSchema()

root
 |-- Visit_Number_Bucket: string (nullable = true)
 |-- Page_Views_Normalized: double (nullable = true)
 |-- label: integer (nullable = true)
 |-- Internal_Search_Successful_Normalized: double (nullable = true)
 |-- Internal_Search_Null_Normalized: double (nullable = true)
 |-- Email_Signup_Normalized: double (nullable = true)
 |-- Total_Seconds_Spent_Normalized: double (nullable = true)
 |-- Store_Locator_Search_Normalized: double (nullable = true)
 |-- Mapped_Last_Touch_Channel: string (nullable = true)
 |-- Mapped_Mobile_Device_Type: string (nullable = true)
 |-- Mapped_Browser_Type: string (nullable = true)
 |-- Mapped_Entry_Pages: string (nullable = true)
 |-- Mapped_Site_Section: string (nullable = true)
 |-- Mapped_Promo_Code: string (nullable = true)
 |-- Maped_Product_Name: string (nullable = true)
 |-- Mapped_Search_Term: string (nullable = true)
 |-- Mapped_Product_Collection: string (nullable = true)



## Pipeline to change categorical to numerical

In [13]:
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf, StringType
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import MultilayerPerceptronClassifier

In [14]:
train, validation, test = data.randomSplit([0.7, 0.2, 0.1], 1234)

In [15]:
categorical_columns = [item[0] for item in data.dtypes if item[1].startswith('string')]
numeric_columns = [item[0] for item in data.dtypes if item[1].startswith('double')]

indexers = [StringIndexer(inputCol=column, outputCol='{0}_index'.format(column)) for column in categorical_columns]

featuresCreator = VectorAssembler(inputCols=[indexer.getOutputCol() for indexer in indexers] + numeric_columns, outputCol="features")


### Layers
The layers parameter is an array of integers where each integer represents the number of neurons in a layer. The way you have structured the layers array is as follows:

* First element: This represents the number of input neurons, which should match the number of features in your input dataset. In your code, featuresCreator.
* getInputCols() suggests that the first layer's size is dynamically set based on the number of input columns transformed by featuresCreator.
* Intermediate elements: These elements specify the number of neurons in each hidden layer. In your example, 4 and 2 are the sizes of the two hidden layers. The hidden layers are where the model learns the non-linear relationships between the features and the label.
* Last element: This represents the number of output neurons, which should match the number of classes in a classification problem (for binary classification, it is often set to 2). Each output corresponds to a class score that, after transformation, can be interpreted as a probability.
* Other Configurable Parameters
In addition to layers, there are several other parameters in

### MultilayerPerceptronClassifier that you can configure:

* labelCol: The name of the column in the dataset that contains the label to predict.
* featuresCol: The name of the column in the dataset that contains the feature vector.
* maxIter: The maximum number of iterations to train the network. More iterations might allow the network to converge to a better solution but also take longer to compute.
* blockSize: The size of the block of input data to process at once. This is used to control memory usage and optimization of the calculations.
* seed: A seed for random number generation. This helps in creating reproducibility in your experiments.

In [16]:
layers = [len(featuresCreator.getInputCols()), 4, 2, 2]
classifier = MultilayerPerceptronClassifier(labelCol='label', featuresCol='features', maxIter=100, layers=layers, blockSize=128, seed=1234)
pipeline = Pipeline(stages=indexers + [featuresCreator, classifier])




## **Fit the model**

In [None]:
model = pipeline.fit(train)

## **Calculating the predictions**

In [17]:
train_output_df = model.transform(train)
validation_output_df = model.transform(validation)
test_output_df = model.transform(test)

## **Evaluate the Predictions**

In [18]:
train_predictionAndLabels = train_output_df.select("prediction", "label")
validation_predictionAndLabels = validation_output_df.select("prediction", "label")
test_predictionAndLabels = test_output_df.select("prediction", "label")

metrics = ['weightedPrecision', 'weightedRecall', 'accuracy']

for metric in metrics:
    evaluator = MulticlassClassificationEvaluator(metricName=metric)
    print('Train ' + metric + ' = ' + str(evaluator.evaluate(train_predictionAndLabels)))
    print('Validation ' + metric + ' = ' + str(evaluator.evaluate(validation_predictionAndLabels)))
    print('Test ' + metric + ' = ' + str(evaluator.evaluate(test_predictionAndLabels)))

Train weightedPrecision = 0.9702977504063095
Validation weightedPrecision = 0.9700372050939752
Test weightedPrecision = 0.9678924180434567
Train weightedRecall = 0.9698878326626621
Validation weightedRecall = 0.9696474751599292
Test weightedRecall = 0.9673558215451578
Train accuracy = 0.9698878326626621
Validation accuracy = 0.9696474751599292
Test accuracy = 0.9673558215451578
