<a href="https://colab.research.google.com/github/datasigntist/deeplearning/blob/master/pySpark_Codes_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Experiments with Spark**

Author : Vishwanathan Raman

EmailId : datasigntist@gmail.com

Description : This notebook covers the basics of building a machine learning model in Spark. A logistic regression model is developed using Iris dataset https://archive.ics.uci.edu/ml/datasets/Iris . Download the data and upload it the google colab environment
   

Reference Links:

*   Introduction to Spark 1 : https://youtu.be/TuGn3e1EgXM
*   Introduction to Spark 2 : https://youtu.be/JruCKuWHKpk
*   Introduction to Spark 3 : https://youtu.be/c9jd4yZGyT8
*   Introduction to RDD 1   : https://youtu.be/M7UuKHYecXQ
*   Introduction to RDD 2   : https://youtu.be/qLGUPdSvAVg
*   Introduction to RDD 3   : https://youtu.be/9NBP-FiHrQg



## Spark Installation in Google Colab

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

## Importing Specific Libraries and Onboarding data

In [0]:
from pyspark.sql.functions import *
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.feature import *
from pyspark.sql import SQLContext
import pandas as pd

In [0]:
sqlContext = SQLContext(spark)

In [0]:
irisData = pd.read_csv('iris_data',header=None)

In [11]:
irisData.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Using sqlContext to Spark dataframe from the pandas dataframe

In [14]:
sdf = sqlContext.createDataFrame(irisData,["SepalLength","SepalWidth","PetalLength","PetalWidth","class"]) 
sdf.printSchema()

root
 |-- SepalLength: double (nullable = true)
 |-- SepalWidth: double (nullable = true)
 |-- PetalLength: double (nullable = true)
 |-- PetalWidth: double (nullable = true)
 |-- class: string (nullable = true)



In [15]:
sdf.show(5)

+-----------+----------+-----------+----------+-----------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|      class|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|
+-----------+----------+-----------+----------+-----------+
only showing top 5 rows



## Prepping the data for modelling

Identifying the feature set excluding the class which is the last column

In [18]:
sdf.columns[:-1]
feature_cols = sdf.columns[:-1]
print("Feature Columns ",feature_cols)

Feature Columns  ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']


VectorAssembler builds the feature vector for each data point as reflected in the features column below

In [19]:
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
data = assembler.transform(sdf)
data.show(5)

+-----------+----------+-----------+----------+-----------+-----------------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|      class|         features|
+-----------+----------+-----------+----------+-----------+-----------------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|[4.6,3.1,1.5,0.2]|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|[5.0,3.6,1.4,0.2]|
+-----------+----------+-----------+----------+-----------+-----------------+
only showing top 5 rows



Selecting only those columns that needs to be taken forward for modelling

In [21]:
dataForModelling = data.select(['features', 'class'])
dataForModelling.show(5)

+-----------------+-----------+
|         features|      class|
+-----------------+-----------+
|[5.1,3.5,1.4,0.2]|Iris-setosa|
|[4.9,3.0,1.4,0.2]|Iris-setosa|
|[4.7,3.2,1.3,0.2]|Iris-setosa|
|[4.6,3.1,1.5,0.2]|Iris-setosa|
|[5.0,3.6,1.4,0.2]|Iris-setosa|
+-----------------+-----------+
only showing top 5 rows



Using StringIndexer to convert the class labels to its numerical values

In [0]:
label_indexer = StringIndexer(inputCol='class', outputCol='label').fit(data)

In [0]:
dataForModellingWithLabels = label_indexer.transform(dataForModelling)

In [26]:
dataForModellingWithLabels.show(5)

+-----------------+-----------+-----+
|         features|      class|label|
+-----------------+-----------+-----+
|[5.1,3.5,1.4,0.2]|Iris-setosa|  2.0|
|[4.9,3.0,1.4,0.2]|Iris-setosa|  2.0|
|[4.7,3.2,1.3,0.2]|Iris-setosa|  2.0|
|[4.6,3.1,1.5,0.2]|Iris-setosa|  2.0|
|[5.0,3.6,1.4,0.2]|Iris-setosa|  2.0|
+-----------------+-----------+-----+
only showing top 5 rows



In [27]:
dataForModellingWithLabels = dataForModellingWithLabels.select(['features', 'label'])
dataForModellingWithLabels.show(5)

+-----------------+-----+
|         features|label|
+-----------------+-----+
|[5.1,3.5,1.4,0.2]|  2.0|
|[4.9,3.0,1.4,0.2]|  2.0|
|[4.7,3.2,1.3,0.2]|  2.0|
|[4.6,3.1,1.5,0.2]|  2.0|
|[5.0,3.6,1.4,0.2]|  2.0|
+-----------------+-----+
only showing top 5 rows



Split the data into training and testing, building a model using logistic regression

In [0]:
train, test = dataForModellingWithLabels.randomSplit([0.70, 0.30])
lr = LogisticRegression(regParam=0.01)
model = lr.fit(train)

Run the model on the test data

In [31]:
prediction = model.transform(test)
print("Prediction")
prediction.show(10)

Prediction
+-----------------+-----+--------------------+--------------------+----------+
|         features|label|       rawPrediction|         probability|prediction|
+-----------------+-----+--------------------+--------------------+----------+
|[4.4,3.2,1.3,0.2]|  2.0|[2.04363189580373...|[0.01170535202404...|       2.0|
|[4.7,3.2,1.3,0.2]|  2.0|[2.18593338860795...|[0.02027222498244...|       2.0|
|[4.9,3.0,1.4,0.2]|  2.0|[2.50699848637643...|[0.06039632656466...|       2.0|
|[5.0,3.4,1.5,0.2]|  2.0|[2.07914187176316...|[0.02179594114258...|       2.0|
|[5.0,3.5,1.3,0.3]|  2.0|[1.88931514926596...|[0.01459390090949...|       2.0|
|[5.0,3.6,1.4,0.2]|  2.0|[1.85294443586416...|[0.01029704957166...|       2.0|
|[5.1,3.3,1.7,0.5]|  2.0|[1.96369925950425...|[0.05924765576852...|       2.0|
|[5.1,3.5,1.4,0.2]|  2.0|[2.01729291370674...|[0.01720705906148...|       2.0|
|[5.1,3.8,1.9,0.4]|  2.0|[1.45203810055444...|[0.01257389227628...|       2.0|
|[5.2,2.7,3.9,1.4]|  0.0|[1.7511379749147

In [32]:
evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
accuracy = evaluator.evaluate(prediction)

print()
print('#####################################')
print('Regularization rate is {}'.format(0.01))
print("Accuracy is {}".format(accuracy))
print('#####################################')
print()


#####################################
Regularization rate is 0.01
Accuracy is 0.9111111111111111
#####################################

