To examine a dataset with Ecommerce Customer Data for a company's website and mobile app. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

### Initialize and create a spark session

In [3]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("ECommerce").getOrCreate()

2019-12-26 12:10:38 WARN  SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@148e681e


### Initialize Logger

In [4]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Import statements to setup ML

In [6]:
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


### Using Spark to read in the Ecommerce Customers csv file

In [7]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("Ecommerce_Customers.csv")

data: org.apache.spark.sql.DataFrame = [Email: string, Address: string ... 6 more fields]


### Print the count of the Dataframe

In [8]:
data.count

res1: Long = 500


### Printing the first row of the dataframe

In [11]:
for (row <- data.head(5)){
    println(row)
}

[mstephenson@fernandez.com,835 Frank TunnelWrightmouth, MI 82180-9605,Violet,34.49726772511229,12.65565114916675,39.57766801952616,4.0826206329529615,587.9510539684005]
[hduke@hotmail.com,4547 Archer CommonDiazchester, CA 06566-8576,DarkGreen,31.92627202636016,11.109460728682564,37.268958868297744,2.66403418213262,392.2049334443264]
[pallen@yahoo.com,24645 Valerie Unions Suite 582Cobbborough, DC 99414-7564,Bisque,33.000914755642675,11.330278057777512,37.110597442120856,4.104543202376424,487.54750486747207]
[riverarebecca@gmail.com,1414 David ThroughwayPort Jason, OH 22070-1220,SaddleBrown,34.30555662975554,13.717513665142507,36.72128267790313,3.120178782748092,581.8523440352177]
[mstephens@davidson-herman.com,14023 Rodriguez PassagePort Jacobville, PR 37242-1057,MediumAquaMarine,33.33067252364639,12.795188551078114,37.53665330059473,4.446308318351434,599.4060920457634]


### Printing the schema of the dataframe

In [12]:
data.printSchema

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



### Show

In [13]:
data.show(3)

+--------------------+--------------------+---------+------------------+------------------+------------------+--------------------+-------------------+
|               Email|             Address|   Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+---------+------------------+------------------+------------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|   Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|
|   hduke@hotmail.com|4547 Archer Commo...|DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|
|    pallen@yahoo.com|24645 Valerie Uni...|   Bisque|33.000914755642675|11.330278057777512|37.110597442120856|   4.104543202376424| 487.54750486747207|
+--------------------+--------------------+---------+------------------+----------------

### See an example of what the data looks like by printing out a Row

In [19]:
val colnames = data.columns
val firstRow = data.head(1)(0)

for (i <- Range(0,colnames.size)){
    println("Column: "+colnames(i))
    println("Data: "+firstRow(i))
    println()
}

Column: Email
Data: mstephenson@fernandez.com

Column: Address
Data: 835 Frank TunnelWrightmouth, MI 82180-9605

Column: Avatar
Data: Violet

Column: Avg Session Length
Data: 34.49726772511229

Column: Time on App
Data: 12.65565114916675

Column: Time on Website
Data: 39.57766801952616

Column: Length of Membership
Data: 4.0826206329529615

Column: Yearly Amount Spent
Data: 587.9510539684005



colnames: Array[String] = Array(Email, Address, Avatar, Avg Session Length, Time on App, Time on Website, Length of Membership, Yearly Amount Spent)
firstRow: org.apache.spark.sql.Row = [mstephenson@fernandez.com,835 Frank TunnelWrightmouth, MI 82180-9605,Violet,34.49726772511229,12.65565114916675,39.57766801952616,4.0826206329529615,587.9510539684005]


### Filtering the string columns and converting the dataframe to ML acceptable format ---> i.e., ("label","features")

In [20]:
val filtered_data = data.select("Avg Session Length", "Time on App", "Time on Website", "Length of Membership", "Yearly Amount Spent")

filtered_data: org.apache.spark.sql.DataFrame = [Avg Session Length: double, Time on App: double ... 3 more fields]


In [21]:
val assembler = new VectorAssembler().setInputCols(Array("Avg Session Length", "Time on App", "Time on Website"
                                                         , "Length of Membership")).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_ec652b3bcb54


In [22]:
val output = assembler.transform(filtered_data)

output: org.apache.spark.sql.DataFrame = [Avg Session Length: double, Time on App: double ... 4 more fields]


In [23]:
output.show(3)

+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|            features|
+------------------+------------------+------------------+--------------------+-------------------+--------------------+
| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|[34.4972677251122...|
| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|[31.9262720263601...|
|33.000914755642675|11.330278057777512|37.110597442120856|   4.104543202376424| 487.54750486747207|[33.0009147556426...|
+------------------+------------------+------------------+--------------------+-------------------+--------------------+
only showing top 3 rows



In [24]:
val final_data = output.select("Yearly Amount Spent","features")

final_data: org.apache.spark.sql.DataFrame = [Yearly Amount Spent: double, features: vector]


In [26]:
final_data.show(5,false)

+-------------------+----------------------------------------------------------------------------+
|Yearly Amount Spent|features                                                                    |
+-------------------+----------------------------------------------------------------------------+
|587.9510539684005  |[34.49726772511229,12.65565114916675,39.57766801952616,4.0826206329529615]  |
|392.2049334443264  |[31.92627202636016,11.109460728682564,37.268958868297744,2.66403418213262]  |
|487.54750486747207 |[33.000914755642675,11.330278057777512,37.110597442120856,4.104543202376424]|
|581.8523440352177  |[34.30555662975554,13.717513665142507,36.72128267790313,3.120178782748092]  |
|599.4060920457634  |[33.33067252364639,12.795188551078114,37.53665330059473,4.446308318351434]  |
+-------------------+----------------------------------------------------------------------------+
only showing top 5 rows



### Splitting the resultane data into training data and testing data

<code>
<b>Training data is to train the model</b>
<b>Testing data is to test the builted model</b>
</code>

#### Splitting the total data to 70% and 30% for training data and testing data respectively

In [27]:
val Array(train_data,test_data) = final_data.randomSplit(Array(0.7,0.3))

train_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Yearly Amount Spent: double, features: vector]
test_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Yearly Amount Spent: double, features: vector]


In [29]:
final_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                500|
|   mean|  499.3140382585909|
| stddev|   79.3147815497068|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [30]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                349|
|   mean| 498.91895574689846|
| stddev|  80.16344725795035|
|    min|   266.086340948469|
|    max|  765.5184619388373|
+-------+-------------------+



In [31]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                151|
|   mean|  500.2271759842895|
| stddev|  77.57302061278227|
|    min| 256.67058229005585|
|    max|  725.5848140556806|
+-------+-------------------+



### Creating a linear regression model object

In [32]:
val lr = new LinearRegression().setLabelCol("Yearly Amount Spent").setFeaturesCol("features")

lr: org.apache.spark.ml.regression.LinearRegression = linReg_ad567351be48


### Creating a linear regression model and fitting the training data to it

In [33]:
val lrModel = lr.fit(train_data)

2019-12-26 12:28:55 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-26 12:28:55 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
2019-12-26 12:28:55 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
2019-12-26 12:28:55 WARN  LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK


lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_ad567351be48


### Getting the training summary of the created model

In [34]:
val trainingSummary = lrModel.summary

trainingSummary: org.apache.spark.ml.regression.LinearRegressionTrainingSummary = org.apache.spark.ml.regression.LinearRegressionTrainingSummary@4414fb8e


### Residuals

In [35]:
trainingSummary.residuals.show(5)

+-------------------+
|          residuals|
+-------------------+
|-17.132000238974456|
| -5.249605680316165|
| -5.707320807003555|
| 16.638244206776164|
| -9.089548631163439|
+-------------------+
only showing top 5 rows



### Errors

In [36]:
println(s"Mean Absolute Error: ${trainingSummary.meanAbsoluteError}")
println(s"Mean Squared Error: ${trainingSummary.meanSquaredError}")
println(s"Root Mean Squared Error: ${trainingSummary.rootMeanSquaredError}")
println(s"R Squared Error: ${trainingSummary.r2}")

Mean Absolute Error: 8.020052223176389
Mean Squared Error: 102.25435605992648
Root Mean Squared Error: 10.11208959908517
R Squared Error: 0.984042118657839


### Evaluating the model against test data

In [38]:
val test_results = lrModel.evaluate(test_data)

test_results: org.apache.spark.ml.regression.LinearRegressionSummary = org.apache.spark.ml.regression.LinearRegressionSummary@2ba3a0c4


### Getting the co-effecients and intercept

In [39]:
println(s"Coefficients: ${lrModel.coefficients}")
println(s"Intercept: ${lrModel.intercept}")

Coefficients: [25.37725532838959,37.99550195624851,0.47167795343915264,61.91345660542813]
Intercept: -1033.6792488814692


### Getting the residuals

In [40]:
test_results.residuals.show(5)

+-------------------+
|          residuals|
+-------------------+
| 1.4054810152236428|
|-6.4842896796715195|
| -8.494262494387215|
|0.16506940239332835|
| -6.569702155294692|
+-------------------+
only showing top 5 rows



### Evaluating the model by checking the different types of error

In [41]:
println(s"Mean Absolute Error: ${test_results.meanAbsoluteError}")
println(s"Mean Squared Error: ${test_results.meanSquaredError}")
println(s"Root Mean Squared Error: ${test_results.rootMeanSquaredError}")
println(s"R Squared Error: ${test_results.r2}")

Mean Absolute Error: 7.628800374545236
Mean Squared Error: 92.02837009666581
Root Mean Squared Error: 9.593141826151943
R Squared Error: 0.9846047759701744


### Getting the predictions from the builted model without label column

In [42]:
val unlabelled_data = test_data.select("features")

unlabelled_data: org.apache.spark.sql.DataFrame = [features: vector]


In [43]:
val predictions = lrModel.transform(unlabelled_data)

predictions: org.apache.spark.sql.DataFrame = [features: vector, prediction: double]


In [44]:
predictions.show(false)

+----------------------------------------------------------------------------+------------------+
|features                                                                    |prediction        |
+----------------------------------------------------------------------------+------------------+
|[32.83694076702139,10.25654903128796,36.143908456341634,0.7895199078816915] |255.2651012748322 |
|[32.529768731474434,11.747731701242175,36.93988205032054,0.8015157200042076]|305.2462975414792 |
|[34.083663301629485,8.668349517101323,35.90675636579306,2.2524459633808416] |317.0220090524206 |
|[32.40237101796123,10.875559548189257,37.78114255999947,1.9140899242311]    |338.1547932391288 |
|[33.50370517913956,12.399436075147092,35.01280603355904,0.9686221157417688] |364.1611415703808 |
|[33.07773079450243,11.466984219092824,35.675727630820134,1.8092295917763102]|370.2797045192233 |
|[32.58249357081853,11.739743796165989,36.85481082475086,2.1820169698233887] |391.7122827915598 |
|[34.18777482695728,

### Stopping the created spark session

In [45]:
spark.stop()

## Thank You!