### For continuous values you need the right evaluation metrics
### Mean Absolute Error (MAE)
This is the mean of the absolute value of errors
Easy to understand, just average error
Average of true value vs. predicted value
Prediction may be below or above the true value
I.e. for house price prediction, how close am I to house price

### Mean Squared Error (MSE)
Mean of the squared errors
Larger errors are noted more than with MAE
Square the average error, then get mean

### Root Mean Squared Error (RMSE)
This is the root of the mean of the squared errors

### R Squared Values
Coefficient of determination
Its a measure of how much variance your model accounts for (0-100%)
You can get to it via different methods i.e. adjusted R squared
Shouldn't be sole source for evaluating a model

In [1]:
### Topic is to try to predict customer's expenditure
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [3]:
from pyspark.ml.regression import LinearRegression

In [4]:
data = spark.read.csv('Ecommerce_Customers.csv', inferSchema=True, header=True)

In [5]:
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [6]:
for item in data.head(1)[0]:
    print(item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


In [7]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [8]:
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [11]:
data.select(['Avg Session Length'])

DataFrame[Avg Session Length: double]

In [12]:
assembler = VectorAssembler(inputCols=['Avg Session Length','Time on App', 'Time on Website', 'Length of Membership'], outputCol='features')

In [13]:
output = assembler.transform(data)

In [14]:
output.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)
 |-- features: vector (nullable = true)



In [15]:
output.head(1)

[Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005, features=DenseVector([34.4973, 12.6557, 39.5777, 4.0826]))]

In [16]:
final_data = output.select('features', 'Yearly Amount Spent')
final_data.show()

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
|[33.0009147556426...| 487.54750486747207|
|[34.3055566297555...|  581.8523440352177|
|[33.3306725236463...|  599.4060920457634|
|[33.8710378793419...|   637.102447915074|
|[32.0215955013870...|  521.5721747578274|
|[32.7391429383803...|  549.9041461052942|
|[33.9877728956856...|  570.2004089636196|
|[31.9365486184489...|  427.1993848953282|
|[33.9925727749537...|  492.6060127179966|
|[33.8793608248049...|  522.3374046069357|
|[29.5324289670579...|  408.6403510726275|
|[33.1903340437226...|  573.4158673313865|
|[32.3879758531538...|  470.4527333009554|
|[30.7377203726281...|  461.7807421962299|
|[32.1253868972878...| 457.84769594494855|
|[32.3388993230671...| 407.70454754954415|
|[32.1878120459321...|  452.3156754800354|
|[32.6178560628234...|   605.061038804892|
+----------

In [17]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [18]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                360|
|   mean|  497.1237839361732|
| stddev|  80.30613395961169|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [19]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                140|
|   mean|  504.9461208019503|
| stddev|  76.70088710979184|
|    min| 327.37795258965207|
|    max|  744.2218671047146|
+-------+-------------------+



In [20]:
lr = LinearRegression(labelCol='Yearly Amount Spent',featuresCol='features')

In [21]:
lr_model = lr.fit(train_data)

In [22]:
test_results = lr_model.evaluate(test_data)

In [25]:
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
| -21.941779724957428|
|  -7.591139381314861|
|-0.24378732833395134|
|  20.069245379300867|
|   4.519046695336442|
|  -7.718195774283629|
|   4.869718700020542|
|  18.486306811935265|
|  17.408491086132074|
| -3.6536245890020496|
|  -6.372841603458653|
| -3.3724135512247813|
|   8.336351224155635|
| -2.1140713641290176|
|    -9.9441195294213|
|  -7.936339346780358|
|  13.061967366456486|
|   4.249671476033143|
|   4.734548616729057|
|  4.8823943074412455|
+--------------------+
only showing top 20 rows



In [26]:
# Average error is $10
test_results.rootMeanSquaredError

9.752910623797812

In [27]:
# Model explains 98% of supplied data
test_results.r2

0.9837152554585455

In [29]:
final_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                500|
|   mean|  499.3140382585909|
| stddev|   79.3147815497068|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [30]:
unlabeled_data = test_data.select('features')
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[31.1239743499119...|
|[31.1280900496166...|
|[31.2606468698795...|
|[31.3123495994443...|
|[31.3662121671876...|
|[31.5261978982398...|
|[31.5316044825729...|
|[31.6005122003032...|
|[31.6098395733896...|
|[31.6253601348306...|
|[31.7242025238451...|
|[31.8124825597242...|
|[31.8512531286083...|
|[31.8627411090001...|
|[31.8648325480987...|
|[31.8854062999117...|
|[31.9262720263601...|
|[32.0123007682454...|
|[32.0215955013870...|
|[32.0542618511847...|
+--------------------+
only showing top 20 rows



In [31]:
predictions = lr_model.transform(unlabeled_data)
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[31.1239743499119...| 508.8888335647232|
|[31.1280900496166...| 564.8438261283695|
|[31.2606468698795...| 421.5704185852853|
|[31.3123495994443...|443.52217264863975|
|[31.3662121671876...| 426.0698358611485|
|[31.5261978982398...|416.81272196662144|
|[31.5316044825729...|  431.645887029342|
|[31.6005122003032...|460.68654467916167|
|[31.6098395733896...| 427.1370585649761|
|[31.6253601348306...|379.99052534592624|
|[31.7242025238451...|509.76072889141915|
|[31.8124825597242...|  396.182758535022|
|[31.8512531286083...|464.65589544264276|
|[31.8627411090001...| 558.4122125381757|
|[31.8648325480987...|  449.835400006235|
|[31.8854062999117...|398.03961231925587|
|[31.9262720263601...| 379.1429660778699|
|[32.0123007682454...| 488.6953815899251|
|[32.0215955013870...| 516.8376261410983|
|[32.0542618511847...| 556.9922633615417|
+--------------------+------------