# Spark MLlib - LinearRegression example

<p>Obs.: After download the databricks notebook to .ipynb we have problems in the output format but if you run this notebook in a databricks cluster you'll have a output in a table format.</p>

<p>E.g.:</p>
<p>The following output:</p>
<p>+----+-------+ age| name| +----+-------+ null|Michael| 30| Andy| 19| Justin| +----+-------+</p>
<p>actually is:</p>
<pre>+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+  </pre>

### Create session

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('lr_example').getOrCreate()

### Import LinearRegression

In [3]:
from pyspark.ml.regression import LinearRegression

### Load csv data

In [4]:
data = spark.read.csv('/FileStore/tables/Ecommerce_Customers-83b1c.csv', header=True, inferSchema=True)

In [5]:
data.printSchema()

In [6]:
data.show()

In [7]:
data.head(1)

### Load VectorAssembler to transform the data to MLlib format

In [8]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [9]:
data.columns

### Instantiate assembler

In [10]:
assembler = VectorAssembler(inputCols=['Avg Session Length', 'Time on App', 'Time on Website', 'Length of Membership'],
                            outputCol='features')

### Transform the data

In [11]:
output = assembler.transform(data)

In [12]:
output.select('features').show()

In [13]:
output.head(1)

In [14]:
final_data = output.select(['features', 'Yearly Amount Spent'])

In [15]:
final_data.show()

### Split the data

In [16]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [17]:
train_data.describe().show()

In [18]:
test_data.describe().show()

### Instantiate the model

In [19]:
lr = LinearRegression(labelCol='Yearly Amount Spent')

### Fit the model in the training data

In [20]:
lr_model = lr.fit(train_data)

### Evaluate the model

In [21]:
test_results = lr_model.evaluate(test_data)

In [22]:
test_results.residuals.show()

In [23]:
test_results.rootMeanSquaredError

In [24]:
test_results.r2

In [25]:
final_data.describe().show()

In [26]:
unlabeled_data = test_data.select('features')

In [27]:
unlabeled_data.show()

### Make predictions

In [28]:
predictions = lr_model.transform(unlabeled_data)
predictions.show()