In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488493 sha256=922d7fecf48826f97a9fd3948a31d8a08c021853120d44b67a85a6644aa9b8e9
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [46]:
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors


Initialize spark session

In [47]:
spark = SparkSession.builder.appName("LinearRegression").getOrCreate()

Get input data

Input data example:

We have 2 columns seperated by comma.

-1.74,1.66</br>
1.24,-1.18</br>
0.29,-0.40</br>
-0.13,0.09</br>
-0.39,0.38</br>
-1.79,1.73</br>

In [48]:
inputLines = spark.sparkContext.textFile("/content/regression.txt")

We use RDD interface to parse the data out. We then map x, where x represents each row of RDD, extracts first column which is the label which we are predicting. First column is amount spent, then after that there are features that we are using. In our case we only have one feature, i.e. the page speed. We could create a dense vector consisting multiple features to multivariate linear regression.

In [49]:
inputLines.take(2)

['-1.74,1.66', '1.24,-1.18']

In [50]:
data = inputLines.map(lambda x:x.split(",")).map(lambda x:(float(x[0]), Vectors.dense(float(x[1]))))

We now create spark dataframe

In [39]:
df = data.toDF()
df.take(2)

[Row(_1=-1.74, _2=DenseVector([1.66])), Row(_1=1.24, _2=DenseVector([-1.18]))]

In [57]:
trainTest = df.randomSplit([0.8,0.2])
trainDF = trainTest[0]
testDF = trainTest[1]

Initializing Linear Regression with hyperparameters:


- maxIter: Controls the number of iterations for the optimization algorithm.
- regParam: Determines the amount of regularization to apply, helping to prevent overfitting.
- elasticNetParam: Balances between L1 and L2 regularization to combine their strengths.

In [56]:
spark_LR = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, labelCol="label")

In [58]:
model = spark_LR.fit(trainDF)

Next, we predict with our test set. Cache the data for doing stuffs with the result dataset.

In [59]:
predictions = model.transform(testDF).cache()

In [61]:
predictions.show()

+-----+--------+-------------------+
|label|features|         prediction|
+-----+--------+-------------------+
|-3.74|  [3.75]| -2.646319368372646|
|-2.58|  [2.57]|  -1.81081549155598|
|-2.07|  [2.04]|-1.4355468011213761|
|-1.94|  [1.94]|-1.3647413878318282|
|-1.74|  [1.66]| -1.166486230621094|
|-1.67|  [1.46]| -1.024875404041998|
|-1.58|  [1.65]| -1.159405689292139|
|-1.42|  [1.59]|-1.1169224413184105|
| -1.4|  [1.32]|-0.9257478254366309|
| -1.3|  [1.45]| -1.017794862713043|
|-1.12|   [1.1]|-0.7699759161996255|
|-1.11|   [1.0]|-0.6991705029100774|
|-0.94|   [1.0]|-0.6991705029100774|
|-0.89|  [1.04]|-0.7274926682258966|
|-0.84|  [0.83]|-0.5788013003178459|
|-0.84|  [0.89]|-0.6212845482915746|
| -0.8|   [0.8]|-0.5575596763309816|
|-0.76|  [0.84]|-0.5858818416468007|
|-0.71|  [0.55]|-0.3805461431071117|
|-0.68|  [0.88]|-0.6142040069626199|
+-----+--------+-------------------+
only showing top 20 rows



In [62]:
spark.stop()