In [3]:
!apt-get update

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:12 http://secur

In [4]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

In [7]:
import findspark
findspark.init()

# **Linear Regression**

 build a Linear Regression model using Spark’s MLlib library and
predict the target variable using the input features

### Data Info
The dataset that we are going to use for this example is a dummy
dataset and contains a total of 1,232 rows and 6 columns. We have to
use 5 input variables to predict the target variable using the Linear
Regression model.

Step **1**: Create the SparkSession Object

In [9]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lin_reg').getOrCreate()

Step **2**: Read the Dataset

In [10]:
df = spark.read.csv('Linear_regression_dataset.csv',inferSchema=True,header=True)

Step **3**: Exploratory Data Analysis

In [13]:
#the shape of the dataset
print((df.count(), len(df.columns)))

(1232, 6)


In [14]:
df.printSchema()

root
 |-- var_1: integer (nullable = true)
 |-- var_2: integer (nullable = true)
 |-- var_3: integer (nullable = true)
 |-- var_4: double (nullable = true)
 |-- var_5: double (nullable = true)
 |-- output: double (nullable = true)



In [16]:
df.describe().show()

+-------+-----------------+-----------------+------------------+--------------------+--------------------+-------------------+
|summary|            var_1|            var_2|             var_3|               var_4|               var_5|             output|
+-------+-----------------+-----------------+------------------+--------------------+--------------------+-------------------+
|  count|             1232|             1232|              1232|                1232|                1232|               1232|
|   mean|715.0819805194806|715.0819805194806| 80.90422077922078|  0.3263311688311693| 0.25927272727272715|0.39734172077922014|
| stddev| 91.5342940441652|93.07993263118064|11.458139049993724|0.015012772334166148|0.012907228928000298|0.03326689862173776|
|    min|              463|              472|                40|               0.277|               0.214|              0.301|
|    max|             1009|             1103|               116|               0.373|               0.294|     

This allows us to get a sense of distribution, measure of center, and
spread for our dataset columns. We then take a sneak peek into the dataset

In [19]:
df.head(4)

[Row(var_1=734, var_2=688, var_3=81, var_4=0.328, var_5=0.259, output=0.418),
 Row(var_1=700, var_2=600, var_3=94, var_4=0.32, var_5=0.247, output=0.389),
 Row(var_1=712, var_2=705, var_3=93, var_4=0.311, var_5=0.247, output=0.417),
 Row(var_1=734, var_2=806, var_3=69, var_4=0.315, var_5=0.26, output=0.415)]

We can check the correlation between input variables and output
variables using the corr function:

In [27]:
from pyspark.sql.functions import corr
df.select(corr('var_1','output')).show()
#var_1 seems to be most strongly correlated with the output column.

+-------------------+
|corr(var_1, output)|
+-------------------+
| 0.9187399607627283|
+-------------------+



Step **4**: Feature Engineering

This is the part where we create a single vector combining all input features
by using Spark’s VectorAssembler. It creates only a single feature that
captures the input values for that row. So, instead of five input columns, it
essentially merges all input columns into a single feature vector column.

In [51]:

from pyspark.ml.feature import VectorAssembler

One can select the number of columns that would be used as input
features and can pass only those columns through the VectorAssembler. In
our case, we will pass all the five input columns to create a single feature
vector column.

In [52]:
df.columns

['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'output']

In [53]:
vec_assembler = VectorAssembler(inputCols=['var_1', 'var_2', 'var_3', 'var_4', 'var_5'],outputCol='features')

In [54]:
features_df= vec_assembler.transform(df)
features_df.printSchema()

root
 |-- var_1: integer (nullable = true)
 |-- var_2: integer (nullable = true)
 |-- var_3: integer (nullable = true)
 |-- var_4: double (nullable = true)
 |-- var_5: double (nullable = true)
 |-- output: double (nullable = true)
 |-- features: vector (nullable = true)



As, we can see, we have an additional column (‘features’) that contains
the single dense vector for all of the inputs.

In [55]:
features_df.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|[734.0,688.0,81.0...|
|[700.0,600.0,94.0...|
|[712.0,705.0,93.0...|
|[734.0,806.0,69.0...|
|[613.0,759.0,61.0...|
+--------------------+
only showing top 5 rows



We take the subset of the dataframe and select only the features
column and the output column to build the Linear Regression model.

In [56]:
model_df = features_df.select('features','output')
model_df.show(5)

+--------------------+------+
|            features|output|
+--------------------+------+
|[734.0,688.0,81.0...| 0.418|
|[700.0,600.0,94.0...| 0.389|
|[712.0,705.0,93.0...| 0.417|
|[734.0,806.0,69.0...| 0.415|
|[613.0,759.0,61.0...| 0.378|
+--------------------+------+
only showing top 5 rows



In [57]:
print((model_df.count(),len(model_df.columns)))

(1232, 2)


Step **5**: Splitting the Dataset

 We
split it into a 70/30 ratio and train our model on 70% of the dataset. We can
print the shape of train and test data to validate the size.

In [58]:
train_df,test_df=model_df.randomSplit([0.7,0.3])
print((train_df.count(),len(train_df.columns)))
print((test_df.count(),len(test_df.columns)))


(864, 2)
(368, 2)


Step **6**: Build and Train Linear Regression Model

In [59]:
from pyspark.ml.regression import LinearRegression
lin_Reg = LinearRegression(labelCol='output')
lr_model = lin_Reg.fit(train_df)
print(lr_model.coefficients)

[0.0003374392956775735,5.160609482977743e-05,0.0001318541177160561,-0.628626969273809,0.5276636824697607]


In [60]:
print(lr_model.intercept)


0.1768447412971254


In [62]:
trainning_predections=lr_model.evaluate(train_df)
print(trainning_predections.r2)
#This model gives a very good accuracy (86%)

0.8657799077266927


Step **7**: Evaluate Linear Regression Model
on Test Data

In [63]:
test_predictions = lr_model.evaluate(test_df)
print(test_predictions.r2)

0.875806489988025


In [64]:
print(test_predictions.meanSquaredError)

0.0001424534922119856
