# Linear Regression

## Step 1: Create the SparkSession Object

We start the Jupyter Notebook and import SparkSession and create a new 
`SparkSession` object to use Spark

In [2]:
import findspark
findspark.init()

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('lin_reg').getOrCreate()

## Step 2: Read the Dataset

We then load and read the dataset within Spark using Dataframe. We have 
to make sure we have opened the PySpark from the same directory folder 
where the dataset is available or else we have to mention the directory path 
of the data folder

In [4]:
df = spark.read.csv('Linear_regression_dataset.csv', inferSchema=True,header=True)

## Step 3: Exploratory Data Analysis

In this section, we drill deeper into the dataset by viewing the dataset, 
validating the shape of the dataset, various statistical measures, and 
correlations among input and output variables

In [5]:
print((df.count(), len(df.columns)))

(1232, 6)


In [6]:
df.printSchema()

root
 |-- var_1: integer (nullable = true)
 |-- var_2: integer (nullable = true)
 |-- var_3: integer (nullable = true)
 |-- var_4: double (nullable = true)
 |-- var_5: double (nullable = true)
 |-- output: double (nullable = true)



In [7]:
df.head(3)

[Row(var_1=734, var_2=688, var_3=81, var_4=0.328, var_5=0.259, output=0.418),
 Row(var_1=700, var_2=600, var_3=94, var_4=0.32, var_5=0.247, output=0.389),
 Row(var_1=712, var_2=705, var_3=93, var_4=0.311, var_5=0.247, output=0.417)]

In [8]:
from pyspark.sql.functions import corr

df.select(corr('var_1','output')).show()

+-------------------+
|corr(var_1, output)|
+-------------------+
| 0.9187399607627283|
+-------------------+



## Step 4: Feature Engineering

This is the part where we create a single vector combining all input features 
by using Spark’s `VectorAssembler`. It creates only a single feature that 
captures the input values for that row. So, instead of five input columns, it 
essentially merges all input columns into a single feature vector column

In [11]:
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler

In [12]:
 df.columns

['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'output']

In [15]:
vec_assmebler = VectorAssembler(inputCols=['var_1', 'var_2', 'var_3', 'var_4', 'var_5'], outputCol='features')

features_df = vec_assmebler.transform(df)

In [16]:
features_df.printSchema()

root
 |-- var_1: integer (nullable = true)
 |-- var_2: integer (nullable = true)
 |-- var_3: integer (nullable = true)
 |-- var_4: double (nullable = true)
 |-- var_5: double (nullable = true)
 |-- output: double (nullable = true)
 |-- features: vector (nullable = true)



In [17]:
features_df.select('features').show(5,False)

+------------------------------+
|features                      |
+------------------------------+
|[734.0,688.0,81.0,0.328,0.259]|
|[700.0,600.0,94.0,0.32,0.247] |
|[712.0,705.0,93.0,0.311,0.247]|
|[734.0,806.0,69.0,0.315,0.26] |
|[613.0,759.0,61.0,0.302,0.24] |
+------------------------------+
only showing top 5 rows



In [18]:
model_df = features_df.select('features','output')
model_df.show(5,False)

+------------------------------+------+
|features                      |output|
+------------------------------+------+
|[734.0,688.0,81.0,0.328,0.259]|0.418 |
|[700.0,600.0,94.0,0.32,0.247] |0.389 |
|[712.0,705.0,93.0,0.311,0.247]|0.417 |
|[734.0,806.0,69.0,0.315,0.26] |0.415 |
|[613.0,759.0,61.0,0.302,0.24] |0.378 |
+------------------------------+------+
only showing top 5 rows



In [19]:
print((model_df.count(), len(model_df.columns)))

(1232, 2)


## Step 5: Splitting the Dataset

We have to split the dataset into a training and test dataset in order to train 
and evaluate the performance of the Linear Regression model built. We 
split it into a 70/30 ratio and train our model on 70% of the dataset. We can 
print the shape of train and test data to validate the size

In [20]:
train_df,test_df=model_df.randomSplit([0.7,0.3])

In [21]:
print((train_df.count(), len(train_df.columns)))

(883, 2)


In [22]:
print((test_df.count(), len(test_df.columns)))

(349, 2)


## Step 6: Build and Train Linear Regression Model

In this part, we build and train the Linear Regression model using features 
of the input and output columns. We can fetch the coefficients (B1, B2, 
B3, B4, B5) and intercept (B0) values of the model as well. We can also 
evaluate the performance of model on training data as well using r2. This 
model gives a very good accuracy (86%) on training datasets

In [25]:
from pyspark.ml.regression import LinearRegression

lin_Reg = LinearRegression(labelCol='output')

lr_model = lin_Reg.fit(train_df)

In [26]:
print(lr_model.coefficients)

[0.0003416993111494008,5.3460409502920966e-05,0.00018721969114898545,-0.6379763976924058,0.4852854913533292]


In [27]:
print(lr_model.intercept)

0.18208587329270542


In [29]:
training_predictions = lr_model.evaluate(train_df)

In [30]:
print(training_predictions.r2)

0.8691103397865702


## Step 7: Evaluate Linear Regression Model on Test Data

The final part of this entire exercise is to check the performance of the model 
on unseen or test data. We use the evaluate function to make predictions for 
the test data and can use r2 to check the accuracy of the model on test data. 
The performance seems to be almost similar to that of training

In [34]:
test_predictions = lr_model.evaluate(test_df)

In [36]:
print(test_predictions.r2)

0.8691408114274075


In [38]:
print(test_predictions.meanSquaredError)

0.00013818268764449108
