# Linear Regression with SparkML
$y=mx+b$
- y (dependent variable)
- x (independent variable)
- m (slope) represents the rate of change of the dependent variable with repsect to the independent variable x.
- b (intercept) represents the value of the dependent variable when the indepent variable is zero.


# Setup dependencies
I will be using pandas and sklearn for managing data and machine learning.
<details>
    <summary>pip install...</summary>

```python
# Allows to install a python package
pip install package-name
# or install python package with a specific version
pip install package-name==version
```
</details>


In [1]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# Importing required libraries

In [2]:
# FindSpark simplifies the process of using Apache Spark with Python
import findspark
findspark.init()

from pyspark.sql import SparkSession

#import functions/Classes for sparkml

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator

# Creating Spark Session

In [3]:
spark = SparkSession.builder.appName("Regressing using SparkML").getOrCreate()

# Reading CSV Dataset

In [4]:
mpg_data = spark.read.csv("data/mpg.csv", header=True, inferSchema=True)

In [5]:
mpg_data.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)



## Task 3 - Identify the label column and the input columns

In [6]:
# Prepare feature vector
assembler = VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight", "Accelerate", "Year"], outputCol="features")
mpg_transformed_data = assembler.transform(mpg_data)

In [7]:
# Display the assembled "features" and the label column "MPG"
mpg_transformed_data.select("features","MPG").show(5)

+--------------------+----+
|            features| MPG|
+--------------------+----+
|[8.0,390.0,190.0,...|15.0|
|[6.0,199.0,90.0,2...|21.0|
|[6.0,199.0,97.0,2...|18.0|
|[8.0,304.0,150.0,...|16.0|
|[8.0,455.0,225.0,...|14.0|
+--------------------+----+
only showing top 5 rows



## Task 4 - Split the data

In [8]:
# We split the data set in the ratio of 70:30. 70% training data, 30% testing data.
(training_data, testing_data) = mpg_transformed_data.randomSplit([0.7, 0.3], seed=42)

# The random_state variable "seed" controls the shuffling for reproducible output across multiple function calls

## Task 5 - Build and Train a Linear Regression Model

In [9]:
lr = LinearRegression(featuresCol="features", labelCol="MPG")
model = lr.fit(training_data)

## Task 6 - Evaluate the model

In [10]:
predictions = model.transform(testing_data)

##### R Squared
R-squared (R2): R2 is a statistical measure that represents the proportion of variance
in the dependent variable (target) that is explained by the independent variables (features).
Higher values indicate better performance.

In [11]:
evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)

R Squared = 0.8046190375720326


##### Root Mean Squared Error
Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared differences
between the predicted and actual values. It measures the average distance between the predicted
and actual values, and lower values indicate better performance.

In [13]:
evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE =", rmse)

RMSE = 3.453104969079216


##### Mean Absolute Error
Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted and
actual values. It measures the average absolute distance between the predicted and actual values, and
lower values indicate better performance.

In [14]:
evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE =", mae)

MAE = 2.842391179195012


# Stop Spark Session

In [15]:
spark.stop()