## Regression using SparkML


## __Table of Contents__

<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
    <li>
      <a href="#Task-1---Create-a-spark-session">Task 1 - Create a spark session
      </a>
    </li>
    <li>
      <a href="#Task-2---Load-the-data-in-a-csv-file-into-a-dataframe">Task 2 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Task-3---Identify-the-label-column-and-the-input-columns">Task 3 - Identify the label column and the input columns
      </a>
    </li>
    <li>
      <a href="#Task-4---Split-the-data">Task 4 - Split the data
      </a>
    </li>
    <li>
      <a href="#Task-5---Build-and-Train-a-Linear-Regression-Model">Task 5 - Build and Train a Linear Regression Model
      </a>
    </li>
    <li>
      <a href="#Task-6---Evaluate-the-model">Task 6 - Evaluate the model
      </a>
    </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Create-a-spark-session">Exercise 1 - Create a spark session
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Load-the-data-in-a-csv-file-into-a-dataframe">Exercise 2 - Load the data in a csv file into a dataframe
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Identify-the-label-column-and-the-input-columns">Exercise 3 - Identify the label column and the input columns
      </a>
    </li>
    <li>
      <a href="#Exercise-4---Split-the-data">Exercise 4 - Split the data
      </a>
    </li>
    <li>
      <a href="#Exercise-5---Build-and-Train-a-Linear-Regression-Model">Exercise 5 - Build and Train a Linear Regression Model
      </a>
    </li>
    <li>
      <a href="#Exercise-6---Evaluate-the-model">Exercise 6 - Evaluate the model
      </a>
    </li>
  </ol>
</ol>


## Objectives

After completing this lab you will be able to:

 - Use PySpark to connect to a spark cluster.
 - Create a spark session.
 - Read a csv file into a data frame.
 - Split the dataset into training and testing sets.
 - Use VectorAssembler to combine multiple columns into a single vector column
 - Use Linear Regression to build a prediction model.
 - Use metrics to evaluate the model.
 - Stop the spark session





## Datasets

In this lab you will be using dataset(s):

 - Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg
 - Modified version of diamonds dataset. Original dataset available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active



----


## Setup


The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [1]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [3]:
# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

from pyspark.sql import SparkSession

#import functions/Classes for sparkml

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator


## Task 1 - Create a spark session


In [4]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Regressing using SparkML").getOrCreate()

## Task 2 - Load the data in a csv file into a dataframe


Download the data file


In [5]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv


--2024-06-17 18:11:40--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13891 (14K) [text/csv]
Saving to: ‘mpg.csv’


2024-06-17 18:11:41 (304 MB/s) - ‘mpg.csv’ saved [13891/13891]



Load the dataset into the spark dataframe


In [6]:
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load mpg dataset
mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)


Print the schema of the dataset


In [7]:
mpg_data.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)



show top 5 rows from the dataset


In [8]:
mpg_data.show(5)

+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0|        8|      390.0|       190|  3850|       8.5|  70|American|
|21.0|        6|      199.0|        90|  2648|      15.0|  70|American|
|18.0|        6|      199.0|        97|  2774|      15.5|  70|American|
|16.0|        8|      304.0|       150|  3433|      12.0|  70|American|
|14.0|        8|      455.0|       225|  3086|      10.0|  70|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows



## Task 3 - Identify the label column and the input columns


We ask the VectorAssembler to group a bunch of inputCols as single column named "features"


In [9]:
# Prepare feature vector
assembler = VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight", "Accelerate", "Year"], outputCol="features")
mpg_transformed_data = assembler.transform(mpg_data)


Display the assembled "features" and the label column "MPG"


In [10]:
mpg_transformed_data.select("features","MPG").show()

+--------------------+----+
|            features| MPG|
+--------------------+----+
|[8.0,390.0,190.0,...|15.0|
|[6.0,199.0,90.0,2...|21.0|
|[6.0,199.0,97.0,2...|18.0|
|[8.0,304.0,150.0,...|16.0|
|[8.0,455.0,225.0,...|14.0|
|[8.0,350.0,165.0,...|15.0|
|[8.0,307.0,130.0,...|18.0|
|[8.0,454.0,220.0,...|14.0|
|[8.0,400.0,150.0,...|15.0|
|[8.0,307.0,200.0,...|10.0|
|[8.0,383.0,170.0,...|15.0|
|[8.0,318.0,210.0,...|11.0|
|[8.0,360.0,215.0,...|10.0|
|[8.0,429.0,198.0,...|15.0|
|[6.0,200.0,85.0,2...|21.0|
|[8.0,302.0,140.0,...|17.0|
|[8.0,304.0,193.0,...| 9.0|
|[8.0,340.0,160.0,...|14.0|
|[6.0,198.0,95.0,2...|22.0|
|[8.0,440.0,215.0,...|14.0|
+--------------------+----+
only showing top 20 rows



## Task 4 - Split the data


We split the data set in the ratio of 70:30. 70% training data, 30% testing data.


In [11]:
# Split data into training and testing sets
(training_data, testing_data) = mpg_transformed_data.randomSplit([0.7, 0.3], seed=42)


The random_state variable "seed" controls the shuffling applied to the data before applying the split. Pass the same integer for reproducible output across multiple function calls


## Task 5 - Build and Train a Linear Regression Model


Create a LR model and train the model using the training data set


In [12]:
# Train linear regression model

lr = LinearRegression(featuresCol="features", labelCol="MPG")
model = lr.fit(training_data)

## Task 6 - Evaluate the model


Your model is now trained. We use the testing data to make predictions.


In [13]:
# Make predictions on testing data
predictions = model.transform(testing_data)

In [14]:
predictions.show(5)

+----+---------+-----------+----------+------+----------+----+--------+--------------------+------------------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|            features|        prediction|
+----+---------+-----------+----------+------+----------+----+--------+--------------------+------------------+
|10.0|        8|      360.0|       215|  4615|      14.0|  70|American|[8.0,360.0,215.0,...| 6.683344024048662|
|11.0|        8|      429.0|       208|  4633|      11.0|  72|American|[8.0,429.0,208.0,...| 8.344953219723493|
|12.0|        8|      350.0|       180|  4499|      12.5|  73|American|[8.0,350.0,180.0,...|10.043420590827143|
|12.0|        8|      383.0|       180|  4955|      11.5|  71|American|[8.0,383.0,180.0,...| 5.252194346982389|
|13.0|        8|      302.0|       129|  3169|      12.0|  75|American|[8.0,302.0,129.0,...|21.473697417345097|
+----+---------+-----------+----------+------+----------+----+--------+--------------------+------------

##### R Squared


In [15]:
#R-squared (R2): R2 is a statistical measure that represents the proportion of variance
#in the dependent variable (target) that is explained by the independent variables (features).
#Higher values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)


R Squared = 0.8046190375720325


##### Root Mean Squared Error


In [16]:
#Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared differences
#between the predicted and actual values. It measures the average distance between the predicted
#and actual values, and lower values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE =", rmse)


RMSE = 3.453104969079217


##### Mean Absolute Error


In [17]:
#Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted and
#actual values. It measures the average absolute distance between the predicted and actual values, and
#lower values indicate better performance.

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE =", mae)


MAE = 2.8423911791950123


Stop Spark Session


In [18]:
spark.stop()