<a href="https://colab.research.google.com/github/fauzanghazi/mllib-spark/blob/main/notebook/spark-mllib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning with Apache MLlib via PySpark

This notebook demonstrates the use of **Apache MLlib**, the scalable machine learning library built on **Apache Spark**, to perform end-to-end regression modeling. Using **PySpark**, we process the California housing dataset, apply data cleaning, feature engineering, and train a linear regression model.

The workflow includes Spark-native techniques like `Imputer`, `VectorAssembler`, and `StandardScaler` for pipeline construction, ensuring scalability across distributed environments.

Finally, we evaluate the model using `RegressionMetrics` to assess performance.

Using *Colab* to avoid Java issue.


### Install PySpark

This installs the PySpark library, which provides the Python API for Apache Spark.

In [1]:
%pip install pyspark



### Initialize Spark Session

Creates a local Spark session named **MRTB1163** with a custom UI port (4050).

This session allows Spark operations within VSCode.

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("MRTB1163")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

spark

### Load and Preview Dataset

Reads the `housing.csv` file into a Spark DataFrame with headers and inferred data types.

Displays the schema and shows the first 5 rows for a quick preview.

In [3]:
df = spark.read.format("csv").load("housing.csv", header=True, inferSchema=True)

df.printSchema()

df.show(5)

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR B

### Add ID Column and Basic Aggregation

Adds a unique `id` column to each row using `monotonically_increasing_id()`.

Reorders columns to place `id` first.  

Displays a sample of 3 rows, counts total records, and computes the average of `total_rooms`.


In [4]:
from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn('id', monotonically_increasing_id())

df = df[['id'] + df.columns[:-1]]

df.show(3)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  0|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  2|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only s

In [5]:
df.count()

20640

In [6]:
df.select('total_rooms').agg({'total_rooms': 'avg'}).show()

+------------------+
|  avg(total_rooms)|
+------------------+
|2635.7630813953488|
+------------------+



### Summary Statistics and Grouped Aggregation

Calculates the mean for all columns in the dataset.  

Then, groups the data by `ocean_proximity` and computes the average for selected numerical columns.


In [7]:
from pyspark.sql.functions import mean

df.select(*[mean(c) for c in df.columns]).show()

+-------+-------------------+----------------+-----------------------+------------------+-------------------+------------------+-----------------+------------------+-----------------------+--------------------+
|avg(id)|     avg(longitude)|   avg(latitude)|avg(housing_median_age)|  avg(total_rooms)|avg(total_bedrooms)|   avg(population)|  avg(households)|avg(median_income)|avg(median_house_value)|avg(ocean_proximity)|
+-------+-------------------+----------------+-----------------------+------------------+-------------------+------------------+-----------------+------------------+-----------------------+--------------------+
|10319.5|-119.56970445736148|35.6318614341087|     28.639486434108527|2635.7630813953488|  537.8705525375618|1425.4767441860465|499.5396802325581|3.8706710029070246|     206855.81690891474|                NULL|
+-------+-------------------+----------------+-----------------------+------------------+-------------------+------------------+-----------------+----------

In [8]:
df.groupby('ocean_proximity').agg({col: 'avg' for col in df.columns[3:-1]}).show()

+---------------+------------------+------------------+-------------------+------------------+------------------+-----------------------+-----------------------+
|ocean_proximity|   avg(households)|   avg(population)|avg(total_bedrooms)|avg(median_income)|  avg(total_rooms)|avg(median_house_value)|avg(housing_median_age)|
+---------------+------------------+------------------+-------------------+------------------+------------------+-----------------------+-----------------------+
|         ISLAND|             276.6|             668.0|              420.4|2.7444200000000003|            1574.6|               380440.0|                   42.4|
|     NEAR OCEAN|501.24454477050415|1354.0086531226486|  538.6156773211568| 4.005784800601957| 2583.700902934537|     249433.97742663656|     29.347253574115875|
|       NEAR BAY| 488.6161572052402|1230.3174672489083|  514.1828193832599| 4.172884759825336| 2493.589519650655|     259212.31179039303|      37.73013100436681|
|      <1H OCEAN| 517.744964

### Custom UDF for Feature Transformation

Defines a user-defined function (UDF) to square the `total_rooms` column.

Applies the UDF to create a new column `total_rooms_squared` and displays the first 5 rows.

In [9]:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

def squared(value):
  return value * value

squared_udf = udf(squared, FloatType())

df.withColumn('total_rooms_squared', squared_udf('total_rooms')).show(5)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+-------------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|total_rooms_squared|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+-------------------+
|  0|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|           774400.0|
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|          5.03958E7|
|  2|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|          2152089.0|
|  3|  -122.25|   37.85|    

In [10]:
df.show(5)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  0|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  2|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  3|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  4| 

### Train-Test Split and Feature Selection

Splits the dataset into training (70%) and testing (30%) sets.

Removes non-numerical and label columns from the feature list to prepare for model input.

In [11]:
train, test = df.randomSplit([0.7, 0.3])

train, test

(DataFrame[id: bigint, longitude: double, latitude: double, housing_median_age: double, total_rooms: double, total_bedrooms: double, population: double, households: double, median_income: double, median_house_value: double, ocean_proximity: string],
 DataFrame[id: bigint, longitude: double, latitude: double, housing_median_age: double, total_rooms: double, total_bedrooms: double, population: double, households: double, median_income: double, median_house_value: double, ocean_proximity: string])

In [12]:
numerical_features_lst = train.columns
numerical_features_lst.remove('median_house_value')
numerical_features_lst.remove('id')
numerical_features_lst.remove('ocean_proximity')

numerical_features_lst

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']

### Handle Missing Values with Imputer

Uses `Imputer` from `pyspark.ml` to fill missing values in selected numerical columns.  

Applies the transformation to both training and testing datasets.

In [13]:
from pyspark.ml.feature import Imputer

imputer = Imputer(inputCols=numerical_features_lst,
                  outputCols=numerical_features_lst)

imputer = imputer.fit(train)

train = imputer.transform(train)
test = imputer.transform(test)

train.show(3)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  2|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  3|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only s

### Assemble Numerical Features

Combines all selected numerical columns into a single vector column named `numerical_feature_vector`.

Prepares the data for scaling and model input.

In [14]:
from pyspark.ml.feature import VectorAssembler

numerical_vector_assembler = VectorAssembler(inputCols=numerical_features_lst,
                                             outputCol='numerical_feature_vector')

train = numerical_vector_assembler.transform(train)
test = numerical_vector_assembler.transform(test)

train.show(2)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|numerical_feature_vector|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|    [-122.22,37.86,21...|
|  2|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|    [-122.24,37.85,52...|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------

In [15]:
train.select('numerical_feature_vector').take(2)

[Row(numerical_feature_vector=DenseVector([-122.22, 37.86, 21.0, 7099.0, 1106.0, 2401.0, 1138.0, 8.3014])),
 Row(numerical_feature_vector=DenseVector([-122.24, 37.85, 52.0, 1467.0, 190.0, 496.0, 177.0, 7.2574]))]

### Standardize Features

Applies `StandardScaler` to normalize the numerical feature vector by removing the mean and scaling to unit variance.

Outputs the result as `scaled_numerical_feature_vector` for both training and testing sets.

In [16]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='numerical_feature_vector',
                        outputCol='scaled_numerical_feature_vector',
                        withStd=True, withMean=True)

scaler = scaler.fit(train)

train = scaler.transform(train)
test = scaler.transform(test)

train.show(3)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|numerical_feature_vector|scaled_numerical_feature_vector|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|    [-122.22,37.86,21...|           [-1.3234119661236...|
|  2|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|    [-122.24,37.85,52...|           [-1.3333943817601...|
|  3|

In [17]:
train.select('scaled_numerical_feature_vector').take(3)

[Row(scaled_numerical_feature_vector=DenseVector([-1.3234, 1.0431, -0.609, 2.1143, 1.3753, 0.8714, 1.6971, 2.3328])),
 Row(scaled_numerical_feature_vector=DenseVector([-1.3334, 1.0384, 1.8457, -0.5424, -0.8339, -0.8182, -0.848, 1.7841])),
 Row(scaled_numerical_feature_vector=DenseVector([-1.3384, 1.0384, 1.8457, -0.6334, -0.7254, -0.7632, -0.7368, 0.9358]))]

### Encode Categorical Feature

Uses `StringIndexer` to convert the categorical `ocean_proximity` column into a numerical index named `ocean_category_index`.

Applies the transformation to both training and testing datasets.

In [18]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='ocean_proximity',
                        outputCol='ocean_category_index')

indexer = indexer.fit(train)
train = indexer.transform(train)
test = indexer.transform(test)

train.show(3)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|numerical_feature_vector|scaled_numerical_feature_vector|ocean_category_index|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|    [-122.22,37.86,21...|           [-1.3234119661236...|                 3.0|
|  2|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          3521

In [19]:
# Check Unique Encoded Categories

set(train.select('ocean_category_index').collect())

{Row(ocean_category_index=0.0),
 Row(ocean_category_index=1.0),
 Row(ocean_category_index=2.0),
 Row(ocean_category_index=3.0),
 Row(ocean_category_index=4.0)}

### One-Hot Encode Categorical Feature

Applies `OneHotEncoder` to convert the indexed categorical column into a sparse binary vector named `ocean_category_one_hot`.

Used to prevent ordinal relationships in categorical features during model training.

In [20]:
from pyspark.ml.feature import OneHotEncoder

one_hot_encoder = OneHotEncoder(inputCol='ocean_category_index',
                                outputCol='ocean_category_one_hot')

one_hot_encoder = one_hot_encoder.fit(train)

train = one_hot_encoder.transform(train)
test = one_hot_encoder.transform(test)

train.show(3)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+----------------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|numerical_feature_vector|scaled_numerical_feature_vector|ocean_category_index|ocean_category_one_hot|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+----------------------+
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|    [-122.22,37.86,21...|           [-1.3234119661236...|                 3.0|         (4,[3],[1.0])|
|  2|  -122.24|   37.85|    

### Combine All Features

Merges scaled numerical features and one-hot encoded categorical features into a single column `final_feature_vector` for model training.

In [21]:
assembler = VectorAssembler(inputCols=['scaled_numerical_feature_vector',
                                       'ocean_category_one_hot'],
                            outputCol='final_feature_vector')

train = assembler.transform(train)
test = assembler.transform(test)

In [22]:
train.select('final_feature_vector').take(2)

[Row(final_feature_vector=DenseVector([-1.3234, 1.0431, -0.609, 2.1143, 1.3753, 0.8714, 1.6971, 2.3328, 0.0, 0.0, 0.0, 1.0])),
 Row(final_feature_vector=DenseVector([-1.3334, 1.0384, 1.8457, -0.5424, -0.8339, -0.8182, -0.848, 1.7841, 0.0, 0.0, 0.0, 1.0]))]

### Train Linear Regression Model and Predict

1. Initialize a linear regression model with input features and target label.
2. Fit the model on the training dataset.
3. Apply the trained model to the training data to generate predictions and rename the prediction column for clarity.

In [23]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol='final_feature_vector',
                      labelCol='median_house_value')

lr

LinearRegression_01301fd3c7e9

In [24]:
lr = lr.fit(train)

lr

LinearRegressionModel: uid=LinearRegression_01301fd3c7e9, numFeatures=12

In [25]:
pred_train_df = lr.transform(train).withColumnRenamed('prediction',
                                                      'predicted_median_house_value')

pred_train_df.show(5)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+----------------------+--------------------+----------------------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|numerical_feature_vector|scaled_numerical_feature_vector|ocean_category_index|ocean_category_one_hot|final_feature_vector|predicted_median_house_value|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+----------------------+--------------------+----------------------------+
|  1|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          35850

### Predict on Test Data

Applies the trained linear regression model to the test dataset and renames the prediction column to `predicted_median_house_value` for easier interpretation.

In [26]:
pred_test_df = lr.transform(test).withColumnRenamed('prediction', 'predicted_median_house_value')

pred_test_df.show(5)

+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+----------------------+--------------------+----------------------------+
| id|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|numerical_feature_vector|scaled_numerical_feature_vector|ocean_category_index|ocean_category_one_hot|final_feature_vector|predicted_median_house_value|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+------------------------+-------------------------------+--------------------+----------------------+--------------------+----------------------------+
|  0|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          45260

### Convert to Pandas DataFrame

Converts the Spark DataFrame to a Pandas DataFrame for easier inspection or visualization using traditional Python libraries like matplotlib or seaborn.

In [27]:
pred_test_pd_df = pred_test_df.toPandas()

pred_test_pd_df.head(2)

Unnamed: 0,id,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,numerical_feature_vector,scaled_numerical_feature_vector,ocean_category_index,ocean_category_one_hot,final_feature_vector,predicted_median_house_value
0,0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,"[-122.23, 37.88, 41.0, 880.0, 129.0, 322.0, 12...","[-1.328403173941875, 1.0524586361422354, 0.974...",3.0,"(0.0, 0.0, 0.0, 1.0)","[-1.328403173941875, 1.0524586361422354, 0.974...",409816.47533
1,4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,"[-122.25, 37.85, 52.0, 1627.0, 280.0, 565.0, 2...","[-1.3383855895783474, 1.0384181679590128, 1.84...",3.0,"(0.0, 0.0, 0.0, 1.0)","[-1.3383855895783474, 1.0384181679590128, 1.84...",254166.050964


# Prepare Data for Regression Evaluation

Extracts only the predicted and actual values for evaluation, then converts the Spark DataFrame to an RDD and maps the values to tuples, which is required for use with `RegressionMetrics` from MLlib.

In [28]:
predictions_and_actuals = pred_test_df[['predicted_median_house_value',
                                        'median_house_value']]

predictions_and_actuals_rdd = predictions_and_actuals.rdd

predictions_and_actuals_rdd.take(2)

[Row(predicted_median_house_value=409816.4753304599, median_house_value=452600.0),
 Row(predicted_median_house_value=254166.0509641794, median_house_value=342200.0)]

In [29]:
predictions_and_actuals_rdd = predictions_and_actuals_rdd.map(tuple)

predictions_and_actuals_rdd.take(2)

[(409816.4753304599, 452600.0), (254166.0509641794, 342200.0)]

### Evaluate Model Performance

Use `RegressionMetrics` from `pyspark.mllib.evaluation` to calculate and display evaluation metrics for the linear regression model such as:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- and R-squared (R²)

In [30]:
from pyspark.mllib.evaluation import RegressionMetrics

metrics = RegressionMetrics(predictions_and_actuals_rdd)

s = '''
Mean Squared Error:      {0}
Root Mean Squared Error: {1}
Mean Absolute Error:     {2}
R**2:                    {3}
'''.format(metrics.meanSquaredError,
           metrics.rootMeanSquaredError,
           metrics.meanAbsoluteError,
           metrics.r2
           )

print(s)




Mean Squared Error:      4752151745.122743
Root Mean Squared Error: 68935.85239280605
Mean Absolute Error:     49860.51170153819
R**2:                    0.6454138983817461



### Visualize Actual vs Predicted (Plotly)

This interactive scatter plot shows how close the model's predictions are to the actual values.

The trendline offers a visual indicator of the model’s fit.

In [31]:
import plotly.express as px

fig = px.scatter(
    pred_test_pd_df,
    x='median_house_value',
    y='predicted_median_house_value',
    title='Actual vs Predicted House Values',
    labels={'median_house_value': 'Actual', 'predicted_median_house_value': 'Predicted'},
    opacity=0.7,
    trendline='ols',
    color='predicted_median_house_value'
)

fig.update_layout(showlegend=False)
fig.show()


The scatter plot compares actual vs. predicted house values using a linear regression model in PySpark.

While the model captures the overall upward trend, the predictions are consistently lower than actual values, especially for high-priced properties.

This suggests the model may be underfitting and not capturing nonlinear patterns, possibly due to the limited expressiveness of linear regression.


### Plotly Bar Chart of Evaluation Metrics

This chart provides a visual summary of the model’s error and accuracy metrics including R².

In [33]:
import plotly.graph_objects as go

# Extract metric values
mse = metrics.meanSquaredError
rmse = metrics.rootMeanSquaredError
mae = metrics.meanAbsoluteError
r2 = metrics.r2

# Create bar chart
fig = go.Figure(data=[
    go.Bar(
        name='Regression Metrics',
        x=['MSE', 'RMSE', 'MAE', 'R²'],
        y=[mse, rmse, mae, r2],
        marker_color='indianred',
        text=[f'{mse:.2e}', f'{rmse:,.0f}', f'{mae:,.0f}', f'{r2:.2f}'],
        textposition='outside'
    )
])

fig.update_layout(
    title='Model Evaluation Metrics',
    yaxis_title='Score',
    xaxis_title='Metric',
    yaxis_type='log',  # Optional: log scale for better visual balance
    template='plotly_white',
    showlegend=False
)

fig.show()


The bar chart presents regression evaluation metrics for the PySpark linear regression model.

Mean Squared Error (MSE) is large due to squared scaling, while RMSE and MAE indicate average prediction errors between 50k–70k.

An R² score of 0.65 suggests the model captures 65% of the variability in house prices, indicating moderate performance.