# Machine Learning with PySpark
* Notebook by Adam Lang
* Date: 12/19/2024

# Overview
* This notebook demonstrates basic machine learning example using linear regression in PySpark.

# Create Spark session

In [1]:
%%capture
!pip install pyspark

In [4]:
## spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Missing').getOrCreate()

# Read the Data

In [5]:
## data path
data_path= '/content/drive/MyDrive/Colab Notebooks/Deep Learning Notebooks/test1.csv'

In [6]:
## training data
train = spark.read.csv(data_path,
                       header=True,
                       inferSchema=True)

In [7]:
## view data
train.show()

+-------+---+----------+------+
|   name|age|experience|salary|
+-------+---+----------+------+
|  Alice| 25|         3| 60000|
|    Bob| 30|         5| 75000|
|Charlie| 28|         4| 70000|
|  David| 35|         8| 90000|
|    Eve| 22|         1| 50000|
|  Frank| 40|        12|110000|
|  Grace| 27|         3| 65000|
|  Helen| 32|         6| 80000|
+-------+---+----------+------+



In [8]:
## print schema
train.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- experience: integer (nullable = true)
 |-- salary: integer (nullable = true)



In [9]:
## columns
train.columns

['name', 'age', 'experience', 'salary']

# Machine Learning with PySpark
* There are some different methods that are used in pyspark as compared to sklearn or other python libraries.
* We will use the `VectorAssembler`:https://api-docs.databricks.com/python/pyspark/latest/api/pyspark.ml.feature.VectorAssembler.html
  * What this does is group features together to create vectors.
  * In this dataset we will use this on:
    * `[Age, Experience]` --> these will be grouped as
    * `[new features]` -->
    * Which are an `independent feature`

## VectorAssembler - Feature Engineering
* VectorAssembler is a transformer that combines a given list of columns into a **single vector column.**
  * It is very useful for combining raw features and features generated by different feature transformers into a **single feature vector**, in order to train ML models like logistic regression and decision trees.
  * VectorAssembler accepts the following input column types:
    * all numeric types
    * boolean type, and
    * vector type.
* In each row, the values of the input columns will be concatenated into a vector in the specified order.
* Source: https://george-jen.gitbook.io/data-science-and-apache-spark/vectorassembler

In [12]:
## load the VectorAssembler
from pyspark.ml.feature import VectorAssembler

## init assembler
feature_assembler = VectorAssembler(inputCols=['age', 'experience'], ## input columns list to be transformed into Vectors
                                   outputCol='Independent Features') ## new combined column

In [13]:
## now apply the feature_assembler
output = feature_assembler.transform(train)

In [14]:
## show output
output.show()

+-------+---+----------+------+--------------------+
|   name|age|experience|salary|Independent Features|
+-------+---+----------+------+--------------------+
|  Alice| 25|         3| 60000|          [25.0,3.0]|
|    Bob| 30|         5| 75000|          [30.0,5.0]|
|Charlie| 28|         4| 70000|          [28.0,4.0]|
|  David| 35|         8| 90000|          [35.0,8.0]|
|    Eve| 22|         1| 50000|          [22.0,1.0]|
|  Frank| 40|        12|110000|         [40.0,12.0]|
|  Grace| 27|         3| 65000|          [27.0,3.0]|
|  Helen| 32|         6| 80000|          [32.0,6.0]|
+-------+---+----------+------+--------------------+



Summary
* We can see the input feature will now be `age` and `experience` as 1 input vector in the new column `Independent Features`.
* The output feature will be `salary`.

In [15]:
## output columns
output.columns

['name', 'age', 'experience', 'salary', 'Independent Features']

## Final Feature Engineering
* We need to create a filtered dataset with the features we want to predict on.

In [16]:
## final dataset
final_data = output.select("Independent Features", "salary")

## show final_data
final_data.show()

+--------------------+------+
|Independent Features|salary|
+--------------------+------+
|          [25.0,3.0]| 60000|
|          [30.0,5.0]| 75000|
|          [28.0,4.0]| 70000|
|          [35.0,8.0]| 90000|
|          [22.0,1.0]| 50000|
|         [40.0,12.0]|110000|
|          [27.0,3.0]| 65000|
|          [32.0,6.0]| 80000|
+--------------------+------+



# Train/Test Split
* Similar to sklearn

In [17]:
from pyspark.ml.regression import LinearRegression

## train/test split
train_data, test_data = final_data.randomSplit([0.75,0.25])

# Build Regression Model

In [18]:
## setup regression
regressor = LinearRegression(featuresCol='Independent Features',
                             labelCol='salary')


## fit model
regressor = regressor.fit(train_data)

In [19]:
## coefficents
regressor.coefficients

DenseVector([2389.9371, 1257.8616])

In [20]:
## intercepts
regressor.intercept

-3616.352201246163

# Predictions

In [21]:
## predict on test set
pred_results = regressor.evaluate(test_data)

## show predictions
pred_results.predictions.show()

+--------------------+------+------------------+
|Independent Features|salary|        prediction|
+--------------------+------+------------------+
|          [27.0,3.0]| 65000| 64685.53459119421|
|          [28.0,4.0]| 70000| 68333.33333333312|
|         [40.0,12.0]|110000|107075.47169811501|
+--------------------+------+------------------+



In [22]:
## other metrics
pred_results.meanAbsoluteError, pred_results.meanSquaredError

(1635.2201257858833, 3809844.0198800503)

# Summary:
1. Coefficients: `DenseVector([2389.9371, 1257.8616])`
* The first coefficient (2389.9371) corresponds to age.
* The second coefficient (1257.8616) corresponds to experience.
* This means that, on average:
  * For each year increase in age, the salary increases by $2,389.94
  * For each year increase in experience, the salary increases by $1,257.86

2. Intercept: `-3616.352201246163`

* This is the **expected salary when both age and experience are 0** (which doesn't make practical sense in this context, but it's part of the mathematical model).

3. Predictions: The model seems to be predicting salaries quite close to the actual values. For example:

* For age 27 and 3 years experience, it predicts $64,685.53 (actual: $65,000)
* For age 28 and 4 years experience, it predicts $68,333.33 (actual: $70,000)
* For age 40 and 12 years experience, it predicts $107,075.47 (actual: $110,000)

4. Error Metrics:
  * Mean Absolute Error (MAE): 1635.2201257858833
    * On average, the model's predictions are off by about $1,635.22
  * Mean Squared Error (MSE): 3809844.0198800503
    * This is the average of the squared differences between predicted and actual values.
    * The Root Mean Squared Error (RMSE) would be sqrt(MSE) ≈ $1,951.88

* Interpretation:

1. The model shows a **positive relationship between both age and experience with salary.***

2. Age seems to have a stronger impact on salary than experience in this model.

3. The predictions are reasonably close to the actual values, with an average error of about $1,635.

4. The model explains some of the variation in salary, but there might be other factors influencing salary that aren't captured in this model.

5. The negative intercept suggests that the model might not be suitable for very low values of age and experience.


6. To improve the model, since this was just a dummy example, we might consider:

  * Adding more relevant features if available
  * Exploring non-linear relationships in the data
  * Check for multi-collinearity in the features
  * Check for feature importance (e.g. permutation feature importance)
  * Explore feature weighting and scaling using lasso or ridge regression.
  * Checking for and handling outliers
  * And lastly, this was a toy example, so obviously we want to ensure we have a sufficiently large and representative dataset which we dont.

