# Caching and Broadcasting

## Spark Set Up

In [1]:
## Imports
import re
import json
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession

app_name = "week2_cache"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .config("spark.ui.port","42229")\
        .getOrCreate()
sc = spark.sparkContext

## Change the working directory
%cd /media

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/25 03:41:23 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/04/25 03:41:23 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/04/25 03:41:23 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/04/25 03:41:23 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator


/media


## Caching

So far we have worked with different types of RDDs and we know that Spark will not execute until an action forces execution, thanks to lazy evaluation. This process allows Spark to optimize the data distribution at execution. However, there are processess that are iterative in nature and this lazy evaluation can slow down performance. Enter caching. 

To avoid Spark computing the RDD each time, we can cache the data at executor level. The idea here is that the RDD will **persist** at executor level, so any networking is only done once.

### Example - Gradient Descent

Let's run an example. We will run gradient descent to come up with the regression coefficients.

Linear Regression tackles the __prediction task__ by assuming that we can compute our output variable, $y$, using a linear combination of our input variables. That is we assume there exist a set of **weights**, such that for any input $\mathbf{x}_j \in \mathbb{R}^{m+1}$:

\begin{equation}\tag{1.1}
y_j = \displaystyle\sum_{i=1}^{m+1}{w_i\cdot x_{ji}}
\end{equation}

In vector notation, this can be written:

\begin{equation}
y_j = \displaystyle{\mathbf{w}^T\mathbf{x}_{j}}
\end{equation}

Linear Regression attempts to fit (i.e. **learn**) the best line (in 1 dimension) or hyperplane (in 2 or more dimensions) to the data.  In the case of **ordinary least squares (OLS)** linear regression, best fit is defined as minimizing the Euclidean distances of each point in the dataset to the line or hyperplane.  These distances are often referred to as **residuals**. 

There is a closed form solution for OLS, that you probably have seen in the past in your statistic class. However, at scale, matrix operations are too slow, so we use **Gradient Descent**. Without going too much into the technicals, GD is based on iterating through the loss function until we minimize it (**Optimization**)

In [2]:
## Let's define the two main functions: Compute the Loss and the Gradient Descent Update
def OLSLoss(dataRDD, W):
    """
    Compute mean squared error.
    Args:
        dataRDD - each record is a tuple of (features_array, y)
        W       - (array) model coefficients with bias at index 0
    """
    # Add the intercept
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1]))
    
    # Compute the loss
    loss = augmentedData.map(lambda x: (W.dot(x[0]) - x[1])**2).mean()
    ################## (END) YOUR CODE ##################
    return loss

def GDUpdate(dataRDD, W, learningRate = 0.1):
    """
    Perform one OLS gradient descent step/update.
    Args:
        dataRDD - records are tuples of (features_array, y)
        W       - (array) model coefficients with bias at index 0
    Returns:
        new_model - (array) updated coefficients, bias at index 0
    """
    # Add the intercept
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1])).cache()
    
    # Compute the gradient
    grad = augmentedData.map(lambda x: (W.dot(x[0]) - x[1])*x[0]).mean() * 2
    new_model = W - learningRate * grad
    return new_model

In [3]:
## Set up the data
## Let's prepare the data
# Number of features
n = 50
# Number of observations
N = 10000

np.random.seed(2022)
X = np.random.uniform(size = (N,n))
y = np.sum(X, axis=1) + np.random.normal(0,0.1,N)
data = pd.DataFrame(np.vstack((X.T, y)).T)

In [4]:
%%time
## Run the model without caching
  
## RDD Creation
PointsRDD = spark.createDataFrame(data)\
            .rdd.map(lambda Row: (Row[0:-1], Row[-1]))

## Number of iterations
nSteps = 5

## Initial guess
model = np.array([0]+[1]*n)

for idx in range(nSteps):
    print("----------")
    print(f"STEP: {idx+1}")
    model = GDUpdate(PointsRDD, model)
    loss = OLSLoss(PointsRDD, model)
    print(f"Loss: {loss}")
    print(f"Model: {[round(w,3) for w in model]}")

----------
STEP: 1


22/04/25 03:41:47 WARN org.apache.spark.scheduler.TaskSetManager: Stage 0 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:41:50 WARN org.apache.spark.scheduler.TaskSetManager: Stage 1 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009826849449816691
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 2


22/04/25 03:41:50 WARN org.apache.spark.scheduler.TaskSetManager: Stage 2 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:41:51 WARN org.apache.spark.scheduler.TaskSetManager: Stage 3 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009825258503583697
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 3


22/04/25 03:41:51 WARN org.apache.spark.scheduler.TaskSetManager: Stage 4 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:41:52 WARN org.apache.spark.scheduler.TaskSetManager: Stage 5 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009823774724840903
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 4


22/04/25 03:41:52 WARN org.apache.spark.scheduler.TaskSetManager: Stage 6 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:41:53 WARN org.apache.spark.scheduler.TaskSetManager: Stage 7 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009822503314113504
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 5


22/04/25 03:41:53 WARN org.apache.spark.scheduler.TaskSetManager: Stage 8 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:41:54 WARN org.apache.spark.scheduler.TaskSetManager: Stage 9 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009821761400156535
Model: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.001, 1.0]
CPU times: user 2.25 s, sys: 36.2 ms, total: 2.29 s
Wall time: 12.8 s


In [5]:
%%time
## Same model, but now we cached the data
## RDD Creation - Check that we cached the data at the end!
PointsRDDcached = spark.createDataFrame(data)\
            .rdd.map(lambda Row: (Row[0:-1], Row[-1])).cache()

## Number of iterations
nSteps = 5

## Initial guess
model = np.array([0]+[1]*n)

for idx in range(nSteps):
    print("----------")
    print(f"STEP: {idx+1}")
    model = GDUpdate(PointsRDDcached, model)
    loss = OLSLoss(PointsRDDcached, model)
    print(f"Loss: {loss}")
    print(f"Model: {[round(w,3) for w in model]}")

----------
STEP: 1


22/04/25 03:42:01 WARN org.apache.spark.scheduler.TaskSetManager: Stage 10 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:01 WARN org.apache.spark.scheduler.TaskSetManager: Stage 11 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009826849449816691
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 2


22/04/25 03:42:01 WARN org.apache.spark.scheduler.TaskSetManager: Stage 12 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:02 WARN org.apache.spark.scheduler.TaskSetManager: Stage 13 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009825258503583697
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 3


22/04/25 03:42:02 WARN org.apache.spark.scheduler.TaskSetManager: Stage 14 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:02 WARN org.apache.spark.scheduler.TaskSetManager: Stage 15 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009823774724840903
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 4


22/04/25 03:42:02 WARN org.apache.spark.scheduler.TaskSetManager: Stage 16 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:03 WARN org.apache.spark.scheduler.TaskSetManager: Stage 17 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:03 WARN org.apache.spark.scheduler.TaskSetManager: Stage 18 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009822503314113504
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 5
Loss: 0.009821761400156535
Model: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.001, 1.0]
CPU times: user 2.26 s, sys: 61.1 ms, total: 2.32 s
Wall time: 4.98 s


22/04/25 03:42:03 WARN org.apache.spark.scheduler.TaskSetManager: Stage 19 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


## Broadcasting

The other element that we need to introduce today are **broadcast variables**. Like accumulators before, broadcast variables allows us to share or send a file towards the executors for execution. Why is this important? Suppose you don't use them and have a large look up table or feature vector living in the driver. Each time that the executors need the file, needs to ask for it to the Driver, which will incurr in Network I/O, and slow down performance.

In [6]:
## Let's change our GD Update to include Broadcasting of the model
def GDUpdateBroad(dataRDD, W, learningRate = 0.1):
    """
    Perform one OLS gradient descent step/update.
    Args:
        dataRDD - records are tuples of (features_array, y)
        W       - (array) model coefficients with bias at index 0
    Returns:
        new_model - (array) updated coefficients, bias at index 0
    """
    # Add the intercept
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1])).cache()
    
    # Compute the gradient
    W_broadcast = sc.broadcast(W)
    grad = augmentedData.map(lambda x: (W_broadcast.value.dot(x[0]) - x[1])*x[0]).mean() * 2
    new_model = W - learningRate * grad
    return new_model

In [7]:
%%time
## Let's run the same example as before, without caching but with broadcasting
## RDD Creation
PointsRDD = spark.createDataFrame(data)\
            .rdd.map(lambda Row: (Row[0:-1], Row[-1]))

## Number of iterations
nSteps = 5

## Initial guess
model = np.array([0]+[1]*n)

for idx in range(nSteps):
    print("----------")
    print(f"STEP: {idx+1}")
    model = GDUpdateBroad(PointsRDD, model)
    loss = OLSLoss(PointsRDD, model)
    print(f"Loss: {loss}")
    print(f"Model: {[round(w,3) for w in model]}")

----------
STEP: 1


22/04/25 03:42:16 WARN org.apache.spark.scheduler.TaskSetManager: Stage 20 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:16 WARN org.apache.spark.scheduler.TaskSetManager: Stage 21 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009826849449816691
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 2


22/04/25 03:42:16 WARN org.apache.spark.scheduler.TaskSetManager: Stage 22 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:17 WARN org.apache.spark.scheduler.TaskSetManager: Stage 23 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009825258503583697
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 3


22/04/25 03:42:17 WARN org.apache.spark.scheduler.TaskSetManager: Stage 24 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:18 WARN org.apache.spark.scheduler.TaskSetManager: Stage 25 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009823774724840903
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 4


22/04/25 03:42:18 WARN org.apache.spark.scheduler.TaskSetManager: Stage 26 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:18 WARN org.apache.spark.scheduler.TaskSetManager: Stage 27 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009822503314113504
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 5


22/04/25 03:42:19 WARN org.apache.spark.scheduler.TaskSetManager: Stage 28 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:19 WARN org.apache.spark.scheduler.TaskSetManager: Stage 29 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009821761400156535
Model: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.001, 1.0]
CPU times: user 2.32 s, sys: 56.7 ms, total: 2.37 s
Wall time: 6 s


In [8]:
%%time
## Let's now combine caching and broadcasting
PointsRDDcached = spark.createDataFrame(data)\
            .rdd.map(lambda Row: (Row[0:-1], Row[-1])).cache()

## Number of iterations
nSteps = 5

## Initial guess
model = np.array([0]+[1]*n)

for idx in range(nSteps):
    print("----------")
    print(f"STEP: {idx+1}")
    model = GDUpdateBroad(PointsRDDcached, model)
    loss = OLSLoss(PointsRDDcached, model)
    print(f"Loss: {loss}")
    print(f"Model: {[round(w,3) for w in model]}")

----------
STEP: 1


22/04/25 03:42:27 WARN org.apache.spark.scheduler.TaskSetManager: Stage 30 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:27 WARN org.apache.spark.scheduler.TaskSetManager: Stage 31 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:27 WARN org.apache.spark.scheduler.TaskSetManager: Stage 32 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009826849449816691
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 2


22/04/25 03:42:28 WARN org.apache.spark.scheduler.TaskSetManager: Stage 33 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:28 WARN org.apache.spark.scheduler.TaskSetManager: Stage 34 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009825258503583697
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 3


22/04/25 03:42:28 WARN org.apache.spark.scheduler.TaskSetManager: Stage 35 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:28 WARN org.apache.spark.scheduler.TaskSetManager: Stage 36 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009823774724840903
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 4


22/04/25 03:42:28 WARN org.apache.spark.scheduler.TaskSetManager: Stage 37 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
22/04/25 03:42:29 WARN org.apache.spark.scheduler.TaskSetManager: Stage 38 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.


Loss: 0.009822503314113504
Model: [-0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
----------
STEP: 5
Loss: 0.009821761400156535
Model: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.999, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.001, 1.0]
CPU times: user 2.29 s, sys: 28.4 ms, total: 2.32 s
Wall time: 4.65 s


22/04/25 03:42:29 WARN org.apache.spark.scheduler.TaskSetManager: Stage 39 contains a task of very large size (1390 KiB). The maximum recommended task size is 1000 KiB.
