# Estimator and Transformer

## Setup and Initialization

In [1]:
// Set log level to ERROR (less verbose)
sc.setLogLevel("ERROR")

Intitializing Scala interpreter ...

Spark Web UI available at http://252f7d7c2f69:4040
SparkContext available as 'sc' (version = 2.4.2, master = local[*], app id = local-1559548724393)
SparkSession available as 'spark'


### Importing Libraries

In [2]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row


## Data Sources

### Training Set

In [3]:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")


training.show()

+-----+--------------+
|label|      features|
+-----+--------------+
|  1.0| [0.0,1.1,0.1]|
|  0.0|[2.0,1.0,-1.0]|
|  0.0| [2.0,1.3,1.0]|
|  1.0|[0.0,1.2,-0.5]|
+-----+--------------+



training: org.apache.spark.sql.DataFrame = [label: double, features: vector]


### Test Set

In [4]:
// Prepare test data.
val test = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")

test.show()

+-----+--------------+
|label|      features|
+-----+--------------+
|  1.0|[-1.0,1.5,1.3]|
|  0.0|[3.0,2.0,-0.1]|
|  1.0|[0.0,2.2,-1.5]|
+-----+--------------+



test: org.apache.spark.sql.DataFrame = [label: double, features: vector]


## Estimator

In [5]:
// Create a LogisticRegression instance. This instance is an Estimator.
val lr = new LogisticRegression()

lr: org.apache.spark.ml.classification.LogisticRegression = logreg_f463ed4a4d9d


### Parameters

In [6]:
// Print out the parameters, documentation, and any default values.
println(s"LogisticRegression parameters:\n ${lr.explainParams()}\n")

LogisticRegression parameters:
 aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial. (default: auto)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. (undefined)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
probabilityCol: Column name for predicted class c

In [7]:
// We may set parameters using setter methods.
lr.setMaxIter(10)
  .setRegParam(0.01)

res4: lr.type = logreg_f463ed4a4d9d


In [8]:
// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
val paramMap = ParamMap(lr.maxIter -> 20)
  .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

paramMap: org.apache.spark.ml.param.ParamMap =
{
	logreg_f463ed4a4d9d-maxIter: 30,
	logreg_f463ed4a4d9d-regParam: 0.1,
	logreg_f463ed4a4d9d-threshold: 0.55
}


### Modeling

In [9]:
// Learn a LogisticRegression model. This uses the parameters stored in lr.
val model1 = lr.fit(training)

// paramMapCombined overrides all parameters set earlier via lr.set* methods.
val model2 = lr.fit(training, paramMap)

model1: org.apache.spark.ml.classification.LogisticRegressionModel = LogisticRegressionModel: uid = logreg_f463ed4a4d9d, numClasses = 2, numFeatures = 3
model2: org.apache.spark.ml.classification.LogisticRegressionModel = LogisticRegressionModel: uid = logreg_f463ed4a4d9d, numClasses = 2, numFeatures = 3


In [10]:
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
println(s"Model 1 was fit using parameters: ${model1.parent.extractParamMap}")

Model 1 was fit using parameters: {
	logreg_f463ed4a4d9d-aggregationDepth: 2,
	logreg_f463ed4a4d9d-elasticNetParam: 0.0,
	logreg_f463ed4a4d9d-family: auto,
	logreg_f463ed4a4d9d-featuresCol: features,
	logreg_f463ed4a4d9d-fitIntercept: true,
	logreg_f463ed4a4d9d-labelCol: label,
	logreg_f463ed4a4d9d-maxIter: 10,
	logreg_f463ed4a4d9d-predictionCol: prediction,
	logreg_f463ed4a4d9d-probabilityCol: probability,
	logreg_f463ed4a4d9d-rawPredictionCol: rawPrediction,
	logreg_f463ed4a4d9d-regParam: 0.01,
	logreg_f463ed4a4d9d-standardization: true,
	logreg_f463ed4a4d9d-threshold: 0.5,
	logreg_f463ed4a4d9d-tol: 1.0E-6
}


In [11]:
// Now learn a new model using the paramMapCombined parameters.
println(s"Model 2 was fit using parameters: ${model2.parent.extractParamMap}")

Model 2 was fit using parameters: {
	logreg_f463ed4a4d9d-aggregationDepth: 2,
	logreg_f463ed4a4d9d-elasticNetParam: 0.0,
	logreg_f463ed4a4d9d-family: auto,
	logreg_f463ed4a4d9d-featuresCol: features,
	logreg_f463ed4a4d9d-fitIntercept: true,
	logreg_f463ed4a4d9d-labelCol: label,
	logreg_f463ed4a4d9d-maxIter: 30,
	logreg_f463ed4a4d9d-predictionCol: prediction,
	logreg_f463ed4a4d9d-probabilityCol: probability,
	logreg_f463ed4a4d9d-rawPredictionCol: rawPrediction,
	logreg_f463ed4a4d9d-regParam: 0.1,
	logreg_f463ed4a4d9d-standardization: true,
	logreg_f463ed4a4d9d-threshold: 0.55,
	logreg_f463ed4a4d9d-tol: 1.0E-6
}


In [12]:
// Make predictions on test data using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
println("Model 1")
model1.transform(test)
  .show()

println("Model 2")
model2.transform(test)
  .show()

Model 1
+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|         probability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0|[-1.0,1.5,1.3]|[-6.5872014439355...|[0.00137599470692...|       1.0|
|  0.0|[3.0,2.0,-0.1]|[3.98018281942565...|[0.98166040093741...|       0.0|
|  1.0|[0.0,2.2,-1.5]|[-6.3765177028604...|[0.00169814755783...|       1.0|
+-----+--------------+--------------------+--------------------+----------+

Model 2
+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|         probability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0|[-1.0,1.5,1.3]|[-2.8046569418746...|[0.05707304171034...|       1.0|
|  0.0|[3.0,2.0,-0.1]|[2.49587635664205...|[0.92385223117041...|       0.0|
|  1.0|[0.0,2.2,-1.5]|[-2.0935249027913...|[0.10972776114779...|       