# **Parameter Tuning**
The setting of the parameters of an algorithm is always a difficult task
- A “brute force” approach can be used to find the setting optimizing a quality index
- The training data is split in two subsets
    - The first set is used to build a model
    - The second one is used to evaluate the quality of the model
- The setting that maximizes a quality index (e.g., the prediction accuracy) is used to build the final model on the whole training dataset.

Hence, the cross-validation approach is usually used. It creates k splits and k models. For each configuration the system will apply k models, and the total number of model is equal to k-times the number in the configuration.

Input:
   - An MLlib pipeline
   - A set of values to be evaluated for each input parameter of the pipeline
       - All the possible combinations of the specified parameter values are considered and the related models are automatically generated and evaluated by Spark.
   - A quality evaluation metric to evaluate the result of the input pipeline

Output:
   - The model associated with the best parameter setting, in term of quality evaluation metric

In [1]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

In [3]:
# input and output folders
# input and output folders
trainingData = "./databases/trainingData.csv"
unlabeledData = "./databases/unlabeledData.csv"
outputPath = "./tuning/"

# Training data in raw format
labeledDataDF = spark.read.load(trainingData,\
                                format="csv",\
                                header=True,\
                                inferSchema=True)

In [4]:
# *************************
# Training step
# *************************

# Define an assembler to create a column (features) of type Vector
# containing the double values associated with columns attr1, attr2, attr3
assembler = VectorAssembler(inputCols=["attr1", "attr2", "attr3"],\
                                    outputCol="features")

In [5]:
lr = LogisticRegression()
pipeline = Pipeline().setStages([assembler, lr])

**Parameter Grid** construction:

In [6]:
paramGrid = ParamGridBuilder()\
.addGrid(lr.maxIter, [10, 100, 1000])\
.addGrid(lr.regParam, [0.1, 0.01])\
.build()

In [7]:
# We now treat the Pipeline as an Estimator, wrapping it in a
# CrossValidator instance. This allows us to jointly choose parameters
# for all Pipeline stages.
# CrossValidator requires
# - an Estimator
# - a set of Estimator ParamMaps
# - an Evaluator.

cv = CrossValidator()\
.setEstimator(pipeline)\
.setEstimatorParamMaps(paramGrid)\
.setEvaluator(BinaryClassificationEvaluator())\
.setNumFolds(3)

In [8]:
# Run cross-validation. The result is the logistic regression model
# based on the best set of parameters (based on the results of the
# cross-validation operation).
tunedLRmodel = cv.fit(labeledDataDF)

In [9]:
# *************************
# Prediction step
# *************************

unlabeledData = spark.read.load(unlabeledData,\
format="csv", header=True, inferSchema=True)
predictionsDF = tunedLRmodel.transform(unlabeledData)

In [10]:
predictions = predictionsDF.select("attr1", "attr2", "attr3", "prediction")

In [13]:
labeledDataDF.show()
predictionsDF.show()

+-----+-----+-----+-----+
|label|attr1|attr2|attr3|
+-----+-----+-----+-----+
|  1.0|  0.0|  1.1|  0.1|
|  0.0|  2.0|  1.0| -1.0|
|  0.0|  2.0|  1.3|  1.0|
|  1.0|  0.0|  1.2| -0.5|
+-----+-----+-----+-----+

+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+
|label|attr1|attr2|attr3|      features|       rawPrediction|         probability|prediction|
+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+
| null| -1.0|  1.5|  1.3|[-1.0,1.5,1.3]|[-2.9809716290125...|[0.04829295232390...|       1.0|
| null|  3.0|  2.0| -0.1|[3.0,2.0,-0.1]|[1.84892116210460...|[0.86400038529945...|       0.0|
| null|  0.0|  2.2| -1.5|[0.0,2.2,-1.5]|[-2.9350858066952...|[0.05044615025400...|       1.0|
+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+

