# Categorical Class Lables
Usually the class label is a categorical value (i.e., a string). As reported before, Spark MLlib works only with numerical values and hence categorical class label values must be mapped to integer (and then double) values.

The Estimator StringIndexer and theTransformer IndexToString support thetransformation of categorical class label intonumerical one and vice versa.

Main steps:
1. Use StringIndexer to extend the input DataFrame with a new column, called “label”, containing the numerical representation of the class label column
2. Create a column, called “features”, of type vector containing the predictive features
3. Infer a classification model by using a classification algorithm (e.g., Decision Tree, Logistic regression). The model is built by considering only the values of features and label. All the other columns are not considered by the classification algorithm during the generation of the prediction mode.
4. Apply the model on a set of unlabeled data to predict their numerical class label
5. Use IndexToString to convert the predicted numerical class label values to the original categorical values

### We make the same operation as for the Linear Regression

In [None]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import IndexToString
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

# input and output folders
trainingData = "ex_dataCategorical/trainingData.csv"
unlabeledData = "ex_dataCategorical/unlabeledData.csv"
outputPath = "predictionsDTCategoricalPipeline/"

In [None]:
# *************************
# Training step
# *************************

# Create a DataFrame from trainingData.csv
# Training data in raw format
trainingData = spark.read.load(trainingData, format="csv", header=True, inferSchema=True)

# Define an assembler to create a column (features) of type Vector
# containing the double values associated with columns attr1, attr2, attr3
assembler = VectorAssembler(inputCols=["attr1", "attr2", "attr3"], outputCol="features")

### Here the difference

In [None]:
# The StringIndexer Estimator is used to map each class label
# value to an integer value (casted to a double).
# A new attribute called label is generated by applying
# transforming the content of the categoricalLabel attribute.
labelIndexer = StringIndexer(inputCol="categoricalLabel", outputCol="label",\
                                handleInvalid="keep").fit(trainingData

In [None]:
# Create a DecisionTreeClassifier object.
# DecisionTreeClassifier is an Estimator that is used to
# create a classification model based on decision trees.
dt = DecisionTreeClassifier()

# We can set the values of the parameters of the DecisionTree
# For example we can set the measure that is used to decide if a
# node must be split.
# In this case we set gini index
dt.setImpurity("gini")

### Finally, we convert the result from numerical to categorical

In [None]:
# At the end of the pipeline we must convert indexed labels back
# to original labels (from numerical to string).
# The content of the prediction attribute is the index of the predicted class
# The original name of the predicted class is stored in the predictedLabel
# attribute.

# IndexToString creates a new column (called predictedLabel in
# this example) that is based on the content of the prediction column.
# prediction is a double while predictedLabel is a string
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",\
                                    labels=labelIndexer.labels)

In [None]:
# Define a pipeline that is used to create the decision tree
# model on the training data. The pipeline includes also
# the preprocessing and postprocessing steps
pipeline = Pipeline().setStages([assembler, labelIndexer, dt, labelConverter])

# Execute the pipeline on the training data to build the
# classification model
classificationModel = pipeline.fit(trainingData)
# Now, the classification model can be used to predict the class label
# of new unlabeled data

In [None]:
# *************************
# Prediction step
# *************************
# Create a DataFrame from unlabeledData.csv
# Unlabeled data in raw format
unlabeledData = spark.read.load(unlabeledData, format="csv", header=True, inferSchema=True)

# Make predictions on the unlabled data using the transform() method of the
# trained classification model transform uses only the content of 'features'
# to perform the predictions.The model is associated with the pipeline and hence
# also the assembler is executed
predictions = classificationModel.transform(unlabeledData)

In [None]:
# The returned DataFrame has the following schema (attributes)
# - attr1: double (nullable = true)
# - attr2: double (nullable = true)
# - attr3: double (nullable = true)
# - features: vector (values of the attributes)
# - label: double (value of the class label)
# - rawPrediction: vector (nullable = true)
# - probability: vector (The i-th cell contains the probability that the
# current record belongs to the i-th class
# - prediction: double (the predicted class label)
# - predictedLabel: string (nullable = true)

# Select only the original features (i.e., the value of the original attributes
# attr1, attr2, attr3) and the predicted class for each record
predictions = predictionsDF.select("attr1", "attr2", "attr3", "predictedLabel")