### Create a cluster

* Cluster name: User Name’s cluster (the default cluster name)
* Policy: Unrestricted
* Cluster mode: Single Node
* Access mode: Single user (with your user account selected)
* Databricks runtime version: Select the ML edition of the latest non-beta version of the runtime (Not a Standard runtime version) that:
  * Does not use a GPU
  * Includes Scala > 2.11
  * Includes Spark > 3.4
* Use Photon Acceleration: Unselected
* Node type: Standard_D4ds_v5
* Terminate after 20 minutes of inactivity

### Ingest the data

[`penguins.csv`](https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/penguins.csv)

In [0]:
%sh
rm -r /dbfs/ml_lab
mkdir /dbfs/ml_lab
wget -O /dbfs/ml_lab/penguins.csv https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/penguins.csv

rm: cannot remove '/dbfs/ml_lab': No such file or directory
--2025-01-23 10:35:00--  https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/penguins.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9533 (9.3K) [text/plain]
Saving to: ‘/dbfs/ml_lab/penguins.csv’

     0K .........                                             100%  582K=0.02s

2025-01-23 10:35:00 (582 KB/s) - ‘/dbfs/ml_lab/penguins.csv’ saved [9533/9533]



### Explore and clean up the data

* Load into a dataframe

In [0]:
df = spark.read.format("csv").option("header", "true").load("/ml_lab/penguins.csv")
display(df)

Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
Torgersen,39.1,18.7,181.0,3750.0,0
Torgersen,39.5,17.4,186.0,3800.0,0
Torgersen,40.3,18.0,195.0,3250.0,0
Torgersen,,,,,0
Torgersen,36.7,19.3,193.0,3450.0,0
Torgersen,39.3,20.6,190.0,3650.0,0
Torgersen,38.9,17.8,181.0,3625.0,0
Torgersen,39.2,19.6,195.0,4675.0,0
Torgersen,34.1,18.1,193.0,3475.0,0
Torgersen,42.0,20.2,190.0,4250.0,0


Since this data was loaded from a text file and contained some blank values, Spark has assigned a string data type to all of the columns.

The data itself consists of measurements of the following details of penguins that have been observed in Antarctica:

* Island: The island in Antarctica where the penguin was observed.
* CulmenLength: The length in mm of the penguin’s culmen (bill).
* CulmenDepth: The depth in mm of the penguin’s culmen.
* FlipperLength: The length in mm of the penguin’s flipper.
* BodyMass: The body mass of the penguin in grams.
* Species: An integer value that represents the species of the penguin:
  * 0: Adelie
  * 1: Gentoo
  * 2: Chinstrap

Our goal in this project is to use the observed characteristics of a penguin (its features) in order to predict its species (which in machine learning terminology, we call the label).

* Transform data

remove the rows with incomplete data by using the dropna method; apply appropriate data types to the data by using the select method with the col and astype functions

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
   
data = df.dropna().select(col("Island").astype("string"), col("CulmenLength").astype("float"),
  col("CulmenDepth").astype("float"),
  col("FlipperLength").astype("float"),
  col("BodyMass").astype("float"),
  col("Species").astype("int")
)
display(data)

Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
Torgersen,39.1,18.7,181.0,3750.0,0
Torgersen,39.5,17.4,186.0,3800.0,0
Torgersen,40.3,18.0,195.0,3250.0,0
Torgersen,36.7,19.3,193.0,3450.0,0
Torgersen,39.3,20.6,190.0,3650.0,0
Torgersen,38.9,17.8,181.0,3625.0,0
Torgersen,39.2,19.6,195.0,4675.0,0
Torgersen,34.1,18.1,193.0,3475.0,0
Torgersen,42.0,20.2,190.0,4250.0,0
Torgersen,37.8,17.1,186.0,3300.0,0


other tasks - fix (or remove) errors in the data, identify and remove outliers (untypically large or small values), or to balance the data so there’s a reasonably equal number of rows for each label you’re trying to predict.

### Split the data

split the full data set into two randomized subsets

In [0]:
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]
print ("Training Rows:", train.count(), " Testing Rows:", test.count())

Training Rows: 251  Testing Rows: 91


### Perform feature engineering

* Encode categorical features

encode the Island categorical column values as numeric indexes.

In [0]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Island", outputCol="IslandIdx")
indexedData = indexer.fit(train).transform(train).drop("Island")
display(indexedData)

CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species,IslandIdx
35.0,17.9,190.0,3450.0,0,0.0
35.7,16.9,185.0,3150.0,0,0.0
35.9,19.2,189.0,3800.0,0,0.0
36.5,16.6,181.0,2850.0,0,0.0
37.6,17.0,185.0,3600.0,0,0.0
37.6,19.1,194.0,3750.0,0,0.0
37.7,16.0,183.0,3075.0,0,0.0
37.7,18.7,180.0,3600.0,0,0.0
37.8,18.3,174.0,3400.0,0,0.0
37.8,20.0,190.0,4250.0,0,0.0


* Normalize (scale) numeric features

These values (CulmenLength, CulmenDepth, FlipperLength, and BodyMass) all represent measurements of one sort or another, but they’re in different scales.

We need to scale multiple column values at the same time, so the technique we use is to create a single column containing a vector (essentially an array) of all the numeric features, and then apply a scaler to produce a new vector column with the equivalent normalized values.

In [0]:
from pyspark.ml.feature import VectorAssembler, MinMaxScaler

# Create a vector column containing all numeric features
numericFeatures = ["CulmenLength", "CulmenDepth", "FlipperLength", "BodyMass"]
numericColVector = VectorAssembler(inputCols=numericFeatures, outputCol="numericFeatures")
vectorizedData = numericColVector.transform(indexedData)
   
# Use a MinMax scaler to normalize the numeric values in the vector
minMax = MinMaxScaler(inputCol = numericColVector.getOutputCol(), outputCol="normalizedFeatures")
scaledData = minMax.fit(vectorizedData).transform(vectorizedData)
   
# Display the data with numeric feature vectors (before and after scaling)
compareNumerics = scaledData.select("numericFeatures", "normalizedFeatures")
display(compareNumerics)

numericFeatures,normalizedFeatures
"Map(vectorType -> dense, length -> 4, values -> List(35.0, 17.899999618530273, 190.0, 3450.0))","Map(vectorType -> dense, length -> 4, values -> List(0.10545460094105114, 0.5714285065527647, 0.3050847457627119, 0.1875))"
"Map(vectorType -> dense, length -> 4, values -> List(35.70000076293945, 16.899999618530273, 185.0, 3150.0))","Map(vectorType -> dense, length -> 4, values -> List(0.13090917413884942, 0.4523808820988284, 0.22033898305084745, 0.09375))"
"Map(vectorType -> dense, length -> 4, values -> List(35.900001525878906, 19.200000762939453, 189.0, 3800.0))","Map(vectorType -> dense, length -> 4, values -> List(0.13818192915482955, 0.726190554582076, 0.288135593220339, 0.296875))"
"Map(vectorType -> dense, length -> 4, values -> List(36.5, 16.600000381469727, 181.0, 2850.0))","Map(vectorType -> dense, length -> 4, values -> List(0.16000005548650567, 0.416666685588777, 0.15254237288135594, 0.0))"
"Map(vectorType -> dense, length -> 4, values -> List(37.599998474121094, 17.0, 185.0, 3600.0))","Map(vectorType -> dense, length -> 4, values -> List(0.19999999999999998, 0.46428568995728675, 0.22033898305084745, 0.234375))"
"Map(vectorType -> dense, length -> 4, values -> List(37.599998474121094, 19.100000381469727, 194.0, 3750.0))","Map(vectorType -> dense, length -> 4, values -> List(0.19999999999999998, 0.7142857467236177, 0.3728813559322034, 0.28125))"
"Map(vectorType -> dense, length -> 4, values -> List(37.70000076293945, 16.0, 183.0, 3075.0))","Map(vectorType -> dense, length -> 4, values -> List(0.20363644686612214, 0.34523806550335046, 0.1864406779661017, 0.0703125))"
"Map(vectorType -> dense, length -> 4, values -> List(37.70000076293945, 18.700000762939453, 180.0, 3600.0))","Map(vectorType -> dense, length -> 4, values -> List(0.20363644686612214, 0.6666667423551079, 0.13559322033898305, 0.234375))"
"Map(vectorType -> dense, length -> 4, values -> List(37.79999923706055, 18.299999237060547, 174.0, 3400.0))","Map(vectorType -> dense, length -> 4, values -> List(0.2072727550159801, 0.6190475109212744, 0.03389830508474576, 0.171875))"
"Map(vectorType -> dense, length -> 4, values -> List(37.79999923706055, 20.0, 190.0, 4250.0))","Map(vectorType -> dense, length -> 4, values -> List(0.2072727550159801, 0.8214285633190956, 0.3050847457627119, 0.4375))"


The numericFeatures column in the results contains a vector for each row. The vector includes four unscaled numeric values (the original measurements of the penguin). The normalizedFeatures column also contains a vector for each penguin observation, but this time the values in the vector are normalized to a relative scale based on the minimum and maximum values for each measurement.

* Prepare features and labels for training

Now, let’s bring everything together and create a single column containing all of the features (the encoded categorical island name and the normalized penguin measurements), and another column containing the class label we want to train a model to predict (the penguin species).

In [0]:
featVect = VectorAssembler(inputCols=["IslandIdx", "normalizedFeatures"], outputCol="featuresVector")
preppedData = featVect.transform(scaledData)[col("featuresVector").alias("features"), col("Species").alias("label")]
display(preppedData)

features,label
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.10545460094105114, 0.5714285065527647, 0.3050847457627119, 0.1875))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.13090917413884942, 0.4523808820988284, 0.22033898305084745, 0.09375))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.13818192915482955, 0.726190554582076, 0.288135593220339, 0.296875))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.16000005548650567, 0.416666685588777, 0.15254237288135594, 0.0))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.19999999999999998, 0.46428568995728675, 0.22033898305084745, 0.234375))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.19999999999999998, 0.7142857467236177, 0.3728813559322034, 0.28125))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.20363644686612214, 0.34523806550335046, 0.1864406779661017, 0.0703125))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.20363644686612214, 0.6666667423551079, 0.13559322033898305, 0.234375))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.2072727550159801, 0.6190475109212744, 0.03389830508474576, 0.171875))",0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.2072727550159801, 0.8214285633190956, 0.3050847457627119, 0.4375))",0


The features vector contains five values (the encoded island and the normalized culmen length, culmen depth, flipper length, and body mass). The label contains a simple integer code that indicates the class of penguin species.

### Train a ML model

logistic regression, which iteratively attempts to find the optimal coefficients that can be applied to the features data in a logistic calculation that predicts the probability for each class label value. To train the model, you will fit the logistic regression algorithm to the training data. You specify the maximum number of iterations performed to find optimal coeficients for the logistic calculation, and a regularization parameter that is used to prevent the model from overfitting.

In [0]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10, regParam=0.3)
model = lr.fit(preppedData)
print ("Model trained!")

Downloading artifacts:   0%|          | 0/15 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Model trained!


### Test the model

prepare the test data and then generate predictions

In [0]:
# Prepare the test data
indexedTestData = indexer.fit(test).transform(test).drop("Island")
vectorizedTestData = numericColVector.transform(indexedTestData)
scaledTestData = minMax.fit(vectorizedTestData).transform(vectorizedTestData)
preppedTestData = featVect.transform(scaledTestData)[col("featuresVector").alias("features"), col("Species").alias("label")]
   
# Get predictions
prediction = model.transform(preppedTestData)
predicted = prediction.select("features", "probability", col("prediction").astype("Int"), col("label").alias("trueLabel"))
display(predicted)

features,probability,prediction,trueLabel
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 0.5866666683620875, 0.17307692307692307, 0.05555555555555555))","Map(vectorType -> dense, length -> 3, values -> List(0.9044455472067027, 0.03732982651059977, 0.05822462628269763))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.02127659574468085, 0.5599999033610149, 0.2692307692307693, 0.2847222222222222))","Map(vectorType -> dense, length -> 3, values -> List(0.8610205002371472, 0.07821972732117789, 0.06075977244167493))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.03404252072598072, 0.6933332197401404, 0.17307692307692307, 0.3055555555555556))","Map(vectorType -> dense, length -> 3, values -> List(0.8933948146599047, 0.04775491452805716, 0.058850270812038245))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0425531914893617, 0.33333341810437295, 0.3269230769230769, 0.18055555555555555))","Map(vectorType -> dense, length -> 3, values -> List(0.7883351742657827, 0.13910254823774043, 0.07256227749647698))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.08085112876080452, 0.4533333519829621, 0.11538461538461539, 0.041666666666666664))","Map(vectorType -> dense, length -> 3, values -> List(0.8608783088017331, 0.05734325469572048, 0.08177843650254632))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.14468091599484706, 0.6533333265516502, 0.2884615384615385, 0.0625))","Map(vectorType -> dense, length -> 3, values -> List(0.8309874786628647, 0.06196598730610369, 0.10704653403103173))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.2595744031540891, 0.6799998372396041, 0.2884615384615385, 0.3055555555555556))","Map(vectorType -> dense, length -> 3, values -> List(0.7389654282886562, 0.1156282644418229, 0.14540630726952097))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.2723404904629322, 0.0, 0.6923076923076923, 0.5416666666666666))","Map(vectorType -> dense, length -> 3, values -> List(0.17105237919019448, 0.7717155092249804, 0.05723211158482511))",1,1
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.28085099889876997, 0.6000000508626238, 0.2692307692307693, 0.375))","Map(vectorType -> dense, length -> 3, values -> List(0.6928545042547388, 0.15728006636198907, 0.1498654293832722))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.3021275946434508, 0.5733332858615512, 0.2692307692307693, 0.3472222222222222))","Map(vectorType -> dense, length -> 3, values -> List(0.669755170765458, 0.16746217196142363, 0.16278265727311833))",0,0


* features: The prepared features data from the test dataset.
* probability: The probability calculated by the model for each class. This consists of a vector containing three probability values (because there are three classes) which add up to a total of 1.0 (its assumed that there’s a 100% probability that the penguin belongs to one of the three species classes).
* prediction: The predicted class label (the one with the highest probability).
* trueLabel: The actual known label value from the test data.

get evaluation metrics for a classification model based on the results from the test data

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
   
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
   
# Simple accuracy
accuracy = evaluator.evaluate(prediction, {evaluator.metricName:"accuracy"})
print("Accuracy:", accuracy)
   
# Individual class metrics
labels = [0,1,2]
print("\nIndividual class metrics:")
for label in sorted(labels):
  print ("Class %s" % (label))
  
  # Precision
  precision = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
    evaluator.metricName:"precisionByLabel"})
  print("\tPrecision:", precision)
  
  # Recall
  recall = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
    evaluator.metricName:"recallByLabel"})
  print("\tRecall:", recall)
  
  # F1 score
  f1 = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
    evaluator.metricName:"fMeasureByLabel"})
  print("\tF1 Score:", f1)
   
# Weighted (overall) metrics
overallPrecision = evaluator.evaluate(prediction, {evaluator.metricName:"weightedPrecision"})
print("Overall Precision:", overallPrecision)
overallRecall = evaluator.evaluate(prediction, {evaluator.metricName:"weightedRecall"})
print("Overall Recall:", overallRecall)
overallF1 = evaluator.evaluate(prediction, {evaluator.metricName:"weightedFMeasure"})
print("Overall F1 Score:", overallF1)

Accuracy: 0.8901098901098901

Individual class metrics:
Class 0
	Precision: 0.782608695652174
	Recall: 1.0
	F1 Score: 0.878048780487805
Class 1
	Precision: 1.0
	Recall: 1.0
	F1 Score: 1.0
Class 2
	Precision: 1.0
	Recall: 0.5833333333333334
	F1 Score: 0.7368421052631579
Overall Precision: 0.9139990444338271
Overall Recall: 0.8901098901098901
Overall F1 Score: 0.8823512815810634


* Accuracy: The proportion of overall predictions that were correct.
* Per-class metrics:
  * Precision: The proportion of predictions of this class that were correct.
  * Recall: The proportion of actual instances of this class that were correctly predicted.
  * F1 score: A combined metric for precision and recall
* Combined (weighted) precision, recall, and F1 metrics for all classes.

### Use a pipeline

You trained your model by performing the required feature engineering steps and then fitting an algorithm to the data. To use the model with some test data to generate predictions (referred to as inferencing), you had to apply the same feature engineering steps to the test data. A more efficient way to build and use models is to encapsulate the transformers used to prepare the data and the model used to train it in a pipeline.

Create a pipeline that encapsulates the data preparation and model training steps

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import LogisticRegression
   
catFeature = "Island"
numFeatures = ["CulmenLength", "CulmenDepth", "FlipperLength", "BodyMass"]
   
# Define the feature engineering and model training algorithm steps
catIndexer = StringIndexer(inputCol=catFeature, outputCol=catFeature + "Idx")
numVector = VectorAssembler(inputCols=numFeatures, outputCol="numericFeatures")
numScaler = MinMaxScaler(inputCol = numVector.getOutputCol(), outputCol="normalizedFeatures")
featureVector = VectorAssembler(inputCols=["IslandIdx", "normalizedFeatures"], outputCol="Features")
algo = LogisticRegression(labelCol="Species", featuresCol="Features", maxIter=10, regParam=0.3)
   
# Chain the steps as stages in a pipeline
pipeline = Pipeline(stages=[catIndexer, numVector, numScaler, featureVector, algo])
   
# Use the pipeline to prepare data and fit the model algorithm
model = pipeline.fit(train)
print ("Model trained!")

Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Model trained!


Since the feature engineering steps are now encapsulated in the model trained by the pipeline, you can use the model with the test data without needing to apply each transformation (they’ll be applied automatically by the model).

Apply the pipeline to the test data

In [0]:
prediction = model.transform(test)
predicted = prediction.select("Features", "probability", col("prediction").astype("Int"), col("Species").alias("trueLabel"))
display(predicted)

Features,probability,prediction,trueLabel
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.08727278275923295, 0.5952381222696814, 0.2542372881355932, 0.015625))","Map(vectorType -> dense, length -> 3, values -> List(0.8600654621440087, 0.05272275500971718, 0.08721178284627418))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.10545460094105114, 0.5714285065527647, 0.3389830508474576, 0.2734375))","Map(vectorType -> dense, length -> 3, values -> List(0.8043831315095489, 0.10903148871629018, 0.08658537977416107))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.1163636641068892, 0.6904761310067009, 0.2542372881355932, 0.296875))","Map(vectorType -> dense, length -> 3, values -> List(0.843064474421891, 0.07186204450202806, 0.08507348107608095))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.12363641912286931, 0.3690476812202672, 0.3898305084745763, 0.15625))","Map(vectorType -> dense, length -> 3, values -> List(0.7294587650054535, 0.17059439702997495, 0.09994683796457159))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.1563637473366477, 0.47619049781574513, 0.2033898305084746, 0.0))","Map(vectorType -> dense, length -> 3, values -> List(0.8101205090029675, 0.07504685201824543, 0.11483263897878696))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.21090920188210227, 0.6547619344966495, 0.3559322033898305, 0.0234375))","Map(vectorType -> dense, length -> 3, values -> List(0.7779183297320522, 0.0793501455030605, 0.14273152476488735))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.3090909090909091, 0.6785713231482425, 0.3559322033898305, 0.296875))","Map(vectorType -> dense, length -> 3, values -> List(0.6788846725124053, 0.1462061812806686, 0.174909146206926))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.32000011097301134, 0.07142850655276464, 0.711864406779661, 0.5625))","Map(vectorType -> dense, length -> 3, values -> List(0.16347329710551844, 0.7704479304646414, 0.06607877242984007))",1,1
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.32727272727272727, 0.6071429301281398, 0.3389830508474576, 0.375))","Map(vectorType -> dense, length -> 3, values -> List(0.6304612397533593, 0.1943578352797864, 0.1751809249668543))",0,0
"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.34545454545454546, 0.583333314411223, 0.3389830508474576, 0.34375))","Map(vectorType -> dense, length -> 3, values -> List(0.6100505977506275, 0.20212548600597946, 0.1878239162433929))",0,0


### Try a different algorithm

create a pipeline that uses a Decision tree algorithm

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import DecisionTreeClassifier
   
catFeature = "Island"
numFeatures = ["CulmenLength", "CulmenDepth", "FlipperLength", "BodyMass"]
   
# Define the feature engineering and model steps
catIndexer = StringIndexer(inputCol=catFeature, outputCol=catFeature + "Idx")
numVector = VectorAssembler(inputCols=numFeatures, outputCol="numericFeatures")
numScaler = MinMaxScaler(inputCol = numVector.getOutputCol(), outputCol="normalizedFeatures")
featureVector = VectorAssembler(inputCols=["IslandIdx", "normalizedFeatures"], outputCol="Features")
algo = DecisionTreeClassifier(labelCol="Species", featuresCol="Features", maxDepth=10)
   
# Chain the steps as stages in a pipeline
pipeline = Pipeline(stages=[catIndexer, numVector, numScaler, featureVector, algo])
   
# Use the pipeline to prepare data and fit the model algorithm
model = pipeline.fit(train)
print ("Model trained!")

Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Model trained!


use the new pipeline with the test data

In [0]:
# Get predictions
prediction = model.transform(test)
predicted = prediction.select("Features", "probability", col("prediction").astype("Int"), col("Species").alias("trueLabel"))
   
# Generate evaluation metrics
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
   
evaluator = MulticlassClassificationEvaluator(labelCol="Species", predictionCol="prediction")
   
# Simple accuracy
accuracy = evaluator.evaluate(prediction, {evaluator.metricName:"accuracy"})
print("Accuracy:", accuracy)
   
# Class metrics
labels = [0,1,2]
print("\nIndividual class metrics:")
for label in sorted(labels):
    print ("Class %s" % (label))
   
    # Precision
    precision = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
      evaluator.metricName:"precisionByLabel"})
    print("\tPrecision:", precision)
   
    # Recall
    recall = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
      evaluator.metricName:"recallByLabel"})
    print("\tRecall:", recall)
   
    # F1 score
    f1 = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
      evaluator.metricName:"fMeasureByLabel"})
    print("\tF1 Score:", f1)
   
# Weighed (overall) metrics
overallPrecision = evaluator.evaluate(prediction, {evaluator.metricName:"weightedPrecision"})
print("Overall Precision:", overallPrecision)
overallRecall = evaluator.evaluate(prediction, {evaluator.metricName:"weightedRecall"})
print("Overall Recall:", overallRecall)
overallF1 = evaluator.evaluate(prediction, {evaluator.metricName:"weightedFMeasure"})
print("Overall F1 Score:", overallF1)

Accuracy: 0.978021978021978

Individual class metrics:
Class 0
	Precision: 0.9722222222222222
	Recall: 0.9722222222222222
	F1 Score: 0.9722222222222222
Class 1
	Precision: 0.96875
	Recall: 1.0
	F1 Score: 0.9841269841269841
Class 2
	Precision: 1.0
	Recall: 0.9583333333333334
	F1 Score: 0.9787234042553191
Overall Precision: 0.9783653846153846
Overall Recall: 0.9780219780219781
Overall F1 Score: 0.9779922880226832


### Save the model

In reality, you’d iteratively try training the model with different algorithms (and parameters) to find the best model for your data. For now, we’ll stick with the decision trees model we’ve trained. Let’s save it so we can use it later with some new penguin observations.

In [0]:
model.save("/models/penguin.model")

load the model and use it to predict the species for a new penguin observation

In [0]:
from pyspark.ml.pipeline import PipelineModel

persistedModel = PipelineModel.load("/models/penguin.model")
   
newData = spark.createDataFrame ([{
  "Island": "Biscoe",
  "CulmenLength": 47.6,
  "CulmenDepth": 14.5,
  "FlipperLength": 215,
  "BodyMass": 5400
  }])

predictions = persistedModel.transform(newData)
display(predictions.select("Island", "CulmenDepth", "CulmenLength", "FlipperLength", "BodyMass", col("prediction").alias("PredictedSpecies")))

Island,CulmenDepth,CulmenLength,FlipperLength,BodyMass,PredictedSpecies
Biscoe,14.5,47.6,215,5400,1.0
