# Spark MLlib Exercises


http://spark.apache.org/docs/latest/ml-statistics.html

In [1]:
!pip install pyspark


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 40 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 54.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=1a60c551d9c5e9a34165b1acfb11b5d7bd1db918ad6c5f898004747d531812fb
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1


In [2]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.getOrCreate()

In [3]:
spark

## 1. Statistics (1p.)

Download the following dataset: https://www.kaggle.com/c/titanic/data?select=train.csv

In [5]:
file = "titanic_train.csv"
titanic_df = spark.read.format("csv").options(inferSchema="true", header="true").load(file)
titanic_df = titanic_df.dropna(how='any')
titanic_df.show(10)
print(titanic_df.dtypes)

+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|  Ticket|   Fare|      Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----------+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|PC 17599|71.2833|        C85|       C|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|  113803|   53.1|       C123|       S|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|   17463|51.8625|        E46|       S|
|         11|       1|     3|Sandstrom, Miss. ...|female| 4.0|    1|    1| PP 9549|   16.7|         G6|       S|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|  113783|  26.55|       C103|       S|
|         22|       1|     2|Beesley, Mr. Lawr...|  male|34.0|    0|    0|  248698|   13.0|     

### Exercise 1.A.
**TODO:** Calculate descriptive statistics for 'Age' and 'Fare' (see https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html#describe(scala.collection.Seq))

In [6]:
titanic_df.describe("Age").show()
titanic_df.describe("Fare").show()

+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|               183|
|   mean|  35.6744262295082|
| stddev|15.643865966849717|
|    min|              0.92|
|    max|              80.0|
+-------+------------------+

+-------+-----------------+
|summary|             Fare|
+-------+-----------------+
|  count|              183|
|   mean|78.68246885245901|
| stddev|76.34784270040569|
|    min|              0.0|
|    max|         512.3292|
+-------+-----------------+



### Exercise 1.B.

**TODO:** Check if 'Age' and 'Fare' have normal distribution (see http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/stat/KolmogorovSmirnovTest.html)

In [28]:
from pyspark.ml.stat import Correlation, KolmogorovSmirnovTest

mean_fare = titanic_df.describe("Fare").to_pandas_on_spark().iloc[1][1]
std_fare = titanic_df.describe("Fare").to_pandas_on_spark().iloc[2][1]


mean_age = titanic_df.describe("Age").to_pandas_on_spark().iloc[1][1]
std_age = titanic_df.describe("Age").to_pandas_on_spark().iloc[2][1]

print(KolmogorovSmirnovTest.test(titanic_df, "Age", "norm", mean_age, std_age).first())
print(KolmogorovSmirnovTest.test(titanic_df, "Fare", "norm", mean_fare, std_fare).first())



Row(pValue=0.8522382560293139, statistic=0.04414417432750317)
Row(pValue=6.282326284745565e-07, statistic=0.2005338889924242)


W przypadku Age możemy mówić o rozkładzie normmalnym (p-value bliskie 1),
natomiast przy Fare mamy bardzo niską wartość, prawie 0, więc odrzucamy hipotezę
o rozkładzie normalnym

### Exercise 1.C.

**TODO:** Calculate Pearson correlation between the following pairs of features:  
* 'Age' and 'Survived'
* 'Sex' and 'Survived' *(remember about encoding 'Sex' attributes as 0s and 1s)*

Which correlation is stronger?

In [30]:
from pyspark.sql import functions as F
from pyspark.sql import types as T

@F.udf(returnType=T.IntegerType())
def sex_to_integers(sex: str) -> int:
    return int(sex == "male")

titanic_df_encoded = (
    titanic_df
    .select("Age", "Sex", "Survived")
    .withColumn("Sex", sex_to_integers(F.col("Sex")))
)

print(titanic_df_encoded.corr("Age", "Survived", method="pearson"))
print(titanic_df_encoded.corr("Sex", "Survived", method="pearson"))

-0.2540847542030532
-0.5324179744538412


W przypadku wieku mamy niską korelację, w przypadku płci już większą, co ma
sens, ponieważ kobiety były ewakuowane wcześniej

## 2. Loading data

Doc: http://spark.apache.org/docs/latest/ml-datasource.html 

Download data from https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt and load as DataFrame. 

In [31]:
!wget -O sample_libsvm_data.txt 'https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt'

--2022-12-19 11:02:43--  https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104736 (102K) [text/plain]
Saving to: ‘sample_libsvm_data.txt’


2022-12-19 11:02:43 (6.31 MB/s) - ‘sample_libsvm_data.txt’ saved [104736/104736]



In [32]:
file = "sample_libsvm_data.txt"

df = spark.read.format("libsvm").option("numFeatures", "780").load(file)
df.show(10)
df.take(1)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(780,[127,128,129...|
|  1.0|(780,[158,159,160...|
|  1.0|(780,[124,125,126...|
|  1.0|(780,[152,153,154...|
|  1.0|(780,[151,152,153...|
|  0.0|(780,[129,130,131...|
|  1.0|(780,[158,159,160...|
|  1.0|(780,[99,100,101,...|
|  0.0|(780,[154,155,156...|
|  0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows



[Row(label=0.0, features=SparseVector(780, {127: 51.0, 128: 159.0, 129: 253.0, 130: 159.0, 131: 50.0, 154: 48.0, 155: 238.0, 156: 252.0, 157: 252.0, 158: 252.0, 159: 237.0, 181: 54.0, 182: 227.0, 183: 253.0, 184: 252.0, 185: 239.0, 186: 233.0, 187: 252.0, 188: 57.0, 189: 6.0, 207: 10.0, 208: 60.0, 209: 224.0, 210: 252.0, 211: 253.0, 212: 252.0, 213: 202.0, 214: 84.0, 215: 252.0, 216: 253.0, 217: 122.0, 235: 163.0, 236: 252.0, 237: 252.0, 238: 252.0, 239: 253.0, 240: 252.0, 241: 252.0, 242: 96.0, 243: 189.0, 244: 253.0, 245: 167.0, 262: 51.0, 263: 238.0, 264: 253.0, 265: 253.0, 266: 190.0, 267: 114.0, 268: 253.0, 269: 228.0, 270: 47.0, 271: 79.0, 272: 255.0, 273: 168.0, 289: 48.0, 290: 238.0, 291: 252.0, 292: 252.0, 293: 179.0, 294: 12.0, 295: 75.0, 296: 121.0, 297: 21.0, 300: 253.0, 301: 243.0, 302: 50.0, 316: 38.0, 317: 165.0, 318: 253.0, 319: 233.0, 320: 208.0, 321: 84.0, 328: 253.0, 329: 252.0, 330: 165.0, 343: 7.0, 344: 178.0, 345: 252.0, 346: 240.0, 347: 71.0, 348: 19.0, 349: 28.0

### Exercise 2.A
**TODO:** Load wine data from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/wine.scale
Dataset description: http://archive.ics.uci.edu/ml/datasets/Wine

In [33]:
!wget -O wine.scale 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/wine.scale'

--2022-12-19 11:05:24--  https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/wine.scale
Resolving www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)... 140.112.30.26
Connecting to www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)|140.112.30.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28116 (27K)
Saving to: ‘wine.scale’


2022-12-19 11:05:26 (171 KB/s) - ‘wine.scale’ saved [28116/28116]



In [34]:
wine_df = (
    spark.read
    .format("libsvm")
    .option("numFeatures", "13").load('wine.scale')
)

wine_df.show(10)
wine_df.take(1)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
+-----+--------------------+
only showing top 10 rows



[Row(label=1.0, features=SparseVector(13, {0: 0.6842, 1: -0.6166, 2: 0.1444, 3: -0.4845, 4: 0.2391, 5: 0.2552, 6: 0.1477, 7: -0.434, 8: 0.1861, 9: -0.256, 10: -0.0894, 11: 0.9414, 12: 0.1227}))]

## 3. Classification (2p.)

In [35]:
!wget -O wine.csv 'https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv'

--2022-12-19 11:12:57--  https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10889 (11K) [text/plain]
Saving to: ‘wine.csv’


2022-12-19 11:12:57 (70.3 MB/s) - ‘wine.csv’ saved [10889/10889]



In [36]:
file = "wine.csv" # https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv

# Remember about deleting dots from the headers of this csv file!
winedf2 = spark.read.format("csv").options(inferSchema="true", header="true").load(file)
winedf2.show(10)
print(winedf2.dtypes)

+----+-------+----------+----+----+---+-------+----------+--------------------+-------+---------+----+----+-------+
|Wine|Alcohol|Malic.acid| Ash| Acl| Mg|Phenols|Flavanoids|Nonflavanoid.phenols|Proanth|Color.int| Hue|  OD|Proline|
+----+-------+----------+----+----+---+-------+----------+--------------------+-------+---------+----+----+-------+
|   1|  14.23|      1.71|2.43|15.6|127|    2.8|      3.06|                0.28|   2.29|     5.64|1.04|3.92|   1065|
|   1|   13.2|      1.78|2.14|11.2|100|   2.65|      2.76|                0.26|   1.28|     4.38|1.05| 3.4|   1050|
|   1|  13.16|      2.36|2.67|18.6|101|    2.8|      3.24|                 0.3|   2.81|     5.68|1.03|3.17|   1185|
|   1|  14.37|      1.95| 2.5|16.8|113|   3.85|      3.49|                0.24|   2.18|      7.8|0.86|3.45|   1480|
|   1|  13.24|      2.59|2.87|21.0|118|    2.8|      2.69|                0.39|   1.82|     4.32|1.04|2.93|    735|
|   1|   14.2|      1.76|2.45|15.2|112|   3.27|      3.39|              

### Exercise 3.A
**TODO:** 

Remember about deleting dots from the headers of this csv file and splitting data into train and test set


1) Create pipeline with VectorAssembler and DecisionTreeClassifier.

2) Use the pipeline to make predictions.

3) Evaluate predictions using MulticlassClassificationEvaluator.

4) Calculate accuracy and test error

5) Print the structure of the trained decision tree (hint: use toDebugString attribute)

In [37]:
filename = "wine.csv"

wine_df = (
    spark.read
    .format("csv")
    .options(inferSchema="true", header="true")
    .load("wine.csv")
    .withColumnRenamed("Malic.acid", "malic_acid")
    .withColumnRenamed("Nonflavanoid.phenols", "nonflavanoid_phenols")
    .withColumnRenamed("Color.int", "color_int")
)

train_df, test_df = wine_df.randomSplit([0.8, 0.2], seed=0)
wine_df.show(3)

+----+-------+----------+----+----+---+-------+----------+--------------------+-------+---------+----+----+-------+
|Wine|Alcohol|malic_acid| Ash| Acl| Mg|Phenols|Flavanoids|nonflavanoid_phenols|Proanth|color_int| Hue|  OD|Proline|
+----+-------+----------+----+----+---+-------+----------+--------------------+-------+---------+----+----+-------+
|   1|  14.23|      1.71|2.43|15.6|127|    2.8|      3.06|                0.28|   2.29|     5.64|1.04|3.92|   1065|
|   1|   13.2|      1.78|2.14|11.2|100|   2.65|      2.76|                0.26|   1.28|     4.38|1.05| 3.4|   1050|
|   1|  13.16|      2.36|2.67|18.6|101|    2.8|      3.24|                 0.3|   2.81|     5.68|1.03|3.17|   1185|
+----+-------+----------+----+----+---+-------+----------+--------------------+-------+---------+----+----+-------+
only showing top 3 rows



In [38]:
from pyspark.ml.classification import DecisionTreeClassifier, LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import HashingTF, IDF, QuantileDiscretizer, StringIndexer, Tokenizer, VectorAssembler
from pyspark.ml import Pipeline

def decision_tree(train_df, test_df):
    num_classes = train_df.select("Wine").distinct().count()
    feature_cols = train_df.columns[1:]
    
    assembler = VectorAssembler(
        inputCols=feature_cols, 
        outputCol="features"
    )
    decision_tree = DecisionTreeClassifier(
        labelCol="Wine", 
        featuresCol="features"
    )
    pipeline = Pipeline(stages=[assembler, decision_tree]) 

    model = pipeline.fit(train_df)
    predictions = model.transform(test_df)

    evaluator = MulticlassClassificationEvaluator(
        labelCol="Wine", 
        predictionCol="prediction", 
        metricName="accuracy"
    )
    accuracy = evaluator.evaluate(predictions) * 100
    
    return model, accuracy, predictions

In [39]:
model, accuracy, predictions = decision_tree(train_df, test_df)

print(accuracy)

86.11111111111111


In [40]:
tree_model = model.stages[1]
print(tree_model.toDebugString)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_2fd66b713573, depth=5, numNodes=19, numClasses=4, numFeatures=13
  If (feature 12 <= 755.0)
   If (feature 6 <= 1.385)
    If (feature 9 <= 3.77)
     Predict: 2.0
    Else (feature 9 > 3.77)
     Predict: 3.0
   Else (feature 6 > 1.385)
    If (feature 0 <= 13.135)
     Predict: 2.0
    Else (feature 0 > 13.135)
     If (feature 1 <= 1.6749999999999998)
      Predict: 2.0
     Else (feature 1 > 1.6749999999999998)
      If (feature 0 <= 13.285)
       Predict: 1.0
      Else (feature 0 > 13.285)
       Predict: 3.0
  Else (feature 12 > 755.0)
   If (feature 5 <= 1.6150000000000002)
    If (feature 1 <= 1.62)
     Predict: 2.0
    Else (feature 1 > 1.62)
     Predict: 3.0
   Else (feature 5 > 1.6150000000000002)
    If (feature 0 <= 11.98)
     Predict: 2.0
    Else (feature 0 > 11.98)
     Predict: 1.0



### Exercise 3.B
**TODO:** 

1) Extend the pipeline from the previos task with QuantileDiscretizer 

2) Try using a couple of different numbers of buckets, which cinfiguration gives the best results?

3) Can you see any difference in the structure of the decistion tree?

In [41]:
def decision_tree_bins(train_df, test_df, num_buckets: int):
    num_classes = train_df.select("Wine").distinct().count()
    feature_cols = train_df.columns[1:]
    discretized_cols = [f"{col}_disc" for col in train_df.columns[1:]]
    
    discretizer = QuantileDiscretizer(
        inputCols=feature_cols,
        outputCols=discretized_cols,
        numBuckets=num_buckets
    )
    assembler = VectorAssembler(
        inputCols=discretized_cols, 
        outputCol="features"
    )
    decision_tree = DecisionTreeClassifier(
        labelCol="Wine", 
        featuresCol="features",
        
    )
    pipeline = Pipeline(stages=[discretizer, assembler, decision_tree]) 

    model = pipeline.fit(train_df)
    predictions = model.transform(test_df)

    evaluator = MulticlassClassificationEvaluator(
        labelCol="Wine", 
        predictionCol="prediction", 
        metricName="accuracy"
    )
    accuracy = evaluator.evaluate(predictions) * 100
    
    return model, accuracy

for num_bins in range(2, 10):
    print("Bins:", num_bins)
    model, accuracy = decision_tree_bins(train_df, test_df, num_bins)
    print(f"Accuracy: {accuracy:.2f}")

Bins: 2
Accuracy: 88.89
Bins: 3
Accuracy: 97.22
Bins: 4
Accuracy: 91.67
Bins: 5
Accuracy: 94.44
Bins: 6
Accuracy: 91.67
Bins: 7
Accuracy: 97.22
Bins: 8
Accuracy: 91.67
Bins: 9
Accuracy: 91.67


In [42]:
for num_bins in range(2, 10):
    print("Bins:", num_bins)
    model, accuracy = decision_tree_bins(train_df, test_df, num_bins)
    tree_model = model.stages[2]
    print(tree_model.toDebugString)
    print()

Bins: 2
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_0014f4a257eb, depth=5, numNodes=21, numClasses=4, numFeatures=13
  If (feature 6 in {0.0})
   If (feature 9 in {0.0})
    If (feature 10 in {0.0})
     If (feature 8 in {0.0})
      If (feature 1 in {0.0})
       Predict: 2.0
      Else (feature 1 not in {0.0})
       Predict: 3.0
     Else (feature 8 not in {0.0})
      Predict: 2.0
    Else (feature 10 not in {0.0})
     Predict: 2.0
   Else (feature 9 not in {0.0})
    If (feature 10 in {0.0})
     Predict: 3.0
    Else (feature 10 not in {0.0})
     Predict: 2.0
  Else (feature 6 not in {0.0})
   If (feature 12 in {0.0})
    Predict: 2.0
   Else (feature 12 not in {0.0})
    If (feature 0 in {0.0})
     If (feature 2 in {0.0})
      Predict: 2.0
     Else (feature 2 not in {0.0})
      If (feature 3 in {0.0})
       Predict: 1.0
      Else (feature 3 not in {0.0})
       Predict: 2.0
    Else (feature 0 not in {0.0})
     Predict: 1.0


Bins: 3
DecisionTreeClassifi

## 4. Text classification (2p.)

### Exercise 4
**TODO:** 
Build a pipeline consisting of Tokenizer, HashingTF, IDF and StringIndexer and LogisticRegression, fit it to training data: 
http://help.sentiment140.com/for-students/

What is the accuracy of this classifier?

In [43]:
!wget -O sentiment.zip 'http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip'
!unzip sentiment.zip

--2022-12-19 11:45:30--  http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip [following]
--2022-12-19 11:45:30--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘sentiment.zip’


2022-12-19 11:45:32 (46.4 MB/s) - ‘sentiment.zip’ saved [81363704/81363704]

Archive:  sentiment.zip
  inflating: testdata.manual.2009.06.14.csv  
  inflating: training.1600000.processed.noemoticon.csv  


In [44]:
columns = ["label", "id", "date", "query", "user", "text"]

train_df = (
    spark.read
    .format("csv")
    .options(inferSchema="true", header="false")
    .load("training.1600000.processed.noemoticon.csv")
)
for old, new in zip(train_df.columns, columns):
    train_df = train_df.withColumnRenamed(old, new)


test_df = (
    spark.read
    .format("csv")
    .options(inferSchema="true", header="false")
    .load("testdata.manual.2009.06.14.csv")
)
for old, new in zip(test_df.columns, columns):
    test_df = test_df.withColumnRenamed(old, new)

    
train_df = train_df.select("label", "text")
test_df = test_df.select("label", "text")

train_df.show(5)
print(train_df.dtypes)

+-----+--------------------+
|label|                text|
+-----+--------------------+
|    0|@switchfoot http:...|
|    0|is upset that he ...|
|    0|@Kenichan I dived...|
|    0|my whole body fee...|
|    0|@nationwideclass ...|
+-----+--------------------+
only showing top 5 rows

[('label', 'int'), ('text', 'string')]


In [45]:
tokenizer = Tokenizer(
    inputCol="text", 
    outputCol="tokens"
)
hashing_tf = HashingTF(
    inputCol="tokens", 
    outputCol="features", 
    numFeatures=50
)
idf = IDF(
    inputCol="features", 
    outputCol="final_features"
)
string_indexer = StringIndexer(
    inputCol="label", 
    outputCol="final_label",
    handleInvalid="skip"
)
classifier = LogisticRegression(
    featuresCol="final_features", 
    labelCol="final_label", 
    predictionCol="prediction"
)

pipeline = Pipeline(stages=[tokenizer, hashing_tf, idf, string_indexer, classifier])

In [46]:
model = pipeline.fit(train_df)
predictions = model.transform(test_df)

evaluator = MulticlassClassificationEvaluator(
    labelCol="final_label", 
    predictionCol="prediction", 
    metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions) * 100

print(f"Accuracy: {accuracy:.2f}")

Accuracy: 53.20


Dość niska dokładność klasyfikatora 