# Comprehensive Student Data Analysis with PySpark

## Objective

Leverage PySpark to perform a thorough analysis of student performance data. This exercise covers data loading and manipulation using RDDs and DataFrames, and it culminates in building and evaluating a logistic regression model to predict student success.



## Dataset

**student_data.csv** includes:

- age: Age of the student
- study_time: Weekly study hours
- failures: Number of past class failures
- passed: Course outcome (1: passed, 0: failed)

## Set Up

In [1]:
# !pip install pyspark

In [2]:
import pyspark

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

spark = SparkSession.builder.appName("Student Data Analysis").getOrCreate()

24/04/08 16:34:37 WARN Utils: Your hostname, codespaces-f38966 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
24/04/08 16:34:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/08 16:34:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Tasks

### Task 1: Resilient Distributed Dataset (RDD) Operations

1. Load student_data.csv into an RDD and remove the header.

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Read CSV and Remove Header") \
    .getOrCreate()


24/04/08 16:34:41 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [5]:
student_data_rdd = spark.sparkContext.textFile("student_data.csv")

In [6]:
header = student_data_rdd.first()

                                                                                

In [7]:
data_without_header = student_data_rdd.filter(lambda row: row != header)

In [8]:
data_without_header.take(10)

['22,8,0,1',
 '19,7,2,0',
 '23,8,1,1',
 '20,6,2,0',
 '22,9,0,1',
 '18,5,3,0',
 '22,3,3,0',
 '23,3,1,0',
 '20,1,0,0',
 '19,8,1,1']

2. Filter to include only students older than 20 years.

In [9]:
# YOUR CODE HERE
students_over_20 = data_without_header.filter(lambda row: int(row.split(',')[1]) > 20)

# View a sample of the filtered data (optional)
data_sample = students_over_20.take(10)
print(data_sample)

[]


3. Count students older than 20 with past failures.


In [10]:
# YOUR CODE HERE
failing_students_over_20 = data_without_header.filter(lambda row: 
                                                    int(row.split(',')[1]) > 20 and 
                                                    row.split(',')[3] == "True")  # Adjust index for "HasFailed"

# Count the failing students
student_count = failing_students_over_20.count()

# Print the count
print(f"Number of students older than 20 with past failures: {student_count}")

Number of students older than 20 with past failures: 0


### Task 2: DataFrame Operations

1. Load student_data.csv into a DataFrame.

In [11]:
# YOUR CODE HERE
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Load CSV to DataFrame") \
    .getOrCreate()

# Read the CSV file into a DataFrame with inference of schema
df = spark.read.option("header", "true").csv("student_data.csv")
df.show()

24/04/08 16:34:44 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


+---+----------+--------+------+
|age|study_time|failures|passed|
+---+----------+--------+------+
| 22|         8|       0|     1|
| 19|         7|       2|     0|
| 23|         8|       1|     1|
| 20|         6|       2|     0|
| 22|         9|       0|     1|
| 18|         5|       3|     0|
| 22|         3|       3|     0|
| 23|         3|       1|     0|
| 20|         1|       0|     0|
| 19|         8|       1|     1|
| 23|         2|       0|     0|
| 23|         3|       2|     0|
| 18|         8|       1|     1|
| 21|         8|       3|     0|
| 20|         4|       2|     0|
| 17|         4|       2|     0|
| 23|         5|       1|     1|
| 21|         6|       3|     0|
| 17|         5|       2|     0|
| 20|         7|       0|     1|
+---+----------+--------+------+
only showing top 20 rows



2. Explore the data by displaying the schema and the first five rows.

In [12]:
# YOUR CODE HERE
df.printSchema()

root
 |-- age: string (nullable = true)
 |-- study_time: string (nullable = true)
 |-- failures: string (nullable = true)
 |-- passed: string (nullable = true)



In [13]:
df.show(5)

+---+----------+--------+------+
|age|study_time|failures|passed|
+---+----------+--------+------+
| 22|         8|       0|     1|
| 19|         7|       2|     0|
| 23|         8|       1|     1|
| 20|         6|       2|     0|
| 22|         9|       0|     1|
+---+----------+--------+------+
only showing top 5 rows



3. Add a new column study_time_hours converting study time from hours to minutes.

In [14]:
# YOUR CODE HERE
df = df.withColumn("study_time_hours", col("study_time") * 60)
df.show()

+---+----------+--------+------+----------------+
|age|study_time|failures|passed|study_time_hours|
+---+----------+--------+------+----------------+
| 22|         8|       0|     1|           480.0|
| 19|         7|       2|     0|           420.0|
| 23|         8|       1|     1|           480.0|
| 20|         6|       2|     0|           360.0|
| 22|         9|       0|     1|           540.0|
| 18|         5|       3|     0|           300.0|
| 22|         3|       3|     0|           180.0|
| 23|         3|       1|     0|           180.0|
| 20|         1|       0|     0|            60.0|
| 19|         8|       1|     1|           480.0|
| 23|         2|       0|     0|           120.0|
| 23|         3|       2|     0|           180.0|
| 18|         8|       1|     1|           480.0|
| 21|         8|       3|     0|           480.0|
| 20|         4|       2|     0|           240.0|
| 17|         4|       2|     0|           240.0|
| 23|         5|       1|     1|           300.0|


4. Calculate the average age of students grouped by their pass/fail status.

In [15]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg

In [16]:
# YOUR CODE HERE
average_age_df = df.groupBy("pass/fail").agg(avg(df["age"]).alias("average_age"))

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `pass/fail` cannot be resolved. Did you mean one of the following? [`passed`, `age`, `failures`, `study_time`, `study_time_hours`].;
'Aggregate ['pass/fail], ['pass/fail, avg(cast(age#17 as double)) AS average_age#105]
+- Project [age#17, study_time#18, failures#19, passed#20, (cast(study_time#18 as double) * cast(60 as double)) AS study_time_hours#68]
   +- Relation [age#17,study_time#18,failures#19,passed#20] csv


24/04/08 16:34:57 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


### Task 3: Logistic Regression Model

1. Prepare the data by vectorizing features and splitting into training and test datasets.


In [None]:
# YOUR CODE HERE
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Khởi tạo phiên Spark
spark = SparkSession.builder \
    .appName("Data Preparation") \
    .getOrCreate()

24/04/08 16:11:42 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [None]:
data_model = spark.read.csv("student_data.csv", header=True, inferSchema=True)

In [None]:
data_model.printSchema()

root
 |-- age: integer (nullable = true)
 |-- study_time: integer (nullable = true)
 |-- failures: integer (nullable = true)
 |-- passed: integer (nullable = true)



In [None]:
# vector hóa các cột dữ liệu cần thiết
feature_cols = data_model.columns[:-1]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
data_model = assembler.transform(data_model)

In [None]:
data_model.show(5)

+---+----------+--------+------+--------------+
|age|study_time|failures|passed|      features|
+---+----------+--------+------+--------------+
| 22|         8|       0|     1|[22.0,8.0,0.0]|
| 19|         7|       2|     0|[19.0,7.0,2.0]|
| 23|         8|       1|     1|[23.0,8.0,1.0]|
| 20|         6|       2|     0|[20.0,6.0,2.0]|
| 22|         9|       0|     1|[22.0,9.0,0.0]|
+---+----------+--------+------+--------------+
only showing top 5 rows



In [None]:
#chia dữ liệu ra thành tệp train và tệp test
(train_data, test_data) = data_model.randomSplit([0.8, 0.2], seed=42)

In [None]:
print("Số lượng dữ liệu huấn luyện:", train_data.count())
print("Số lượng dữ liệu kiểm tra:", test_data.count())

Số lượng dữ liệu huấn luyện: 838
Số lượng dữ liệu kiểm tra: 162


In [None]:
train_data.show(5)

+---+----------+--------+------+--------------+
|age|study_time|failures|passed|      features|
+---+----------+--------+------+--------------+
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
+---+----------+--------+------+--------------+
only showing top 5 rows



2. Build and train a logistic regression model.

In [None]:
# YOUR CODE HERE
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [None]:
lr = LogisticRegression(featuresCol='features', labelCol='passed')

In [None]:
lr_model = lr.fit(train_data)

24/04/08 16:14:00 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/04/08 16:14:00 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


3. Evaluate the model using accuracy, precision, recall, F1 score, and the area under the ROC curve.


In [None]:
# YOUR CODE HERE
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

In [None]:
predictions = lr_model.transform(test_data)

In [None]:
# độ chính xác
accuracy_evaluator = MulticlassClassificationEvaluator(labelCol='passed', predictionCol='prediction', metricName='accuracy')
accuracy = accuracy_evaluator.evaluate(predictions)
print("accuracy:", accuracy)

accuracy: 0.8888888888888888


In [None]:
# Đánh giá độ chính xác có trọng số
precision_evaluator = MulticlassClassificationEvaluator(labelCol='passed', predictionCol='prediction', metricName='weightedPrecision')
precision = precision_evaluator.evaluate(predictions)
print("precision:", precision)

precision: 0.8901141743247007


In [None]:
# đánh giá recall
recall_evaluator = MulticlassClassificationEvaluator(labelCol='passed', predictionCol='prediction', metricName='weightedRecall')
recall = recall_evaluator.evaluate(predictions)
print("recall:", recall)

recall: 0.8888888888888888


In [None]:
# Đánh giá F1 score
f1_evaluator = MulticlassClassificationEvaluator(labelCol='passed', predictionCol='prediction', metricName='f1')
f1_score = f1_evaluator.evaluate(predictions)
print("F1_score:", f1_score)

F1_score: 0.8893568433662774


In [None]:
# the area under the ROC curve
binary_evaluator = BinaryClassificationEvaluator(labelCol='passed')
roc_auc = binary_evaluator.evaluate(predictions)
print("the area under the ROC curve:", roc_auc)

the area under the ROC curve: 0.977570093457944


In [None]:
# Tính ma trận nhầm lẫn
predictionAndLabels = predictions.select('prediction', 'passed').rdd
metrics = MulticlassMetrics(predictionAndLabels)
confusion_matrix = metrics.confusionMatrix()
print("Ma trận nhầm lẫn:")
print(confusion_matrix)

24/04/08 16:20:39 ERROR Executor: Exception in task 0.0 in stage 66.0 (TID 62)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 1247, in main
    process()
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 1239, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 274, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 83, in wrapper
    return f(*args, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/session.py", line 1459, in prepare
    verify_func(obj)
  File "/home/codesp

Py4JJavaError: An error occurred while calling o607.confusionMatrix.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 66.0 failed 1 times, most recent failure: Lost task 0.0 in stage 66.0 (TID 62) (4afdeb8e-3ac8-4cdd-ab86-61cdc30c004d.internal.cloudapp.net executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 1247, in main
    process()
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 1239, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 274, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 83, in wrapper
    return f(*args, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/session.py", line 1459, in prepare
    verify_func(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2187, in verify
    verify_value(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2160, in verify_struct
    verifier(v)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2187, in verify
    verify_value(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2181, in verify_default
    verify_acceptable_types(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2006, in verify_acceptable_types
    raise PySparkTypeError(
pyspark.errors.exceptions.base.PySparkTypeError: [CANNOT_ACCEPT_OBJECT_IN_TYPE] `DoubleType()` can not accept object `0` in type `int`.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:784)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2463)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1048)
	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$collectAsMap$1(PairRDDFunctions.scala:738)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
	at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:737)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.confusions$lzycompute(MulticlassMetrics.scala:61)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.confusions(MulticlassMetrics.scala:52)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.tpByClass$lzycompute(MulticlassMetrics.scala:78)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.tpByClass(MulticlassMetrics.scala:76)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.labels$lzycompute(MulticlassMetrics.scala:241)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.labels(MulticlassMetrics.scala:241)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.confusionMatrix(MulticlassMetrics.scala:113)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 1247, in main
    process()
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 1239, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 274, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/python/3.10.13/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 83, in wrapper
    return f(*args, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/session.py", line 1459, in prepare
    verify_func(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2187, in verify
    verify_value(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2160, in verify_struct
    verifier(v)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2187, in verify
    verify_value(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2181, in verify_default
    verify_acceptable_types(obj)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/pyspark/sql/types.py", line 2006, in verify_acceptable_types
    raise PySparkTypeError(
pyspark.errors.exceptions.base.PySparkTypeError: [CANNOT_ACCEPT_OBJECT_IN_TYPE] `DoubleType()` can not accept object `0` in type `int`.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:784)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	... 1 more
