# Comprehensive Student Data Analysis with PySpark

## Objective

Leverage PySpark to perform a thorough analysis of student performance data. This exercise covers data loading and manipulation using RDDs and DataFrames, and it culminates in building and evaluating a logistic regression model to predict student success.



## Dataset

**student_data.csv** includes:

- age: Age of the student
- study_time: Weekly study hours
- failures: Number of past class failures
- passed: Course outcome (1: passed, 0: failed)

## Set Up

In [1]:
# !pip install pyspark

In [2]:
import pyspark

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

spark = SparkSession.builder.appName("Student Data Analysis").getOrCreate()

24/04/08 08:59:30 WARN Utils: Your hostname, codespaces-f38966 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
24/04/08 08:59:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/08 08:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Tasks

### Task 1: Resilient Distributed Dataset (RDD) Operations

1. Load student_data.csv into an RDD and remove the header.

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Read CSV and Remove Header") \
    .getOrCreate()


24/04/08 08:59:34 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [5]:
student_data_rdd = spark.sparkContext.textFile("student_data.csv")

In [6]:
header = student_data_rdd.first()

                                                                                

In [7]:
data_without_header = student_data_rdd.filter(lambda row: row != header)

In [8]:
data_without_header.take(10)

['22,8,0,1',
 '19,7,2,0',
 '23,8,1,1',
 '20,6,2,0',
 '22,9,0,1',
 '18,5,3,0',
 '22,3,3,0',
 '23,3,1,0',
 '20,1,0,0',
 '19,8,1,1']

2. Filter to include only students older than 20 years.

In [9]:
# YOUR CODE HERE
students_over_20 = data_without_header.filter(lambda row: int(row.split(',')[1]) > 20)

# View a sample of the filtered data (optional)
data_sample = students_over_20.take(10)
print(data_sample)

[]


3. Count students older than 20 with past failures.


In [10]:
# YOUR CODE HERE
failing_students_over_20 = data_without_header.filter(lambda row: 
                                                    int(row.split(',')[1]) > 20 and 
                                                    row.split(',')[3] == "True")  # Adjust index for "HasFailed"

# Count the failing students
student_count = failing_students_over_20.count()

# Print the count
print(f"Number of students older than 20 with past failures: {student_count}")

Number of students older than 20 with past failures: 0


### Task 2: DataFrame Operations

1. Load student_data.csv into a DataFrame.

In [11]:
# YOUR CODE HERE
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Load CSV to DataFrame") \
    .getOrCreate()

# Read the CSV file into a DataFrame with inference of schema
df = spark.read.option("header", "true").csv("student_data.csv")
df.show()

24/04/08 08:59:38 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


+---+----------+--------+------+
|age|study_time|failures|passed|
+---+----------+--------+------+
| 22|         8|       0|     1|
| 19|         7|       2|     0|
| 23|         8|       1|     1|
| 20|         6|       2|     0|
| 22|         9|       0|     1|
| 18|         5|       3|     0|
| 22|         3|       3|     0|
| 23|         3|       1|     0|
| 20|         1|       0|     0|
| 19|         8|       1|     1|
| 23|         2|       0|     0|
| 23|         3|       2|     0|
| 18|         8|       1|     1|
| 21|         8|       3|     0|
| 20|         4|       2|     0|
| 17|         4|       2|     0|
| 23|         5|       1|     1|
| 21|         6|       3|     0|
| 17|         5|       2|     0|
| 20|         7|       0|     1|
+---+----------+--------+------+
only showing top 20 rows



2. Explore the data by displaying the schema and the first five rows.

In [12]:
# YOUR CODE HERE
df.printSchema()

root
 |-- age: string (nullable = true)
 |-- study_time: string (nullable = true)
 |-- failures: string (nullable = true)
 |-- passed: string (nullable = true)



In [13]:
df.show(5)

+---+----------+--------+------+
|age|study_time|failures|passed|
+---+----------+--------+------+
| 22|         8|       0|     1|
| 19|         7|       2|     0|
| 23|         8|       1|     1|
| 20|         6|       2|     0|
| 22|         9|       0|     1|
+---+----------+--------+------+
only showing top 5 rows



3. Add a new column study_time_hours converting study time from hours to minutes.

In [14]:
# YOUR CODE HERE
df = df.withColumn("study_time_hours", col("study_time") * 60)
df.show()

+---+----------+--------+------+----------------+
|age|study_time|failures|passed|study_time_hours|
+---+----------+--------+------+----------------+
| 22|         8|       0|     1|           480.0|
| 19|         7|       2|     0|           420.0|
| 23|         8|       1|     1|           480.0|
| 20|         6|       2|     0|           360.0|
| 22|         9|       0|     1|           540.0|
| 18|         5|       3|     0|           300.0|
| 22|         3|       3|     0|           180.0|
| 23|         3|       1|     0|           180.0|
| 20|         1|       0|     0|            60.0|
| 19|         8|       1|     1|           480.0|
| 23|         2|       0|     0|           120.0|
| 23|         3|       2|     0|           180.0|
| 18|         8|       1|     1|           480.0|
| 21|         8|       3|     0|           480.0|
| 20|         4|       2|     0|           240.0|
| 17|         4|       2|     0|           240.0|
| 23|         5|       1|     1|           300.0|


4. Calculate the average age of students grouped by their pass/fail status.

In [15]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg

In [16]:
# YOUR CODE HERE
average_age_df = df.groupBy("pass/fail").agg(avg(df["age"]).alias("average_age"))

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `pass/fail` cannot be resolved. Did you mean one of the following? [`passed`, `age`, `failures`, `study_time`, `study_time_hours`].;
'Aggregate ['pass/fail], ['pass/fail, avg(cast(age#17 as double)) AS average_age#105]
+- Project [age#17, study_time#18, failures#19, passed#20, (cast(study_time#18 as double) * cast(60 as double)) AS study_time_hours#68]
   +- Relation [age#17,study_time#18,failures#19,passed#20] csv


24/04/08 08:59:48 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


### Task 3: Logistic Regression Model

1. Prepare the data by vectorizing features and splitting into training and test datasets.


In [17]:
# YOUR CODE HERE
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Khởi tạo phiên Spark
spark = SparkSession.builder \
    .appName("Data Preparation") \
    .getOrCreate()

24/04/08 09:01:33 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [41]:
data_model = spark.read.csv("student_data.csv", header=True, inferSchema=True)

In [42]:
data_model.printSchema()

root
 |-- age: integer (nullable = true)
 |-- study_time: integer (nullable = true)
 |-- failures: integer (nullable = true)
 |-- passed: integer (nullable = true)



In [43]:
# vector hóa các cột dữ liệu cần thiết
feature_cols = data_model.columns[:-1]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
data_model = assembler.transform(data_model)

In [44]:
data_model.show(5)

+---+----------+--------+------+--------------+
|age|study_time|failures|passed|      features|
+---+----------+--------+------+--------------+
| 22|         8|       0|     1|[22.0,8.0,0.0]|
| 19|         7|       2|     0|[19.0,7.0,2.0]|
| 23|         8|       1|     1|[23.0,8.0,1.0]|
| 20|         6|       2|     0|[20.0,6.0,2.0]|
| 22|         9|       0|     1|[22.0,9.0,0.0]|
+---+----------+--------+------+--------------+
only showing top 5 rows



In [45]:
#chia dữ liệu ra thành tệp train và tệp test
(train_data, test_data) = data_model.randomSplit([0.8, 0.2], seed=42)

In [46]:
print("Số lượng dữ liệu huấn luyện:", train_data.count())
print("Số lượng dữ liệu kiểm tra:", test_data.count())

Số lượng dữ liệu huấn luyện: 838
Số lượng dữ liệu kiểm tra: 162


In [47]:
train_data.show(5)

+---+----------+--------+------+--------------+
|age|study_time|failures|passed|      features|
+---+----------+--------+------+--------------+
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
| 16|         1|       0|     0|[16.0,1.0,0.0]|
+---+----------+--------+------+--------------+
only showing top 5 rows



2. Build and train a logistic regression model.

In [48]:
# YOUR CODE HERE
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [52]:
lr = LogisticRegression(featuresCol="features", labelCol="label")

In [53]:
lr_model = lr.fit(train_data)

IllegalArgumentException: label does not exist. Available: age, study_time, failures, passed, features

3. Evaluate the model using accuracy, precision, recall, F1 score, and the area under the ROC curve.


In [None]:
# YOUR CODE HERE

24/04/04 00:07:36 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
