# Predicting Student Depression: A Big Data Analytics Approach with Apache Spark

- **Author:** David Araba
- **Student ID:** 48093143
- **Course:** INFS3208 - Cloud Computing
- **Date:** October 2025

## 1. Introduction & Project Goals

### 1.1. Objective
The primary objective of this project is to develop and evaluate a suite of machine learning models using Apache Spark to predict the likelihood of depression among students. By leveraging a comprehensive dataset that includes demographic, academic, and lifestyle factors, I aim to identify key indicators associated with mental health challenges in an academic environment.

### 1.2. Significance
Student mental health is a growing concern globally. The pressures of academic life, combined with financial and social stressors, can significantly impact a student's well-being and academic performance. This project is important because it seeks to create a data-driven framework that could potentially identify at-risk students, enabling educational institutions to offer timely and targeted support. By using scalable cloud computing technologies, we can build a foundation for a system capable of handling large-scale, real-world student data, moving from reactive to proactive mental wellness strategies. [cite_start]This aligns with the need for modern solutions that traditional computing can struggle to scale effectively[cite: 1].

### 1.3. Technical Stack
This project will be implemented using the following technologies:
* **Language:** Python 3.x
* **Core Engine:** Apache Spark (via PySpark)
* **Libraries:**
    * **Spark MLlib:** For building scalable machine learning pipelines.
    * **Pandas:** For initial data handling and manipulation.
    * **Matplotlib & Seaborn:** For data visualisation and result interpretation.

## 2. Project Architecture & Workflow


### 2.1. Workflow Description
This project follows a standard big data analytics workflow, as depicted in the diagram below. The process begins with the ingestion of four separate but related data files into the Spark environment. These datasets are then joined and pre-processed to create a unified, analysis-ready master dataset. Subsequently, this dataset is used to train and evaluate four distinct machine learning functionalities as required by the project specification: classification, regression, clustering, and association rule mining. The final insights and model performance metrics are then visualised to provide clear, interpretable results.

### 2.2. Architecture Diagram
This diagram illustrates the project's workflow from data source to final analysis. It explicitly shows the use of multiple data sources and the application of various Spark MLlib functionalities, fulfilling the key project requirements.


### 2.3. Architecture Explanation
The above workflow diagram demonstrates a comprehensive big data analytics pipeline designed for scalable student mental health analysis. The process begins with four distinct data sources being ingested simultaneously into the Spark environment, leveraging Spark's distributed processing capabilities to handle large-scale datasets efficiently. The unified master dataset is then processed through feature engineering pipelines before being split for model training and evaluation. The four ML functionalities operate in parallel, each addressing different aspects of student mental health prediction and analysis. This architecture showcases the power of cloud computing technologies in handling complex, multi-dimensional data analysis tasks that would be challenging with traditional single-machine approaches.


## 3. Environment Setup
This section prepares the notebook's environment. The first part imports all necessary libraries for the project, including `pyspark` for distributed data processing, `pyspark.ml` for machine learning, and `matplotlib` for visualisation. The second part initialises the `SparkSession`, which is the essential entry point to all of Spark's functionalities.

In [None]:
# Part 0: Project Initialisation & Overview
# 3. Environment Setup

# 3.1. Import All Necessary Libraries
# ---

# Spark libraries for session management, data manipulation, and ML
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, expr, avg, mean, array
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import LinearRegression
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import BinaryClassificationEvaluator, RegressionEvaluator, ClusteringEvaluator
from pyspark.ml.fpm import FPGrowth


# Standard Python libraries for data handling and plotting
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("Libraries imported successfully.")

# 3.2. Initialise Spark Session
# ---
# Create a SparkSession, which is the entry point to any Spark functionality.
# - appName: Sets a name for the application, which will appear in the Spark UI.
# - getOrCreate(): Gets an existing SparkSession or, if there is none, creates a new one.
spark = SparkSession.builder \
    .appName("StudentMentalHealthPrediction") \
    .getOrCreate()

print(f"Spark session created successfully. Version: {spark.version}")