1. YAEL

### Student Mental Health Analysis - Final Project
Authors: Eden Elkoubi, Yael Barbash, Avigail Cohen

Dataset: Student Mental Health Survey
Project Overview & Kick-off
This project investigates the relationship between academic disciplines (STEM vs. Non-STEM), social support, and mental health outcomes among students.

Research Questions
1.Is there a significant difference in anxiety and stress levels between STEM and Humanities/Social Science students?    
2.Does social support act as a moderator in the relationship between academic stress and mental health?

In [None]:
code

2.AVIGAIL

In [None]:
code

3.EDEN

### Setup and Logging 
 According to the project guidelines, we must use a logger instead of print statements to track the execution flow. This part sets up the professional logging configuration that will record all statistical results into a file named analysis_log.log.

In [None]:
import pandas as pd
import numpy as np
import logging
from scipy import stats
import statsmodels.api as sm
##
# Configure the logger to track the analysis process
def setup_logger():
    """
    Sets up a logger that outputs to both a log file and the console.
    This replaces standard 'print' statements as per project requirements.
    """
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler("analysis_log.log"),
            logging.StreamHandler()
        ]
    )
    return logging.getLogger(__name__)
   
logger = setup_logger()
logger.info("Logger initialized. Ready for statistical analysis.")

### Hypothesis Testing - Independent T-Test
Explanation: The first hypothesis explores whether there is a significant difference in mental health scores between STEM students and students from other faculties. We use an Independent T-test, ensuring we handle missing values (NaN) correctly.

In [None]:
def perform_group_comparison(df, group_col, target_col):
    """
    Hypothesis:
    Mental health levels differ between STEM and Non-STEM students.

    Method:
    Mann–Whitney U test (non-parametric), suitable for ordinal data (0–5)
    and non-normal distributions.
    """
    logger.info(f"Comparing {target_col} between groups in {group_col}")

    # Separate groups
    stem_group = df[df[group_col] == 1][target_col].dropna()
    non_stem_group = df[df[group_col] == 0][target_col].dropna()

    # Mann–Whitney U Test
    u_stat, p_val = stats.mannwhitneyu(
        stem_group,
        non_stem_group,
        alternative='two-sided'
    )

    logger.info(f"Mann–Whitney U results - U statistic: {u_stat:.4f}, P-value: {p_val:.4f}")

    # Interpretation
    if p_val < 0.05:
        logger.info("Result is statistically significant (p < 0.05).")
    else:
        logger.info("Result is not statistically significant (p >= 0.05).")

    return u_stat, p_val


### Correlation Analysis
Explanation: The second hypothesis checks for a linear relationship between academic performance (CGPA) and substance use. We use Pearson’s Correlation coefficient to determine the strength and direction of this relationship.

In [None]:
def analyze_variable_correlation(df, var_x, var_y):
    """
    Hypothesis:
    There is an association between academic factors (e.g., CGPA or credit load)
    and substance use behavior.

    Method:
    Spearman rank-order correlation, suitable for ordinal and non-normally
    distributed variables.
    """
    logger.info(f"Calculating correlation between {var_x} and {var_y}")

    # Remove missing values
    valid_data = df[[var_x, var_y]].dropna()

    # Spearman correlation
    correlation_coeff, p_value = stats.spearmanr(
        valid_data[var_x],
        valid_data[var_y]
    )

    logger.info(
        f"Spearman correlation results - "
        f"Coefficient: {correlation_coeff:.4f}, P-value: {p_value:.4f}"
    )

    if p_value < 0.05:
        logger.info("Correlation is statistically significant (p < 0.05).")
    else:
        logger.info("Correlation is not statistically significant (p >= 0.05).")

    return correlation_coeff, p_value


### Moderation Analysis (Regression)
Explanation: To increase the "Complexity Level" (as required in the grading rubric), we implement a moderation model. We test if "Social Support" moderates the relationship between "Academic Load" and "Mental Health Stress".

In [None]:
def run_moderation_model(df, outcome, predictor, moderator):
    """
    Hypothesis:
    Social support moderates the relationship between academic load
    and mental health outcomes.

    Method:
    OLS regression with centered predictors and an interaction term.
    """
    logger.info(
        f"Starting moderation analysis with outcome={outcome}, "
        f"predictor={predictor}, moderator={moderator}"
    )

    df_temp = df.copy()

    # Centering predictor and moderator (important for moderation analysis)
    df_temp[f'{predictor}_c'] = df_temp[predictor] - df_temp[predictor].mean()
    df_temp[f'{moderator}_c'] = df_temp[moderator] - df_temp[moderator].mean()

    # Interaction term
    df_temp['interaction'] = (
        df_temp[f'{predictor}_c'] * df_temp[f'{moderator}_c']
    )

    # Design matrix
    X = df_temp[[f'{predictor}_c', f'{moderator}_c', 'interaction']]
    X = sm.add_constant(X)

    y = df_temp[outcome]

    # Fit OLS model
    model = sm.OLS(y, X, missing='drop').fit()

    logger.info("Moderation model fitted successfully.")
    logger.info(f"R-squared: {model.rsquared:.4f}")

    return model.summary()


### Main Execution Block
Explanation: This is the entry point of the script. Following the guidelines, the main() function is kept minimal, serving only to orchestrate the flow of the analysis modules.

In [None]:
def main():
    """
    Main execution flow for the statistical analysis phase.
    Assumes that the dataset has undergone basic preprocessing.
    """
    logger.info("Final Project – Statistical Analysis Phase Started")

    # Load dataset
    try:
        df = pd.read_csv("st_1.csv")
        logger.info(f"Dataset loaded successfully: {df.shape[0]} rows, {df.shape[1]} columns")
    except Exception as e:
        logger.error("Failed to load dataset", exc_info=True)
        return

 
    # Feature Engineering
    # Create STEM indicator
    stem_fields = ['Engineering', 'Computer Science', 'Medicine', 'Science', 'Technology']
    df['Is_STEM'] = df['Course'].apply(lambda x: 1 if x in stem_fields else 0)

    # Encode Social Support as ordinal numeric variable
    support_map = {'Low': 1, 'Moderate': 2, 'High': 3}
    df['Social_Support_Num'] = df['Social_Support'].map(support_map)

    logger.info("Feature engineering completed.")


    # Group Comparisons
    perform_group_comparison(df, 'Is_STEM', 'Stress_Level')
    perform_group_comparison(df, 'Is_STEM', 'Anxiety_Score')

#
    # Correlation Analysis
    analyze_variable_correlation(df, 'CGPA', 'Substance_Use')


    # Moderation Analysis
    regression_summary = run_moderation_model(
        df,
        outcome='Stress_Level',
        predictor='Semester_Credit_Load',
        moderator='Social_Support_Num'
    )

    logger.info("Moderation analysis completed.")
    logger.info(f"\n{regression_summary}")

    logger.info("Statistical Analysis Phase Completed Successfully.")
