### Student Mental Health Analysis - Final Project
Authors: Eden Elkoubi, Yael Barbash, Avigail Cohen

Dataset: Student Mental Health Survey
Project Overview & Kick-off
This project investigates the relationship between academic disciplines (STEM vs. Non-STEM), social support, and mental health outcomes among students.

Research Questions
1.Is there a significant difference in anxiety and stress levels between STEM and Humanities/Social Science students?    
2.Does social support act as a moderator in the relationship between academic stress and mental health?

In [None]:
code

#####Step 1: Data Preprocessing and Cleaning

Handling missing values
Missing values were identified and treated in key variables. Rows with missing values in Substance_Use were removed, as this categorical variable cannot be meaningfully imputed. Missing values in CGPA, a continuous variable, were replaced with the mean CGPA calculated from the available (non-missing) data.

Feature engineering
A new binary variable named Is_STEM was created to represent whether a student belongs to a STEM-related field. The variable was coded as 1 for students in Engineering, Medical, or Computer Science programs, and 0 for all other fields of study.

Categorical data transformation
Categorical variables were converted into numerical form to enable statistical analysis. Specifically, Social Support was encoded as low = 1, moderate = 2, and high = 3, while Sleep Quality was encoded as poor = 1, average = 2, and good = 3.

Outlier detection and treatment
Outliers in selected numerical variables were addressed using the Interquartile Range (IQR) method. Values outside 1.5 times the IQR were capped at the lower and upper bounds, reducing the impact of extreme values while preserving all observations.

Final dataset preparation
After completing the preprocessing steps, the cleaned and transformed dataset was saved and used for subsequent statistical analyses and modeling.

In [None]:
import pandas as pd

#level 1
df = pd.read_csv("st_1.csv") # Load the dataset
cgpa_mean = df["CGPA"].mean(skipna=True) # Calculate the mean CGPA (excluding missing values)
df_clean = df.dropna(subset=["Substance_Use"]).copy() # Remove rows with missing values in Substance_Use
df_clean["CGPA"] = df_clean["CGPA"].fillna(cgpa_mean) # Fill missing CGPA values with the calculated mean
df_clean.to_csv("st_1_cleaned.csv", index=False) # Save the cleaned dataset


#level 2- Feature Engineering: Create a binary variable 'Is_STEM'
# Assign 1 for Engineering, Medical, and Computer Science; 0 for all others
stem_courses = ['Engineering', 'Medical', 'Computer Science']
df_clean['Is_STEM'] = df_clean['Course'].isin(stem_courses).astype(int)

# level 3- Data Transformation: Convert categorical variables to numerical values
# Define mapping: Poor/Low = 1, Average/Moderate = 2, Good/High = 3
sleep_mapping = {'Poor': 1, 'Average': 2, 'Good': 3}
support_mapping = {'Low': 1, 'Moderate': 2, 'High': 3}

# Apply the mapping to the respective columns
df_clean['Sleep_Quality'] = df_clean['Sleep_Quality'].map(sleep_mapping)
df_clean['Social_Support'] = df_clean['Social_Support'].map(support_mapping)


# level 4a- Handling Outliers using IQR for continuous variables
# This will remove extreme/unrealistic values for Age, CGPA, and Credit Load
outlier_columns = ['Age', 'CGPA', 'Semester_Credit_Load']

for col in outlier_columns:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Filter the data to keep only values within the calculated bounds
    df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]


# level 4b- Logical Range Validation for score columns
# Ensure that scores like Stress, Depression, and Anxiety are within the valid range [0, 5]
score_columns = ['Stress_Level', 'Depression_Score', 'Anxiety_Score', 'Financial_Stress']

for col in score_columns:
    # Filter out any values that are negative or greater than 5
    df_clean = df_clean[(df_clean[col] >= 0) & (df_clean[col] <= 5)]

# Optional: Save the final cleaned dataset
df_clean.to_csv("st_1_cleaned_final.csv", index=False)

# Save the final cleaned dataset as st1.csv
df_clean.to_csv("st_1.csv", index=False)

# Save the final cleaned dataset as st1.csv
df_clean.to_csv("st_1.csv", index=False)






3.EDEN

### Setup and Logging 
 According to the project guidelines, we must use a logger instead of print statements to track the execution flow. This part sets up the professional logging configuration that will record all statistical results into a file named analysis_log.log.

In [None]:
import pandas as pd
import numpy as np
import logging
from scipy import stats
import statsmodels.api as sm

# Configure the logger to track the analysis process
def setup_logger():
    """
    Sets up a logger that outputs to both a log file and the console.
    This replaces standard 'print' statements as per project requirements.
    """
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler("analysis_log.log"),
            logging.StreamHandler()
        ]
    )
    return logging.getLogger(__name__)
   
logger = setup_logger()
logger.info("Logger initialized. Ready for statistical analysis.")

### Hypothesis Testing - Independent T-Test
Explanation: The first hypothesis explores whether there is a significant difference in mental health scores between STEM students and students from other faculties. We use an Independent T-test, ensuring we handle missing values (NaN) correctly.

In [None]:
def perform_group_comparison(df, group_col, target_col):
    """
    Hypothesis: Mental health levels differ between STEM and Non-STEM students.
    Method: Independent Two-Sample T-Test.
    """
    logger.info(f"Comparing {target_col} between groups in {group_col}")

    # Separating groups based on the binary indicator (1 for STEM, 0 for others)
    # Using meaningful variable names instead of raw indices
    stem_group = df[df[group_col] == 1][target_col]
    non_stem_group = df[df[group_col] == 0][target_col]

    # Performing the T-test
    t_stat, p_val = stats.ttest_ind(stem_group, non_stem_group, nan_policy='omit')

    logger.info(f"T-test results - Statistic: {t_stat:.4f}, P-value: {p_val:.4f}")
    
    # Interpretation logic
    if p_val < 0.05:
        logger.info("Result is statistically significant (p < 0.05).")
    else:
        logger.info("Result is not statistically significant (p >= 0.05).")
        
    return t_stat, p_val

### Correlation Analysis
Explanation: The second hypothesis checks for a linear relationship between academic performance (CGPA) and substance use. We use Pearsonâ€™s Correlation coefficient to determine the strength and direction of this relationship.

In [None]:
def analyze_variable_correlation(df, var_x, var_y):
    """
    Hypothesis: There is a correlation between Academic Load/CGPA and Substance Use.
    Method: Pearson Correlation.
    """
    logger.info(f"Calculating correlation between {var_x} and {var_y}")

    # Cleaning data locally for this specific test
    valid_data = df[[var_x, var_y]].dropna()
    
    correlation_coeff, p_value = stats.pearsonr(valid_data[var_x], valid_data[var_y])

    logger.info(f"Correlation results - Coefficient: {correlation_coeff:.4f}, P-value: {p_value:.4f}")
    return correlation_coeff, p_value

### Moderation Analysis (Regression)
Explanation: To increase the "Complexity Level" (as required in the grading rubric), we implement a moderation model. We test if "Social Support" moderates the relationship between "Academic Load" and "Mental Health Stress".

In [None]:
def run_moderation_model(df, outcome, predictor, moderator):
    """
    Hypothesis: Social Support acts as a moderator for the effect of Academic Load on Stress.
    Method: OLS Regression with an interaction term.
    """
    logger.info("Starting Moderation Analysis (Regression with interaction term)")

    # Creating the interaction term (Predictor * Moderator)
    df_temp = df.copy()
    df_temp['interaction'] = df_temp[predictor] * df_temp[moderator]

    # Independent variables: Predictor, Moderator, and their Interaction
    X = df_temp[[predictor, moderator, 'interaction']]
    X = sm.add_constant(X)  # Adds the intercept (constant) to the model
    y = df_temp[outcome]

    # Fit the OLS model
    model = sm.OLS(y, X, missing='drop').fit()

    logger.info("Moderation model fitting complete.")
    # We return the summary which contains all statistical details
    return model.summary()

### Main Execution Block
Explanation: This is the entry point of the script. Following the guidelines, the main() function is kept minimal, serving only to orchestrate the flow of the analysis modules.

In [None]:
def main():
    """
    Main execution flow. In a real scenario, 'df' comes from the Preprocessing module.
    """
    logger.info("Final Project - Part II: Statistical Analysis Phase Started")

    # Placeholder: In your project, replace this with your actual cleaned DataFrame
    # df = pd.read_csv("your_cleaned_data.csv")

    # Example calls for testing the logic:
    # perform_group_comparison(df, 'Is_STEM', 'Stress_Score')
    # analyze_variable_correlation(df, 'CGPA', 'Substance_Use')
    # regression_results = run_moderation_model(df, 'Stress_Score', 'Academic_Load', 'Social_Support')
    
    logger.info("Statistical Analysis Phase Completed Successfully.")

if __name__ == "__main__":
    main()