## Goal
The goal of collecting this dataset is to analyze factors contributing to students stress levels using data mining techniques such as classification and clustering. This involves gathering information on variables like study load, bullying, self-esteem, mental health history, and other relevant factors. The dataset aims to provide a deeper understanding of how different stressors affect students and to uncover patterns that can guide schools in developing effective strategies and support systems to reduce student stress and enhance overall academic success.

## Soure of the Dataset
https://www.kaggle.com/datasets/rxnach/student-stress-factors-a-comprehensive-analysis

## General information about the dataset
The dataset contains 21 attributes and 1,100 objects. The attributes are divided into 5 major factors with a focus on student stress levels: Psychological, Physiological, Social, Environmental, and Academic. The types of attributes are all ordinal except for mental_health_history, which is a binary attribute.

Attribute Explanations:

1- Psychological Factors:<br>
● anxiety_level (0-21): An ordinal attribute that measures student’s anxiety severity. 0–4: Minimal anxiety, 5–9: Mild anxiety, 10–14: Moderate anxiety, 15–21: Severe anxiety.<br>
● self_esteem (0-30): An ordinal attribute that reflects student’s self-worth. 0–15: Low self-esteem, 16–25: Normal self-esteem, 26–30: High self-esteem.<br>
● mental_health_history (0-1): An asymmetric binary attribute that indicates if a student has a history of mental health issues (1) or not (0).<br>
● depression (0-27):  An ordinal attribute that assesses the severity of depressive symptoms. 0–4: Minimal depression, 5–9: Mild depression, 10–14: Moderate depression, 15–19: Moderately severe depression, 20–27: Severe depression.<br>

2- Physiological Factors:<br>
● headache (0-5): An ordinal attribute that measures frequency or intensity of headaches. Higher values indicate more headaches.<br>
● blood_pressure (1-3): An ordinal attribut that categorizes blood pressure levels, such as low (1), normal (2), or high (3).<br>
● sleep_quality (0-5): An ordinal attribute that evaluates sleep quality. Higher scores mean better sleep.<br>
● breathing_problem (0-5): An ordinal attribute that measures severity of breathing issues. Higher scores indicate more problems. <br>

3- Environmental Factors:<br>
● noise_level (0-5): An ordinal attribute that assesses environmental noise levels. Higher values mean more noise.<br>
● living_conditions (0-5): An ordinal attribute that rates the quality of living conditions. Higher scores reflect better conditions.<br>
● safety (0-5): An ordinal attribute that measures student’s sense of safety. Higher scores indicate greater safety.<br>
● basic_needs (0-5): An ordinal attribute that evaluates if basic needs are met. Higher scores mean better fulfillment.<br>

4- Academic Factors:<br>
● academic_performance (0-5): An ordinal attribute that rates academic success. Higher scores indicate better performance.<br>
● study_load (0-5): An ordinal attribute that measures the amount of study work. Higher values indicate heavier loads.<br>
● teacher_student_relationship (0-5): An ordinal attribute that assesses the quality of the relationship with teachers. Higher scores mean better relationships.<br>
● future_career_concerns (0-5): An ordinal attribute that evaluates concerns about future careers. Higher values mean more concerns.<br>

5- Social Factor:<br>
● social_support (0-3): An ordinal attribute that measures available social support. Higher scores indicate more support.<br>
● peer_pressure (0-5): An ordinal attribute that assesses the level of peer pressure. Higher scores mean more pressure.<br>
● extracurricular_activities (0-5): An ordinal attribute that rates involvement in activities outside of academics. Higher scores mean more involvement.<br>
● bullying (0-5): An ordinal attribute that measures extent of bullying experienced. Higher scores suggest more bullying.

Class Label:<br>
● stress_level (0-2): An ordinal attribute that categorizes stress into three levels 0 for low stress, 1 for moderate stress, and 2 for high stress.

In [40]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('StressLevelDataset.csv')

In [41]:
t0 = "\033[1m" + "Data types: " + "\033[0m"
num_attributes = len(df.columns)
attribute_types = df.dtypes.to_frame().rename(columns={0: t0})
num_objects = len(df)
class_name = df.columns[-1]  

t = "\033[1m" + "Attribute types:" + "\033[0m"
print(t)
print(attribute_types)
print("\n")

t1= "\033[1m" + "Number of attributes:" + "\033[0m"
print(t1, num_attributes)

t2 = "\033[1m" + "Number of objects:" + "\033[0m"
print(t2, num_objects)

[1mAttribute types:[0m
                             [1mData types: [0m
anxiety_level                               int64
self_esteem                                 int64
mental_health_history                       int64
depression                                  int64
headache                                    int64
blood_pressure                              int64
sleep_quality                               int64
breathing_problem                           int64
noise_level                                 int64
living_conditions                           int64
safety                                      int64
basic_needs                                 int64
academic_performance                        int64
study_load                                  int64
teacher_student_relationship                int64
future_career_concerns                      int64
social_support                              int64
peer_pressure                               int64
extracurricular_activitie

In [42]:
print(df.head(5)) #get the first 5 rows
print(df.tail(3)) #get the last 3 rows
print(df.dtypes) #prints the data types
print(df.index) #prints index
print(df.columns) #prints the columns of the DataFrame
print(df.values)

   anxiety_level  self_esteem  mental_health_history  depression  headache  \
0             14           20                      0          11         2   
1             15            8                      1          15         5   
2             12           18                      1          14         2   
3             16           12                      1          15         4   
4             16           28                      0           7         2   

   blood_pressure  sleep_quality  breathing_problem  noise_level  \
0               1              2                  4            2   
1               3              1                  4            3   
2               1              2                  2            2   
3               3              1                  3            4   
4               3              5                  1            3   

   living_conditions  ...  basic_needs  academic_performance  study_load  \
0                  3  ...            2        

In [43]:
df.describe()

Unnamed: 0,anxiety_level,self_esteem,mental_health_history,depression,headache,blood_pressure,sleep_quality,breathing_problem,noise_level,living_conditions,...,basic_needs,academic_performance,study_load,teacher_student_relationship,future_career_concerns,social_support,peer_pressure,extracurricular_activities,bullying,stress_level
count,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,...,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0,1100.0
mean,11.063636,17.777273,0.492727,12.555455,2.508182,2.181818,2.66,2.753636,2.649091,2.518182,...,2.772727,2.772727,2.621818,2.648182,2.649091,1.881818,2.734545,2.767273,2.617273,0.996364
std,6.117558,8.944599,0.500175,7.727008,1.409356,0.833575,1.548383,1.400713,1.328127,1.119208,...,1.433761,1.414594,1.315781,1.384579,1.529375,1.047826,1.425265,1.417562,1.530958,0.821673
min,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.0,11.0,0.0,6.0,1.0,1.0,1.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,0.0
50%,11.0,19.0,0.0,12.0,3.0,2.0,2.5,3.0,3.0,2.0,...,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.5,3.0,1.0
75%,16.0,26.0,1.0,19.0,3.0,3.0,4.0,4.0,3.0,3.0,...,4.0,4.0,3.0,4.0,4.0,3.0,4.0,4.0,4.0,2.0
max,21.0,30.0,1.0,27.0,5.0,3.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0,2.0


In [44]:
df = pd.DataFrame(df)

missing_values = df.isna().sum()
print("Missing values in each column")
print(missing_values)
print("\nTotal number of missing valies: ",missing_values.sum())

Missing values in each column
anxiety_level                   0
self_esteem                     0
mental_health_history           0
depression                      0
headache                        0
blood_pressure                  0
sleep_quality                   0
breathing_problem               0
noise_level                     0
living_conditions               0
safety                          0
basic_needs                     0
academic_performance            0
study_load                      0
teacher_student_relationship    0
future_career_concerns          0
social_support                  0
peer_pressure                   0
extracurricular_activities      0
bullying                        0
stress_level                    0
dtype: int64

Total number of missing valies:  0


In [45]:
def describe_with_central_tendencies(df):
    description = df.describe().T  
    
    description['median'] = df.median()
    description['mode'] = df.mode().iloc[0] 
    description['midrange'] = (df.min() + df.max()) / 2
    description['range'] = df.max() - df.min()
    description['variance'] = df.var()
    description['IQR'] = df.quantile(0.75) - df.quantile(0.25)

    return description

numeric_columns = df.select_dtypes(include=[np.number])
central_tendencies = describe_with_central_tendencies(numeric_columns)

print("\nCentral Tendencies for Numeric Columns:")
print(central_tendencies)


Central Tendencies for Numeric Columns:
                               count       mean       std  min   25%   50%  \
anxiety_level                 1100.0  11.063636  6.117558  0.0   6.0  11.0   
self_esteem                   1100.0  17.777273  8.944599  0.0  11.0  19.0   
mental_health_history         1100.0   0.492727  0.500175  0.0   0.0   0.0   
depression                    1100.0  12.555455  7.727008  0.0   6.0  12.0   
headache                      1100.0   2.508182  1.409356  0.0   1.0   3.0   
blood_pressure                1100.0   2.181818  0.833575  1.0   1.0   2.0   
sleep_quality                 1100.0   2.660000  1.548383  0.0   1.0   2.5   
breathing_problem             1100.0   2.753636  1.400713  0.0   2.0   3.0   
noise_level                   1100.0   2.649091  1.328127  0.0   2.0   3.0   
living_conditions             1100.0   2.518182  1.119208  0.0   2.0   2.0   
safety                        1100.0   2.737273  1.406171  0.0   2.0   2.0   
basic_needs            

In [46]:
def detect_outliers_iqr(df):
    outliers = {}
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    for col in df.columns:
        outliers[col] = df[(df[col] < lower_bound[col]) | (df[col] > upper_bound[col])][col]
    
    return outliers

numeric_columns = df.select_dtypes(include=[np.number])
outliers = detect_outliers_iqr(numeric_columns)

print("\nDetected Outliers for Numeric Columns:")
for column, values in outliers.items():
    print(f"\nOutliers in {column}:")
    print(values)


Detected Outliers for Numeric Columns:

Outliers in anxiety_level:
Series([], Name: anxiety_level, dtype: int64)

Outliers in self_esteem:
Series([], Name: self_esteem, dtype: int64)

Outliers in mental_health_history:
Series([], Name: mental_health_history, dtype: int64)

Outliers in depression:
Series([], Name: depression, dtype: int64)

Outliers in headache:
Series([], Name: headache, dtype: int64)

Outliers in blood_pressure:
Series([], Name: blood_pressure, dtype: int64)

Outliers in sleep_quality:
Series([], Name: sleep_quality, dtype: int64)

Outliers in breathing_problem:
Series([], Name: breathing_problem, dtype: int64)

Outliers in noise_level:
9       0
11      5
21      5
27      5
29      5
       ..
1085    5
1090    0
1091    5
1094    5
1096    0
Name: noise_level, Length: 173, dtype: int64

Outliers in living_conditions:
9       5
114     5
125     5
184     5
204     0
       ..
1037    0
1040    0
1053    5
1065    5
1099    0
Name: living_conditions, Length: 62, dt

In [47]:
column_to_discretize = 'anxiety_level'
bin_edges = [0, 4, 9, 14, 21]
bin_labels = ['0-4', '5-9', '10-14', '15-21']

df['discretized_' + column_to_discretize] = pd.cut(df[column_to_discretize], bins=bin_edges, labels=bin_labels, include_lowest=True)

print("Original DataFrame:")
print(df[['anxiety_level', 'discretized_anxiety_level']])

Original DataFrame:
      anxiety_level discretized_anxiety_level
0                14                     10-14
1                15                     15-21
2                12                     10-14
3                16                     15-21
4                16                     15-21
...             ...                       ...
1095             11                     10-14
1096              9                       5-9
1097              4                       0-4
1098             21                     15-21
1099             18                     15-21

[1100 rows x 2 columns]


In [48]:
column_to_discretize = 'self_esteem'
bin_edges = [0, 15, 25, 30]
bin_labels = ['0-15', '16-25', '26-30']

df['discretized_' + column_to_discretize] = pd.cut(df[column_to_discretize], bins=bin_edges, labels=bin_labels, include_lowest=True)

print("Original DataFrame:")
print(df[['self_esteem', 'discretized_self_esteem']])

Original DataFrame:
      self_esteem discretized_self_esteem
0              20                   16-25
1               8                    0-15
2              18                   16-25
3              12                    0-15
4              28                   26-30
...           ...                     ...
1095           17                   16-25
1096           12                    0-15
1097           26                   26-30
1098            0                    0-15
1099            6                    0-15

[1100 rows x 2 columns]


In [49]:
column_to_discretize = 'depression'
bin_edges = [0, 4, 9, 14, 19, 27]
bin_labels = ['0-4', '5-9', '10-14', '15-19', '20-27']

df['discretized_' + column_to_discretize] = pd.cut(df[column_to_discretize], bins=bin_edges, labels=bin_labels, include_lowest=True)

print("Original DataFrame:")
print(df[['depression', 'discretized_depression']])

Original DataFrame:
      depression discretized_depression
0             11                  10-14
1             15                  15-19
2             14                  10-14
3             15                  15-19
4              7                    5-9
...          ...                    ...
1095          14                  10-14
1096           8                    5-9
1097           3                    0-4
1098          19                  15-19
1099          15                  15-19

[1100 rows x 2 columns]
