# Student Stress Analysis

We will be using the "Student Stress Analysis" Kaggle dataset by Warda Bilal to investigate the patterns between different stress factors and student stress levels as well as determine the most contributing factor to stress.

The dataset contains 521 instances with the following 6 features:
- `sleep_quality`: number from 1-5, with 1 being the worst and 5 being the best.

- `headache_freq`: number of headaches suffered per week

- `academic_performance`: number from 1-5, with 1 being the worst and 5 being the best

- `study_load`: number from 1-5, with 1 being the lightest and 5 being the heaviest

- `extracurricular_activity`: number of practices per week

- `stress_level`: target value from 1-5, with 1 being low stress and 5 being high stress

## Data Cleaning / Preprocessing

First, we will begin by loading and inspecting the dataset.
1. Import all libraries needed

2. Load the CSV file

3. See what features are included

4. Simplify feature names

5. Identify missing or duplicated values

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('stressfactors.csv')
df.shape, df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 6 columns):
 #   Column                                                            Non-Null Count  Dtype
---  ------                                                            --------------  -----
 0   Kindly Rate your Sleep Quality üò¥                                  520 non-null    int64
 1   How many times a week do you suffer headaches ü§ï?                  520 non-null    int64
 2   How would you rate you academic performance üë©‚Äçüéì?                  520 non-null    int64
 3   how would you rate your study load?                               520 non-null    int64
 4   How many times a week you practice extracurricular activities üéæ?  520 non-null    int64
 5   How would you rate your stress levels?                            520 non-null    int64
dtypes: int64(6)
memory usage: 24.5 KB


Index(['Kindly Rate your Sleep Quality üò¥',
       'How many times a week do you suffer headaches ü§ï?',
       'How would you rate you academic performance üë©‚Äçüéì?',
       'how would you rate your study load?',
       'How many times a week you practice extracurricular activities üéæ?',
       'How would you rate your stress levels?'],
      dtype='object')

In [3]:
df.head()
df.describe()

Unnamed: 0,Kindly Rate your Sleep Quality üò¥,How many times a week do you suffer headaches ü§ï?,How would you rate you academic performance üë©‚Äçüéì?,how would you rate your study load?,How many times a week you practice extracurricular activities üéæ?,How would you rate your stress levels?
count,520.0,520.0,520.0,520.0,520.0,520.0
mean,3.125,2.182692,3.326923,2.75,2.682692,2.875
std,1.099023,1.247459,1.061158,1.372381,1.470745,1.357825
min,1.0,1.0,1.0,1.0,1.0,1.0
25%,2.0,1.0,3.0,2.0,1.0,2.0
50%,3.0,2.0,3.0,2.5,3.0,3.0
75%,4.0,3.0,4.0,4.0,4.0,4.0
max,5.0,5.0,5.0,5.0,5.0,5.0


Since the feature names are very long and unnecessary, we will rename them to be more simple and straightforward.

In [9]:
df.columns = ['sleep_quality', 'headache_freq', 'academic_performance', 'study_load', 'extracurricular_activity', 'stress_level']
df.columns

Index(['sleep_quality', 'headache_freq', 'academic_performance', 'study_load',
       'extracurricular_activity', 'stress_level'],
      dtype='object')

We will now check for any missing values and duplicated data.

In [14]:
display(df.isna().sum())
df.duplicated().sum()

sleep_quality               0
headache_freq               0
academic_performance        0
study_load                  0
extracurricular_activity    0
stress_level                0
dtype: int64

np.int64(416)

Since students can technically have the same answers, and there are no unique features that identify students from each other, we will keep duplicated data. There are also no missing values, so we do not have to drop any rows.