### 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np

### 2. importing the dataset

In [2]:
df = pd.read_csv('data/student_lifestyle_dataset.csv', sep=',')

In [3]:
df.head()

Unnamed: 0,Student_ID,Study_Hours_Per_Day,Extracurricular_Hours_Per_Day,Sleep_Hours_Per_Day,Social_Hours_Per_Day,Physical_Activity_Hours_Per_Day,GPA,Stress_Level
0,1,6.9,3.8,8.7,2.8,1.8,2.99,Moderate
1,2,5.3,3.5,8.0,4.2,3.0,2.75,Low
2,3,5.1,3.9,9.2,1.2,4.6,2.67,Low
3,4,6.5,2.1,7.2,1.7,6.5,2.88,Moderate
4,5,8.1,0.6,6.5,2.2,6.6,3.51,High


### 3. Data Cleaning
Get general information about the dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Student_ID                       2000 non-null   int64  
 1   Study_Hours_Per_Day              2000 non-null   float64
 2   Extracurricular_Hours_Per_Day    2000 non-null   float64
 3   Sleep_Hours_Per_Day              2000 non-null   float64
 4   Social_Hours_Per_Day             2000 non-null   float64
 5   Physical_Activity_Hours_Per_Day  2000 non-null   float64
 6   GPA                              2000 non-null   float64
 7   Stress_Level                     2000 non-null   object 
dtypes: float64(6), int64(1), object(1)
memory usage: 125.1+ KB


All columns have 2000 non-null values, indicating there are no missing values in the dataset.

In [5]:
df.describe()

Unnamed: 0,Student_ID,Study_Hours_Per_Day,Extracurricular_Hours_Per_Day,Sleep_Hours_Per_Day,Social_Hours_Per_Day,Physical_Activity_Hours_Per_Day,GPA
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1000.5,7.4758,1.9901,7.50125,2.70455,4.3283,3.11596
std,577.494589,1.423888,1.155855,1.460949,1.688514,2.51411,0.298674
min,1.0,5.0,0.0,5.0,0.0,0.0,2.24
25%,500.75,6.3,1.0,6.2,1.2,2.4,2.9
50%,1000.5,7.4,2.0,7.5,2.6,4.1,3.11
75%,1500.25,8.7,3.0,8.8,4.1,6.1,3.33
max,2000.0,10.0,4.0,10.0,6.0,13.0,4.0


It can be observed that standard deviation of the columns (except student_id) is very low, indicating that the values are very close to the mean. This is a good sign as it indicates that the data is consistent and reliable. And the max and min values are also very close to the mean, indicating that there are no outliers in the dataset.

Display the column names to see if there are any discrepancies

In [6]:
df.columns

Index(['Student_ID', 'Study_Hours_Per_Day', 'Extracurricular_Hours_Per_Day',
       'Sleep_Hours_Per_Day', 'Social_Hours_Per_Day',
       'Physical_Activity_Hours_Per_Day', 'GPA', 'Stress_Level'],
      dtype='object')

Columns isn't in lower case, so let's convert them to lower case

In [7]:
df.columns = df.columns.str.lower()

In [8]:
df.columns

Index(['student_id', 'study_hours_per_day', 'extracurricular_hours_per_day',
       'sleep_hours_per_day', 'social_hours_per_day',
       'physical_activity_hours_per_day', 'gpa', 'stress_level'],
      dtype='object')

check for inconsistency data types

In [9]:
print(df['stress_level'].unique())

['Moderate' 'Low' 'High']


The stress_level column doesn't have inconsistent values. The values are in the same format.

### 4. Creating necessary columns

It will be needed to work the stress level column, so let's convert it to numeric

In [10]:
df['stress_level_numeric'] = df['stress_level'].map({'Low': 1, 'Moderate': 2, 'High': 3})

In [11]:
df['stress_level_numeric'].unique()

array([2, 1, 3])

creating gpa_group column

In [12]:
df['gpa_group'] = pd.cut(df['gpa'], bins=[2, 2.5, 3, 3.5, 4], labels=['2-2.5', '2.5-3', '3-3.5', '3.5-4'])

In [13]:
df['gpa_group'].unique()

['2.5-3', '3.5-4', '3-3.5', '2-2.5']
Categories (4, object): ['2-2.5' < '2.5-3' < '3-3.5' < '3.5-4']

In [14]:
df.head()

Unnamed: 0,student_id,study_hours_per_day,extracurricular_hours_per_day,sleep_hours_per_day,social_hours_per_day,physical_activity_hours_per_day,gpa,stress_level,stress_level_numeric,gpa_group
0,1,6.9,3.8,8.7,2.8,1.8,2.99,Moderate,2,2.5-3
1,2,5.3,3.5,8.0,4.2,3.0,2.75,Low,1,2.5-3
2,3,5.1,3.9,9.2,1.2,4.6,2.67,Low,1,2.5-3
3,4,6.5,2.1,7.2,1.7,6.5,2.88,Moderate,2,2.5-3
4,5,8.1,0.6,6.5,2.2,6.6,3.51,High,3,3.5-4


data cleaning is done, now let's save the cleaned dataset

In [15]:
df.to_csv('data/cleaned_student_lifestyle_dataset.csv', index=False)