# ENDTERM Project Data Analysis <hr style="border:2.5px solid #126782"></hr>

Name: **Ely John C. Punzalan** <br>
Course: **CPE2A**

# Data Cleaning

## Setting Up the Data File

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv(r"../dataset/student-record-unclean.csv",
                 delimiter=",")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Student_ID                               9833 non-null   object 
 1   Age                                      9807 non-null   float64
 2   Gender                                   9830 non-null   object 
 3   Study_Hours_per_Week                     9830 non-null   float64
 4   Preferred_Learning_Style                 9824 non-null   object 
 5   Online_Courses_Completed                 9833 non-null   float64
 6   Participation_in_Discussions             9834 non-null   object 
 7   Assignment_Completion_Rate (%)           9831 non-null   float64
 8   Exam_Score (%)                           9821 non-null   float64
 9   Attendance_Rate (%)                      9815 non-null   float64
 10  Use_of_Educational_Tech                  9821 n

In [3]:
df = df.drop(df.columns[[0]], axis=1)
df.head()

Unnamed: 0,Age,Gender,Study_Hours_per_Week,Preferred_Learning_Style,Online_Courses_Completed,Participation_in_Discussions,Assignment_Completion_Rate (%),Exam_Score (%),Attendance_Rate (%),Use_of_Educational_Tech,Self_Reported_Stress_Level,Time_Spent_on_Social_Media (hours/week),Sleep_Hours_per_Night,Equiv_Grade,College,Degree_Program
0,18.0,Female,48.0,Kinesthetic,14.0,Yes,100.0,69.0,66.0,Yes,High,9.0,8.0,3.885,Engineering,Mechanical
1,29.0,Female,30.0,Reading/Writing,20.0,No,71.0,40.0,57.0,Yes,Medium,28.0,8.0,2.635,Health Sciences,Nursing
2,20.0,Female,47.0,Kinesthetic,11.0,No,60.0,43.0,79.0,Yes,Low,13.0,7.0,2.765,Health Sciences,Nursing
3,23.0,Female,13.0,Auditory,0.0,Yes,63.0,70.0,60.0,Yes,Low,24.0,10.0,3.295,Business and Finance,Accountancy
4,19.0,Female,24.0,Auditory,19.0,Yes,59.0,63.0,93.0,Yes,Medium,26.0,8.0,3.39,Health Sciences,Pharmacy


Renaming Columns

In [4]:
df_cleaning = df.rename(
    columns={
        'Study_Hours_per_Week' : 'StudyHrs' ,
        'Preferred_Learning_Style' : 'Learning_Style' ,
        'Online_Courses_Completed' : 'Courses_Comp' ,
        'Participation_in_Discussions' : 'Discuss_Part' ,
        'Assignment_Completion_Rate (%)': 'AssignRate' ,
        'Exam_Score (%)' : 'Exam_Score(%)' ,
        'Attendance_Rate (%)' : 'AttendRate' ,
        'Use_of_Educational_Tech' : 'TechUse' ,
        'Self_Reported_Stress_Level' : 'StressLevel',
        'Time_Spent_on_Social_Media (hours/week)' : 'SocMediaHrs',
        'Sleep_Hours_per_Night' : 'SleepHrs'
    }
)
df_cleaning.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             9807 non-null   float64
 1   Gender          9830 non-null   object 
 2   StudyHrs        9830 non-null   float64
 3   Learning_Style  9824 non-null   object 
 4   Courses_Comp    9833 non-null   float64
 5   Discuss_Part    9834 non-null   object 
 6   AssignRate      9831 non-null   float64
 7   Exam_Score(%)   9821 non-null   float64
 8   AttendRate      9815 non-null   float64
 9   TechUse         9821 non-null   object 
 10  StressLevel     9816 non-null   object 
 11  SocMediaHrs     9817 non-null   float64
 12  SleepHrs        9814 non-null   float64
 13  Equiv_Grade     10000 non-null  float64
 14  College         10000 non-null  object 
 15  Degree_Program  10000 non-null  object 
dtypes: float64(9), object(7)
memory usage: 1.2+ MB


## Determing NaN values

In [5]:
df_cleaning.isna().sum()

Age               193
Gender            170
StudyHrs          170
Learning_Style    176
Courses_Comp      167
Discuss_Part      166
AssignRate        169
Exam_Score(%)     179
AttendRate        185
TechUse           179
StressLevel       184
SocMediaHrs       183
SleepHrs          186
Equiv_Grade         0
College             0
Degree_Program      0
dtype: int64

## Dropping NaN values <br>
<b>Reason:</b> The number of NaN values is very small compared to the total number of rows in the dataset. Hence, dropping the NaN values will not affect the dataset significantly.

In [6]:
# Cleaning NaN values via dropping
df_cleaning = df_cleaning.dropna(subset=['Age', 'Gender', 'StudyHrs', 'Learning_Style', 
                                       'Courses_Comp', 'Discuss_Part', 'AssignRate', 
                                       'Exam_Score(%)', 'AttendRate', 'TechUse', 'StressLevel', 
                                       'SocMediaHrs', 'SleepHrs', 'Equiv_Grade'])
df_cleaning.isna().sum()

Age               0
Gender            0
StudyHrs          0
Learning_Style    0
Courses_Comp      0
Discuss_Part      0
AssignRate        0
Exam_Score(%)     0
AttendRate        0
TechUse           0
StressLevel       0
SocMediaHrs       0
SleepHrs          0
Equiv_Grade       0
College           0
Degree_Program    0
dtype: int64

In [7]:
df_cleaning.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9629 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             9629 non-null   float64
 1   Gender          9629 non-null   object 
 2   StudyHrs        9629 non-null   float64
 3   Learning_Style  9629 non-null   object 
 4   Courses_Comp    9629 non-null   float64
 5   Discuss_Part    9629 non-null   object 
 6   AssignRate      9629 non-null   float64
 7   Exam_Score(%)   9629 non-null   float64
 8   AttendRate      9629 non-null   float64
 9   TechUse         9629 non-null   object 
 10  StressLevel     9629 non-null   object 
 11  SocMediaHrs     9629 non-null   float64
 12  SleepHrs        9629 non-null   float64
 13  Equiv_Grade     9629 non-null   float64
 14  College         9629 non-null   object 
 15  Degree_Program  9629 non-null   object 
dtypes: float64(9), object(7)
memory usage: 1.2+ MB


## Saving the Cleaned Dataset

In [8]:
df_cleaning.to_csv(r"../dataset/student-record-cleaned.csv")