#**Final Project 2025 - Student Performance Analysis**
**Student:** Daud Rusyad Nurdin, DA/DS bootcamp Dibimbing.id, batch 34



##**Title: Student Performance Analysis**

**Description:**
- This project aims to predict students’ final scores or grades based on the available features in the dataset. Additionally, it seeks to identify which features have the most significant impact on students’ academic success or failure. The analysis uses the dataset "Student_Grading_Dataset.csv", sourced from Kaggle: https://www.kaggle.com/datasets/mahmoudelhemaly/students-grading-dataset/data.

**Objective:**
- The objective of this project is to predict students’ final scores or grades based on the available features, and to analyze which factors most significantly influence academic success or failure.

## **Library Preparation**

In [13]:
# Library data
import numpy as np #number
import pandas as pd

# Library grafik
import seaborn as sns
import matplotlib.pyplot as plt

# Menghilangkan warning yang tidak perlu
import warnings
warnings.filterwarnings('ignore')

##**Data Understanding**
- Size ⟶ shape: (rows, columns)
- Data Structure ⟶ info & description @Kaggle
- Data Quality:
  - Duplicate
  - Missing Values
  - Outliers/ Anomaly
  - Descriptive Statistic
  - Features Correlation

In [28]:
# Read data set from github
file_name = "Students_Grading_Dataset.csv"
def load_data():
    url = "https://raw.githubusercontent.com/daudrusyadnurdin/final-project-2025/main/data/" + file_name
    return pd.read_csv(url)

df = load_data()

###**Dataset Size**

In [29]:
# Count the number of rows and columns
df.shape
print(f"This dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

This dataset has 5000 rows and 23 columns.


###**Data Structure**
Source: https://www.kaggle.com/code/mahmoudelhemaly/student-performance-analysis/input?select=metadata.xlsx

No | Column Name	| Data Type	| Description
---|--------------|-----------| ------------
1	| Student_ID |	String	| Unique identifier for each student
2	|First_Name |	String	| Student’s first name
3	|Last_Name |	String	| Student’s last name
4	|Email	| String	| Contact email (can be anonymized)
5	|Gender	| Categorical	| Male, Female, Other
6	|Age	| Integer	| Age of the student
7	|Department	| String	| Student's department (e.g., CS, Engineering, Business)
8	|Attendance (%)	| Float	| Attendance percentage (0-100%)
9	|Midterm_Score	| Float	| Midterm exam score (out of 100)
10	| Final_Score	| Float	| Final exam score (out of 100)
11	| Assignments_Avg	| Float	| Average score of all assignments (out of 100)
12	| Quizzes_Avg	| Float	| Average quiz scores (out of 100)
13	| Participation_Score	| Float	| Score based on class participation (0-10)
14	| Projects_Score	| Float	| Project evaluation score (out of 100)
15	| Total_Score	| Float	| Weighted sum of all grades
16	| Grade	| Categorical	| Letter grade (A, B, C, D, F)
17	| Study_Hours_per_Week	| Float	| Average study hours per week
18	| Extracurricular_Activities	| Boolean	| Whether the student participates in extracurriculars (Yes/No)
19	| Internet_Access_at_Home	| Boolean	| Does the student have access to the internet at home? (Yes/No)
20	| Parent_Education_Level	| Categorical	| Highest education level of parents (None, High School, Bachelor's, Master's, PhD)
21	| Family_Income_Level	| Categorical	| Low, Medium, High
22	| Stress_Level (1-10)	| Integer	| Self-reported stress level (1: Low, 10: High)
23	| Sleep_Hours_per_Night	| Float	| Average hours of sleep per night

**Notes:**
- From an initial observation, several columns appear to be irrelevant for the subsequent analysis, as they do not hold analytical significance. These include (4 columns):
  - **Student_ID**
  - **First_Name**
  - **Last_Name**
  - **Email**

- These variables/ features are merely identifiers or personal details and do not contribute to understanding or predicting a student's academic performance or final grade.

- Features or variables potentially less directly relevant which stated below here (6 columns), but worth considering for the other analysis such as sociological or psychological analysis.
  - **Gender** : While it can be examined to identify potential performance differences across genders, it does not causally influence academic achievement.
  - **Department** : May hold relevance if the dataset covers multiple disciplines with varying levels of difficulty; otherwise, its analytical contribution is limited.
  - **Extracurricular_Activities** : Participation may affect time management and overall balance, yet its impact on academic outcomes tends to be indirect.
  - **Internet_Access_at_Home** " Serves as a supportive factor rather than a direct determinant of performance.
  - **Parent_Education_Level** : Reflects background context; it may influence students’ attitudes toward learning but not directly their scores.
  - **Family_Income_Level** : Similar to parental education, it can affect access to resources and learning environments, though not as an immediate predictor of success.

- Features or variables highly relevant for performance analysis (to be retained) are bellows (11 columns):
  - **Age** : May influence students’ maturity levels and sense of academic responsibility.
  - **Attendance (%)** : Commonly exhibits a strong correlation with academic achievement.
  - **Midterm_Score**, **Assignments_Avg**, **Quizzes_Avg**, **Participation_Score**, **Projects_Score**, **Total_Score** : All represent the core indicators of students’ academic performance and are therefore essential for predictive modeling.
  - **Study_Hours_per_Week** : A key behavioral variable that directly contributes to academic success.
  - **Stress_Level (1–10)** : Can significantly affect cognitive performance and learning efficiency.
  - **Sleep_Hours_per_Night** : Impacts concentration, memory retention, and overall learning outcomes.

- In this project, **Final_Score** and **Grade** are defined as the two target variables (labels). These columns represent the outcomes to be predicted based on the remaining input features, reflecting students’ overall academic performance in numerical and categorical forms, respectively.

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Student_ID                  5000 non-null   object 
 1   First_Name                  5000 non-null   object 
 2   Last_Name                   5000 non-null   object 
 3   Email                       5000 non-null   object 
 4   Gender                      5000 non-null   object 
 5   Age                         5000 non-null   int64  
 6   Department                  5000 non-null   object 
 7   Attendance (%)              4484 non-null   float64
 8   Midterm_Score               5000 non-null   float64
 9   Final_Score                 5000 non-null   float64
 10  Assignments_Avg             4483 non-null   float64
 11  Quizzes_Avg                 5000 non-null   float64
 12  Participation_Score         5000 non-null   float64
 13  Projects_Score              5000 