# Student Performance Indicator

### *Project's Life Cycle:*
*- Problem Statement* <br>
*- Data Collection*  <br>
*- Data Checks to perform*  <br>
*- Exploratory Data Analysis (EDA)*  <br>
*- Data Pre-processing* <br>
*- Model Training* <br>
*- Best Model Recommendation* <br>

### **1) Problem Statement**

- This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental Level of Education, Lunch, and Test Preparation course.

### **2) Data Collection**

- Dataset Source: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
- Our data consists of 1000 rows and 8 columns.

#### 2.1. Importing Data and necessary packages

In [3]:
# Package imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [34]:
# Data
student_data = pd.read_csv("data/stud.csv")
print(student_data.head())

# Numerical information
numeric_cols = list(student_data.select_dtypes(include=['int64']).columns)
print(f"The dataset provides a glimpse into the academic performance of the students, including {numeric_cols}.\n")
print(student_data[numeric_cols].describe())

   gender race_ethnicity parental_level_of_education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test_preparation_course  math_score  reading_score  writing_score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
3                    none          47             57             44  
4                    none          76             78             75  
The dataset provides a glimpse into the academic performance of the students, including ['math_score', 'reading_score', 'writing_score'].

       math_score  r

In [None]:
# Categorical features
print(f"The student body consists of {student_data['gender'].unique()} students.")
print(f"Ethnicity-wise, they belong to {student_data['race_ethnicity'].unique()}.")
print(f"Their parents graduated with {student_data['parental_level_of_education'].unique()}.")
print(f"Students have lunch options of {student_data['lunch'].unique()}")
print(f"Their test preparation course status either {student_data['test_preparation_course'].unique()} before exams.")

The student body consists of ['female' 'male'] students.
Ethnicity-wise, they belong to ['group B' 'group C' 'group A' 'group D' 'group E'].
Their parents graduated with ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school'].
Students have lunch options of ['standard' 'free/reduced']
Their test preparation course status either ['none' 'completed'] before exams.


### **3) Data Checks to perform**

- Missing Values
- Duplicates
- Data type
- Unique values of each column
- Statistics of dataset
- Category names

In [36]:
# Missing values
print("\nMissing values in each column:")
print(student_data.isnull().sum())
if student_data.isnull().sum().any() == False:
    print("No missing values found in the dataset.")


Missing values in each column:
gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64
No missing values found in the dataset.


In [38]:
# Duplicates
print("\nThere are {} duplicate rows in the dataset.".format(student_data.duplicated().sum()))
if student_data.duplicated().sum() > 0:
    student_data = student_data.drop_duplicates()
    print("Duplicates have been removed from dataset.")


There are 0 duplicate rows in the dataset.
