> ### 1) Problem statement
- This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.


> ### 2) Data Collection
- Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
- The data consists of 8 column and 1000 rows.

> ### ***2.1 - Import requirement libraries and data***

In [3]:
## import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# set the style of seaborn
sns.set_style("whitegrid")

import warnings
# ignore warnings
warnings.filterwarnings("ignore")

In [4]:
## read the dataset 
df = pd.read_csv('data/stud.csv')

In [5]:
## show the first 5 rows of the dataset
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [6]:
## shape of the dataset
df.shape 

(1000, 8)

>  ***3. Data Checks to perform***

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

> #### 3.1 Check for missing values

In [8]:
## Check for missing values
df.isna().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

> ### **Observations :-** 🚫
- there is no missing values in our dataset 

> #### 3.2 Check the Duplicates Values

In [10]:
## check the duplicates values
df.duplicated().sum()

np.int64(0)

> ### **Observations :-** 🚫
- there is no duplicated values in our dataset 

> #### 3.3 Check the data types 

In [11]:
## Check info of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


> ###  **Observations :-** 📅
#### Basic Information
- **Samples**: 1,000 student records
- **Features**: 8 (3 numeric, 5 categorical)
- **Missing Values**: None (complete dataset)
- **Memory Usage**: ~62.6 KB

#### Column Breakdown
| # | Column Name                  | Non-Null | Dtype   | Unique Values* | Sample Values              |
|---|------------------------------|----------|---------|----------------|----------------------------|
| 0 | gender                       | 1000     | object  | 2              | male, female               |
| 1 | race_ethnicity               | 1000     | object  | 5              | group A, group B, group C  |
| 2 | parental_level_of_education  | 1000     | object  | 6              | high school, some college  |
| 3 | lunch                        | 1000     | object  | 2              | standard, free/reduced     |
| 4 | test_preparation_course      | 1000     | object  | 2              | none, completed           |
| 5 | math_score                   | 1000     | int64   | Continuous     | 72, 85, 93                |
| 6 | reading_score                | 1000     | int64   | Continuous     | 68, 90, 82                |
| 7 | writing_score                | 1000     | int64   | Continuous     | 75, 88, 79                |

*Estimated unique values for categorical columns



> #### 3.4 Checking the numbers of unique values 

In [12]:
## checking the unique values
df.nunique() 


gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

> #### 3.5 Check the statistics of the dataset 

In [13]:
## check the statistics of the dataset
df.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


> ###  **Observations :-** 🛃

### Test Score Distribution Analysis

#### Summary Statistics
| Metric            | Math Score | Reading Score | Writing Score |
|-------------------|------------|---------------|---------------|
| **Count**         | 1000       | 1000          | 1000          |
| **Mean**          | 66.09      | 69.17         | 68.05         |
| **Std Dev**       | 15.16      | 14.60         | 15.20         |
| **Minimum**       | 0          | 17            | 10            |
| **25th %ile**     | 57         | 59            | 57.75         |
| **Median**        | 66         | 70            | 69            |
| **75th %ile**     | 77         | 79            | 79            |
| **Maximum**       | 100        | 100           | 100           |


1. **Score Comparisons**:
   - Reading scores show the highest average (69.17)
   - Math scores have the widest variability (SD=15.16)
   - Writing scores are most left-skewed (median > mean)

2. **Potential Outliers**:
   - Math: 0 score (possible data entry error)
   - Reading: Minimum 17 (far below 1st %ile)
   


> #### 3.6 Exploring Data

In [14]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [18]:
## print the categories of each categorical column
print("Categories in 'gender' variable: ",end="" )
print(df['gender'].unique())

print("Categories in 'race_ethnicity' variable: ",end="" )
print(df['race_ethnicity'].unique())

print("Categories in 'parental_level_of_education' variable: ",end="" )
print(df['parental_level_of_education'].unique())

print("Categories in 'lunch' variable: ",end="" )
print(df['lunch'].unique())

print("Categories in 'test_preparation_course' variable: ",end="" )
print(df['test_preparation_course'].unique())

Categories in 'gender' variable: ['female' 'male']
Categories in 'race_ethnicity' variable: ['group B' 'group C' 'group A' 'group D' 'group E']
Categories in 'parental_level_of_education' variable: ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
Categories in 'lunch' variable: ['standard' 'free/reduced']
Categories in 'test_preparation_course' variable: ['none' 'completed']


In [21]:
## define numerical & categorical columns
numerical_features = [feature for feature in df.columns if df[feature].dtype != "O"]
categorical_features = [feature for feature in df.columns if df[feature].dtype == "O"]

## print the numerical features
print(f"We have {len(numerical_features)} Numerical Features: {numerical_features}")
## print the categorical features
print(f"\n We have {len(categorical_features)} Categorical Features: {categorical_features} ")

We have 3 Numerical Features: ['math_score', 'reading_score', 'writing_score']

 We have 5 Categorical Features: ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course'] 
