# Student Performance Indicator

---

## Problem Statement

---

> Test performance of students may depend on multiple factors and not merely on how the student prepared for it.\
> Goal of this project is to understand how various factors including student's **gender**, **ethnicity**, **parental level of education** etc impacts his/her test performance.

## Data Collection

---

### Dataset Source
[Students Performance in Exams, Kaggle](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977)

### About the Dataset
1. Includes marks secured by high school students in the United States of America (USA)

|Feature/Column|Explanation|
|:---:|:---:|
|`gender`|Biological sex of student (Male/Female)|
|`race_ethnicity`|Ethnicity of student (anonymized) $\rightarrow$ Group A/B/C/D/E|
|`parental_level_of_education`|Highest qualification of parents $\rightarrow$ Bachelor's degree/Some college/Master's degree/Associate's degree/High school|
|`lunch`|Type of lunch eaten by student before taking the test $\rightarrow$ Standard or Free/reduced|
|`test_preparation_course`|Whether student completed the test preparation materials before taking the test|
|`math_score`||
|`reading_score`||
|`writing_score`||

### Importing Data and Required Libraries

In [1]:
# DEPENDENCIES 

# for numerical computations
import numpy as np
# for working with dataframes and `csv` files
import pandas as pd
# for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# for suppressing warnings
import warnings
warnings.filterwarnings("ignore")

# for working with file paths
from pathlib import Path

In [5]:
# DATA

file_path = Path("data/stud.csv")

# load csv into dataframe
dataset = pd.read_csv(file_path)

### Getting to Know the Dataset

In [7]:
# make copy of dataframe to avoid accidental changes
df = dataset.copy()

In [8]:
# check top five rows
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [9]:
# check shape of dataset
n_samples, n_features = df.shape

print(f"Dataset has {n_samples} samples and {n_features} features.")

Dataset has 1000 samples and 8 features.


## Data Checks before EDA

Following basic checks will be done to ensure hassle-free EDA ahead:
1. Missing data
2. Duplicate data
3. Data type of columns
4. No. of unique values per numeric column
5. No. of unique categories per categorical column
6. Basic statistics

---

### Check for Missing data

In [11]:
# total missing data per column
df.isnull().sum(axis=0)

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

### Check for Duplicates

In [14]:
# total duplicate entries
df.duplicated().sum()

0

### Data types of the Columns

In [15]:
df.dtypes

gender                         object
race_ethnicity                 object
parental_level_of_education    object
lunch                          object
test_preparation_course        object
math_score                      int64
reading_score                   int64
writing_score                   int64
dtype: object

### Unique Values/Categories per Column

In [18]:
# No. of unique values in each feature
df.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

In [24]:
# unique categories
categorical_cols = [
    column
    for column in df.columns
    if df[column].dtype == "O"
]

for col in categorical_cols:
    print(f"\nCategorical Column -> {col}")
    print(f"{len(df[col].unique())} Unique categories:\n{df[col].unique()}")


Categorical Column -> gender
2 Unique categories:
['female' 'male']

Categorical Column -> race_ethnicity
5 Unique categories:
['group B' 'group C' 'group A' 'group D' 'group E']

Categorical Column -> parental_level_of_education
6 Unique categories:
["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']

Categorical Column -> lunch
2 Unique categories:
['standard' 'free/reduced']

Categorical Column -> test_preparation_course
2 Unique categories:
['none' 'completed']


### Basic Statistics

In [26]:
# descriptive statistics of numerical columns
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math_score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading_score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing_score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Observations
1. There are **no missing values** in the dataset
2. There are **no duplicate entries** in the dataset
3. Following are the data types:
    - **Numeric columns**: `math_score`, `reading_score`, `writing_score`
    - **Categorical columns**: `gender`, `race_ethnicity`, `parental_level_of_education`, `lunch`, `test_preparation_course`
4. `parental_level_of_education` has 6 unique categories, but 2 of them indicate similar level of education $\rightarrow$ 'high school' and 'some high school'
5. `test_preparation_course` has an unique value 'none' which may indicate that the student did not have any test preparation materials
6. Preliminary observations on **test scores** shows:
    - **Mean** scores of students in all three tests are close to each other ($66.08, 69.16, 68.05$)
    - **Standard deviation** of marks in all three tests are also similar ($15.16, 14.60, 15.19$)
    - **Minimum** scores show greater variation:
        - only Maths has min. score of zero
        - remaining tests have seen atleast a score of $10$
    - **Maximum** scores in all three tests is $100$
    
---

## Exploratory Data Analysis (EDA)

---

In [29]:
# separate out column names for easier analysis

categorical_cols = [
    col for col in df.columns
    if df[col].dtype == "O"
]

numeric_cols = [
    col for col in df.columns
    if df[col].dtype != "O"
]

print(f"Categorical columns -> {categorical_cols}\n\nNumeric columns -> {numeric_cols}")

Categorical columns -> ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course']

Numeric columns -> ['math_score', 'reading_score', 'writing_score']


### Adding New columns for better analysis

In [30]:
# new 'total_score' and 'avg_score' columns

df["total_score"] = (df["math_score"] + df["writing_score"] + df["reading_score"]) ## Total score in 3 tests
df["avg_score"] = (df["math_score"] + df["writing_score"] + df["reading_score"])/3 ## Average score in 3 tests

# check top five rows
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,total_score,avg_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333
