Understanding and Forecasting Student Performance in
India

Objective  

Analyze student performance data to uncover patterns in academic success across subjects
and demographics. Use statistical analysis and machine learning to predict performance and
identify interventions for improvement — useful for educators and policymakers alike. 

Dataset  
* Source: [Kaggle - Student Performance Dataset (Math, Reading, Writing Scores)]  
* Link: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams  

Columns include:  
* gender, race/ethnicity, parental level of education,
* lunch, test preparation course
* math score, reading score, writing score

Use Cases  
* Predict students at risk of underperforming
* Understand how socio-economic and educational backgrounds impact scores
* Recommend interventions (e.g., test prep, tutoring)
* Visualize gaps across gender or ethnicity groups

SECTION A: Python & Data Cleaning  
* Load the dataset and inspect the first few rows, datatypes, and null values.
* Check for duplicate rows or invalid data entries.
* Standardize categorical values (e.g., group education levels, rename ethnicities).
* Add derived columns:  
      * Average Score = (Math + Reading + Writing)/3  
      * Performance Category: Low, Medium, High based on average score  
      * Preparation Effectiveness: Compare scores with and without test prep

In [1]:
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv("StudentsPerformance.csv")
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [3]:
df.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

In [4]:
df.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [5]:
df["test preparation course"].unique()

array(['none', 'completed'], dtype=object)

In [6]:
df["lunch"].unique()

array(['standard', 'free/reduced'], dtype=object)

In [7]:
df["parental level of education"].unique()

array(["bachelor's degree", 'some college', "master's degree",
       "associate's degree", 'high school', 'some high school'],
      dtype=object)

In [8]:
df["race/ethnicity"].unique()

array(['group B', 'group C', 'group A', 'group D', 'group E'],
      dtype=object)

In [10]:
duplicate_value = df.duplicated().sum()
duplicate_value

np.int64(0)

In [11]:
if duplicate_value > 0:
  print(df[df.duplicated()].head())
