## Predicting academic performance using demographic and behavioral Data

by Zhengling Jiang,  Colombe Tolokin, Franklin Aryee, Tien Nguyen

Packages:

In [16]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split

## Summary

## Introduction

Math teaches us to think logically and it also provides us with analytical and problem-solving skills. These skills can be applied to various academic and professional fields. However, student performance in mathematics can be influenced by many factors, like individual factor, social factor, and family factor. Research has shown that attributes such as study habits, age and family background can significantly impact a student's academic success  (Amuda, Bulus, and Joseph 2016; Modi 2023). Understanding these factors is crucial for improving educational outcomes.

In this study, we aim to address this question: **“Can we predict a student's math academic performance based on the demographic and behavioral data?”**. Answering this question is important because understanding the factors behind student performance can help teachers provide support to struggling students. Furthermore, the ability to predict academic performance could assist schools in developing educational strategies based on different backgrounds of students. 
The goal of this study is to develop a machine learning model capable of predicting student’s math performance with high accuracy.

## Methods & Results

The objective here to prepare the data for our classification analysis by exploring relevant features and summarizing key insights through data wrangling and visualization.

### Data Loading, Wrangling and Summary

Let's start by loading the data and have an initial view of data set structure.

The file is a `.csv` file with `;` as delimiter. Let's use `pandas`to read it in.

In [39]:
# Load data
df_url = "https://archive.ics.uci.edu/dataset/320/student+performance"
student_performance = pd.read_csv('../data/student-mat.csv', delimiter=';')
student_performance.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


This provides an overview of the data set with 33 columns, each representing student attributes such as age, gender, study time, grades, and parental details.

Let's get some information on the data set to better understand it.

In [25]:
student_performance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

The data set contains 395 observations and 33 columns covering different aspects of student demographics, academic and behavioral traits.

We can see that there is no missing values. There is not need to handle NAs.

The data set includes categorical (school, sex, Mjob) and numerical (age, G1, G2, G3) features.

There is a large range of features but not all of them are necessary for this analysis. Let's proceed and select only the necessary ones.

Let's selected the following key columns:

- Demographic attributes: sex, age
- Academic Attributes: studytime, failures, G1, G2, G3 (grades for three terms)
- Behavioral Attributes: goout (socializing), Dalc (weekday alcohol consumption), Walc (weekend alcohol consumption)

In [35]:
# Necessary columns
columns = ['sex', 
           'age', 
           'studytime', 
           'failures', 
           'goout', 
           'Dalc', 
           'Walc', 
           'G1', 
           'G2', 
           'G3']
student_performance_df = student_df[columns]
student_performance_df.isnull().sum()

sex          0
age          0
studytime    0
failures     0
goout        0
Dalc         0
Walc         0
G1           0
G2           0
G3           0
dtype: int64

In [36]:
student_performance_df.head()

Unnamed: 0,sex,age,studytime,failures,goout,Dalc,Walc,G1,G2,G3
0,F,18,2,0,4,1,1,5,6,6
1,F,17,2,0,3,1,1,5,5,6
2,F,15,2,3,2,2,3,7,8,10
3,F,15,3,0,2,1,1,15,14,15
4,F,16,2,0,2,1,2,6,10,10


Let's get a summary of the subset we are going to use for the analysis.

In [37]:
student_performance_df.describe()

Unnamed: 0,age,studytime,failures,goout,Dalc,Walc,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.035443,0.334177,3.108861,1.481013,2.291139,10.908861,10.713924,10.41519
std,1.276043,0.83924,0.743651,1.113278,0.890741,1.287897,3.319195,3.761505,4.581443
min,15.0,1.0,0.0,1.0,1.0,1.0,3.0,0.0,0.0
25%,16.0,1.0,0.0,2.0,1.0,1.0,8.0,9.0,8.0
50%,17.0,2.0,0.0,3.0,1.0,2.0,11.0,11.0,11.0
75%,18.0,2.0,0.0,4.0,2.0,3.0,13.0,13.0,14.0
max,22.0,4.0,3.0,5.0,5.0,5.0,19.0,19.0,20.0


Key takeaways from summary statistics: 

- Final grades `G3` range from `0` to `20`, with an average of around `10.41`.
- The average study time is about `2.03` hours.
- Most students have zero reported failures.
- Alcohol consumption (Dalc and Walc) and socializing habits (goout) appear to vary across the student population.

Let's create a visualization to explore the final grades `G3` distribution. We will use a histogram as it allows us to see the spread.

In [38]:
# Visualization of grade distributions
eda_plot1 = alt.Chart(student_performance_df).mark_bar().encode(
    x=alt.X('G3:Q', bin=True, title='Final Grades (G3)'),
    y=alt.Y('count()', title='Number of Students'),
    tooltip=['G3']
).properties(
    title='Distribution of Final Grades (G3)',
    width=400,
    height=200
)
eda_plot1 

**Figure 1: Distribution of Final Grades (G3)**

The histogram shows that most students achieve grades between 8 and 15, with fewer students scoring very low or very high. 

### Analysis

## Results & Discussion

## References

Amuda, Bitrus Glawala, Apagu Kidlindila Bulus, and Hamsatu Pur Joseph. "Marital Status and Age as Predictors of Academic Performance of Students of Colleges of Education in the Nort- Eastern Nigeria." American Journal of Educational Research 4.12 (2016): 896-902.

Modi, Y. G. “The Impact of Stress on Academic Performance: Strategies for High School Students.” International Journal of Psychiatry, vol. 8, no. 5, 2023, pp. 150–152. 