## Predicting academic performance using demographic and behavioral Data


by Zhengling Jiang, Colombe Tolokin, Franklin Aryee, Tien Nguyen


Packages:


In [1]:
import pandas as pd
import altair as alt
import altair_ally as ally
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge

## Summary


## Introduction


Math teaches us to think logically and it also provides us with analytical and problem-solving skills. These skills can be applied to various academic and professional fields. However, student performance in mathematics can be influenced by many factors, like individual factor, social factor, and family factor. Research has shown that attributes such as study habits, age and family background can significantly impact a student's academic success (Amuda, Bulus, and Joseph 2016; Modi 2023). Understanding these factors is crucial for improving educational outcomes.

In this study, we aim to address this question: **“Can we predict a student's math academic performance based on the demographic and behavioral data?”**. Answering this question is important because understanding the factors behind student performance can help teachers provide support to struggling students. Furthermore, the ability to predict academic performance could assist schools in developing educational strategies based on different backgrounds of students.
The goal of this study is to develop a machine learning model capable of predicting student’s math performance with high accuracy.


## Methods & Results


The objective here to prepare the data for our classification analysis by exploring relevant features and summarizing key insights through data wrangling and visualization.


### Dataset Description

The full data set contains the following columns:

1. `school` - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
1. `sex` - student's sex (binary: 'F' - female or 'M' - male)
1. `age` - student's age (numeric: from 15 to 22)
1. `address` - student's home address type (binary: 'U' - urban or 'R' - rural)
1. `famsize` - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
1. `Pstatus` - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
1. `Medu` - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - “ 5th to 9th grade, 3 - “ secondary education or 4 - “ higher education)
1. `Fedu` - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - “ 5th to 9th grade, 3 - “ secondary education or 4 - “ higher education)
1. `Mjob` - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
1. `Fjob` - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
1. `reason` - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
1. `guardian` - student's guardian (nominal: 'mother', 'father' or 'other')
1. `traveltime` - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
1. `studytime` - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
1. `failures` - number of past class failures (numeric: n if 1<=n<3, else 4)
1. `schoolsup` - extra educational support (binary: yes or no)
1. famsup` - family educational support (binary: yes or no)
1. `paid` - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
1. `activities` - extra-curricular activities (binary: yes or no)
1. `nursery` - attended nursery school (binary: yes or no)
1. `higher` - wants to take higher education (binary: yes or no)
1. `internet` - Internet access at home (binary: yes or no)
1. `romantic` - with a romantic relationship (binary: yes or no)
1. `famrel` - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
1. `freetime` - free time after school (numeric: from 1 - very low to 5 - very high)
1. `goout` - going out with friends (numeric: from 1 - very low to 5 - very high)
1. `Dalc` - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
1. `Walc` - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
1. `health` - current health status (numeric: from 1 - very bad to 5 - very good)
1. `absences` - number of school absences (numeric: from 0 to 93)

These columns represent the grades:

- G1 - first period grade (numeric: from 0 to 20)
- G2 - second period grade (numeric: from 0 to 20)
- G3 - final grade (numeric: from 0 to 20, output target)

_Attribution_: The dataset variable description is copied as original from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/320/student+performance).


### Data Loading, Wrangling and Summary


Let's start by loading the data and have an initial view of data set structure.


The file is a `.csv` file with `;` as delimiter. Let's use `pandas`to read it in.


In [2]:
!python ../src/download_data.py

File already existed, exitting script...


In [3]:
# Load data
student_performance = pd.read_csv('../data/raw/student-mat.csv', delimiter=';')
student_performance.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


This provides an overview of the data set with 33 columns, each representing student attributes such as age, gender, study time, grades, and parental details.


Let's get some information on the data set to better understand it.


In [4]:
student_performance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

The data set contains 395 observations and 33 columns covering different aspects of student demographics, academic and behavioral traits.


We can see that there is no missing values. There is not need to handle NAs.


The data set includes categorical (school, sex, Mjob) and numerical (age, G1, G2, G3) features.


There is a large range of features but not all of them are necessary for this analysis. Let's proceed and select only the necessary ones.


Let's selected the following key columns:

- Demographic attributes: sex, age
- Academic Attributes: studytime, failures, G1, G2, G3 (grades for three terms)
- Behavioral Attributes: goout (socializing), Dalc (weekday alcohol consumption), Walc (weekend alcohol consumption)

We will also split the dataset into train and test set with a 80/20 ratio. We also set `random_state=123` for reproducibility.


In [5]:
# Necessary columns
columns = ['sex', 
           'age', 
           'studytime', 
           'failures', 
           'goout', 
           'Dalc', 
           'Walc', 
           'G1', 
           'G2', 
           'G3']

subset_df = student_performance[columns]

train_df, test_df = train_test_split(
    subset_df, test_size=0.2, random_state=123
)

In [6]:
train_df.head()

Unnamed: 0,sex,age,studytime,failures,goout,Dalc,Walc,G1,G2,G3
288,M,18,3,0,4,1,3,15,14,14
6,M,16,2,0,4,1,1,12,12,11
226,F,17,2,0,4,1,3,16,15,15
319,F,18,2,0,4,3,3,11,11,11
216,F,17,2,2,5,2,4,6,6,4


In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 316 entries, 288 to 365
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sex        316 non-null    object
 1   age        316 non-null    int64 
 2   studytime  316 non-null    int64 
 3   failures   316 non-null    int64 
 4   goout      316 non-null    int64 
 5   Dalc       316 non-null    int64 
 6   Walc       316 non-null    int64 
 7   G1         316 non-null    int64 
 8   G2         316 non-null    int64 
 9   G3         316 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 27.2+ KB


Let's get a summary of the training set we are going to use for the analysis.


In [8]:
train_df.describe()

Unnamed: 0,age,studytime,failures,goout,Dalc,Walc,G1,G2,G3
count,316.0,316.0,316.0,316.0,316.0,316.0,316.0,316.0,316.0
mean,16.756329,2.050633,0.360759,3.098101,1.471519,2.306962,10.835443,10.601266,10.262658
std,1.290056,0.860398,0.770227,1.11833,0.855874,1.258904,3.252078,3.756797,4.522676
min,15.0,1.0,0.0,1.0,1.0,1.0,4.0,0.0,0.0
25%,16.0,1.0,0.0,2.0,1.0,1.0,8.0,8.75,8.0
50%,17.0,2.0,0.0,3.0,1.0,2.0,11.0,11.0,11.0
75%,18.0,2.0,0.0,4.0,2.0,3.0,13.0,13.0,13.0
max,22.0,4.0,3.0,5.0,5.0,5.0,19.0,19.0,20.0


Key takeaways from summary statistics:

- Final grades `G3` range from `0` to `20`, with an average of around `10.26`.
- The average study time is about `2.05` hours.
- Most students have zero reported failures.
- Alcohol consumption (Dalc and Walc) and socializing habits (goout) appear to vary across the student population.


Let's create a visualization to explore the final grades `G3` distribution. We will use a histogram as it allows us to see the spread.


In [9]:
# Visualization of grade distributions
eda_plot1 = alt.Chart(train_df).mark_bar().encode(
    x=alt.X('G3:Q', bin=True, title='Final Grades (G3)'),
    y=alt.Y('count()', title='Number of Students'),
    tooltip=['G3']
).properties(
    title='Distribution of Final Grades (G3)',
    width=400,
    height=200
)
eda_plot1 

**Figure 1: Distribution of Final Grades (G3)**


The histogram shows that most students achieve grades between 8 and 15, with fewer students scoring very low or very high.


In [10]:
ally.dist(train_df).properties(title="Density Plot for all numeric columns")

**Figure 2: Density plot for each numeric columns (including the target `G3`)**

Some interesting observations:

- The distirbution of the grades `G3`, `G2`, `G1` are somewhat bell-shaped.
- Most student do not consume alcohol, or very minimally.
- Most student studies around 2-5 hours a week and most of them also did not fail any previous classes.


In [11]:
ally.corr(train_df).properties(title="Correlation matrices for each numeric column pair")

**Figure 3: Correlation matrices for each numeric columns (including target `G3`)**

Some interesting observations:

- The grades are very correlated with one another
- Alcohol consumptions are somewhat negatively correlated with grades
- Study time are somewhat positively correlated with grades/


### Analysis


In [12]:
# Split features and target
X_train, y_train = (
    train_df.drop(columns=['G3']),
    train_df['G3']
)
X_test, y_test = (
    test_df.drop(columns=['G3']),
    test_df['G3'],
)

In [13]:
# Define categorical and numerical columns
categorical_feats = X_train.select_dtypes(include=['object']).columns
numeric_feats = X_train.select_dtypes(include=['int64']).columns

In [14]:
# Apply column transformers
preprocessor = make_column_transformer(    
    (StandardScaler(), numeric_feats),  # scaling on numeric features 
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features
)

#  Make pipeline
pipe_lr = make_pipeline(preprocessor, Ridge())

In [15]:
# Define parameter grid
param_grid = {
    'ridge__alpha': [0.1, 1, 10, 100]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipe_lr, param_grid=param_grid, n_jobs=-1, return_train_score=True)
grid_search.fit(X_train, y_train)

In [16]:
# Best score
grid_search.best_score_

np.float64(0.8097236209574398)

In [17]:
# Get the best hyperparameter value
grid_search.best_params_

{'ridge__alpha': 1}

In [18]:
# Define the best model
best_model = grid_search.best_estimator_

In [19]:
pd.DataFrame(grid_search.cv_results_)[
    [
        "mean_test_score",
        "param_ridge__alpha",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().T

rank_test_score,1,2,3,4
mean_test_score,0.809724,0.809702,0.808141,0.764852
param_ridge__alpha,1.0,0.1,10.0,100.0
mean_fit_time,0.003142,0.004755,0.003367,0.002878


In [20]:
# Apply best model on test set
y_pred = pd.DataFrame(best_model.predict(X_test), columns=['G3'])
y_pred.head()

Unnamed: 0,G3
0,8.253686
1,12.96319
2,11.922497
3,5.186804
4,9.629061


## Results & Discussion


## References


Amuda, Bitrus Glawala, Apagu Kidlindila Bulus, and Hamsatu Pur Joseph. "Marital Status and Age as Predictors of Academic Performance of Students of Colleges of Education in the Nort- Eastern Nigeria." American Journal of Educational Research 4.12 (2016): 896-902.

Modi, Y. G. “The Impact of Stress on Academic Performance: Strategies for High School Students.” International Journal of Psychiatry, vol. 8, no. 5, 2023, pp. 150–152.
