PROJECT

# Investigating and Forecasting Students' Academic Achievement through Machine Learning
**Student Name:** Malugu Sai Deepthi  
- **Student ID:** 23070109  
- **Email:** malugusaideepthi@gmail.com  
- **Course:** 2-Year Advanced Research (MSc Data Science)  
- **Module:** 7PAM2002 – Data Science Project  
- **Semester:** A 2025/2026  
- **Supervisor:** Vanadana Das 


**Project Aim**

To apply machine learning algorithms to predict students’ academic outcomes and identify key personal, social, and academic factors influencing their success.

**Research Question**

Which personal, social, and school-related factors are most often determinants of students' academic performance, and can data science methods be used to accurately predict their final grades?


 **Dataset Information**
- **Source:** UCI Machine Learning Repository – *Student Performance Dataset*  
- **URL:** [https://archive.ics.uci.edu/ml/datasets/student+performance](https://archive.ics.uci.edu/ml/datasets/student+performance)  

In [16]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [18]:
# Load both datasets
math = pd.read_csv('student-mat.csv', sep=';')
port = pd.read_csv('student-por.csv', sep=';')

print("Math dataset shape:", math.shape)
print("Portuguese dataset shape:", port.shape)


Math dataset shape: (395, 33)
Portuguese dataset shape: (649, 33)


In [22]:
# merging the datasets
merge_cols = [
    'school', 'sex', 'age', 'address', 'famsize', 'Pstatus',
    'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'nursery', 'internet'
]


merged = pd.merge(math, port, on=merge_cols, suffixes=('_math', '_port'))

print("Merged dataset shape:", merged.shape)
merged.head()


Merged dataset shape: (382, 53)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian_math,traveltime_math,studytime_math,failures_math,schoolsup_math,famsup_math,paid_math,activities_math,nursery,higher_math,internet,romantic_math,famrel_math,freetime_math,goout_math,Dalc_math,Walc_math,health_math,absences_math,G1_math,G2_math,G3_math,guardian_port,traveltime_port,studytime_port,failures_port,schoolsup_port,famsup_port,paid_port,activities_port,higher_port,romantic_port,famrel_port,freetime_port,goout_port,Dalc_port,Walc_port,health_port,absences_port,G1_port,G2_port,G3_port
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6,mother,2,2,0,yes,no,no,no,yes,no,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6,father,1,2,0,no,yes,no,no,yes,no,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10,mother,1,2,0,yes,no,no,no,yes,no,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15,mother,1,3,0,no,yes,no,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10,father,1,2,0,no,yes,no,no,yes,no,4,3,2,1,2,5,0,11,13,13


In [8]:

print("Missing values:", merged.isnull().sum().sum())

merged.info()

print("Duplicates:", merged.duplicated().sum())


Missing values: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 382 entries, 0 to 381
Data columns (total 53 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   school          382 non-null    object
 1   sex             382 non-null    object
 2   age             382 non-null    int64 
 3   address         382 non-null    object
 4   famsize         382 non-null    object
 5   Pstatus         382 non-null    object
 6   Medu            382 non-null    int64 
 7   Fedu            382 non-null    int64 
 8   Mjob            382 non-null    object
 9   Fjob            382 non-null    object
 10  reason          382 non-null    object
 11  guardian_mat    382 non-null    object
 12  traveltime_mat  382 non-null    int64 
 13  studytime_mat   382 non-null    int64 
 14  failures_mat    382 non-null    int64 
 15  schoolsup_mat   382 non-null    object
 16  famsup_mat      382 non-null    object
 17  paid_mat        382 non-null    obje

In [12]:
num_cols = merged.select_dtypes(include=np.number).columns
merged[num_cols].describe()


Unnamed: 0,age,Medu,Fedu,traveltime_mat,studytime_mat,failures_mat,famrel_mat,freetime_mat,goout_mat,Dalc_mat,...,famrel_por,freetime_por,goout_por,Dalc_por,Walc_por,health_por,absences_por,G1_por,G2_por,G3_por
count,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,...,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0
mean,16.586387,2.806283,2.565445,1.442408,2.034031,0.290576,3.939791,3.222513,3.112565,1.473822,...,3.942408,3.230366,3.117801,1.47644,2.290576,3.575916,3.672775,12.112565,12.23822,12.515707
std,1.17347,1.086381,1.09624,0.695378,0.845798,0.729481,0.92162,0.988233,1.131927,0.886229,...,0.908884,0.985096,1.13371,0.886303,1.282577,1.404248,4.905965,2.556531,2.468341,2.945438
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,5.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,...,4.0,3.0,2.0,1.0,1.0,3.0,0.0,10.0,11.0,11.0
50%,17.0,3.0,3.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,...,4.0,3.0,3.0,1.0,2.0,4.0,2.0,12.0,12.0,13.0
75%,17.0,4.0,4.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,...,5.0,4.0,4.0,2.0,3.0,5.0,6.0,14.0,14.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,32.0,19.0,19.0,19.0


In [10]:
merged.dtypes
merged['age'] = pd.to_numeric(merged['age'], errors='coerce')



NameError: name 'merged' is not defined