# Predicting Student Performance in Secondary Education
**Team Members:** Abby Skillestad and Miles Mercer  
**Course:** CPSC 322, Fall 2025

## Dataset Description
The dataset contains student academic performance data from two Portuguese secondary schools. It includes demographic, social, and school-related variables, along with three grade measurements (G1, G2, G3). Each instance represents a single student. The dataset is multivariate with 649 instances and 30 features.

### Source:
**Dataset:** UCI Machine Learning Repository – Student Performance Dataset  
**Link:** https://archive.ics.uci.edu/dataset/320/student+performance  
**Format:** CSV file (`student-mat.csv`)  
**Contents:** This file contains student attributes such as demographics, parental background, lifestyle, study habits, and academic performance (G1, G2, G3).

### Attributes and Target Variable:
The dataset includes categorical and numeric variables describing a student’s background and school life (e.g., age, parents’ education, family size, study time, absences).  
**Target Variable:**  
- **G3** (final grade), used for classification (predict pass/fail or grade categories).

### UCI Import:

In [34]:
# import ucimlrepo dataset - according to their documentation at https://pypi.org/project/ucimlrepo/
from ucimlrepo import fetch_ucirepo

# fetch dataset
student_performance = fetch_ucirepo(id=320)

# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets

# metadata
print(student_performance.metadata)

# variable information
print(student_performance.variables)

{'uci_id': 320, 'name': 'Student Performance', 'repository_url': 'https://archive.ics.uci.edu/dataset/320/student+performance', 'data_url': 'https://archive.ics.uci.edu/static/public/320/data.csv', 'abstract': 'Predict student performance in secondary education (high school). ', 'area': 'Social Science', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Multivariate'], 'num_instances': 649, 'num_features': 30, 'feature_types': ['Integer'], 'demographics': ['Sex', 'Age', 'Other', 'Education Level', 'Occupation'], 'target_col': ['G1', 'G2', 'G3'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Fri Jan 05 2024', 'dataset_doi': '10.24432/C5TG7T', 'creators': ['Paulo Cortez'], 'intro_paper': {'ID': 360, 'type': 'NATIVE', 'title': 'Using data mining to predict secondary school student performance', 'authors': 'P. Cortez, A. M. G. Silva', 'venue': 'Proceedings of 5th Annual Future Business Technolo

In [39]:
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import csv

# Add mysklearn to path
sys.path.insert(0, os.path.join(os.getcwd(), "mysklearn"))

from mysklearn.mypytable import MyPyTable
from mysklearn.myclassifiers import MyKNeighborsClassifier, MyDecisionTreeClassifier, MyDummyClassifier
from mysklearn.myevaluation import train_test_split, kfold_split, stratified_kfold_split
from mysklearn.myevaluation import confusion_matrix, accuracy_score, classification_report
from mysklearn import myutils

# Load the dataset with semicolon delimiter
data_table = MyPyTable()

# Manual load with correct delimiter
with open("data/student-mat.csv", "r", encoding="utf-8") as infile:
    reader = csv.reader(infile, delimiter=';')
    data_table.column_names = next(reader)
    data_table.data = list(reader)

# Convert to numeric where possible
data_table.convert_to_numeric()

# Get shape manually
n_rows = len(data_table.data)
n_cols = len(data_table.column_names)

print(f"Dataset shape: ({n_rows}, {n_cols})")
print(f"Column names ({n_cols} total):")
print(data_table.column_names)
print(f"\nFirst 3 rows (first 10 columns shown):")
for i in range(min(3, len(data_table.data))):
    print(data_table.data[i][:10], "...")

Dataset shape: (395, 33)
Column names (33 total):
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']

First 3 rows (first 10 columns shown):
['GP', 'F', 18.0, 'U', 'GT3', 'A', 4.0, 4.0, 'at_home', 'teacher'] ...
['GP', 'F', 17.0, 'U', 'GT3', 'T', 1.0, 1.0, 'at_home', 'other'] ...
['GP', 'F', 15.0, 'U', 'LE3', 'T', 1.0, 1.0, 'at_home', 'other'] ...


### Converting G3 to Pass/Fail Classification
We convert the final grade (G3) from a numeric scale (0-20) to a binary classification:
- **Pass**: G3 ≥ 10
- **Fail**: G3 < 10

This aligns with the grading system where 10/20 is the minimum passing grade.

In [40]:
# Create pass/fail labels from G3
g3_index = data_table.column_names.index("G3")
pass_fail_labels = []

for row in data_table.data:
    if row[g3_index] >= 10:
        pass_fail_labels.append("pass")
    else:
        pass_fail_labels.append("fail")

# Add pass/fail column to the table
data_table.column_names.append("pass_fail")
for i, row in enumerate(data_table.data):
    row.append(pass_fail_labels[i])

# Check class distribution
pass_count = pass_fail_labels.count("pass")
fail_count = pass_fail_labels.count("fail")
print(f"Class Distribution:")
print(f"  Pass: {pass_count} ({pass_count/len(pass_fail_labels)*100:.1f}%)")
print(f"  Fail: {fail_count} ({fail_count/len(pass_fail_labels)*100:.1f}%)")

Class Distribution:
  Pass: 265 (67.1%)
  Fail: 130 (32.9%)


## Exploratory Data Analysis

### Summary Statistics
We begin by computing summary statistics for the numeric attributes in our dataset.

In [41]:
# Get numeric columns for summary statistics
numeric_cols = ["age", "Medu", "Fedu", "traveltime", "studytime", "failures", 
                "famrel", "freetime", "goout", "Dalc", "Walc", "health", "absences",
                "G1", "G2", "G3"]

summary_stats = data_table.compute_summary_statistics(numeric_cols)

print("Summary Statistics:")
print(f"{'Attribute':<15} {'Min':<8} {'Max':<8} {'Mid':<8} {'Avg':<8} {'Median':<8}")
print("-" * 70)
for row in summary_stats.data:
    print(f"{row[0]:<15} {row[1]:<8.2f} {row[2]:<8.2f} {row[3]:<8.2f} {row[4]:<8.2f} {row[5]:<8.2f}")

Summary Statistics:
Attribute       Min      Max      Mid      Avg      Median  
----------------------------------------------------------------------
age             15.00    22.00    18.50    16.70    17.00   
Medu            0.00     4.00     2.00     2.75     3.00    
Fedu            0.00     4.00     2.00     2.52     2.00    
traveltime      1.00     4.00     2.50     1.45     1.00    
studytime       1.00     4.00     2.50     2.04     2.00    
failures        0.00     3.00     1.50     0.33     0.00    
famrel          1.00     5.00     3.00     3.94     4.00    
freetime        1.00     5.00     3.00     3.24     3.00    
goout           1.00     5.00     3.00     3.11     3.00    
Dalc            1.00     5.00     3.00     1.48     1.00    
Walc            1.00     5.00     3.00     2.29     2.00    
health          1.00     5.00     3.00     3.55     4.00    
absences        0.00     75.00    37.50    5.71     4.00    
G1              3.00     19.00    11.00    10.91    11.

# Citations
UCI Machine Learning Repository. Student Performance Dataset. https://archive.ics.uci.edu/dataset/320/student+performance