# Predicting Student Performance in Secondary Education
**Team Members:** Abby Skillestad and Miles Mercer  
**Course:** CPSC 322, Fall 2025

## Dataset Description
The dataset contains student academic performance data from two Portuguese secondary schools. It includes demographic, social, and school-related variables, along with three grade measurements (G1, G2, G3). Each instance represents a single student. The dataset is multivariate with 649 instances and 30 features.

### Source:
**Dataset:** UCI Machine Learning Repository – Student Performance Dataset  
**Link:** https://archive.ics.uci.edu/dataset/320/student+performance  
**Format:** CSV file (`student-mat.csv`)  
**Contents:** Each file contains student attributes such as demographics, parental background, lifestyle, study habits, and academic performance (G1, G2, G3).

### Attributes and Target Variable:
The dataset includes categorical and numeric variables describing a student’s background and school life (e.g., age, parents’ education, family size, study time, absences).
**Target Variable:**  
- **G3** (final grade), used for either regression (predict numeric grade) or classification (predict pass/fail or grade categories).

# Implementation / Technical Merit
The project will load the dataset, encode categorical variables, and perform exploratory data analysis. Models such as linear regression, logistic regression, random forest, gradient boosting, and decision trees may be evaluated depending on whether the task is classification or regression. Model performance will be compared using metrics such as RMSE for regression and accuracy or F1 score for classification.

### Anticipated Challenges:
- **Correlation among grades:** G1 and G2 are strongly correlated with G3, which affects predictive difficulty.  
- **Potential class imbalance:** If converting G3 into categories, some groups may be underrepresented.  
- **Categorical encoding:** Many variables are categorical and require one-hot or ordinal encoding.  
- **Redundant or weak features:** Some attributes may offer little predictive power.

### Feature Selection:
To reduce the number of attributes, methods such as correlation analysis, mutual information scoring, recursive feature elimination, or model-based feature importance (e.g., from tree-based models) will be explored to identify the strongest predictors.

# Potential Impact
The results can help schools identify students who may need academic support earlier in the year. Feature importance findings can inform educators about which social or academic factors contribute most to student performance.

### Why these results are useful:
They support early intervention, guide educational planning, and help schools allocate resources more effectively. They also provide insight into broader academic trends that could guide instruction and policy.

### Stakeholders:
- Teachers and school administrators  
- Students and parents  
- Educational policy makers  
- Researchers studying student learning and outcomes

# Citations
UCI Machine Learning Repository. Student Performance Dataset. https://archive.ics.uci.edu/dataset/320/student+performance

Python libraries such as pandas, numpy, matplotlib, and scikit-learn will be cited in the final report as needed.

### UCI Import:

In [5]:
# import ucimlrepo dataset - according to their documentation at https://pypi.org/project/ucimlrepo/
from ucimlrepo import fetch_ucirepo

# fetch dataset
student_performance = fetch_ucirepo(id=320)

# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets

# metadata
print(student_performance.metadata)

# variable information
print(student_performance.variables)

{'uci_id': 320, 'name': 'Student Performance', 'repository_url': 'https://archive.ics.uci.edu/dataset/320/student+performance', 'data_url': 'https://archive.ics.uci.edu/static/public/320/data.csv', 'abstract': 'Predict student performance in secondary education (high school). ', 'area': 'Social Science', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Multivariate'], 'num_instances': 649, 'num_features': 30, 'feature_types': ['Integer'], 'demographics': ['Sex', 'Age', 'Other', 'Education Level', 'Occupation'], 'target_col': ['G1', 'G2', 'G3'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Fri Jan 05 2024', 'dataset_doi': '10.24432/C5TG7T', 'creators': ['Paulo Cortez'], 'intro_paper': {'ID': 360, 'type': 'NATIVE', 'title': 'Using data mining to predict secondary school student performance', 'authors': 'P. Cortez, A. M. G. Silva', 'venue': 'Proceedings of 5th Annual Future Business Technolo