# Student performance prediction based on previous performance

Our data set contains a lot of features and attributes, our aim is to use random forest model prediction on G1, with separate versions for female and male, a prediction model for G2 for male and female, and then a final linear regression model to predict G3 based on G1 and G2 predictions/input data.

## Attributes and Datasets
### List of basic attributes

| # | Attribute | Description |
|---|-----------|-------------|
| 1 | school | Student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) |
| 2 | sex | Student's sex (binary: 'F' - female or 'M' - male) |
| 3 | age | Student's age (numeric: from 15 to 22) |
| 4 | address | Student's home address type (binary: 'U' - urban or 'R' - rural) |
| 5 | famsize | Family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) |
| 6 | Pstatus | Parent's cohabitation status (binary: 'T' - living together or 'A' - apart) |
| 7 | Medu | Mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) |
| 8 | Fedu | Father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) |
| 9 | Mjob | Mother's job (nominal: 'teacher', 'health' care related, civil 'services', 'at_home' or 'other') |
| 10 | Fjob | Father's job (nominal: 'teacher', 'health' care related, civil 'services', 'at_home' or 'other') |
| 11 | reason | Reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other') |
| 12 | guardian | Student's guardian (nominal: 'mother', 'father' or 'other') |
| 13 | traveltime | Home to school travel time (numeric: 1 - <15 min., 2 - 15-30 min., 3 - 30-60 min., 4 - >60 min.) |
| 14 | studytime | Weekly study time (numeric: 1 - <2 hours, 2 - 2-5 hours, 3 - 5-10 hours, 4 - >10 hours) |
| 15 | failures | Number of past class failures (numeric: n if 1<=n<3, else 4) |
| 16 | schoolsup | Extra educational support (binary: yes or no) |
| 17 | famsup | Family educational support (binary: yes or no) |
| 18 | paid | Extra paid classes within the course subject (binary: yes or no) |
| 19 | activities | Extra-curricular activities (binary: yes or no) |
| 20 | nursery | Attended nursery school (binary: yes or no) |
| 21 | higher | Wants to take higher education (binary: yes or no) |
| 22 | internet | Internet access at home (binary: yes or no) |
| 23 | romantic | With a romantic relationship (binary: yes or no) |
| 24 | famrel | Quality of family relationships (numeric: from 1 - very bad to 5 - excellent) |
| 25 | freetime | Free time after school (numeric: from 1 - very low to 5 - very high) |
| 26 | goout | Going out with friends (numeric: from 1 - very low to 5 - very high) |
| 27 | Dalc | Workday alcohol consumption (numeric: from 1 - very low to 5 - very high) |
| 28 | Walc | Weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) |
| 29 | health | Current health status (numeric: from 1 - very bad to 5 - very good) |
| 30 | absences | Number of school absences (numeric: from 0 to 93) |

**Grade-related attributes:**
| # | Attribute | Description |
|---|-----------|-------------|
| 31 | G1 | First period grade (numeric: from 0 to 20) |
| 32 | G2 | Second period grade (numeric: from 0 to 20) |
| 33 | G3 | Final grade (numeric: from 0 to 20, output target) |

### Datasets
- `mat.arff` - contains attributes, G1-G3 math results
- `por.arff` - contains attributes, G1-G3 portuguese results
- `dataset.csv` - contains all attributes, G3 math and portuguese results

## Data Wrangling for Student Performance Datasets

This notebook demonstrates data wrangling for the student performance datasets (`mat.arff`, `por.arff`, and `dataset.csv`). The processed data will be saved in the `processed_data` folder.

### Dependancies and frameworks
Load the two required dependencies:

- [Pandas](https://pandas.pydata.org/) is library that allows us to handle data for wrangling and visualisation.
- [sklearn](https://scikit-learn.org/stable/) A framework for training Machine Learning, we will use this for wrangling, but also applies to training and testing.
- [os, IO](https://docs.python.org/3/library) Default packages installed with python, allows us to create, save and edit files with basic string functions.

In [23]:
# Import frameworks
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import os
from io import StringIO


#### Load the datasets

In [24]:
# Load ARFF files as CSV
def load_arff_as_csv(filepath):
    with open(filepath, 'r') as file:
        lines = file.readlines()
    data_start = False
    data = []
    for line in lines:
        if data_start:
            data.append(line.strip())
        if line.strip().lower() == '@data':
            data_start = True
    return pd.read_csv(StringIO('\n'.join(data)), header=None)

# Load ARFF files
mat_df = load_arff_as_csv('data/mat.arff')
por_df = load_arff_as_csv('data/por.arff')

# Set column names for mat.arff
mat_columns = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']
mat_df.columns = mat_columns

# Set column names for por.arff
por_columns = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']
por_df.columns = por_columns

# Load CSV file (which is actually in ARFF format)
csv_columns = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G3']

# Use the same ARFF loading function for dataset.csv
csv_df = load_arff_as_csv('data/dataset.csv')
csv_df.columns = csv_columns

#### Dealing with null values

In [25]:
# Remove Null values
def remove_nulls(df):
    df = df.dropna()
    return df

mat_df = remove_nulls(mat_df)
por_df = remove_nulls(por_df)
csv_df = remove_nulls(csv_df)

#### Remove Duplicates

In [26]:
# Remove duplicates
def remove_duplicates(df):
    df = df.drop_duplicates()
    return df

mat_df = remove_duplicates(mat_df)
por_df = remove_duplicates(por_df)
csv_df = remove_duplicates(csv_df)

#### Replace data

In [27]:
# Replace data
def replace_data(df, column):
    df[column] = df[column].apply(lambda x: x.lower())
    return df

mat_df = replace_data(mat_df, 'sex')
por_df = replace_data(por_df, 'sex')
csv_df = replace_data(csv_df, 'sex')

#### Remove outliers

In [28]:
# Remove outliers
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    df = df[(df[column] >= Q1 - 1.5 * IQR) & (df[column] <= Q3 + 1.5 * IQR)]
    return df

mat_df = remove_outliers(mat_df, 'age')
por_df = remove_outliers(por_df, 'age')
csv_df = remove_outliers(csv_df, 'age')

#### Scaling features to a common range

In [29]:
# Scale features
# this would normally be something like this
# scaler = MinMaxScaler()
# mat_df[['age', 'absences', 'G3']] = scaler.fit_transform(mat_df[['age', 'absences', 'G3']])
# por_df[['age', 'absences', 'G3']] = scaler.fit_transform(por_df[['age', 'absences', 'G3']])
# csv_df[['age', 'absences', 'G3']] = scaler.fit_transform(csv_df[['age', 'absences', 'G3']])

# but for this use case, this data should not be scaled

#### Save the wrangled data

In [30]:
# Create directory if it doesn't exist
os.makedirs('processed_data', exist_ok=True)

# Save the processed data files
mat_df.to_csv('processed_data/Pmat.csv', index=False)
por_df.to_csv('processed_data/Ppor.csv', index=False)
csv_df.to_csv('processed_data/Pdataset.csv', index=False)
# Raw processed data

### Encoding numerical and categorical values

In [None]:
def encode_categorical_variables(df):
    df = df.copy()
    
    # Binary and nominal encodings
    encodings = {
        'school': {'gp': 0, 'ms': 1},
        'sex': {'f': 0, 'm': 1},
        'address': {'u': 0, 'r': 1},
        'famsize': {'le3': 0, 'gt3': 1},
        'Pstatus': {'t': 0, 'a': 1},
        'Mjob': {'teacher': 0, 'health': 1, 'services': 2, 'at_home': 3, 'other': 4},
        'Fjob': {'teacher': 0, 'health': 1, 'services': 2, 'at_home': 3, 'other': 4},
        'reason': {'home': 0, 'reputation': 1, 'course': 2, 'other': 3},
        'guardian': {'mother': 0, 'father': 1, 'other': 2},
        'schoolsup': {'no': 0, 'yes': 1},
        'famsup': {'no': 0, 'yes': 1},
        'paid': {'no': 0, 'yes': 1},
        'activities': {'no': 0, 'yes': 1},
        'nursery': {'no': 0, 'yes': 1},
        'higher': {'no': 0, 'yes': 1},
        'internet': {'no': 0, 'yes': 1},
        'romantic': {'no': 0, 'yes': 1}
    }
    
    # Clean and encode each column
    for col, mapping in encodings.items():
        if col in df.columns:
            # convert to lowercase and strip any whitespace/quotes
            df[col] = df[col].str.lower().str.strip().str.strip("'")
            # encode
            df[col] = df[col].map(mapping)
            
            # verify encoding worked
            if df[col].isna().any():
                print(f"Warning: NaN values found in {col} after encoding")
                print(f"Unique values before encoding: {df[col].unique()}")
    
    return df

## Split data by gender

In [32]:
def split_by_gender(df):
    """Split dataframe by gender after encoding"""
    female_df = df[df['sex'] == 0].copy()
    male_df = df[df['sex'] == 1].copy()
    return female_df, male_df

# Process Mathematics data
mat_df = remove_outliers(mat_df, 'age')
Pmat_full = mat_df.copy()
# ecode
Pmat_full = encode_categorical_variables(Pmat_full)
# Split by gender
PmatFE, PmatM = split_by_gender(Pmat_full)

# Process Portuguese data
por_df = remove_outliers(por_df, 'age')
Ppor_full = por_df.copy()
# ecode
Ppor_full = encode_categorical_variables(Ppor_full)
# Split by gender
PporFE, PporM = split_by_gender(Ppor_full)

# Save datasets
Pmat_full.to_csv('processed_data/Pmat_full.csv', index=False)
Ppor_full.to_csv('processed_data/Ppor_full.csv', index=False)
PmatFE.to_csv('processed_data/PmatFE.csv', index=False)
PmatM.to_csv('processed_data/PmatM.csv', index=False)
PporFE.to_csv('processed_data/PporFE.csv', index=False)
PporM.to_csv('processed_data/PporM.csv', index=False)

# Print verification statistics
print("\nDataset Statistics after Processing:")
print("\nMathematics Dataset:")
print(f"Total: {len(Pmat_full)} samples")
print(f"Female: {len(PmatFE)} ({len(PmatFE)/len(mat_df)*100:.1f}%)")
print(f"Male: {len(PmatM)} ({len(PmatM)/len(mat_df)*100:.1f}%)")

print("\nPortuguese Dataset:")
print(f"Total: {len(Ppor_full)} samples")
print(f"Female: {len(PporFE)} ({len(PporFE)/len(por_df)*100:.1f}%)")
print(f"Male: {len(PporM)} ({len(PporM)/len(por_df)*100:.1f}%)")

# Print sample of encoded values for verification
print("\nSample of encoded values:")
print(Pmat_full[['sex', 'school', 'address', 'Mjob']].head())


Dataset Statistics after Processing:

Mathematics Dataset:
Total: 394 samples
Female: 208 (52.8%)
Male: 186 (47.2%)

Portuguese Dataset:
Total: 648 samples
Female: 383 (59.1%)
Male: 265 (40.9%)

Sample of encoded values:
   sex  school  address  Mjob
0    0       0        0     3
1    0       0        0     3
2    0       0        0     3
3    0       0        0     1
4    0       0        0     4


#### Split the data into training and testing sets

saving this, but we are not using this, as we split the data again in feature

`
# Function to split and save datasets
def split_save_and_print(data, name, test_size=0.2, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(['sex', 'G1', 'G2', 'G3'], axis=1),
        data[['G1', 'G2', 'G3']],
        test_size=test_size,
        random_state=random_state
    )
    
    # Save splits
    base_path = f'processed_data/{name.lower().replace(" ", "_")}'
    X_train.to_csv(f'{base_path}_X_train.csv', index=False)
    X_test.to_csv(f'{base_path}_X_test.csv', index=False)
    y_train.to_csv(f'{base_path}_y_train.csv', index=False)
    y_test.to_csv(f'{base_path}_y_test.csv', index=False)
    
    print(f"\n{name}:")
    print(f"Training: {X_train.shape[0]} samples")
    print(f"Testing: {X_test.shape[0]} samples")
    return X_train, X_test, y_train, y_test

print("Performing and saving train-test splits...")

# Mathematics splits
X_train_mat_f, X_test_mat_f, y_train_mat_f, y_test_mat_f = split_save_and_print(PmatFemale, "Mathematics Female")
X_train_mat_m, X_test_mat_m, y_train_mat_m, y_test_mat_m = split_save_and_print(PmatMale, "Mathematics Male")

# Portuguese splits
X_train_por_f, X_test_por_f, y_train_por_f, y_test_por_f = split_save_and_print(PporFemale, "Portuguese Female")
X_train_por_m, X_test_por_m, y_train_por_m, y_test_por_m = split_save_and_print(PporMale, "Portuguese Male")

print("All splits have been saved to the processed_data folder")

`