# Student secondary school performance based on primary school performance in math or Portuguese

The scope and aim of this model is to use random forest models and linear regression averaging to find the relationship between student primary school performance and secondary school performance, using a range of features such as-

## Attributes of our data, 
1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira).
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15>1 hour)
14. studytime - weekly study time (numeric: 1 - <2>10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject, Math or Portuguese:
31. G1 - first period grade (numeric: from 0 to 20)
32. G2 - second period grade (numeric: from 0 to 20)
33. G3 - final grade (numeric: from 0 to 20, output target)

### (from these datasets)
    > mat.arff // just the math results for G1-G3
    > por.arff // just the portugese resutls for G1-G3
    > dataset.csv // the "combined" data set of por and math results, without G1 or G2, with only G3 as target

## Engineered features:
1. Gvg - G1 and G2 average grade ( integer of (G1 + G2) /2)
2. Avgalc - Average Dalc and Walc ( integer of (Dalc + Walc) /2)
3. Bum - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.

## Models and training methods
We will use a random forest decision tree model for G1-G2 results, deciding the importance of the attributes at catergorising students, and a linear regression average for hte final G3 target predition. Each model is made of 3 parts, and there will be two models for math and portugese, and a seperate model for female and male. 

This will be acheived with model A, trained on finding the relationship between attributes and G1, fitting students into performance via fitting into group 1 through to 5,
And thne model B trained on G2 results, doing the same decision training,
and then model C that will train on these two models and attempt to find the average to predict a final G3 result. 

In [41]:
# Import frameworks
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import os
from io import StringIO

#### Load the datasets

In [42]:
# Load ARFF files as CSV
def load_arff_as_csv(filepath):
    with open(filepath, 'r') as file:
        lines = file.readlines()
    data_start = False
    data = []
    for line in lines:
        if data_start:
            data.append(line.strip())
        if line.strip().lower() == '@data':
            data_start = True
    return pd.read_csv(StringIO('\n'.join(data)), header=None)

# Load ARFF files
mat_df = load_arff_as_csv('data/mat.arff')
por_df = load_arff_as_csv('data/por.arff')

# Set column names for mat.arff
mat_columns = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']
mat_df.columns = mat_columns

# Set column names for por.arff
por_columns = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']
por_df.columns = por_columns

# Load CSV file (which is actually in ARFF format)
csv_columns = ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G3']

# Use the same ARFF loading function for dataset.csv
csv_df = load_arff_as_csv('data/dataset.csv')
csv_df.columns = csv_columns

#### Dealing with null values

In [43]:
# Remove Null values
def remove_nulls(df):
    df = df.dropna()
    return df

mat_df = remove_nulls(mat_df)
por_df = remove_nulls(por_df)
csv_df = remove_nulls(csv_df)

#### Remove Duplicates

In [44]:
# Remove duplicates
def remove_duplicates(df):
    df = df.drop_duplicates()
    return df

mat_df = remove_duplicates(mat_df)
por_df = remove_duplicates(por_df)
csv_df = remove_duplicates(csv_df)

#### Replace data

In [45]:
# Replace data
def replace_data(df, column):
    df[column] = df[column].apply(lambda x: x.lower())
    return df

mat_df = replace_data(mat_df, 'sex')
por_df = replace_data(por_df, 'sex')
csv_df = replace_data(csv_df, 'sex')

#### Remove outliers

In [46]:
# Remove outliers using IQR method across multiple columns
def remove_outliers(df, columns_to_check=['age', 'absences', 'G1', 'G2', 'G3']):
    df_clean = df.copy()
    original_rows = len(df_clean)
    
    print("Outlier removal statistics:")
    for col in columns_to_check:
        if col in df.columns and pd.api.types.is_numeric_dtype(df[col]):
            # Calculate IQR
            Q1 = df_clean[col].quantile(0.25)
            Q3 = df_clean[col].quantile(0.75)
            IQR = Q3 - Q1
            
            # Define bounds
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            
            # Count outliers before removal
            outliers_count = df_clean[(df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)].shape[0]
            
            # Remove outliers
            df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]
            
            # Print statistics
            print(f"  • {col}: removed {outliers_count} outliers (bounds: {lower_bound:.2f} to {upper_bound:.2f})")
    
    # Print summary
    rows_removed = original_rows - len(df_clean)
    print(f"Total: {rows_removed} rows removed out of {original_rows} ({rows_removed/original_rows*100:.2f}%)")
    
    return df_clean

# Apply enhanced outlier detection to all datasets
numerical_columns = ['age', 'absences', 'G1', 'G2', 'G3', 'studytime', 'failures', 'Dalc', 'Walc']
mat_df = remove_outliers(mat_df, numerical_columns)
por_df = remove_outliers(por_df, numerical_columns)
csv_df = remove_outliers(csv_df, numerical_columns)

Outlier removal statistics:
  • age: removed 1 outliers (bounds: 13.00 to 21.00)
  • absences: removed 15 outliers (bounds: -12.00 to 20.00)
  • G1: removed 0 outliers (bounds: 0.50 to 20.50)
  • G2: removed 13 outliers (bounds: 3.00 to 19.00)
  • G3: removed 25 outliers (bounds: 1.50 to 21.50)
  • studytime: removed 24 outliers (bounds: -0.50 to 3.50)
  • failures: removed 54 outliers (bounds: 0.00 to 0.00)
  • Dalc: removed 12 outliers (bounds: -0.50 to 3.50)
  • Walc: removed 0 outliers (bounds: -2.00 to 6.00)
Total: 144 rows removed out of 395 (36.46%)
Outlier removal statistics:
  • age: removed 1 outliers (bounds: 13.00 to 21.00)
  • absences: removed 21 outliers (bounds: -9.00 to 15.00)
  • G1: removed 16 outliers (bounds: 5.50 to 17.50)
  • G2: removed 15 outliers (bounds: 5.50 to 17.50)
  • G3: removed 7 outliers (bounds: 4.00 to 20.00)
  • studytime: removed 31 outliers (bounds: -0.50 to 3.50)
  • failures: removed 78 outliers (bounds: 0.00 to 0.00)
  • Dalc: removed 19 outli

#### Scaling features to a common range

In [47]:
# Scale numerical features to a common range [0,1]
def scale_features(df, columns_to_scale):
    df_scaled = df.copy()
    
    # Filter only existing numerical columns
    valid_columns = [col for col in columns_to_scale 
                    if col in df.columns and pd.api.types.is_numeric_dtype(df[col])]
    
    if valid_columns:
        # Apply scaling
        scaler = MinMaxScaler()
        df_scaled[valid_columns] = scaler.fit_transform(df[valid_columns])
        print(f"Scaled features: {', '.join(valid_columns)}")
    
    return df_scaled

# Define which numerical columns should be scaled
numerical_features = [
    'age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures',
    'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'
]

# Apply scaling to all datasets
print("\nApplying feature scaling:")
mat_df = scale_features(mat_df, numerical_features)
por_df = scale_features(por_df, numerical_features)
csv_df = scale_features(csv_df, numerical_features)


Applying feature scaling:
Scaled features: age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout, Dalc, Walc, health, absences
Scaled features: age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout, Dalc, Walc, health, absences
Scaled features: age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout, Dalc, Walc, health, absences


#### Save the wrangled data

In [48]:
# Create directory if it doesn't exist
os.makedirs('processed_data', exist_ok=True)

# Save the processed data files
mat_df.to_csv('processed_data/Pmat_full.csv', index=False)
por_df.to_csv('processed_data/Ppor_full.csv', index=False)
csv_df.to_csv('processed_data/Pdataset.csv', index=False)
# Raw processed data

### Split by gender

In [49]:
# Create processed_data directory if it doesn't exist
os.makedirs('processed_data', exist_ok=True)

# Process Mathematics data
mat_df = remove_outliers(mat_df, 'age')
# Check unique values in sex column
print("Unique values in sex column (Math):", mat_df['sex'].unique())
PmatFE = mat_df[mat_df['sex'].str.contains('F', case=False)].copy()
PmatM = mat_df[mat_df['sex'].str.contains('M', case=False)].copy()

# Process Portuguese data
por_df = remove_outliers(por_df, 'age')
print("Unique values in sex column (Portuguese):", por_df['sex'].unique())
PporFE = por_df[por_df['sex'].str.contains('F', case=False)].copy()
PporM = por_df[por_df['sex'].str.contains('M', case=False)].copy()

# Save gender-split datasets
PmatFE.to_csv('processed_data/PmatFE.csv', index=False)
PmatM.to_csv('processed_data/PmatM.csv', index=False)
PporFE.to_csv('processed_data/PporFE.csv', index=False)
PporM.to_csv('processed_data/PporM.csv', index=False)

# Print verification statistics
print("\nGender Distribution after Outlier Removal:")
print("\nMathematics Dataset:")
print(f"Female: {len(PmatFE)} ({len(PmatFE)/len(mat_df)*100:.1f}%)")
print(f"Male: {len(PmatM)} ({len(PmatM)/len(mat_df)*100:.1f}%)")
print("\nPortuguese Dataset:")
print(f"Female: {len(PporFE)} ({len(PporFE)/len(por_df)*100:.1f}%)")
print(f"Male: {len(PporM)} ({len(PporM)/len(por_df)*100:.1f}%)")
print("\nFiles saved in processed_data folder")

Outlier removal statistics:
Total: 0 rows removed out of 251 (0.00%)
Unique values in sex column (Math): ["'f'" "'m'"]
Outlier removal statistics:
Total: 0 rows removed out of 461 (0.00%)
Unique values in sex column (Portuguese): ["'f'" "'m'"]

Gender Distribution after Outlier Removal:

Mathematics Dataset:
Female: 132 (52.6%)
Male: 119 (47.4%)

Portuguese Dataset:
Female: 282 (61.2%)
Male: 179 (38.8%)

Files saved in processed_data folder


### Split data for training and testing

In [50]:
# Function to split and save datasets
def split_save_and_print(data, name, test_size=0.2, random_state=42):
    X = data.drop(['sex', 'G1', 'G2', 'G3'], axis=1)
    y = data[['G1', 'G2', 'G3']]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        random_state=random_state
    )
    
    # Save splits with new naming convention
    base_path = 'processed_data'
    X_train.to_csv(f'{base_path}/X_{name}_train.csv', index=False)
    X_test.to_csv(f'{base_path}/X_{name}_test.csv', index=False)
    y_train.to_csv(f'{base_path}/Y_{name}_train.csv', index=False)
    y_test.to_csv(f'{base_path}/Y_{name}_test.csv', index=False)
    
    print(f"\n{name}:")
    print(f"Training: {X_train.shape[0]} samples")
    print(f"Testing: {X_test.shape[0]} samples")
    return X_train, X_test, y_train, y_test

print("Performing and saving train-test splits...")

# Mathematics splits
X_train_mat_f, X_test_mat_f, y_train_mat_f, y_test_mat_f = split_save_and_print(PmatFE, "PmatFE")
X_train_mat_m, X_test_mat_m, y_train_mat_m, y_test_mat_m = split_save_and_print(PmatM, "PmatM")

# Portuguese splits
X_train_por_f,X_train_por_f, X_test_por_f, y_train_por_f, y_test_por_f = split_save_and_print(PporFE, "PporFE")
X_train_por_m, X_test_por_m, y_train_por_m, y_test_por_m = split_save_and_print(PporM, "PporM")

print("\nAll splits have been saved to the processed_data folder")

Performing and saving train-test splits...

PmatFE:
Training: 105 samples
Testing: 27 samples

PmatM:
Training: 95 samples
Testing: 24 samples

PporFE:
Training: 225 samples
Testing: 57 samples


ValueError: not enough values to unpack (expected 5, got 4)