## Feature Engineering

### Engineered Features and Justifications

1. **Gvg** - G1 and G2 average grade (integer of (G1 + G2) / 2)
   - Justification: Averaging the first and second period grades provides a more stable measure of a student's performance over time, reducing the impact of any single period's anomalies.

2. **Avgalc** - Average Dalc and Walc (integer of (Dalc + Walc) / 2)
   - Justification: Combining workday and weekend alcohol consumption into a single average value gives a more comprehensive view of a student's overall alcohol consumption habits.

3. **Bum** - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.
   - Justification for weights:
     - Failures are given a higher weight (2) because past class failures are a strong indicator of academic struggles.
     - Absences are weighted at 1.5 as frequent absences can significantly impact academic performance.
     - Both Dalc and Walc are weighted at 1 as alcohol consumption can affect both health and academic performance.
     - Studytime is inverted (5 - studytime) and weighted at 1 because less study time can lead to poorer academic outcomes.
     - Freetime is weighted at 1 as more free time might indicate less focus on academics.

## Import dependancies

In [1]:
import pandas as pd
import numpy as np
import os

#### Load the datasets

In [2]:
# Load processed datasets
mat_df = pd.read_csv('processed_data/Pmat.csv')
por_df = pd.read_csv('processed_data/Ppor.csv')
csv_df = pd.read_csv('processed_data/Pdataset.csv')

print("Datasets loaded successfully:")
print(f"Mathematics: {len(mat_df)} rows")
print(f"Portuguese: {len(por_df)} rows")
print(f"Combined: {len(csv_df)} rows")

Datasets loaded successfully:
Mathematics: 394 rows
Portuguese: 648 rows
Combined: 648 rows


#### Encoding categorical variables



1. Label Encoding is used because:
    Our categorical variables are nominal (no inherent order)
    We want to preserve unique values for each category
    The machine learning algorithms we'll use can handle numeric inputs better

2. Naming Convention:
    Added '_encoded' suffix to encoded columns for clarity
    Original columns are dropped to avoid redundancy
    New files saved with 'FE' suffix to indicate Feature Engineered and Encoded

3. Verification:
    Print statements confirm successful encoding
    Column counts ensure no data was lost in the process
    New datasets maintain all information in numeric format


In [3]:
# Define categorical columns
categorical_columns = ['school', 'sex', 'address', 'famsize', 'Pstatus', 
                      'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                      'famsup', 'paid', 'activities', 'nursery', 'higher', 
                      'internet', 'romantic']

# Import encoder
from sklearn.preprocessing import LabelEncoder

# Initialize encoder
le = LabelEncoder()

# Apply encoding to all datasets
for df in [mat_df, por_df, csv_df]:
    for col in categorical_columns:
        df[f'{col}_encoded'] = le.fit_transform(df[col])
        
    # Drop original categorical columns
    df.drop(columns=categorical_columns, inplace=True)

# Save encoded datasets with 'FE' suffix (Features Encoded)
mat_df.to_csv('processed_data/PmatFE.csv', index=False)
por_df.to_csv('processed_data/PporFE.csv', index=False)
csv_df.to_csv('processed_data/PdatasetFE.csv', index=False)

# Print encoding verification
print("\nEncoding Verification:")
for name, df in [("Mathematics", mat_df), ("Portuguese", por_df), ("Combined", csv_df)]:
    print(f"\n{name} Dataset:")
    print(f"Original categorical columns: {len(categorical_columns)}")
    print(f"New encoded columns: {len([col for col in df.columns if '_encoded' in col])}")
    print(f"Total columns: {len(df.columns)}")


Encoding Verification:

Mathematics Dataset:
Original categorical columns: 17
New encoded columns: 17
Total columns: 33

Portuguese Dataset:
Original categorical columns: 17
New encoded columns: 17
Total columns: 33

Combined Dataset:
Original categorical columns: 17
New encoded columns: 17
Total columns: 31


#### Calculating new features from existing features

In [4]:
def calculate_gvg(row):
    """Calculate average grade from G1 and G2"""
    return int((row['G1'] + row['G2']) / 2)

def calculate_avgalc(row):
    """Calculate average alcohol consumption"""
    return int((row['Dalc'] + row['Walc']) / 2)

def calculate_bum(row):
    """Calculate academic risk factor"""
    return (2.0 * row['failures'] + 
            1.5 * row['absences'] + 
            1.0 * row['Dalc'] + 
            1.0 * row['Walc'] + 
            1.0 * (5 - row['studytime']) + 
            1.0 * row['freetime'])

### Apply features

In [None]:
# Load gender-split datasets
split_files = ['X_PmatM', 'X_PmatFE', 'X_PporM', 'X_PporFE']
datasets = {}

for split in split_files:
    datasets[split] = pd.read_csv(f'../processed_data/{split}.csv')

# Apply features to each split
for name, df in datasets.items():
    # Calculate features
    df['Avgalc'] = df.apply(calculate_avgalc, axis=1)
    df['Bum'] = df.apply(calculate_bum, axis=1)
    df['Gvg'] = df.apply(calculate_gvg, axis=1)
    
    # Save feature-engineered datasets with lowercase 'f' suffix
    df.to_csv(f'../processed_data/{name}f.csv', index=False)

print("Features added to gender-split datasets successfully")
print("Created files: X_PmatMf.csv, X_PmatFEf.csv, X_PporMf.csv, X_PporFEf.csv")

### Analyse features

In [6]:
# Print statistics for new features
print("\nFeature Engineering Statistics:")
for name, df in [("Mathematics", mat_df), ("Portuguese", por_df), ("Combined", csv_df)]:
    print(f"\n{name} Dataset Statistics:")
    
    # Only print Gvg statistics if the column exists
    if 'Gvg' in df.columns:
        print(f"Grade Average (Gvg):")
        print(f"  Mean: {df['Gvg'].mean():.2f}")
        print(f"  Std: {df['Gvg'].std():.2f}")
    else:
        print("Grade Average (Gvg): Not available for this dataset")
    
    print(f"\nAlcohol Average (Avgalc):")
    print(f"  Mean: {df['Avgalc'].mean():.2f}")
    print(f"  Std: {df['Avgalc'].std():.2f}")
    
    print(f"\nRisk Factor (Bum):")
    print(f"  Mean: {df['Bum'].mean():.2f}")
    print(f"  Std: {df['Bum'].std():.2f}")


Feature Engineering Statistics:

Mathematics Dataset Statistics:
Grade Average (Gvg):
  Mean: 10.58
  Std: 3.43

Alcohol Average (Avgalc):
  Mean: 1.71
  Std: 0.96

Risk Factor (Bum):
  Mean: 19.13
  Std: 12.75

Portuguese Dataset Statistics:
Grade Average (Gvg):
  Mean: 11.21
  Std: 2.74

Alcohol Average (Avgalc):
  Mean: 1.71
  Std: 0.97

Risk Factor (Bum):
  Mean: 15.93
  Std: 8.07

Combined Dataset Statistics:
Grade Average (Gvg): Not available for this dataset

Alcohol Average (Avgalc):
  Mean: 1.71
  Std: 0.97

Risk Factor (Bum):
  Mean: 15.93
  Std: 8.07
