## Feature Engineering

### Engineered Features and Justifications

1. **Gvg** - G1 and G2 average grade (integer of (G1 + G2) / 2)
   - Justification: Averaging the first and second period grades provides a more stable measure of a student's performance over time, reducing the impact of any single period's anomalies.

2. **Avgalc** - Average Dalc and Walc (integer of (Dalc + Walc) / 2)
   - Justification: Combining workday and weekend alcohol consumption into a single average value gives a more comprehensive view of a student's overall alcohol consumption habits.

3. **Bum** - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.
   - Justification for weights:
     - Failures are given a higher weight (2) because past class failures are a strong indicator of academic struggles.
     - Absences are weighted at 1.5 as frequent absences can significantly impact academic performance.
     - Both Dalc and Walc are weighted at 1 as alcohol consumption can affect both health and academic performance.
     - Studytime is inverted (5 - studytime) and weighted at 1 because less study time can lead to poorer academic outcomes.
     - Freetime is weighted at 1 as more free time might indicate less focus on academics.

## Import dependancies

In [22]:
import pandas as pd
import numpy as np
import os

#### Load the datasets

In [23]:
def load_datasets():
    base_path = './processed_data/'
    datasets = {}
    
    try:
        # Load Mathematics datasets
        datasets['MatFE'] = pd.read_csv(f'{base_path}PmatFE_full.csv')
        datasets['MatM'] = pd.read_csv(f'{base_path}PmatM_full.csv')
        
        # Load Portuguese datasets
        datasets['PorFE'] = pd.read_csv(f'{base_path}PporFE_full.csv')
        datasets['PorM'] = pd.read_csv(f'{base_path}PporM_full.csv')
        
        print("Datasets loaded successfully:")
        for name, df in datasets.items():
            print(f"{name}: {df.shape[0]} rows, {df.shape[1]} columns")
        
        return datasets
        
    except FileNotFoundError as e:
        print(f"Error loading datasets: {e}")
        return None

# Load all datasets
datasets = load_datasets()

Datasets loaded successfully:
MatFE: 208 rows, 33 columns
MatM: 186 rows, 33 columns
PorFE: 383 rows, 33 columns
PorM: 265 rows, 33 columns


#### Calculating new features from existing features

In [24]:
def calculate_features(df):
    """Add engineered features to dataframe"""
    df = df.copy()
    
    # Calculate grade average if G1 and G2 exist
    if 'G1' in df.columns and 'G2' in df.columns:
        df['Gvg'] = (df['G1'].astype(float) + df['G2'].astype(float)) / 2
    
    # Calculate alcohol average
    df['Avgalc'] = (df['Dalc'].astype(float) + df['Walc'].astype(float)) / 2
    
    # Calculate risk factor (Bum)
    df['Bum'] = (2.0 * df['failures'].astype(float) + 
                 1.5 * df['absences'].astype(float) + 
                 1.0 * df['Dalc'].astype(float) + 
                 1.0 * df['Walc'].astype(float) + 
                 1.0 * (5 - df['studytime'].astype(float)) + 
                 1.0 * df['freetime'].astype(float))
    
    return df

### Apply features

In [25]:
def process_and_save_datasets(datasets):
    base_path = './processed_data/'
    processed_datasets = {}
    
    for name, df in datasets.items():
        print(f"\nProcessing {name} dataset...")
        try:
            # Add engineered features
            processed_df = calculate_features(df)
            
            # Save enhanced dataset
            output_file = f'{base_path}X_{name}_enhanced.csv'
            processed_df.to_csv(output_file, index=False)
            print(f"Saved enhanced dataset to: {output_file}")
            
            processed_datasets[name] = processed_df
            
        except Exception as e:
            print(f"Error processing {name} dataset: {e}")
            continue
    
    return processed_datasets

# Process all datasets
if datasets:
    processed_datasets = process_and_save_datasets(datasets)


Processing MatFE dataset...
Saved enhanced dataset to: ./processed_data/X_MatFE_enhanced.csv

Processing MatM dataset...
Saved enhanced dataset to: ./processed_data/X_MatM_enhanced.csv

Processing PorFE dataset...
Saved enhanced dataset to: ./processed_data/X_PorFE_enhanced.csv

Processing PorM dataset...
Saved enhanced dataset to: ./processed_data/X_PorM_enhanced.csv


### Analyse features

In [26]:
def analyze_features(processed_datasets):
    print("\nFeature Engineering Statistics:")
    
    for name, df in processed_datasets.items():
        print(f"\n{name} Dataset Statistics:")
        
        new_features = ['Gvg', 'Avgalc', 'Bum']
        for feature in new_features:
            if feature in df.columns:
                print(f"\n{feature}:")
                print(f"  Mean: {df[feature].mean():.2f}")
                print(f"  Std: {df[feature].std():.2f}")
                print(f"  Min: {df[feature].min():.2f}")
                print(f"  Max: {df[feature].max():.2f}")
            else:
                print(f"\n{feature}: Not available for this dataset")

# Analyze all processed datasets
if 'processed_datasets' in locals():
    analyze_features(processed_datasets)


Feature Engineering Statistics:

MatFE Dataset Statistics:

Gvg:
  Mean: 10.50
  Std: 3.30
  Min: 2.00
  Max: 18.50

Avgalc:
  Mean: 1.61
  Std: 0.73
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 18.87
  Std: 14.70
  Min: 6.00
  Max: 118.50

MatM Dataset Statistics:

Gvg:
  Mean: 11.17
  Std: 3.50
  Min: 2.50
  Max: 19.00

Avgalc:
  Mean: 2.18
  Std: 1.13
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 19.42
  Std: 10.16
  Min: 5.00
  Max: 68.00

PorFE Dataset Statistics:

Gvg:
  Mean: 11.73
  Std: 2.78
  Min: 2.50
  Max: 18.50

Avgalc:
  Mean: 1.61
  Std: 0.74
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 14.94
  Std: 7.69
  Min: 4.00
  Max: 59.00

PorM Dataset Statistics:

Gvg:
  Mean: 11.15
  Std: 2.63
  Min: 2.00
  Max: 18.00

Avgalc:
  Mean: 2.28
  Std: 1.16
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 17.35
  Std: 8.41
  Min: 6.00
  Max: 53.00
