## Feature Engineering

### Engineered Features and Justifications

1. **Gvg** - G1 and G2 average grade (integer of (G1 + G2) / 2)
   - Justification: Averaging the first and second period grades provides a more stable measure of a student's performance over time, reducing the impact of any single period's anomalies.

2. **Avgalc** - Average Dalc and Walc (integer of (Dalc + Walc) / 2)
   - Justification: Combining workday and weekend alcohol consumption into a single average value gives a more comprehensive view of a student's overall alcohol consumption habits.

3. **Bum** - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.
   - Justification for weights:
     - Failures are given a higher weight (2) because past class failures are a strong indicator of academic struggles.
     - Absences are weighted at 1.5 as frequent absences can significantly impact academic performance.
     - Both Dalc and Walc are weighted at 1 as alcohol consumption can affect both health and academic performance.
     - Studytime is inverted (5 - studytime) and weighted at 1 because less study time can lead to poorer academic outcomes.
     - Freetime is weighted at 1 as more free time might indicate less focus on academics.

## Import dependancies

In [15]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split

#### Load the datasets

In [16]:
def load_datasets():
    """Load encoded datasets"""
    base_path = 'processed_data/'
    datasets = {}
    
    try:
        # Load full encoded datasets
        datasets['Mat'] = pd.read_csv(f'{base_path}Pmat_full.csv')
        datasets['Por'] = pd.read_csv(f'{base_path}Ppor_full.csv')
        
        print("Datasets loaded successfully:")
        for name, df in datasets.items():
            print(f"{name}: {df.shape[0]} rows, {df.shape[1]} columns")
            print(f"Sample data types:\n{df.dtypes.head()}\n")
        
        return datasets
        
    except FileNotFoundError as e:
        print(f"Error loading datasets: {e}")
        return None

# Load all datasets
datasets = load_datasets()

Datasets loaded successfully:
Mat: 394 rows, 33 columns
Sample data types:
school     int64
sex        int64
age        int64
address    int64
famsize    int64
dtype: object

Por: 648 rows, 33 columns
Sample data types:
school     int64
sex        int64
age        int64
address    int64
famsize    int64
dtype: object



#### Calculating new features from existing features

In [17]:
def calculate_features(df):
    """Add engineered features to dataframe"""
    df = df.copy()
    
    # Calculate grade average
    df['Gvg'] = df[['G1', 'G2']].mean(axis=1)
    
    # Calculate alcohol average
    df['Avgalc'] = df[['Dalc', 'Walc']].mean(axis=1)
    
    # Calculate risk factor (Bum)
    df['Bum'] = (2.0 * df['failures'] + 
                 1.5 * df['absences'] + 
                 1.0 * df['Dalc'] + 
                 1.0 * df['Walc'] + 
                 1.0 * (5 - df['studytime']) + 
                 1.0 * df['freetime'])
    
    return df

### Apply features

In [18]:
def process_and_save_datasets(datasets):
    """Process and split datasets by gender"""
    base_path = 'processed_data/'
    processed_datasets = {}
    
    for name, df in datasets.items():
        print(f"\nProcessing {name} dataset...")
        try:
            # Add engineered features
            processed_df = calculate_features(df)
            
            # Split by gender (using encoded values)
            processed_df_fe = processed_df[processed_df['sex'] == 0].copy()
            processed_df_m = processed_df[processed_df['sex'] == 1].copy()
            
            # Save enhanced gender-split datasets
            processed_df_fe.to_csv(f'{base_path}X_{name}FE_enhanced.csv', index=False)
            processed_df_m.to_csv(f'{base_path}X_{name}M_enhanced.csv', index=False)
            
            processed_datasets[f'{name}FE'] = processed_df_fe
            processed_datasets[f'{name}M'] = processed_df_m
            
        except Exception as e:
            print(f"Error processing {name} dataset: {e}")
            continue
    
    return processed_datasets

# Process datasets
processed_datasets = process_and_save_datasets(datasets)


Processing Mat dataset...

Processing Por dataset...


### Analyse features

In [19]:
def analyze_features(processed_datasets):
    print("\nFeature Engineering Statistics:")
    
    for name, df in processed_datasets.items():
        print(f"\n{name} Dataset Statistics:")
        
        new_features = ['Gvg', 'Avgalc', 'Bum']
        for feature in new_features:
            if feature in df.columns:
                print(f"\n{feature}:")
                print(f"  Mean: {df[feature].mean():.2f}")
                print(f"  Std: {df[feature].std():.2f}")
                print(f"  Min: {df[feature].min():.2f}")
                print(f"  Max: {df[feature].max():.2f}")
            else:
                print(f"\n{feature}: Not available for this dataset")

# Analyze all processed datasets
if 'processed_datasets' in locals():
    analyze_features(processed_datasets)


Feature Engineering Statistics:

MatFE Dataset Statistics:

Gvg:
  Mean: 10.50
  Std: 3.30
  Min: 2.00
  Max: 18.50

Avgalc:
  Mean: 1.61
  Std: 0.73
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 18.87
  Std: 14.70
  Min: 6.00
  Max: 118.50

MatM Dataset Statistics:

Gvg:
  Mean: 11.17
  Std: 3.50
  Min: 2.50
  Max: 19.00

Avgalc:
  Mean: 2.18
  Std: 1.13
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 19.42
  Std: 10.16
  Min: 5.00
  Max: 68.00

PorFE Dataset Statistics:

Gvg:
  Mean: 11.73
  Std: 2.78
  Min: 2.50
  Max: 18.50

Avgalc:
  Mean: 1.61
  Std: 0.74
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 14.94
  Std: 7.69
  Min: 4.00
  Max: 59.00

PorM Dataset Statistics:

Gvg:
  Mean: 11.15
  Std: 2.63
  Min: 2.00
  Max: 18.00

Avgalc:
  Mean: 2.28
  Std: 1.16
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 17.35
  Std: 8.41
  Min: 6.00
  Max: 53.00


### Splitting and saving this data for training

In [20]:
def split_save_and_print(data, name, test_size=0.3, random_state=42):
    """Split dataset into training and testing sets"""
    if len(data) == 0:
        print(f"\nWarning: {name} dataset is empty, skipping split")
        return None, None, None, None
        
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(['G1', 'G2', 'G3'], axis=1),
        data[['G1', 'G2', 'G3']],
        test_size=test_size,
        random_state=random_state
    )
    
    # Save splits
    base_path = f'processed_data/{name.lower()}'
    X_train.to_csv(f'{base_path}_X_train.csv', index=False)
    X_test.to_csv(f'{base_path}_X_test.csv', index=False)
    y_train.to_csv(f'{base_path}_y_train.csv', index=False)
    y_test.to_csv(f'{base_path}_y_test.csv', index=False)
    
    print(f"\n{name}:")
    print(f"Training: {X_train.shape[0]} samples")
    print(f"Testing: {X_test.shape[0]} samples")
    return X_train, X_test, y_train, y_test

# Process and save datasets
if 'processed_datasets' in locals():
    for name, df in processed_datasets.items():
        print(f"\nProcessing {name} dataset for train-test split...")
        split_save_and_print(df, name)


Processing MatFE dataset for train-test split...

MatFE:
Training: 145 samples
Testing: 63 samples

Processing MatM dataset for train-test split...

MatM:
Training: 130 samples
Testing: 56 samples

Processing PorFE dataset for train-test split...

PorFE:
Training: 268 samples
Testing: 115 samples

Processing PorM dataset for train-test split...

PorM:
Training: 185 samples
Testing: 80 samples
