## Feature Engineering

### Engineered Features and Justifications

1. **Gvg** - G1 and G2 average grade (integer of (G1 + G2) / 2)
   - Justification: Averaging the first and second period grades provides a more stable measure of a student's performance over time, reducing the impact of any single period's anomalies.

2. **Avgalc** - Average Dalc and Walc (integer of (Dalc + Walc) / 2)
   - Justification: Combining workday and weekend alcohol consumption into a single average value gives a more comprehensive view of a student's overall alcohol consumption habits.

3. **Bum** - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.
   - Justification for weights:
     - Failures are given a higher weight (2) because past class failures are a strong indicator of academic struggles.
     - Absences are weighted at 1.5 as frequent absences can significantly impact academic performance.
     - Both Dalc and Walc are weighted at 1 as alcohol consumption can affect both health and academic performance.
     - Studytime is inverted (5 - studytime) and weighted at 1 because less study time can lead to poorer academic outcomes.
     - Freetime is weighted at 1 as more free time might indicate less focus on academics.

## Import dependancies

In [3]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split

#### Load the datasets

In [4]:
def load_datasets():
    base_path = 'processed_data/'
    datasets = {}
    
    try:
        # Load full datasets
        datasets['Mat'] = pd.read_csv(f'{base_path}Pmat_full.csv')
        datasets['Por'] = pd.read_csv(f'{base_path}Ppor_full.csv')
        
        print("Datasets loaded successfully:")
        for name, df in datasets.items():
            print(f"{name}: {df.shape[0]} rows, {df.shape[1]} columns")
        
        return datasets
        
    except FileNotFoundError as e:
        print(f"Error loading datasets: {e}")
        return None

# Load all datasets
datasets = load_datasets()

Datasets loaded successfully:
Mat: 394 rows, 33 columns
Por: 648 rows, 33 columns


#### Calculating new features from existing features

In [5]:
def calculate_features(df):
    """Add engineered features to dataframe"""
    df = df.copy()
    
    # Calculate grade average if G1 and G2 exist
    if 'G1' in df.columns and 'G2' in df.columns:
        df['Gvg'] = (df['G1'].astype(float) + df['G2'].astype(float)) / 2
    
    # Calculate alcohol average
    df['Avgalc'] = (df['Dalc'].astype(float) + df['Walc'].astype(float)) / 2
    
    # Calculate risk factor (Bum)
    df['Bum'] = (2.0 * df['failures'].astype(float) + 
                 1.5 * df['absences'].astype(float) + 
                 1.0 * df['Dalc'].astype(float) + 
                 1.0 * df['Walc'].astype(float) + 
                 1.0 * (5 - df['studytime'].astype(float)) + 
                 1.0 * df['freetime'].astype(float))
    
    return df

### Apply features

In [6]:
def process_and_save_datasets(datasets):
    base_path = 'processed_data/'
    processed_datasets = {}
    
    for name, df in datasets.items():
        print(f"\nProcessing {name} dataset...")
        try:
            # Add engineered features
            processed_df = calculate_features(df)
            
            # Split by gender
            processed_df_fe = processed_df[processed_df['sex'] == "'f'"].copy()
            processed_df_m = processed_df[processed_df['sex'] == "'m'"].copy()
            
            # Save enhanced gender-split datasets
            processed_df_fe.to_csv(f'{base_path}X_{name}FE_enhanced.csv', index=False)
            processed_df_m.to_csv(f'{base_path}X_{name}M_enhanced.csv', index=False)
            
            # Store both gender splits in processed_datasets
            processed_datasets[f'{name}FE'] = processed_df_fe
            processed_datasets[f'{name}M'] = processed_df_m
            
            print(f"Saved enhanced datasets for {name}")
            
        except Exception as e:
            print(f"Error processing {name} dataset: {e}")
            continue
    
    return processed_datasets

# Process all datasets
if datasets:
    processed_datasets = process_and_save_datasets(datasets)


Processing Mat dataset...
Saved enhanced datasets for Mat

Processing Por dataset...
Saved enhanced datasets for Por


### Analyse features

In [7]:
def analyze_features(processed_datasets):
    print("\nFeature Engineering Statistics:")
    
    for name, df in processed_datasets.items():
        print(f"\n{name} Dataset Statistics:")
        
        new_features = ['Gvg', 'Avgalc', 'Bum']
        for feature in new_features:
            if feature in df.columns:
                print(f"\n{feature}:")
                print(f"  Mean: {df[feature].mean():.2f}")
                print(f"  Std: {df[feature].std():.2f}")
                print(f"  Min: {df[feature].min():.2f}")
                print(f"  Max: {df[feature].max():.2f}")
            else:
                print(f"\n{feature}: Not available for this dataset")

# Analyze all processed datasets
if 'processed_datasets' in locals():
    analyze_features(processed_datasets)


Feature Engineering Statistics:

MatFE Dataset Statistics:

Gvg:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Avgalc:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Bum:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

MatM Dataset Statistics:

Gvg:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Avgalc:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Bum:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

PorFE Dataset Statistics:

Gvg:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Avgalc:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Bum:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

PorM Dataset Statistics:

Gvg:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Avgalc:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan

Bum:
  Mean: nan
  Std: nan
  Min: nan
  Max: nan


### Splitting and saving this data for training

In [8]:
# Function to split and save datasets
def split_save_and_print(data, name, test_size=0.3, random_state=42):
    if len(data) == 0:
        print(f"\nWarning: {name} dataset is empty, skipping split")
        return None, None, None, None
        
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(['G1', 'G2', 'G3'], axis=1),
        data[['G1', 'G2', 'G3']],
        test_size=test_size,
        random_state=random_state
    )
    
    # Save splits
    base_path = f'processed_data/{name.lower().replace(" ", "_")}'
    X_train.to_csv(f'{base_path}_X_train.csv', index=False)
    X_test.to_csv(f'{base_path}_X_test.csv', index=False)
    y_train.to_csv(f'{base_path}_y_train.csv', index=False)
    y_test.to_csv(f'{base_path}_y_test.csv', index=False)
    
    print(f"\n{name}:")
    print(f"Training: {X_train.shape[0]} samples")
    print(f"Testing: {X_test.shape[0]} samples")
    return X_train, X_test, y_train, y_test

# Process and save datasets
if 'processed_datasets' in locals():
    for name, df in processed_datasets.items():
        print(f"\nProcessing {name} dataset for train-test split...")
        print(f"Unique sex values in dataset: {df['sex'].unique()}")
        
        # Split by gender using encoded values
        df_female = df[df['sex'] == "'f'"].copy()
        df_male = df[df['sex'] == "'m'"].copy()
        
        print(f"Female samples: {len(df_female)}")
        print(f"Male samples: {len(df_male)}")
        
        if len(df_female) > 0:
            # Split and save female datasets
            split_save_and_print(df_female, f"{name}FE")
        
        if len(df_male) > 0:
            # Split and save male datasets
            split_save_and_print(df_male, f"{name}M")
        
        # Split and save full datasets
        split_save_and_print(df, name)


Processing MatFE dataset for train-test split...
Unique sex values in dataset: []
Female samples: 0
Male samples: 0


Processing MatM dataset for train-test split...
Unique sex values in dataset: []
Female samples: 0
Male samples: 0


Processing PorFE dataset for train-test split...
Unique sex values in dataset: []
Female samples: 0
Male samples: 0


Processing PorM dataset for train-test split...
Unique sex values in dataset: []
Female samples: 0
Male samples: 0

