## Feature Engineering

### Engineered Features and Justifications

1. **Gvg** - G1 and G2 average grade (integer of (G1 + G2) / 2)
   - Justification: Averaging the first and second period grades provides a more stable measure of a student's performance over time, reducing the impact of any single period's anomalies.

2. **Avgalc** - Average Dalc and Walc (integer of (Dalc + Walc) / 2)
   - Justification: Combining workday and weekend alcohol consumption into a single average value gives a more comprehensive view of a student's overall alcohol consumption habits.

3. **Bum** - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.
   - Justification for weights:
     - Failures are given a higher weight (2) because past class failures are a strong indicator of academic struggles.
     - Absences are weighted at 1.5 as frequent absences can significantly impact academic performance.
     - Both Dalc and Walc are weighted at 1 as alcohol consumption can affect both health and academic performance.
     - Studytime is inverted (5 - studytime) and weighted at 1 because less study time can lead to poorer academic outcomes.
     - Freetime is weighted at 1 as more free time might indicate less focus on academics.

## Import dependancies

In [97]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split

#### Load the datasets

In [98]:
# Load the datasets
def load_processed_datasets():
    """Load all preprocessed datasets"""
    base_path = 'processed_data/'
    try:
        # Load full datasets
        Pmat_full = pd.read_csv(f'{base_path}Pmat_full.csv')
        Ppor_full = pd.read_csv(f'{base_path}Ppor_full.csv')
        
        # Load gender splits
        PmatM = pd.read_csv(f'{base_path}PmatM.csv')
        PmatFE = pd.read_csv(f'{base_path}PmatFE.csv')
        PporM = pd.read_csv(f'{base_path}PporM.csv')
        PporFE = pd.read_csv(f'{base_path}PporFE.csv')
        
        return {
            'Pmat_full': Pmat_full,
            'Ppor_full': Ppor_full,
            'PmatM': PmatM,
            'PmatFE': PmatFE,
            'PporM': PporM,
            'PporFE': PporFE
        }
    except Exception as e:
        print(f"Error loading datasets: {e}")
        return None

# Load all datasets
datasets = load_processed_datasets()
print("\nVerifying datasets after loading:")
for name, df in datasets.items():
    print(f"{name}: {df.shape[0]} rows, {df.columns.tolist()[:5]} columns")


Verifying datasets after loading:
Pmat_full: 394 rows, ['school', 'sex', 'age', 'address', 'famsize'] columns
Ppor_full: 648 rows, ['school', 'sex', 'age', 'address', 'famsize'] columns
PmatM: 186 rows, ['school', 'sex', 'age', 'address', 'famsize'] columns
PmatFE: 208 rows, ['school', 'sex', 'age', 'address', 'famsize'] columns
PporM: 265 rows, ['school', 'sex', 'age', 'address', 'famsize'] columns
PporFE: 383 rows, ['school', 'sex', 'age', 'address', 'famsize'] columns


#### Calculating new features from existing features

In [99]:
# Feature calculation function
def calculate_features(df):
    """Calculate engineered features"""
    df = df.copy()
    
    # Calculate grade average
    df['Gvg'] = df[['G1', 'G2']].mean(axis=1)
    
    # Calculate alcohol average
    df['Avgalc'] = df[['Dalc', 'Walc']].mean(axis=1)
    
    # Calculate risk factor
    df['Bum'] = (2.0 * df['failures'] + 
                 1.5 * df['absences'] + 
                 1.0 * df['Dalc'] + 
                 1.0 * df['Walc'] + 
                 1.0 * (5 - df['studytime']) + 
                 1.0 * df['freetime'])
    
    # Drop any rows with NaN values
    df = df.dropna()
    
    return df

### Encoding categorical and nomial attributes of our data

In [100]:
def encode_categorical_data(df):
    """Encode categorical variables and clean data"""
    df = df.copy()
    
    # Clean string values by removing quotes
    for col in df.select_dtypes(include=['object']).columns:
        df[col] = df[col].str.strip("'")
    
    # Drop completely empty columns
    df = df.dropna(axis=1, how='all')
    
    # Encode categorical variables
    categorical_cols = ['Mjob', 'Fjob', 'reason', 'guardian', 'Pstatus']
    df = pd.get_dummies(df, columns=categorical_cols, prefix=categorical_cols)
    
    # Encode binary variables
    binary_mapping = {
        'schoolsup': {'no': 0, 'yes': 1},
        'famsup': {'no': 0, 'yes': 1},
        'paid': {'no': 0, 'yes': 1},
        'activities': {'no': 0, 'yes': 1},
        'nursery': {'no': 0, 'yes': 1},
        'higher': {'no': 0, 'yes': 1},
        'internet': {'no': 0, 'yes': 1},
        'romantic': {'no': 0, 'yes': 1},
        'school': {'GP': 0, 'MS': 1},
        'sex': {'F': 0, 'M': 1},
        'address': {'U': 0, 'R': 1},
        'famsize': {'LE3': 0, 'GT3': 1}
    }
    
    for col, mapping in binary_mapping.items():
        if col in df.columns:
            df[col] = df[col].map(mapping)
    
    return df

### Apply features

In [101]:
def process_and_save_datasets(datasets):
    """Process datasets and save enhanced versions"""
    base_path = 'processed_data/'
    enhanced_datasets = {}
    
    for name, df in datasets.items():
        print(f"\nProcessing {name} dataset...")
        try:
            # First encode and clean data
            encoded_df = encode_categorical_data(df)
            
            # Then add engineered features
            enhanced_df = calculate_features(encoded_df)
            
            # Verify we have valid data
            if enhanced_df.empty:
                print(f"Warning: {name} dataset is empty after processing")
                continue
            
            # Save enhanced dataset
            output_path = f'{base_path}X_{name}_enhanced.csv'
            enhanced_df.to_csv(output_path, index=False)
            enhanced_datasets[name] = enhanced_df
            
            # Print processing results
            print(f"Original shape: {df.shape}")
            print(f"Enhanced shape: {enhanced_df.shape}")
            print(f"Features: {sorted(enhanced_df.columns.tolist())}")
            
        except Exception as e:
            print(f"Error processing {name}: {e}")
            continue
    
    return enhanced_datasets

# Process datasets
processed_datasets = process_and_save_datasets(datasets)


Processing Pmat_full dataset...
Original shape: (394, 33)
Enhanced shape: (394, 50)
Features: ['Avgalc', 'Bum', 'Dalc', 'Fedu', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'G1', 'G2', 'G3', 'Gvg', 'Medu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Pstatus_A', 'Pstatus_T', 'Walc', 'absences', 'activities', 'address', 'age', 'failures', 'famrel', 'famsize', 'famsup', 'freetime', 'goout', 'guardian_father', 'guardian_mother', 'guardian_other', 'health', 'higher', 'internet', 'nursery', 'paid', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'romantic', 'school', 'schoolsup', 'sex', 'studytime', 'traveltime']

Processing Ppor_full dataset...
Original shape: (648, 33)
Enhanced shape: (648, 50)
Features: ['Avgalc', 'Bum', 'Dalc', 'Fedu', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'G1', 'G2', 'G3', 'Gvg', 'Medu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_serv

### Analyse features

In [102]:
def analyze_features(processed_datasets):
    print("\nFeature Engineering Statistics:")
    
    for name, df in processed_datasets.items():
        if df is None or df.empty:
            print(f"\n{name} Dataset is empty or None")
            continue
            
        print(f"\n{name} Dataset Statistics:")
        
        new_features = ['Gvg', 'Avgalc', 'Bum']
        for feature in new_features:
            if feature in df.columns:
                stats = df[feature].describe()
                print(f"\n{feature}:")
                print(f"  Mean: {stats['mean']:.2f}")
                print(f"  Std: {stats['std']:.2f}")
                print(f"  Min: {stats['min']:.2f}")
                print(f"  Max: {stats['max']:.2f}")
            else:
                print(f"\n{feature}: Not available for this dataset")

# Analyze all processed datasets
if processed_datasets:
    analyze_features(processed_datasets)


Feature Engineering Statistics:

Pmat_full Dataset Statistics:

Gvg:
  Mean: 10.82
  Std: 3.41
  Min: 2.00
  Max: 19.00

Avgalc:
  Mean: 1.88
  Std: 0.98
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 19.13
  Std: 12.75
  Min: 5.00
  Max: 118.50

Ppor_full Dataset Statistics:

Gvg:
  Mean: 11.49
  Std: 2.73
  Min: 2.00
  Max: 18.50

Avgalc:
  Mean: 1.89
  Std: 0.99
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 15.93
  Std: 8.07
  Min: 4.00
  Max: 59.00

PmatM Dataset Statistics:

Gvg:
  Mean: 11.17
  Std: 3.50
  Min: 2.50
  Max: 19.00

Avgalc:
  Mean: 2.18
  Std: 1.13
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 19.42
  Std: 10.16
  Min: 5.00
  Max: 68.00

PmatFE Dataset Statistics:

Gvg:
  Mean: 10.50
  Std: 3.30
  Min: 2.00
  Max: 18.50

Avgalc:
  Mean: 1.61
  Std: 0.73
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 18.87
  Std: 14.70
  Min: 6.00
  Max: 118.50

PporM Dataset Statistics:

Gvg:
  Mean: 11.15
  Std: 2.63
  Min: 2.00
  Max: 18.00

Avgalc:
  Mean: 2.28
  Std: 1.16
  Min: 1.00
  Max: 5.00

Bum:
  Mean: 1

### Splitting and saving this data for training

In [103]:
def split_save_and_print(data, name, test_size=0.3, random_state=42):
    """Split dataset into training and testing sets"""
    if data is None or len(data) == 0:
        print(f"\nWarning: {name} dataset is empty, skipping split")
        return None, None, None, None
    
    # Remove sex column if present (already split by gender)
    features_to_drop = ['G1', 'G2', 'G3']
    if 'sex' in data.columns:
        features_to_drop.append('sex')
        
    X = data.drop(features_to_drop, axis=1)
    y = data[['G1', 'G2', 'G3']]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        random_state=random_state
    )
    
    # Save splits
    base_path = 'processed_data'
    X_train.to_csv(f'{base_path}/X_{name}_train.csv', index=False)
    X_test.to_csv(f'{base_path}/X_{name}_test.csv', index=False)
    y_train.to_csv(f'{base_path}/y_{name}_train.csv', index=False)
    y_test.to_csv(f'{base_path}/y_{name}_test.csv', index=False)
    
    print(f"\n{name}:")
    print(f"Training: {X_train.shape[0]} samples")
    print(f"Testing: {X_test.shape[0]} samples")
    return X_train, X_test, y_train, y_test

# Process and save datasets
if processed_datasets:
    for name, df in processed_datasets.items():
        print(f"\nProcessing {name} dataset for train-test split...")
        split_save_and_print(df, name)


Processing Pmat_full dataset for train-test split...

Pmat_full:
Training: 275 samples
Testing: 119 samples

Processing Ppor_full dataset for train-test split...

Ppor_full:
Training: 453 samples
Testing: 195 samples

Processing PmatM dataset for train-test split...

PmatM:
Training: 130 samples
Testing: 56 samples

Processing PmatFE dataset for train-test split...

PmatFE:
Training: 145 samples
Testing: 63 samples

Processing PporM dataset for train-test split...

PporM:
Training: 185 samples
Testing: 80 samples

Processing PporFE dataset for train-test split...

PporFE:
Training: 268 samples
Testing: 115 samples
