## Feature Engineering

### Engineered Features and Justifications

1. **Gvg** - G1 and G2 average grade (integer of (G1 + G2) / 2)
   - Justification: Averaging the first and second period grades provides a more stable measure of a student's performance over time, reducing the impact of any single period's anomalies.

2. **Avgalc** - Average Dalc and Walc (integer of (Dalc + Walc) / 2)
   - Justification: Combining workday and weekend alcohol consumption into a single average value gives a more comprehensive view of a student's overall alcohol consumption habits.

3. **Bum** - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.
   - Justification for weights:
     - Failures are given a higher weight (2) because past class failures are a strong indicator of academic struggles.
     - Absences are weighted at 1.5 as frequent absences can significantly impact academic performance.
     - Both Dalc and Walc are weighted at 1 as alcohol consumption can affect both health and academic performance.
     - Studytime is inverted (5 - studytime) and weighted at 1 because less study time can lead to poorer academic outcomes.
     - Freetime is weighted at 1 as more free time might indicate less focus on academics.

## Import dependancies

In [2]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#### Load the datasets
Due to our intergration of pipelines, we do not need to convert from arff again

In [3]:
def load_dataset(filename):
    """Load a dataset with improved error handling"""
    filepath = os.path.join('processed_data', filename)
    
    if not os.path.exists(filepath):
        print(f"Warning: File not found: {filepath}")
        return None
    
    try:
        # Load CSV with proper handling of quoted strings
        df = pd.read_csv(filepath, quotechar="'", escapechar="\\")
        print(f"Successfully loaded {filename}: {df.shape[0]} rows")
        return df
    except Exception as e:
        print(f"Error loading {filename}: {str(e)}")
        return None

# Load all datasets
datasets = {
    'Pmat_full': load_dataset('Pmat_full.csv'),
    'Ppor_full': load_dataset('Ppor_full.csv'),
    'PmatM': load_dataset('PmatM.csv'),
    'PmatF': load_dataset('PmatF.csv'),
    'PporM': load_dataset('PporM.csv'),
    'PporF': load_dataset('PporF.csv')
}

Successfully loaded Pmat_full.csv: 251 rows
Successfully loaded Ppor_full.csv: 461 rows
Successfully loaded PmatM.csv: 119 rows
Successfully loaded PporM.csv: 179 rows


### Encoding categorical and nomial attributes of our data
additionally, we will clean erronious data here before training. 

In [4]:
# Encode categorical features and clean data
def encode_categorical_data(df):
    if df is None or df.empty:
        return df
        
    df = df.copy()
    
    # Clean string values by removing quotes
    for col in df.select_dtypes(include=['object']).columns:
        df[col] = df[col].str.strip("'")
    
    # Encode categorical variables
    categorical_cols = ['school', 'sex', 'address', 'famsize', 'Mjob', 'Fjob', 'reason', 'guardian', 'Pstatus']
    
    # Handle binary variables first
    binary_mapping = {
        'schoolsup': {'no': 0, 'yes': 1},
        'famsup': {'no': 0, 'yes': 1},
        'paid': {'no': 0, 'yes': 1},
        'activities': {'no': 0, 'yes': 1},
        'nursery': {'no': 0, 'yes': 1},
        'higher': {'no': 0, 'yes': 1},
        'internet': {'no': 0, 'yes': 1},
        'romantic': {'no': 0, 'yes': 1},
        'school': {'GP': 0, 'MS': 1},
        'sex': {'f': 0, 'm': 1, 'F': 0, 'M': 1},
        'address': {'U': 0, 'R': 1},
        'famsize': {'LE3': 0, 'GT3': 1},
        'Pstatus': {'T': 0, 'A': 1}
    }
    
    for col, mapping in binary_mapping.items():
        if col in df.columns:
            df[col] = df[col].map(mapping)
    
    # Remove binary columns from categorical columns
    categorical_cols = [col for col in categorical_cols if col not in binary_mapping]
    
    # One-hot encode remaining categorical variables
    if categorical_cols:
        df = pd.get_dummies(df, columns=categorical_cols, prefix=categorical_cols)
    
    return df

# Process all datasets
print("\nEncoding categorical features...")
encoded_datasets = {}

for name, df in datasets.items():
    if df is not None and not df.empty:
        print(f"Encoding {name}...")
        encoded_datasets[name] = encode_categorical_data(df)
    else:
        encoded_datasets[name] = None
        print(f"Skipped {name} (empty dataset)")

# Replace original datasets with encoded versions
datasets = encoded_datasets

print("Categorical encoding complete")


Encoding categorical features...
Encoding Pmat_full...
Encoding Ppor_full...
Encoding PmatM...
Skipped PmatF (empty dataset)
Encoding PporM...
Skipped PporF (empty dataset)
Categorical encoding complete


### Create and apply features
Refer to the top of his notebook for justifications.

In [5]:
# Create engineered features
def create_engineered_features(df):
    if df is None or df.empty:
        return df
        
    df_eng = df.copy()
    
    # Create Gvg - average of G1 and G2
    if 'G1' in df.columns and 'G2' in df.columns:
        try:
            df_eng['Gvg'] = ((pd.to_numeric(df['G1'], errors='coerce') + 
                             pd.to_numeric(df['G2'], errors='coerce')) / 2)
        except Exception as e:
            print(f"Error creating Gvg: {str(e)}")
    
    # Create Avgalc - average alcohol consumption
    if 'Dalc' in df.columns and 'Walc' in df.columns:
        try:
            df_eng['Avgalc'] = ((pd.to_numeric(df['Dalc'], errors='coerce') + 
                               pd.to_numeric(df['Walc'], errors='coerce')) / 2)
        except Exception as e:
            print(f"Error creating Avgalc: {str(e)}")
    
    # Create Bum composite indicator
    required_cols = ['failures', 'absences', 'Dalc', 'Walc', 'studytime', 'freetime']
    if all(col in df.columns for col in required_cols):
        try:
            df_eng['Bum'] = (
                2.0 * pd.to_numeric(df['failures'], errors='coerce') + 
                1.5 * pd.to_numeric(df['absences'], errors='coerce') + 
                1.0 * pd.to_numeric(df['Dalc'], errors='coerce') + 
                1.0 * pd.to_numeric(df['Walc'], errors='coerce') + 
                1.0 * (1.0 - pd.to_numeric(df['studytime'], errors='coerce')) + 
                0.5 * pd.to_numeric(df['freetime'], errors='coerce')
            )
        except Exception as e:
            print(f"Error creating Bum: {str(e)}")
            
    return df_eng

# Process each dataset
processed_datasets = {}
for name, df in datasets.items():
    if df is not None and not df.empty:
        print(f"Processing {name} dataset...")
        processed_datasets[name] = create_engineered_features(df)
    else:
        processed_datasets[name] = None
        print(f"Warning: {name} dataset is empty or None")

Processing Pmat_full dataset...
Processing Ppor_full dataset...
Processing PmatM dataset...
Processing PporM dataset...


### Analyse features
Provides some feedback on if our application worked, while additionally providing metrics for MLOPs development cycle operations testing and improved model development with realtime statistic.

In [6]:
def analyze_features(processed_datasets):
    print("\nFeature Engineering Statistics:")
    
    for name, df in processed_datasets.items():
        if df is None or df.empty:
            print(f"\n{name} Dataset is empty or None")
            continue
            
        print(f"\n{name} Dataset Statistics:")
        
        new_features = ['Gvg', 'Avgalc', 'Bum']
        for feature in new_features:
            if feature in df.columns:
                stats = df[feature].describe()
                print(f"\n{feature}:")
                print(f"  Mean: {stats['mean']:.2f}")
                print(f"  Std: {stats['std']:.2f}")
                print(f"  Min: {stats['min']:.2f}")
                print(f"  Max: {stats['max']:.2f}")
            else:
                print(f"\n{feature}: Not available for this dataset")

# Analyze all processed datasets
if processed_datasets:
    analyze_features(processed_datasets)


Feature Engineering Statistics:

Pmat_full Dataset Statistics:

Gvg:
  Mean: 11.74
  Std: 3.04
  Min: 5.00
  Max: 18.50

Avgalc:
  Mean: 0.22
  Std: 0.25
  Min: 0.00
  Max: 1.00

Bum:
  Mean: 1.56
  Std: 0.80
  Min: 0.25
  Max: 3.85

Ppor_full Dataset Statistics:

Gvg:
  Mean: 11.95
  Std: 2.22
  Min: 6.50
  Max: 17.00

Avgalc:
  Mean: 0.22
  Std: 0.25
  Min: 0.00
  Max: 1.00

Bum:
  Mean: 1.55
  Std: 0.79
  Min: 0.12
  Max: 3.92

PmatM Dataset Statistics:

Gvg:
  Mean: 12.47
  Std: 2.90
  Min: 5.50
  Max: 18.50

Avgalc:
  Mean: 0.27
  Std: 0.29
  Min: 0.00
  Max: 1.00

Bum:
  Mean: 1.78
  Std: 0.83
  Min: 0.33
  Max: 3.85

PmatF Dataset is empty or None

PporM Dataset Statistics:

Gvg:
  Mean: 11.72
  Std: 2.20
  Min: 7.50
  Max: 17.00

Avgalc:
  Mean: 0.30
  Std: 0.29
  Min: 0.00
  Max: 1.00

Bum:
  Mean: 1.81
  Std: 0.87
  Min: 0.38
  Max: 3.80

PporF Dataset is empty or None


### Scaling engineered features to fit our datasets
Because we scaled our features in datawrngle were scaled, our additional engineered features need to be within the same scale.

In [7]:
engineered_features = ['Gvg', 'Avgalc', 'Bum']

print("\nScaling Engineered Features")

# Process each dataset
for name, df in processed_datasets.items():
    if df is not None and not df.empty:
        # Find engineered features that exist and are non-binary numerical
        features_to_scale = []
        for col in engineered_features:
            if col in df.columns:
                try:
                    # Check if it's numerical and non-binary
                    unique_values = df[col].dropna().unique()
                    if len(unique_values) > 2:  # Skip binary features
                        features_to_scale.append(col)
                except:
                    continue
        
        # Apply scaling if we have features to scale
        if features_to_scale:
            scaler = MinMaxScaler()
            # Scale each feature individually 
            for col in features_to_scale:
                # Reshape for MinMaxScaler (needs 2D array)
                values = df[col].values.reshape(-1, 1)
                processed_datasets[name][col] = scaler.fit_transform(values).flatten()
            
            print(f"{name}: Scaled {len(features_to_scale)} features - {', '.join(features_to_scale)}")
        else:
            print(f"{name}: No non-binary engineered features to scale")
    else:
        print(f"{name}: Dataset is empty or None, skipping scaling")

print("Scaling Complete")


Scaling Engineered Features
Pmat_full: Scaled 3 features - Gvg, Avgalc, Bum
Ppor_full: Scaled 3 features - Gvg, Avgalc, Bum
PmatM: Scaled 3 features - Gvg, Avgalc, Bum
PmatF: Dataset is empty or None, skipping scaling
PporM: Scaled 3 features - Gvg, Avgalc, Bum
PporF: Dataset is empty or None, skipping scaling
Scaling Complete


### Saving the full enhanced datasets
To maintain our operational contingency and additonally implement enhanced datasets, with new or changed datawrngling or featuring, this solution allows for new file names while adding the _enhanced.csv suffix. 

In [8]:
# Save enhanced datasets post-scaling
print("\nSaving Enhanced Datasets")

# Save all processed datasets with _enhanced suffix
for name, df in processed_datasets.items():
    if df is not None and not df.empty:
        # Create enhanced filename
        output_filename = f"{name}_enhanced.csv"
        output_path = os.path.join('processed_data', output_filename)
        
        # Save to CSV
        df.to_csv(output_path, index=False)
        print(f"Saved: {output_filename} ({df.shape[0]} rows, {df.shape[1]} columns)")
    else:
        print(f"Skipped saving {name} (empty dataset)")

print("Enhanced datasets saved successfully")


Saving Enhanced Datasets
Saved: Pmat_full_enhanced.csv (251 rows, 49 columns)
Saved: Ppor_full_enhanced.csv (461 rows, 49 columns)
Saved: PmatM_enhanced.csv (119 rows, 49 columns)
Skipped saving PmatF (empty dataset)
Saved: PporM_enhanced.csv (179 rows, 49 columns)
Skipped saving PporF (empty dataset)
Enhanced datasets saved successfully


### Splitting and saving data for training and testing
Additionally including our formatting of X and Y train and test sets intergrates with our chosen training and testing pipeline

In [9]:
# Split and save enhanced datasets for training
print("\nCreating Train/Test Splits for Enhanced Datasets")

for name, df in processed_datasets.items():
    if df is not None and not df.empty:
        print(f"Processing {name}...")
        
        # Define features and target
        features_to_drop = ['G1', 'G2', 'G3']
        X = df.drop(features_to_drop, axis=1)
        y = df[['G1', 'G2', 'G3']]
        
        # Create train/test split
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=42)
        
        # Save with enhanced naming convention
        base_path = 'processed_data'
        X_train.to_csv(f'{base_path}/X_{name}_enhanced_train.csv', index=False)
        X_test.to_csv(f'{base_path}/X_{name}_enhanced_test.csv', index=False)
        y_train.to_csv(f'{base_path}/y_{name}_enhanced_train.csv', index=False)
        y_test.to_csv(f'{base_path}/y_{name}_enhanced_test.csv', index=False)
        
        print(f"  • Training: {X_train.shape[0]} samples")
        print(f"  • Testing: {X_test.shape[0]} samples")
    else:
        print(f"Skipped {name} (empty dataset)")

print("All enhanced datasets split and saved")


Creating Train/Test Splits for Enhanced Datasets
Processing Pmat_full...
  • Training: 175 samples
  • Testing: 76 samples
Processing Ppor_full...
  • Training: 322 samples
  • Testing: 139 samples
Processing PmatM...
  • Training: 83 samples
  • Testing: 36 samples
Skipped PmatF (empty dataset)
Processing PporM...
  • Training: 125 samples
  • Testing: 54 samples
Skipped PporF (empty dataset)
All enhanced datasets split and saved
