## Feature Engineering

### Engineered Features and Justifications

1. **Gvg** - G1 and G2 average grade (integer of (G1 + G2) / 2)
   - Justification: Averaging the first and second period grades provides a more stable measure of a student's performance over time, reducing the impact of any single period's anomalies.

2. **Avgalc** - Average Dalc and Walc (integer of (Dalc + Walc) / 2)
   - Justification: Combining workday and weekend alcohol consumption into a single average value gives a more comprehensive view of a student's overall alcohol consumption habits.

3. **Bum** - A weighted sum of failures, absences, Dalc, Walc, inverted studytime, and freetime to indicate a student's tendency to fail, skip school, drink alcohol, not study, and have free time.
   - Justification for weights:
     - Failures are given a higher weight (2) because past class failures are a strong indicator of academic struggles.
     - Absences are weighted at 1.5 as frequent absences can significantly impact academic performance.
     - Both Dalc and Walc are weighted at 1 as alcohol consumption can affect both health and academic performance.
     - Studytime is inverted (5 - studytime) and weighted at 1 because less study time can lead to poorer academic outcomes.
     - Freetime is weighted at 1 as more free time might indicate less focus on academics.

In [1]:
# Import frameworks
import pandas as pd
import os

#### Load the datasets

In [2]:
# Load CSV files
mat_df = pd.read_csv('data/mat.csv')
por_df = pd.read_csv('data/por.csv')
csv_df = pd.read_csv('data/dataset.csv')

#### Dealing with null values

In [3]:
# Remove Null values
def remove_nulls(df):
    df = df.dropna()
    return df

mat_df = remove_nulls(mat_df)
por_df = remove_nulls(por_df)
csv_df = remove_nulls(csv_df)

#### Remove Duplicates

In [4]:
# Remove duplicates
def remove_duplicates(df):
    df = df.drop_duplicates()
    return df

mat_df = remove_duplicates(mat_df)
por_df = remove_duplicates(por_df)
csv_df = remove_duplicates(csv_df)

#### Encoding categorical variables

In [5]:
# Encode categorical variables
def encode_categorical(df, column):
    df[column] = df[column].apply(lambda x: 1 if x == 'yes' else 0)
    return df

mat_df = encode_categorical(mat_df, 'schoolsup')
por_df = encode_categorical(por_df, 'schoolsup')
csv_df = encode_categorical(csv_df, 'schoolsup')

#### Calculating new features from existing features

In [6]:
# Calculate Gvg feature
def calculate_gvg(row):
    return (row['G1'] + row['G2']) / 2

mat_df['Gvg'] = mat_df.apply(calculate_gvg, axis=1)
por_df['Gvg'] = por_df.apply(calculate_gvg, axis=1)
csv_df['Gvg'] = csv_df.apply(calculate_gvg, axis=1)

In [7]:
# Calculate Avgalc feature
def calculate_avgalc(row):
    return (row['Dalc'] + row['Walc']) / 2

mat_df['Avgalc'] = mat_df.apply(calculate_avgalc, axis=1)
por_df['Avgalc'] = por_df.apply(calculate_avgalc, axis=1)
csv_df['Avgalc'] = csv_df.apply(calculate_avgalc, axis=1)

In [8]:
# Calculate Bum feature
def calculate_bum(row):
    return (2 * row['failures'] + 1.5 * row['absences'] + row['Dalc'] + row['Walc'] + (5 - row['studytime']) + row['freetime'])

mat_df['Bum'] = mat_df.apply(calculate_bum, axis=1)
por_df['Bum'] = por_df.apply(calculate_bum, axis=1)
csv_df['Bum'] = csv_df.apply(calculate_bum, axis=1)

#### Save the wrangled and engineered data to CSV

In [9]:
# Create processed_data directory if it doesn't exist
os.makedirs('../processed_data', exist_ok=True)

# Save processed data
mat_df.to_csv('../processed_data/Pmat.csv', index=False)
por_df.to_csv('../processed_data/Ppor.csv', index=False)
csv_df.to_csv('../processed_data/Pdataset.csv', index=False)