# Feature Engineering

**Objective:** Create new features that improve model prediction power

**Why Feature Engineering Matters:**
- Raw data doesn't always capture business logic
- Combining features creates powerful predictors
- Domain knowledge translates to better features

**Features We'll Create:**
1. Tenure groups (New, Mid, Senior, Veteran)
2. Salary bins (categorize income levels)
3. Income-to-age ratio
4. Work-life balance index
5. Tenure-performance interaction
6. Distance categories
7. Binary encodings for models

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported!")

Libraries imported!


In [2]:
# Load cleaned data
print("Loading cleaned data...")

df = pd.read_csv('../data/processed/cleaned_data.csv')

print(f"Data loaded!")
print(f"Current shape: {df.shape}")
print(f"{df.shape[0]} rows Ã— {df.shape[1]} columns")

Loading cleaned data...
Data loaded!
Current shape: (1470, 32)
1470 rows Ã— 32 columns


---

## FEATURE 1: Tenure Groups

**Business Logic:** Employees at different career stages have different attrition patterns
- **New (0-2 years):** Still deciding if company is right fit
- **Mid (3-5 years):** Critical retention period
- **Senior (6-10 years):** Settled, lower risk
- **Veteran (10+ years):** Very low attrition risk

In [3]:
# Create Tenure Groups
print("Creating Tenure Groups...")

def categorize_tenure(years):
    """Categorize employees by tenure length"""
    if years <= 2:
        return 'New'
    elif years <= 5:
        return 'Mid'
    elif years <= 10:
        return 'Senior'
    else:
        return 'Veteran'

df['TenureGroup'] = df['YearsAtCompany'].apply(categorize_tenure)

# Check distribution
print("TenureGroup created!")
print("\nDistribution:")
print(df['TenureGroup'].value_counts().sort_index())

# Check attrition by tenure group
print("\nAttrition rate by Tenure Group:")
tenure_attrition = pd.crosstab(df['TenureGroup'], df['Attrition'], normalize='index') * 100
print(tenure_attrition['Yes'].sort_values(ascending=False).round(2))

Creating Tenure Groups...
TenureGroup created!

Distribution:
TenureGroup
Mid        434
New        342
Senior     448
Veteran    246
Name: count, dtype: int64

Attrition rate by Tenure Group:
TenureGroup
New        29.82
Mid        13.82
Senior     12.28
Veteran     8.13
Name: Yes, dtype: float64


---

## FEATURE 2: Salary Bins

**Business Logic:** Group salaries into meaningful brackets for HR
- Helps identify if low/mid/high earners have different attrition patterns

In [4]:
# Create Salary Bins
print("ðŸ”§ Creating Salary Categories...")

# Check salary distribution first
print("Salary quartiles:")
print(df['MonthlyIncome'].describe())

# Create bins based on quartiles
df['SalaryBin'] = pd.cut(df['MonthlyIncome'],
                          bins=[0, 3000, 5000, 7000, 20000],
                          labels=['Low', 'Medium', 'High', 'Very High'])

print("\nSalaryBin created!")
print("\nDistribution:")
print(df['SalaryBin'].value_counts().sort_index())

# Check attrition by salary bin
print("\nAttrition rate by Salary Bin:")
salary_attrition = pd.crosstab(df['SalaryBin'], df['Attrition'], normalize='index') * 100
print(salary_attrition['Yes'].sort_values(ascending=False).round(2))

ðŸ”§ Creating Salary Categories...
Salary quartiles:
count     1470.000000
mean      6502.931293
std       4707.956783
min       1009.000000
25%       2911.000000
50%       4919.000000
75%       8379.000000
max      19999.000000
Name: MonthlyIncome, dtype: float64

SalaryBin created!

Distribution:
SalaryBin
Low          395
Medium       354
High         286
Very High    435
Name: count, dtype: int64

Attrition rate by Salary Bin:
SalaryBin
Low          28.61
Medium       14.12
Very High    10.80
High          9.44
Name: Yes, dtype: float64


---

## FEATURE 3: Income-to-Age Ratio

**Business Logic:** 
- Young employees earning high = happy, stay longer
- Older employees earning low = underpaid, likely to leave
- This ratio captures relative compensation fairness

In [5]:
# Create Income-to-Age Ratio
print("Creating Income-to-Age Ratio...")

df['Income_Age_Ratio'] = df['MonthlyIncome'] / df['Age']

print("Income_Age_Ratio created!")
print("\nStatistics:")
print(df.groupby('Attrition')['Income_Age_Ratio'].describe().round(2))

# Higher ratio = better compensation for age
avg_ratio_stayed = df[df['Attrition'] == 'No']['Income_Age_Ratio'].mean()
avg_ratio_left = df[df['Attrition'] == 'Yes']['Income_Age_Ratio'].mean()

print(f"\nEmployees who stayed: Avg ratio = {avg_ratio_stayed:.2f}")
print(f"Employees who left: Avg ratio = {avg_ratio_left:.2f}")

Creating Income-to-Age Ratio...
Income_Age_Ratio created!

Statistics:
            count    mean     std    min    25%     50%     75%     max
Attrition                                                              
No         1233.0  177.18  104.08  36.42  97.34  148.80  232.02  556.00
Yes         237.0  138.50   82.21  36.03  81.45  112.63  173.43  476.71

Employees who stayed: Avg ratio = 177.18
Employees who left: Avg ratio = 138.50


---

## FEATURE 4: Work-Life Balance Index

**Business Logic:** Combine overtime and work-life balance rating
- Overtime reduces effective work-life balance
- Create composite score

In [7]:
# Create Work-Life Balance Index
print("Creating Work-Life Balance Index...")

# First convert OverTime to numeric
df['OverTime_Numeric'] = (df['OverTime'] == 'Yes').astype(int)

# Create index: penalize overtime
df['WLB_Index'] = df['WorkLifeBalance'] - (df['OverTime_Numeric'] * 2)

# Clip to realistic range
df['WLB_Index'] = df['WLB_Index'].clip(lower=1)

print("WLB_Index created!")
print("\nDistribution:")
print(df['WLB_Index'].value_counts().sort_index())

# Check impact on attrition
print("\nAverage WLB Index:")
print(df.groupby('Attrition')['WLB_Index'].mean().round(2))

Creating Work-Life Balance Index...
WLB_Index created!

Distribution:
WLB_Index
1    438
2    276
3    639
4    117
Name: count, dtype: int64

Average WLB Index:
Attrition
No     2.39
Yes    1.81
Name: WLB_Index, dtype: float64


---

## FEATURE 5: Age Groups

**Business Logic:** Different generations have different retention patterns

In [8]:
# Create Age Groups (if not already created in EDA)
print("Creating Age Groups...")

df['AgeGroup'] = pd.cut(df['Age'],
                        bins=[0, 30, 40, 50, 70],
                        labels=['Young', 'Mid-Career', 'Experienced', 'Senior'])

print("AgeGroup created!")
print("\nDistribution:")
print(df['AgeGroup'].value_counts().sort_index())

Creating Age Groups...
AgeGroup created!

Distribution:
AgeGroup
Young          386
Mid-Career     619
Experienced    322
Senior         143
Name: count, dtype: int64


---

## FEATURE 6: Distance Categories

**Business Logic:** Long commutes impact retention

In [9]:
# Create Distance Categories
print("Creating Distance Categories...")

df['DistanceCategory'] = pd.cut(df['DistanceFromHome'],
                                 bins=[0, 5, 15, 30],
                                 labels=['Near', 'Medium', 'Far'])

print("DistanceCategory created!")
print("\nDistribution:")
print(df['DistanceCategory'].value_counts().sort_index())

# Check attrition by distance
print("\nAttrition rate by Distance:")
distance_attrition = pd.crosstab(df['DistanceCategory'], df['Attrition'], normalize='index') * 100
print(distance_attrition['Yes'].sort_values(ascending=False).round(2))

Creating Distance Categories...
DistanceCategory created!

Distribution:
DistanceCategory
Near      632
Medium    509
Far       329
Name: count, dtype: int64

Attrition rate by Distance:
DistanceCategory
Far       20.67
Medium    16.11
Near      13.77
Name: Yes, dtype: float64


---

## FEATURE 7: Total Satisfaction Score

**Business Logic:** Combine multiple satisfaction metrics
- JobSatisfaction + EnvironmentSatisfaction + RelationshipSatisfaction
- Higher score = more likely to stay

In [10]:
# Create Total Satisfaction Score
print("Creating Total Satisfaction Score...")

df['TotalSatisfaction'] = (df['JobSatisfaction'] + 
                           df['EnvironmentSatisfaction'] + 
                           df['RelationshipSatisfaction'])

print("TotalSatisfaction created!")
print("\nStatistics:")
print(df.groupby('Attrition')['TotalSatisfaction'].describe().round(2))

# Compare
avg_sat_stayed = df[df['Attrition'] == 'No']['TotalSatisfaction'].mean()
avg_sat_left = df[df['Attrition'] == 'Yes']['TotalSatisfaction'].mean()

print(f"\nEmployees who stayed: Avg satisfaction = {avg_sat_stayed:.2f}")
print(f"Employees who left: Avg satisfaction = {avg_sat_left:.2f}")

Creating Total Satisfaction Score...
TotalSatisfaction created!

Statistics:
            count  mean   std  min  25%  50%   75%   max
Attrition                                               
No         1233.0  8.28  1.82  3.0  7.0  8.0  10.0  12.0
Yes         237.0  7.53  2.06  3.0  6.0  8.0   9.0  12.0

Employees who stayed: Avg satisfaction = 8.28
Employees who left: Avg satisfaction = 7.53


---

## FEATURE 8: Years Since Last Promotion Ratio

**Business Logic:** Career stagnation indicator
- Long time without promotion = frustration

In [11]:
# Create promotion stagnation indicator
print("Creating Promotion Stagnation Indicator...")

df['YearsSincePromotion_Ratio'] = df['YearsSinceLastPromotion'] / (df['YearsAtCompany'] + 1)

print("YearsSincePromotion_Ratio created!")
print("\nStatistics:")
print(df.groupby('Attrition')['YearsSincePromotion_Ratio'].describe().round(2))

Creating Promotion Stagnation Indicator...
YearsSincePromotion_Ratio created!

Statistics:
            count  mean   std  min  25%   50%  75%   max
Attrition                                               
No         1233.0  0.24  0.27  0.0  0.0  0.14  0.4  0.92
Yes         237.0  0.24  0.29  0.0  0.0  0.12  0.5  0.88


---

## FEATURE 9: Department Risk Score

**Business Logic:** Encode historical attrition rate by department
- Sales has higher risk â†’ higher score

In [12]:
# Create Department Risk Score
print("Creating Department Risk Score...")

# Calculate attrition rate per department
dept_risk = df.groupby('Department')['Attrition'].apply(
    lambda x: (x == 'Yes').mean()
).to_dict()

print("Department Risk Scores:")
for dept, risk in dept_risk.items():
    print(f"  {dept}: {risk:.3f}")

# Map to dataframe
df['DeptRiskScore'] = df['Department'].map(dept_risk)

print("\nDeptRiskScore created!")

Creating Department Risk Score...
Department Risk Scores:
  Human Resources: 0.190
  Research & Development: 0.138
  Sales: 0.206

DeptRiskScore created!


---

## ENCODING FOR MACHINE LEARNING

Now we'll encode categorical variables for ML models:
1. **Binary Encoding:** Yes/No â†’ 1/0
2. **One-Hot Encoding:** Multiple categories â†’ Separate columns

In [13]:
# Binary Encoding
print("Encoding categorical variables for ML...")
print("="*70)

# 1. Target variable
df['Attrition_Binary'] = (df['Attrition'] == 'Yes').astype(int)
print("âœ“ Attrition â†’ Attrition_Binary (0/1)")

# 2. OverTime
df['OverTime_Binary'] = (df['OverTime'] == 'Yes').astype(int)
print("âœ“ OverTime â†’ OverTime_Binary (0/1)")

# 3. Gender
df['Gender_Binary'] = (df['Gender'] == 'Male').astype(int)
print("âœ“ Gender â†’ Gender_Binary (0=Female, 1=Male)")

print("\nBinary encoding complete!")

Encoding categorical variables for ML...
âœ“ Attrition â†’ Attrition_Binary (0/1)
âœ“ OverTime â†’ OverTime_Binary (0/1)
âœ“ Gender â†’ Gender_Binary (0=Female, 1=Male)

Binary encoding complete!


In [14]:
# One-Hot Encoding for multi-category variables
print("\nOne-Hot Encoding...")

# Select columns for one-hot encoding
cols_to_encode = ['Department', 'JobRole', 'MaritalStatus', 'EducationField', 'BusinessTravel']

# Create encoded dataframe
df_encoded = pd.get_dummies(df, columns=cols_to_encode, prefix=cols_to_encode, drop_first=True)

print(f"One-hot encoding complete!")
print(f"Original columns: {df.shape[1]}")
print(f"After encoding: {df_encoded.shape[1]}")
print(f"New columns created: {df_encoded.shape[1] - df.shape[1]}")


One-Hot Encoding...
One-hot encoding complete!
Original columns: 45
After encoding: 59
New columns created: 14


---

## FEATURE ENGINEERING SUMMARY

In [17]:
# Summary of all new features
print("="*70)
print("FEATURE ENGINEERING COMPLETE!")
print("="*70)

new_features = [
    'TenureGroup',
    'SalaryBin',
    'Income_Age_Ratio',
    'WLB_Index',
    'AgeGroup',
    'DistanceCategory',
    'TotalSatisfaction',
    'YearsSincePromotion_Ratio',
    'DeptRiskScore',
    'Attrition_Binary',
    'OverTime_Binary',
    'Gender_Binary'
]

print(f"\nCreated {len(new_features)} new features:")
for i, feature in enumerate(new_features, 1):
    print(f"   {i:2d}. {feature}")

print(f"\nOriginal features: {df.shape[1] - len(new_features)}")
print(f"New features: {len(new_features)}")
print(f"Total features (with encoding): {df_encoded.shape[1]}")

FEATURE ENGINEERING COMPLETE!

Created 12 new features:
    1. TenureGroup
    2. SalaryBin
    3. Income_Age_Ratio
    4. WLB_Index
    5. AgeGroup
    6. DistanceCategory
    7. TotalSatisfaction
    8. YearsSincePromotion_Ratio
    9. DeptRiskScore
   10. Attrition_Binary
   11. OverTime_Binary
   12. Gender_Binary

Original features: 33
New features: 12
Total features (with encoding): 59


In [16]:
# Show sample of new features
print("\nSample of engineered features:")
print("="*70)

sample_cols = ['EmployeeNumber', 'Age', 'MonthlyIncome', 'TenureGroup', 
               'SalaryBin', 'Income_Age_Ratio', 'WLB_Index', 
               'TotalSatisfaction', 'Attrition']

print(df[sample_cols].head(10))


Sample of engineered features:
   EmployeeNumber  Age  MonthlyIncome TenureGroup  SalaryBin  \
0               1   41           5993      Senior       High   
1               2   49           5130      Senior       High   
2               4   37           2090         New        Low   
3               5   33           2909      Senior        Low   
4               7   27           3468         New     Medium   
5               8   32           3068      Senior     Medium   
6              10   59           2670         New        Low   
7              11   30           2693         New        Low   
8              12   38           9526      Senior  Very High   
9              13   36           5237      Senior       High   

   Income_Age_Ratio  WLB_Index  TotalSatisfaction Attrition  
0        146.170732          1                  7       Yes  
1        104.693878          3                  9        No  
2         56.486486          1                  9       Yes  
3         88.15

In [18]:
# Save both versions
print("\nSaving engineered datasets...")

# Version 1: With categorical labels (for interpretation)
df.to_csv('../data/processed/featured_data.csv', index=False)
print("âœ“ Saved: data/processed/featured_data.csv")

# Version 2: Fully encoded (for ML models)
df_encoded.to_csv('../data/processed/ml_ready_data.csv', index=False)
print("âœ“ Saved: data/processed/ml_ready_data.csv")

print("\nAll data saved successfully!")


Saving engineered datasets...
âœ“ Saved: data/processed/featured_data.csv
âœ“ Saved: data/processed/ml_ready_data.csv

All data saved successfully!


---

## Feature Engineering Complete!

### What We Created:

**Business Logic Features:**
1. TenureGroup - Career stage categorization
2. SalaryBin - Income level grouping
3. Income_Age_Ratio - Compensation fairness
4. WLB_Index - Work-life balance composite
5. AgeGroup - Generational cohorts
6. DistanceCategory - Commute impact
7. TotalSatisfaction - Combined satisfaction score
8. YearsSincePromotion_Ratio - Career stagnation
9. DeptRiskScore - Department-level risk

**ML-Ready Encodings:**
- Binary encoding (Yes/No â†’ 1/0)
- One-hot encoding (Categories â†’ Multiple columns)

### Files Created:
- `data/processed/featured_data.csv` - Human-readable with labels
- `data/processed/ml_ready_data.csv` - Ready for model training

---

**Next Step:** Proceed to `05_model_training.ipynb` 