# 02 - Feature Engineering

## Development Plan

### Objectives:
- Create meaningful features from raw match statistics
- Implement rolling averages and form indicators
- Engineer team-specific and head-to-head features
- Prepare features for both training and test datasets

### Implementation Steps:

#### 1. Data Preprocessing
- Load cleaned training data from notebook 01
- Sort data chronologically by Date
- Handle any remaining missing values
- Create season identifier from Date column

#### 2. Basic Team Statistics Features
- Overall win rate (total wins / total matches)
- Home win rate and away win rate
- Average goals scored per match (home and away separately)
- Average goals conceded per match
- Average shots, shots on target per match
- Average fouls, yellow cards, red cards per match
- Average corners per match

#### 3. Form-Based Features (Rolling Windows)
- Last N matches form (e.g., N=5, 10): wins, points
- Rolling average goals scored (last 5/10 matches)
- Rolling average goals conceded
- Recent performance trend (win rate in last N matches vs overall)
- Points from last 5 matches (3 for win, 1 for draw, 0 for loss)

#### 4. Head-to-Head Features
- Historical win rate between specific teams
- Average goals in matchups between two teams
- Last meeting outcome
- Home team advantage in this specific matchup

#### 5. Time-Based Features
- Day of week
- Month of year
- Season stage (early/mid/late season)
- Days since last match (rest days)

#### 6. Advanced Features
- Goal difference (cumulative)
- Shot accuracy (shots on target / total shots)
- Discipline index (yellow + red cards)
- Attack strength vs defense strength
- ELO rating (optional, if time permits)

#### 7. Feature Encoding
- Encode team names (label encoding or one-hot encoding)
- Encode referee names if used as feature
- Create features for both home and away teams

#### 8. Feature Selection & Analysis
- Calculate correlation with target variable (FTR)
- Identify highly correlated features
- Remove redundant or low-importance features
- Visualize feature distributions

#### 9. Apply to Test Data
- Load test dataset
- Apply same feature engineering pipeline
- Ensure consistency in feature creation
- Handle any missing historical data for test matches

#### 10. Save Processed Data
- Save feature-engineered training data to data/processed/
- Save feature-engineered test data
- Export feature list and descriptions

### Expected Outputs:
- features_training.csv with all engineered features
- features_test.csv ready for prediction
- feature_descriptions.txt documenting each feature
- Correlation analysis plots in figures/feature_analysis/
- Feature importance preliminary analysis

In [None]:
# Import libraries
# TODO: Import pandas, numpy, sklearn preprocessing tools

In [None]:
# Load and preprocess data
# TODO: Load training data, sort by date, handle missing values

In [None]:
# Create basic team statistics
# TODO: Implement win rates, average goals, shots, etc.

In [None]:
# Create form-based features
# TODO: Implement rolling windows for recent form

In [None]:
# Create head-to-head features
# TODO: Calculate historical matchup statistics

In [None]:
# Feature selection and analysis
# TODO: Correlation analysis, feature importance

In [None]:
# Apply to test data
# TODO: Apply same pipeline to test dataset

In [None]:
# Save processed data
# TODO: Export to data/processed/