In [1]:
import sys
from pathlib import Path

# Get the absolute path of the current notebook
# Assumes the notebook is in a subdirectory of the project root (e.g., /notebooks)
notebook_path = Path().resolve()

# Get the project root directory (which is the parent of the 'notebooks' directory)
project_root = notebook_path.parent

# Add BOTH the project root and the src directory to the Python path
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))
if str(project_root / 'src') not in sys.path:
    sys.path.append(str(project_root / 'src'))

# Now, we can import our modules
from src.data_handling import DataHandler
from src.xgboost_model import XGBoostModel

In [2]:
dh = DataHandler(test_year=2020)

DataHandler initialized - Using 52 features - Test year: 2020


### **Model 1: Standard XGBoost**
Now, we will execute the full pipeline for a standard XGBoost model without using PCA.

#### Initialize XGBoostModel Handler
Here, we create an instance of the `XGBoostModel` class. We provide a unique `model_name` (`'xgb_base'`) which will be used to name all the output files (study, model, and predictions).

In [3]:
xgb_base = XGBoostModel(dh, model_name='xgb_base')

XGBoostModel initialized with model name: xgb_base
Optuna study will be stored in  : /Users/arvindsuresh/Documents/Github/Election-prediction-May-2025/2020-results-20251023/optuna/xgb_base_study.pkl
Trained model will be stored in : /Users/arvindsuresh/Documents/Github/Election-prediction-May-2025/2020-results-20251023/models/xgb_base_model.json
Final preds will be stored in   : /Users/arvindsuresh/Documents/Github/Election-prediction-May-2025/2020-results-20251023/preds/xgb_base_preds.csv


#### Run Optuna Hyperparameter Study
This cell runs the Optuna hyperparameter search using the efficient ASHA pruner. It will perform 3-fold cross-validation on the training years to find the optimal set of hyperparameters that minimizes the custom weighted cross-entropy metric. The study object is saved upon completion.

*   `n_trials`: The number of different hyperparameter combinations to test.
*   `timeout`: A safety stop in minutes for the entire study.
*   `min_resource`: The minimum number of boosting rounds a trial must train for before it can be pruned.
*   `reduction_factor`: The factor by which resources are reduced in ASHA (e.g., halving the number of trials at each rung).

In [4]:
xgb_base.run_optuna_study(
    n_trials=100,
    timeout=10,
    min_resource=8,  
    reduction_factor=2,  
    use_pca=False
)

[I 2025-10-23 10:33:38,734] A new study created in memory with name: xgb_base_study
[I 2025-10-23 10:33:40,436] Trial 6 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:40,490] Trial 4 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:40,496] Trial 1 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:40,691] Trial 7 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:40,811] Trial 5 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:40,920] Trial 2 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:40,948] Trial 3 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:41,306] Trial 9 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:41,725] Trial 10 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:42,194] Trial 13 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:42,276] Trial 14 pruned. Trial was pruned at iteration 8.
[I 2025-10-23 10:33:42,387] Trial 16 pruned. Trial was pruned at i

--------------------
Study concluded. Results:
Best trial: 87
Best loss: 0.9942566666666667
Best params: {'learning_rate': 0.20934415984202634, 'gamma': 0.01248134336564366, 'subsample': 0.7322252935015268, 'colsample_bytree': 0.7372352102775366, 'colsample_bylevel': 0.9759430519429679, 'colsample_bynode': 0.7214443735212577, 'reg_alpha': 0.4508419359396373, 'reg_lambda': 0.00043211407266855763, 'max_depth': 11, 'min_child_weight': 5}
Optimal boosting rounds: 107
Study saved to /Users/arvindsuresh/Documents/Github/Election-prediction-May-2025/2020-results-20251023/optuna/xgb_base_study.pkl


#### Train Final Model
Using the best hyperparameters and the optimal number of boosting rounds found by Optuna, this cell trains the final XGBoost model on the entire training dataset (2008, 2012, and 2016 combined). The trained model is saved to the results directory in JSON format.

In [5]:
xgb_base.train_final_model()

Training xgb_base for 107 boosting rounds...
Using best params: {'learning_rate': 0.20934415984202634, 'gamma': 0.01248134336564366, 'subsample': 0.7322252935015268, 'colsample_bytree': 0.7372352102775366, 'colsample_bylevel': 0.9759430519429679, 'colsample_bynode': 0.7214443735212577, 'reg_alpha': 0.4508419359396373, 'reg_lambda': 0.00043211407266855763, 'max_depth': 11, 'min_child_weight': 5}
Training completed in 0.49 seconds.
Model saved to /Users/arvindsuresh/Documents/Github/Election-prediction-May-2025/2020-results-20251023/models/xgb_base_model.json


#### Generate Final Predictions
Finally, we load the trained model and use it to make predictions on the held-out 2020 test set. The predictions are saved to a CSV file in the results directory.

In [6]:
preds_base = xgb_base.make_final_predictions()
print("\nBase XGBoost Model Predictions (first 5 rows):")
print(preds_base[:5])

xgb_base predictions saved to: /Users/arvindsuresh/Documents/Github/Election-prediction-May-2025/2020-results-20251023/preds/xgb_base_preds.csv.

Base XGBoost Model Predictions (first 5 rows):
[[0.13650045 0.29802307 0.02114831 0.5443282 ]
 [0.12210447 0.33626625 0.02544304 0.5161862 ]
 [0.16061217 0.17182407 0.0100354  0.65752834]
 [0.09442554 0.27459973 0.01240689 0.6185678 ]
 [0.08282262 0.31229147 0.02329286 0.5815931 ]]
