An end-to-end machine learning pipeline for customer churn prediction using the WA_Fn-UseC_-Telco-Customer-Churn.csv dataset and XGBoost.
- Clean preprocessing pipeline for mixed tabular data
- Robust handling of
TotalChargesconversion and missing values - One-hot encoding for categorical features
- Train/test split with reproducible random seed
- XGBoost classifier training and evaluation
- Classification metrics: Accuracy, Precision, Recall, F1-score
- Confusion matrix and ROC-AUC with ROC curve visualization
- Top feature importance extraction
- Optional hyperparameter tuning via
RandomizedSearchCV - Ready-to-run Python script and Jupyter notebooks
.
├── WA_Fn-UseC_-Telco-Customer-Churn.csv
├── xgboost_churn_pipeline.py
├── xgboost_churn_pipeline.ipynb
├── xgboost_churn_pipeline.executed.ipynb
├── churn_quickstart.ipynb
├── churn_quickstart.executed.ipynb
└── Readme.md
- Python
3.12+ - Virtual environment (
venv)
Core dependencies:
pandasscikit-learnxgboostmatplotlibnotebook/jupyter
python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install pandas scikit-learn xgboost matplotlib notebooksource .venv/bin/activate
python xgboost_churn_pipeline.pyOptional arguments:
# Specify dataset path explicitly
python xgboost_churn_pipeline.py --data-path "WA_Fn-UseC_-Telco-Customer-Churn.csv"
# Enable hyperparameter tuning (slower)
python xgboost_churn_pipeline.py --tunesource .venv/bin/activate
jupyter notebookOpen:
xgboost_churn_pipeline.ipynbfor full pipelinechurn_quickstart.ipynbfor a quick dataset sanity check
The pipeline includes:
- Drop
customerIDif present - Convert
TotalChargesto numeric with coercion - Fill missing numeric values with median
- Fill missing categorical values with mode
- Encode target
Churn:Yes -> 1,No -> 0 - One-hot encode categorical predictors
After training, the pipeline prints:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion Matrix
- ROC-AUC score
And displays:
- ROC Curve
- Top important features from XGBoost
- Train/test split uses
random_state=42 - Model defaults also use
random_state=42
- The script auto-detects a Telco churn CSV in the current directory if
--data-pathis not provided. - Hyperparameter tuning can significantly increase runtime.
Contributions are welcome.
- Fork the repository
- Create a feature branch
- Commit your changes with clear messages
- Open a pull request describing motivation and impact
This project is available under the MIT License.
If you reuse this code, please keep attribution in your repository.