Skip to content

daisyXai/data-mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Telco Customer Churn Classification with XGBoost

An end-to-end machine learning pipeline for customer churn prediction using the WA_Fn-UseC_-Telco-Customer-Churn.csv dataset and XGBoost.

Features

  • Clean preprocessing pipeline for mixed tabular data
  • Robust handling of TotalCharges conversion and missing values
  • One-hot encoding for categorical features
  • Train/test split with reproducible random seed
  • XGBoost classifier training and evaluation
  • Classification metrics: Accuracy, Precision, Recall, F1-score
  • Confusion matrix and ROC-AUC with ROC curve visualization
  • Top feature importance extraction
  • Optional hyperparameter tuning via RandomizedSearchCV
  • Ready-to-run Python script and Jupyter notebooks

Project Structure

.
├── WA_Fn-UseC_-Telco-Customer-Churn.csv
├── xgboost_churn_pipeline.py
├── xgboost_churn_pipeline.ipynb
├── xgboost_churn_pipeline.executed.ipynb
├── churn_quickstart.ipynb
├── churn_quickstart.executed.ipynb
└── Readme.md

Requirements

  • Python 3.12+
  • Virtual environment (venv)

Core dependencies:

  • pandas
  • scikit-learn
  • xgboost
  • matplotlib
  • notebook / jupyter

Installation

python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install pandas scikit-learn xgboost matplotlib notebook

Usage

1) Run the Python script

source .venv/bin/activate
python xgboost_churn_pipeline.py

Optional arguments:

# Specify dataset path explicitly
python xgboost_churn_pipeline.py --data-path "WA_Fn-UseC_-Telco-Customer-Churn.csv"

# Enable hyperparameter tuning (slower)
python xgboost_churn_pipeline.py --tune

2) Run the Jupyter Notebook

source .venv/bin/activate
jupyter notebook

Open:

  • xgboost_churn_pipeline.ipynb for full pipeline
  • churn_quickstart.ipynb for a quick dataset sanity check

Data Preprocessing Logic

The pipeline includes:

  1. Drop customerID if present
  2. Convert TotalCharges to numeric with coercion
  3. Fill missing numeric values with median
  4. Fill missing categorical values with mode
  5. Encode target Churn: Yes -> 1, No -> 0
  6. One-hot encode categorical predictors

Evaluation Outputs

After training, the pipeline prints:

  • Accuracy
  • Precision
  • Recall
  • F1-score
  • Confusion Matrix
  • ROC-AUC score

And displays:

  • ROC Curve
  • Top important features from XGBoost

Reproducibility

  • Train/test split uses random_state=42
  • Model defaults also use random_state=42

Notes

  • The script auto-detects a Telco churn CSV in the current directory if --data-path is not provided.
  • Hyperparameter tuning can significantly increase runtime.

Contributing

Contributions are welcome.

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes with clear messages
  4. Open a pull request describing motivation and impact

License

This project is available under the MIT License.
If you reuse this code, please keep attribution in your repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors