# **Churn Prediction Project - XGBoost and Random Forest Models**

## **Overview**
This project implements a churn prediction model using **XGBoost** and **Random Forest** classifiers. It includes data preprocessing, exploratory data analysis (EDA), feature engineering, model training, and evaluation. The models are trained on a dataset of customer information and used to predict whether a customer will churn (Attrited) or stay (Existing).

## **Purpose**
The primary goal of this project is to predict customer churn for a company using historical customer data. By identifying which customers are likely to churn, the company can take preventative actions, thus improving customer retention.

## **Technologies Used**
- **Python** 3.x
- **Libraries**:
  - `Pandas`: Data manipulation and analysis
  - `Numpy`: Numerical operations
  - `Matplotlib`, `Seaborn`: Data visualization
  - `Scikit-learn`: Machine learning algorithms and evaluation metrics
  - `XGBoost`: Gradient Boosting classifier for churn prediction
  - `Imbalanced-learn`: SMOTE (Synthetic Minority Over-sampling Technique) for handling class imbalance

## **Steps Involved**
### **1. Data Loading and Preprocessing**
- Data is loaded and inspected to check for missing values and outliers.
- Features are selected based on their relevance to the prediction.
- Categorical variables are encoded using one-hot encoding.
- **SMOTE** is used for handling class imbalance by oversampling the minority class (Attrited customers).

### **2. Exploratory Data Analysis (EDA)**
- Descriptive statistics are generated for numerical and categorical features.
- Visualizations such as histograms, box plots, and heatmaps are created to analyze the distribution of features and correlations.

### **3. Model Training**
- **Random Forest**: Trained using the default and class-weight-balanced strategies before applying SMOTE.
- **XGBoost**: Hyperparameter tuning is performed using **GridSearchCV** to find the best model configuration.

### **4. Model Evaluation**
- The models are evaluated using key metrics: Accuracy, Precision, Recall, F1 Score, Confusion Matrix, and Classification Report.
- **Confusion Matrix** and **Classification Report** visualizations are generated for better understanding.

### **5. Model Performance Comparison**
- A bar plot is used to compare the performance of Random Forest and XGBoost on key metrics.

## **Directory Structure**
````
churn_prediction_project/
│
├── churn_prediction_final_updated_with_xgb.ipynb  # Jupyter notebook with XGBoost model
├── churn_prediction_final.ipynb                   # Original notebook with Random Forest model
├── README.md                                       # This README file
└── requirements.txt                                # List of dependencies
```

## **How to Run the Code**
### **1. Install Dependencies**
To run the project, you’ll need to install the required dependencies. You can use the following command to install them using **pip**:
````bash
pip install -r requirements.txt
```
If you don't have a `requirements.txt` file yet, you can generate it by running:
````bash
pip freeze > requirements.txt
```
Here is a sample **requirements.txt**:
````bash
pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
imbalanced-learn
```

### **2. Run the Notebook**
- Open the **Jupyter notebook** file (`churn_prediction_final_updated_with_xgb.ipynb`) using Jupyter or any IDE that supports Jupyter notebooks (e.g., VSCode).
- Run the cells sequentially to load the data, preprocess it, train the models, and visualize the results.

### **3. Results and Visualizations**
- The **Confusion Matrix** and **Classification Report** will be displayed for both Random Forest and XGBoost models.
- A **performance comparison** bar chart will visualize the model’s accuracy, precision, recall, and F1 score.

## **Conclusion**
This project demonstrates the process of building, training, and evaluating churn prediction models using **Random Forest** and **XGBoost** classifiers. With the addition of **SMOTE** for class balancing, the models are tuned to handle imbalanced datasets effectively.

## **Future Work**
- Explore further model optimization techniques such as **Hyperparameter Tuning** using RandomizedSearchCV.
- Incorporate other machine learning algorithms like **Logistic Regression**, **SVM**, and **Neural Networks** for comparison.
- Integrate the model into a real-time prediction system for customer retention.

## **License**
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.