# 👩‍💻 Bagging in Action: Predicting Customer Churn with Random Forest

## 📋 Overview
In this lab, you'll implement bagging techniques using a Random Forest to predict customer churn. Predicting customer churn is critical for businesses as it allows them to proactively address issues and improve customer retention strategies. By leveraging the Telco Customer Churn dataset, you will build and evaluate a Random Forest model to make these predictions.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- ✅ Perform data preprocessing and exploratory data analysis (EDA) on a real dataset.  
- ✅ Implement and train a Random Forest model to predict customer churn.  
- ✅ Evaluate the model's performance using relevant metrics.  
- ✅ Analyze feature importance from the trained model.

## 📂 Task 1:  Data Preparation

**Context:** Before building a model, it is essential to understand and preprocess your data.

**Steps:**

1. **Load the Data:**  
     
   - Load the Telco Customer Churn dataset using `pandas`.  
   - Display the first few rows to understand the structure.

In [5]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Load Data
df = pd.read_csv('Telco-Customer-Churn.csv')

2. **Explore the Data:**  
     
   - Use `.info()`, `.describe()`, and visualize data distributions to uncover insights.  
   - Identify and handle missing values if any.

In [6]:
# Explore the Data

3. **Preprocess the Data:**  
     
   - Convert categorical variables to numerical using `LabelEncoder` or `pd.get_dummies()`.  
   - Separate features and target variable (Churn).

In [7]:
# Preprocess the Data

💡 **Tip:** Look for columns with many unique values during categorical encoding.

⚙️ **Test Your Work:**
- Ensure no missing values in the final DataFrame.  
- Verify all categorical variables are encoded correctly.

## 🔍 Task 2: Splitting the Data

**Context:** Splitting the dataset ensures that your model is trained and tested on different data, promoting fairness and performance evaluation.

**Steps:**
   - Use `train_test_split` to divide the dataset into training and testing sets (80%-20%).

💡 **Tip:** Keep a random state for reproducibility.

In [8]:
# Splitting the Data

⚙️ **Test Your Work:**

- Confirm the shapes of the training and testing sets.

## 🧹 Task 3: Model Building and Training

**Context:** Building and training the model is the core step where the Random Forest algorithm is applied.

**Steps:**
     
   - Create a `RandomForestClassifier` with 100 trees.
     
   - Fit the classifier on the training data.
     
   - Experiment with `n_estimators`, `max_depth`, and `max_features` to optimize performance.

💡 **Tip:** Use GridSearchCV to automate hyperparameter tuning.

In [None]:
# Model Building and Training

⚙️ **Test Your Work:**
- Ensure your model trains without errors.

## 🧠 Task 4: Feature Selection
**Context:** Evaluating the model is crucial to understand its real-world performance.

**Steps:**
     
   - Use your trained model to predict on the test set.
     
   - Use `classification_report` to assess metrics like accuracy, precision, recall, and F1 score.

💡 **Tip:** Analyze the confusion matrix for more detailed insight.

In [None]:
# Feature Selection

⚙️ **Test Your Work:**
- Validate the classification report output.

## 🧪 Task 5: Feature Importance Analysis
**Context:** Understanding which features matter more helps in model interpretability and further feature engineering.

**Steps:**
     
   - Retrieve and print feature importances from the Random Forest model.
     
   - Identify the top contributing features and interpret their impact.

💡 **Tip:** Plot a bar chart for visual representation of feature importances.

In [None]:
# Feature Importance Analysis

⚙️ **Test Your Work:**
- Ensure you get a clear ranking of features by their importance.

### ✅ Success Checklist

- Data is preprocessed without errors.  
- Data is split into training and testing sets correctly.  
- Random Forest model is instantiated and trained successfully.  
- Model evaluation metrics are correctly calculated and interpreted.  
- Feature importances are analyzed and visualized.

### 🔍 Common Issues & Solutions

**Problem:** Data leakage during preprocessing.   
**Solution:** Ensure splitting data before transformations or using pipelines.  

**Problem:** Overfitting the model.  
**Solution:** Tune `max_depth`, use cross-validation, and analyze if the model performs consistently on test data.  

### 🔑 Key Points

- Bagging with Random Forests helps improve model stability and reduce overfitting.  
- Data preprocessing and proper splitting are foundational for model performance.  
- Evaluating with multiple metrics provides a comprehensive view of model capabilities.  
- Feature importance analysis aids in model interpretability and potential feature engineering.

## Exemplar Solution

After completing this activity (or if you get stuck!), take a moment to review the exemplar solution. This sample solution can offer insights into different techniques and approaches. 

Reflect on what you can learn from the exemplar solution to improve your coding skills.

Remember, multiple solutions can exist for some problems; the goal is to learn and grow as a programmer by exploring various approaches.

Use the exemplar solution as a learning tool to enhance your understanding and refine your approach to coding challenges.

<details>    
<summary><strong>Click HERE to see an exemplar solution</strong></summary>    

```python
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

# Load Data
df = pd.read_csv('Telco-Customer-Churn.csv')

# Data Preprocessing
# Example: Encode categorical columns (assuming non-numeric columns need encoding)
label_encoder = LabelEncoder()
for column in df.select_dtypes(include=['object']).columns:
    df[column] = label_encoder.fit_transform(df[column])

# Split the data
X = df.drop('Churn', axis=1)  # Features
y = df['Churn']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the random forest model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predictions
y_pred = rf_clf.predict(X_test)

# Evaluation
print(classification_report(y_test, y_pred))

# Feature Importance
importances = rf_clf.feature_importances_
feature_importances = pd.Series(importances, index=X.columns).sort_values(ascending=False)
print(feature_importances.head(10))
```