In [None]:
!git init
!git config --global user.email "wheelessbrian@yahoo.com"
!git config --global user.name "bwheeless7"
!git remote add origin https://github.com/bwheeless7/data-portfolio.git


In [None]:
!mkdir -p churn-retention-analysis/notebooks
!mkdir -p churn-retention-analysis/data
!mkdir -p churn-retention-analysis/src


In [None]:
!mv "/content/01_problem_definition_and_eda.ipynb" churn-retention-analysis/notebooks/


In [None]:
!find /content -maxdepth 2 -type f

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
%cd /content/drive/MyDrive/data-portfolio/churn-retention-analysis
!ls -a

In [None]:
!git remote -v


In [None]:
!git config --global user.name "bwheeless7"
!git config --global user.email "wheelessbrian@yahoo.com"


In [None]:
!git status
# !git add /content/drive/MyDrive/data-portfolio/churn-retention-analysis/notebooks/01_problem_definition_and_eda.ipynb
# !git commit -m "Progress on problem definition and EDA"



In [None]:
!git pull origin main --no-rebase


In [None]:
!git add .
!git commit -m "Update notebook with new analysis"
!git push

In [None]:
# New Workflow
# !git pull

# !git add .
# !git commit -m "Tune decision "
# !git push


# Business Problem
A digital bank is experiencing increased customer attrition, which directly impacts revenue and long-term growth.
The objective is to identify the key drivers of customer churn, build a predictive model, and propose retention strategies.

# Success Metrics
- Churn Rate
- Retention Rate
- Model ROC-AUC
- Precision/Recall at business-selected threshold


In [None]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/data-portfolio/churn-retention-analysis/data/BankChurners.csv")


In [None]:
df.shape
df['Attrition_Flag'].value_counts(normalize=True)

## Initial Findings

- Overall churn rate is approximately **16.1%**, which is significant for a financial services business.
- The dataset is moderately imbalanced, so evaluation will emphasize recall, precision, and ROC-AUC over raw accuracy.
- Even small improvements in churn reduction could yield substantial revenue impact.


In [None]:
df.info()
df.head()

In [None]:
df = df.drop(columns=[
    'CLIENTNUM',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
])


## Data Preparation Notes

Several columns were removed prior to modeling:

- `CLIENTNUM` (unique identifier)
- Two Naive Bayes probability columns that directly leak the target variable

These features would artificially inflate model performance and are excluded to preserve model integrity.


In [None]:
df['Attrition_Flag'] = df['Attrition_Flag'].map({
    'Existing Customer': 0,
    'Attrited Customer': 1
})

X = df.drop('Attrition_Flag', axis=1)
y = df['Attrition_Flag']

X.shape, y.mean()


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape, y_train.mean(), y_test.mean()


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

categorical_cols = X.select_dtypes(include='object').columns
numeric_cols = X.select_dtypes(exclude='object').columns

preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

model = Pipeline(steps=[
    ('preprocess', preprocess),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("ROC-AUC:", roc_auc_score(y_test, y_prob))
print(classification_report(y_test, y_pred))


## Baseline Model Performance

The baseline logistic regression model achieved:

- **ROC-AUC: 0.921**, indicating excellent overall discrimination.
- **Churn Recall: 0.82**, meaning the model identifies 82% of customers who will churn.
- **Churn Precision: 0.53**, reflecting the tradeoff between recall and false positives.
- **Overall Accuracy: 85%**, though accuracy is secondary due to class imbalance.

From a business perspective, the model is highly effective at capturing potential churners, enabling proactive retention strategies.


## Tuning the Decision Threshold

Our baseline model uses a default probability cutoff of 0.5.  
However, for retention efforts, we want to **maximize recall** to catch as many potential churners as possible.  
We will evaluate different thresholds and select the one that balances recall with acceptable precision.


In [None]:
from sklearn.metrics import precision_recall_curve, f1_score
import matplotlib.pyplot as plt
import numpy as np

# Predicted probabilities for the positive class
y_probs = model.predict_proba(X_test)[:,1]

precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
f1_scores = 2 * (precision * recall) / (precision + recall)

# Plot F1 score vs threshold
plt.figure(figsize=(8,5))
plt.plot(thresholds, f1_scores[:-1], label='F1 Score')
plt.plot(thresholds, recall[:-1], label='Recall', linestyle='--')
plt.xlabel("Threshold")
plt.ylabel("Score")
plt.title("F1 Score and Recall vs Decision Threshold")
plt.legend()
plt.show()

# Choose threshold for target recall ~0.8
target_recall = 0.8
idx = np.argmin(np.abs(recall - target_recall))
optimal_threshold = thresholds[idx]
print("Optimal threshold for target recall:", optimal_threshold)

# Apply threshold
y_pred_adjusted = (y_probs >= optimal_threshold).astype(int)


At a decision threshold of **0.53**, the model achieves approximately **80% recall** for churn prediction.

This means the model successfully identifies the majority of customers who are at risk of churning.  
Although this slightly reduces precision, the trade-off is appropriate for a retention strategy where the cost of missing a potential churner is higher than the cost of contacting a customer who would have stayed.


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_adjusted))

After tuning the decision threshold, recall improves significantly for the churn class while maintaining acceptable overall accuracy.  
This aligns the model with business objectives focused on proactive customer retention.


In [None]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score

thresholds_test = [0.4, 0.5, 0.53, 0.6]
rows = []

for t in thresholds_test:
    preds = (y_probs >= t).astype(int)
    rows.append({
        "Threshold": t,
        "Precision": precision_score(y_test, preds),
        "Recall": recall_score(y_test, preds)
    })

pd.DataFrame(rows)


This comparison confirms that 0.53 offers the best balance between recall and precision for the bankâ€™s retention objectives.

## Feature Importance Analysis

We use the coefficients from the logistic regression model to understand which features most strongly influence churn predictions.  
Positive coefficients increase the likelihood of churn, while negative coefficients decrease it.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Get feature names after preprocessing
ohe = model.named_steps['preprocess'].named_transformers_['cat']
cat_features = ohe.get_feature_names_out()

num_features = model.named_steps['preprocess'].transformers_[0][2]

all_features = np.concatenate([num_features, cat_features])

# Extract coefficients
coefs = model.named_steps['classifier'].coef_[0]

importance_df = pd.DataFrame({
    "Feature": all_features,
    "Coefficient": coefs
}).sort_values(by="Coefficient", ascending=False)

top_positive = importance_df.head(10)
top_negative = importance_df.tail(10)

# Plot
plt.figure(figsize=(10,6))
plt.barh(top_positive["Feature"], top_positive["Coefficient"])
plt.title("Top Features Increasing Churn Risk")
plt.gca().invert_yaxis()
plt.show()

top_positive, top_negative

### Key Drivers of Churn

The model identifies the following major drivers of churn:

- **Months_Inactive_12_mon** and **Contacts_Count_12_mon** are strong behavioral indicators of disengagement.
- **Total_Trans_Ct** and **Total_Trans_Amt** reflect declining usage prior to churn.
- Certain customer segments (income level, card category, education) show elevated churn risk.

Conversely, features such as high transaction volume and strong product relationships significantly reduce churn risk.

Overall, the dominant pattern is **declining engagement preceding customer attrition**.


## Translating Model Insights into Retention Strategy

The model reveals a clear behavioral pattern:  
**customers churn when engagement drops and transactional behavior declines.**

### Key Churn Risk Drivers
Customers are significantly more likely to churn when they show:

- **High transaction amounts but declining activity** (`Total_Trans_Amt`, `Total_Trans_Ct`)
- **Increased inactivity** (`Months_Inactive_12_mon`)
- **More frequent support contact** (`Contacts_Count_12_mon`)
- **Premium product segments** (`Card_Category_Gold`, high income categories)

### Protective Factors
Customers are less likely to churn when they exhibit:

- **Strong multi-product relationships** (`Total_Relationship_Count`)
- **Consistent transaction behavior**
- **High revolving balances and utilization**
- **Blue and Silver card ownership**


### Recommended Retention Actions

| Risk Indicator | Business Interpretation | Targeted Retention Action |
|---------------|----------------------|--------------------------|
| Rising inactivity | Customer disengaging | Proactive re-engagement campaigns, personalized offers |
| Falling transaction count | Reduced product usage | Usage-based incentives, cashback or loyalty rewards |
| High contact frequency | Customer experiencing friction | Priority service, issue resolution outreach |
| Premium customers showing decline | High-value customers at risk | Dedicated relationship managers, exclusive retention offers |
| Low product relationship count | Weak customer attachment | Cross-sell relevant financial products |

These actions allow the bank to intervene **before churn occurs**, prioritizing high-risk, high-value customers.


### Executive Summary

By combining predictive modeling with interpretable insights, this churn system enables the bank to:
- Identify at-risk customers early
- Prioritize outreach based on business impact
- Design targeted retention programs
- Improve long-term customer lifetime value

This approach transforms churn prediction from a technical model into a scalable business solution.
