<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/predicting_customer_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In a project aimed at predicting customer churn, I was initially working with a logistic regression model. The initial results were promising but not up to the desired accuracy level, and the model also struggled with slow prediction times when deployed. To optimize the model, I employed several strategies:
Feature Engineering: I revisited the feature selection and extraction process to ensure that the model was receiving the most relevant information. This involved removing redundant features, creating interaction terms, and applying principal component analysis (PCA) to reduce dimensionality while retaining the variance in the dataset.
Hyperparameter Tuning: I used grid search with cross-validation to systematically explore a wide range of hyperparameters for the logistic regression to find the optimal settings. This helped in improving the model’s accuracy significantly.
Model Selection: Realizing that logistic regression might be too simplistic for the complexity of the data, I tested several other algorithms, including Random Forest and Gradient Boosting Machines (GBM). The GBM outperformed other models in terms of both accuracy and execution speed in the production environment.
Ensemble Methods: To further enhance the performance, I employed a stacking ensemble method that combined the predictions from logistic regression, Random Forest, and GBM. This approach leveraged the strengths of each model and improved the overall prediction accuracy.
Post-processing Techniques: I implemented calibration techniques to adjust the probability outputs from the model, which helped in improving the reliability of the predictions.


In [None]:
# Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.decomposition import PCA
from sklearn.ensemble import StackingClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import classification_report, accuracy_score

# Load and Preprocess Data
data = pd.read_csv('path/to/your/dataset.csv')
X = data.drop(columns=['target'])  # Replace 'target' with your dependent variable
y = data['target']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Engineering
# Example: PCA for Dimensionality Reduction
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Hyperparameter Tuning for Logistic Regression
param_grid_lr = {
    'C': [0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['liblinear']
}
logistic = LogisticRegression()
grid_lr = GridSearchCV(logistic, param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train_pca, y_train)

# Best Logistic Regression Model
best_lr = grid_lr.best_estimator_

# Random Forest and GBM Model Selection
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Model Training
random_forest.fit(X_train_pca, y_train)
gbm.fit(X_train_pca, y_train)

# Stacking Ensemble Model
estimators = [
    ('lr', best_lr),
    ('rf', random_forest),
    ('gbm', gbm)
]
stacking_model = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stacking_model.fit(X_train_pca, y_train)

# Calibrating the Model
calibrated_model = CalibratedClassifierCV(base_estimator=stacking_model, method='sigmoid')
calibrated_model.fit(X_train_pca, y_train)

# Model Evaluation
y_pred = calibrated_model.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)
classification_report_output = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report_output)
