In [1]:
from sklearn.model_selection import StratifiedShuffleSplit
import pandas as pd

df = pd.read_csv("../2_data/telcocustomerchurn_featured.csv")
print(df.columns.tolist())
print(df.dtypes)


churn_columns = [col for col in df.columns if 'Churn' in col]
print(churn_columns)

['Unnamed: 0', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Satisfaction Score', 'Churn Value', 'Churn Score', 'CLTV', 'LoyaltyID', 'Partner', 'Tenure', 'Monthly Charges', 'Churn', 'Country_United States', 'State_California', 'Quarter_Q3', 'Offer_Offer A', 'Offer_Offer B', 'Offer_Offer C', 'Offer_Offer D', 'Offer_Offer E', 'Internet Type_Cable', 'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Cont

In [2]:
# Drop all columns containing "Churn" in their name except the "Churn" column
churn_columns_to_drop = [col for col in churn_columns if col != 'Churn']
df = df.drop(columns=churn_columns_to_drop)

# Display the remaining columns
print(df.columns.tolist())

# Define the features and target variable
X = df.drop(columns=['Churn', 'Unnamed: 0', 'Customer Status_Joined', 'Customer Status_Stayed', 'LoyaltyID'])
y = df['Churn']

# StratifiedShuffleSplit 
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=7)
print(sss)

for train_index, test_index in sss.split(X, y):
    print("train:", train_index, "test:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

print(f"X shape: {X.shape}")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y shape: {y.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# Save the train and test splits to CSV files
X_train.to_csv("../2_data/X_train.csv", index=False)
X_test.to_csv("../2_data/X_test.csv", index=False)
y_train.to_csv("../2_data/y_train.csv", index=False)
y_test.to_csv("../2_data/y_test.csv", index=False)

['Unnamed: 0', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Satisfaction Score', 'CLTV', 'LoyaltyID', 'Partner', 'Tenure', 'Monthly Charges', 'Churn', 'Country_United States', 'State_California', 'Quarter_Q3', 'Offer_Offer A', 'Offer_Offer B', 'Offer_Offer C', 'Offer_Offer D', 'Offer_Offer E', 'Internet Type_Cable', 'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Contract_Month-to-Month', 'Contrac

In [4]:
from sklearn.naive_bayes import GaussianNB                   # Naive Bayes

nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

In [5]:
y_pred = nb_classifier.predict(X_test)

# Evaluation Metrics
1. Accuracy
    - Accuracy is a metric that measures the proportion of correct classifications made by the model. In a customer churn prediction project, it indicates the overall number of customers who were correctly classified as either churned or not churned.
    - In this project, accuracy can provide a general sense of how well the model performs in distinguishing between customers who will churn and those who will not. However, accuracy alone may not be the most reliable metric for an imbalanced dataset, like churn prediction, where the number of churned customers is much smaller compared to non-churned customers.
2. Precision and Recall
    - Precision is the proportion of correctly predicted positive observations (churned customers) out of all observations that were predicted to be positive.
    - Recall is the proportion of correctly predicted positive observations out of all actual positive observations (all churned customers).
    -  In churn prediction, precision helps to minimize false positives, ensuring that customers classified as likely to churn are actually at risk of churning. Recall, on the other hand, helps to minimize false negatives, ensuring the model does not miss many of the churned customers. These metrics are particularly useful when dealing with imbalanced datasets.
3. F1-Score
    -  F1-Score is the harmonic mean of precision and recall, providing a balance between both metrics. It is particularly useful when you need a single score that considers both false positives and false negatives.
     - In churn prediction, the F1-score is useful because it provides a balance between precision and recall, especially when a high recall or high precision alone might not be sufficient. It helps find an optimal balance for identifying at-risk customers accurately without missing too many.
4. ROC-AUC Score
    - The ROC-AUC score measures the ability of the model to distinguish between positive and negative classes (churned vs. not churned). It is a summary of the ROC curve, where AUC represents the area under the curve. A higher AUC value indicates better performance in distinguishing between the two classes.
    - In churn prediction, the ROC-AUC score helps evaluate the model's ability to correctly differentiate churned customers from those who do not churn. It is particularly useful in assessing different thresholds to understand the model's overall performance in distinguishing between classes.

## Accuracy
- Accuracy represents the overall performance of the model on the test set. Specifically, it is the ratio of the number of correctly classified samples to the total number of samples in the test set. In this example, the model achieved an accuracy of 0.93 (i.e., 93%), which means that the model correctly predicted 93% of the samples in the test set.

- While an accuracy of 93% may seem high, in customer churn prediction projects, churn is often an imbalanced problem. Relying solely on accuracy may not provide a comprehensive evaluation of the model's performance. In imbalanced datasets, the model may be biased towards the majority class (e.g., predicting that most customers will not churn), leading to high accuracy but potentially failing to effectively identify the minority (churned) customers. Therefore, additional evaluation metrics (such as precision, recall, and F1 score) are required for a more comprehensive assessment of the model's performance.

In [6]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.83


##  Precision and Recall
- In this example, precision and recall are used to evaluate the model's performance in predicting customer churn.

- Precision measures the accuracy of positive predictions. Specifically, it calculates the proportion of actual churned customers among those predicted to churn by the model. A precision of 0.87 indicates that 87% of the customers predicted to churn by the model actually did churn.

- Recall measures the model's ability to correctly identify all actual positive cases. It calculates the proportion of correctly predicted churned customers out of all actual churned customers. A recall of 0.88 indicates that the model was able to identify 88% of the actual churned customers.

- These two metrics are important for assessing a model's ability to detect positive instances, especially when the data is imbalanced. In a customer churn prediction scenario, a higher recall indicates better identification of customers who are at risk of churning, whereas higher precision indicates fewer false positive churn predictions.

In [7]:
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_test, y_pred, pos_label=1)  
recall = recall_score(y_test, y_pred, pos_label=1)
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')

Precision: 0.63
Recall: 0.84


## F1-Score
- F1-score is a metric that combines precision and recall into a single value to provide a balanced measure of the model’s performance, especially useful in imbalanced datasets. It is defined as the harmonic mean of precision and recall, giving an overall sense of how well the model can balance these two metrics. In this case, the F1-score is 0.88, meaning that the model is effectively balancing the correct identification of churned customers (recall) with the precision of these predictions. This metric is particularly useful when it is important to both identify as many positive cases as possible while minimizing false positives.

In [8]:
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, pos_label=1)
print(f'F1-Score: {f1:.2f}')

F1-Score: 0.72


## ROC-AUC Score
-  The ROC-AUC Score is a measure of how well the model can distinguish between positive and negative classes. It calculates the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (recall) against the false positive rate. A higher ROC-AUC score (closer to 1) means that the model has a strong ability to correctly classify positive and negative cases, effectively distinguishing between churned and non-churned customers.

In [9]:
from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_test, y_pred)
print(f'ROC-AUC Score: {roc_auc:.2f}')


ROC-AUC Score: 0.83


## Summary
Overall, the Naive Bayes model has performed well in terms of accuracy, precision, recall, F1-score, and ROC-AUC score. It has achieved high scores across all metrics, suggesting that it is capable of predicting customer churn effectively. However, it is still essential to keep in mind that the context of the data, especially its imbalance, might mean relying solely on accuracy is insufficient. The combination of metrics indicates that the model is good at both identifying churned customers and minimizing incorrect predictions.