
# Telco Customer Churn Prediction using Machine Learning

This project aims to predict whether a telecom customer will churn (i.e., stop using the service) based on their account details, services used, and contract information. Early detection of potential churners helps businesses retain customers and reduce loss.

We will explore, clean, and model the Telco Customer Churn dataset using Python and machine learning.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib


In [1]:

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()


NameError: name 'pd' is not defined

In [None]:
df.info()

In [None]:
df.describe()

In [None]:

# Replace empty strings with NaN
df.replace(" ", np.nan, inplace=True)

# Drop rows with missing TotalCharges
df.dropna(inplace=True)

# Convert TotalCharges to float
df['TotalCharges'] = df['TotalCharges'].astype(float)

# Encode target variable
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})


In [None]:

cat_cols = df.select_dtypes(include='object').columns.tolist()
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)


In [None]:

X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:

log_model = LogisticRegression(max_iter=2000)
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))


In [None]:

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


In [None]:

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


In [None]:

# Feature Importance
importances = rf_model.feature_importances_
feat_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feat_df.sort_values(by='Importance', ascending=False).head(10).plot(kind='barh', x='Feature', y='Importance', title='Top 10 Features')
plt.show()


In [None]:
joblib.dump(rf_model, 'churn_model.pkl')


## Conclusion

We successfully built a churn prediction model using Random Forest and Logistic Regression. The Random Forest model gave higher accuracy and better performance in detecting churn. This model can be used to alert telecom companies about customers who are likely to churn and take action accordingly.
