## Problem Statement
The objective of this project is to predict customer churn
based on demographic and usage-related features.


In [21]:
import pandas as pd

dataset = pd.read_csv("customer_churn.csv")
dataset.head()

Unnamed: 0,age,monthly_charges,tenure,churn
0,22,250,3,1
1,25,300,6,1
2,30,400,12,0
3,35,450,18,0
4,40,500,24,0


The dataset contains customer attributes and a binary churn label.
Churn prediction is critical for customer retention strategies.


In [22]:
dataset.info()
dataset.isnull().sum()
dataset["churn"].value_counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              10 non-null     int64
 1   monthly_charges  10 non-null     int64
 2   tenure           10 non-null     int64
 3   churn            10 non-null     int64
dtypes: int64(4)
memory usage: 448.0 bytes


<bound method IndexOpsMixin.value_counts of 0    1
1    1
2    0
3    0
4    0
5    1
6    0
7    0
8    1
9    0
Name: churn, dtype: int64>

The target variable shows class imbalance,
making accuracy an unreliable sole evaluation metric.


In [23]:
x = dataset.iloc[:,:-1]
y = dataset["churn"]

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

Stratified splitting preserves class distribution.
Feature scaling is applied for models sensitive to feature magnitude.


In [25]:
from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model.fit(x_train_scaled,y_train)

y_log_pred = log_model.predict(x_test_scaled)

In [26]:
from sklearn.tree import DecisionTreeClassifier

tree_model = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_model.fit(x_train, y_train)

y_tree_pred = tree_model.predict(x_test)

In [28]:
from sklearn.metrics import classification_report, confusion_matrix

print("Logistic Regression")
print(confusion_matrix(y_test, y_log_pred))
print(classification_report(y_test, y_log_pred))

print("Decision Tree")
print(confusion_matrix(y_test, y_tree_pred))
print(classification_report(y_test, y_tree_pred))


Logistic Regression
[[1 0]
 [0 1]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

Decision Tree
[[1 0]
 [0 1]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



Precision, recall, and F1-score are used instead of accuracy
due to class imbalance and business cost considerations.


# CONCLUSION
A complete classification pipeline was built to predict customer churn.

Logistic regression provided better generalization,
while decision trees required depth control to prevent overfitting.

Metric selection played a critical role in model evaluation.
