## Problem Statement
The goal is to predict customer churn based on age,
monthly charges, and tenure using classification models.


In [1]:
import pandas as pd

dataset = pd.read_csv("customer_churn.csv")
dataset.head()


Unnamed: 0,age,monthly_charges,tenure,churn
0,22,250,3,1
1,25,300,6,1
2,30,400,12,0
3,35,450,18,0
4,40,500,24,0


The dataset contains customer information and a binary churn label.

Such problems are common in business retention analysis.


In [3]:
dataset.info()
dataset.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              10 non-null     int64
 1   monthly_charges  10 non-null     int64
 2   tenure           10 non-null     int64
 3   churn            10 non-null     int64
dtypes: int64(4)
memory usage: 448.0 bytes


age                0
monthly_charges    0
tenure             0
churn              0
dtype: int64

No missing values are present, allowing direct model training.


In [4]:
dataset["churn"].value_counts()


churn
0    6
1    4
Name: count, dtype: int64

The target variable shows slight class imbalance.

This makes accuracy an unreliable sole evaluation metric.


In [5]:
x = dataset[["age", "monthly_charges", "tenure"]]
y = dataset["churn"]


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=42, stratify=y
)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)


Stratified splitting preserves class distribution.

Scaling is applied for models sensitive to feature magnitude.


In [7]:
from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model.fit(x_train_scaled, y_train)

log_pred = log_model.predict(x_test_scaled)


In [8]:
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test, log_pred))
print(classification_report(y_test, log_pred))


[[2 0]
 [0 1]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



Precision, recall, and F1-score provide a more reliable evaluation
than accuracy due to class imbalance.


In [9]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(x_train, y_train)

tree_pred = tree.predict(x_test)


In [10]:
print("Logistic Regression")
print(classification_report(y_test, log_pred))

print("Decision Tree")
print(classification_report(y_test, tree_pred))


Logistic Regression
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3

Decision Tree
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



Logistic regression provides stable performance,
while the decision tree captures more complex patterns
but risks overfitting.


# CONCLUSION
A complete classification pipeline was built using real-world style data.

Model evaluation focused on precision, recall, and F1-score rather than accuracy.

Logistic regression provided better generalization for this dataset.
