## Problem Statement
The goal is to evaluate classification models using cross-validation
to obtain more reliable performance estimates.


In [2]:
import pandas as pd

dataset = pd.read_csv("customer_churn.csv")

x = dataset.iloc[:,:-1]
y = dataset["churn"]

Cross-validation helps reduce evaluation bias caused by random data splits.


In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)


Scaling is applied because distance-based and linear models
are sensitive to feature magnitude.


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = LogisticRegression()

cv_scores = cross_val_score(
    model,
    x_scaled,
    y,
    cv=3,
    scoring="f1"
)

cv_scores

array([0.8, 1. , 1. ])

In [7]:
cv_scores.mean(), cv_scores.std()

(np.float64(0.9333333333333332), np.float64(0.09428090415820632))

Cross-validation provides an average performance estimate
along with variance, making evaluation more reliable
than a single train–test split.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

x_train, x_test, y_train, y_test = train_test_split(
    x_scaled,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

model.fit(x_train, y_train)
y_pred = model.predict(x_test)

f1_score(y_test, y_pred)


1.0

The single split score may differ from cross-validation results,
highlighting the instability of single train–test evaluation.


In [11]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

knn_cv = cross_val_score(
    knn,
    x_scaled,
    y,
    cv=3,
    scoring="f1"
)

knn_cv.mean()


np.float64(0.6666666666666666)

# CONCLUSION
Cross-validation provides a more reliable estimate of model performance
by reducing dependence on a single data split.

It is especially important for small datasets and model comparison.
