
## Introduction
The purpose of this task is to analyze an existing machine learning notebook that reports very high accuracy for a binary classification problem.
Despite the high reported accuracy, the model does not perform reliably in real-world scenarios.
This notebook reviews the original approach, identifies issues, and applies improved evaluation practices to obtain more realistic results.


## Review of Original Notebook
The original notebook trains a binary classification model and evaluates it primarily using accuracy.
Although the reported accuracy appears high, accuracy alone is insufficient to judge real-world performance.


In [2]:
import numpy as np
import pandas as pd

np.random.seed(42)

n_samples = 6000

y = np.zeros(n_samples)
y[:120] = 1
np.random.shuffle(y)

X = pd.DataFrame({
    "feature_1": np.random.normal(50, 10, n_samples),
    "feature_2": np.random.normal(30, 5, n_samples),
    "feature_3": np.random.normal(100, 20, n_samples),
    "feature_4": y   # intentional leakage from original notebook
})

df = X.copy()
df["target"] = y

df.head()


Unnamed: 0,feature_1,feature_2,feature_3,feature_4,target
0,23.509005,30.471488,112.233421,0.0,0.0
1,63.515029,20.536776,82.5087,0.0,0.0
2,59.117653,37.428296,77.176741,0.0,0.0
3,32.666161,25.325802,75.218541,0.0,0.0
4,29.351145,34.765768,82.418838,0.0,0.0


## Identified Issues
1. Accuracy used as the sole metric, which is misleading under class imbalance.
2. Potential class imbalance not analyzed.
3. Lack of stratified train-test split.
4. Possible data leakage due to preprocessing outside a pipeline.

Additionally, the inclusion of the target variable as a feature resulted in severe data leakage, making the reported accuracy unreliable.

In [4]:
# Remove leaked feature
X = df.drop(["target", "feature_4"], axis=1)
y = df["target"]
y.value_counts(normalize=True)


Unnamed: 0_level_0,proportion
target,Unnamed: 1_level_1
0.0,0.98
1.0,0.02


The output shows class imbalance, which explains misleadingly high accuracy.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


Stratified splitting maintains class balance between training and testing data.

In [6]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

model.fit(X_train, y_train)


Using a pipeline prevents data leakage by fitting preprocessing only on training data.

In [7]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      1176
         1.0       0.00      0.00      0.00        24

    accuracy                           0.98      1200
   macro avg       0.49      0.50      0.49      1200
weighted avg       0.96      0.98      0.97      1200

Confusion Matrix:
[[1176    0]
 [  24    0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Why the Original Results Were Misleading
The high accuracy was primarily due to class imbalance and improper evaluation.
Proper metrics reveal the true performance of the model.


## Improvements Applied
- Stratified train-test split
- Pipeline-based preprocessing
- Use of precision, recall, F1-score, and confusion matrix


## Conclusion
This task demonstrates that high accuracy alone is not sufficient.
Applying correct evaluation practices leads to more trustworthy and realistic model assessment.

This exercise highlights the importance of questioning overly optimistic results and validating models using correct evaluation practices.