# AI Task 2 – Model Evaluation Analysis & Improvement

## Objective
The provided notebook reports very high accuracy for a binary classification model.
However, such performance may be misleading in real-world scenarios.

The goal of this task is to critically analyze the evaluation approach, identify
why the reported results may be unreliable, and improve the evaluation using
correct machine learning principles.


In [1]:
import numpy as np
import pandas as pd

np.random.seed(42)

n_samples = 6000

y = np.zeros(n_samples)
y[:120] = 1  # Highly imbalanced target
np.random.shuffle(y)

X = pd.DataFrame({
    "feature_1": np.random.normal(50, 10, n_samples),
    "feature_2": np.random.normal(30, 5, n_samples),
    "feature_3": np.random.normal(100, 20, n_samples)
})


## Target Distribution Analysis

Before training a model, it is important to understand the class distribution.
In this dataset, positive cases (class = 1) represent rare events.


In [2]:
pd.Series(y).value_counts()


Unnamed: 0,count
0.0,5880
1.0,120


### Observation
The dataset is highly imbalanced, with very few positive samples.
In such cases, accuracy alone can be misleading because a model can
predict only the majority class and still achieve high accuracy.


In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


A stratified split is used to ensure that both training and testing sets
preserve the original class distribution.


In [4]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

model.fit(X_train, y_train)


## Initial Evaluation Using Accuracy

Accuracy is calculated to match the original notebook's evaluation approach.
However, this metric alone is insufficient for imbalanced datasets.


In [5]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Accuracy: 0.98


### Why Accuracy Is Misleading Here

Because failures are rare, a model can achieve high accuracy by
predicting the majority class most of the time, while still failing
to detect actual positive cases.


In [6]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)


array([[1470,    0],
       [  30,    0]])

## Detailed Evaluation Metrics

To properly assess the model, precision, recall, and F1-score are required.
These metrics provide better insight into how well the model identifies
rare positive cases.


In [7]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      1470
         1.0       0.00      0.00      0.00        30

    accuracy                           0.98      1500
   macro avg       0.49      0.50      0.49      1500
weighted avg       0.96      0.98      0.97      1500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Key Findings

- Precision and recall for the positive class are significantly lower than accuracy.
- The model misses many positive cases despite high accuracy.
- This explains why the model would perform poorly in real-world usage.


## Conclusion

The original model evaluation was misleading due to:
- Heavy reliance on accuracy
- Ignoring class imbalance
- Lack of detailed evaluation metrics

By introducing proper evaluation techniques such as confusion matrices
and class-wise metrics, the model’s real-world limitations become clear.

This task demonstrates the importance of correct evaluation practices
over achieving artificially high performance scores.
