### **Load and Test-Train Split the Data**

In [1]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', as_frame=False)
X, y = mnist.data, mnist.target
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

### **Train a Binary Classifier**

In [2]:
from sklearn.linear_model import SGDClassifier
y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
some_digit = X[0]
sgd_clf.predict([some_digit])

array([ True])

---
### Step 1: Evaluate Accuracy with Cross-Validation

* Use 3-fold cross-validation to get a more reliable estimate of your model’s accuracy:
* That looks great (95%+ accuracy), but...

In [5]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print("Cross-validation scores:", scores)

Cross-validation scores: [0.95035 0.96035 0.9604 ]


### Step 2: Test with a Baseline (Dummy Classifier)

* Now test a model that just always predicts “not 5” (i.e., the most common class):
* This shows that **91%** accuracy is possible without predicting a single "5" — accuracy alone is not trustworthy on imbalanced data.
*  Even though it makes zero useful predictions, it still achieves ~91% accuracy
because only ~10% of the training images are 5s!

In [6]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train_5)
dummy_scores = cross_val_score(dummy_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print("Dummy classifier scores:", dummy_scores)

Dummy classifier scores: [0.90965 0.90965 0.90965]


### Step 3 (Optional): Manual Cross-Validation with StratifiedKFold

In [7]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for train_idx, test_idx in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_idx]
    y_train_folds = y_train_5[train_idx]
    X_test_fold = X_train[test_idx]
    y_test_fold = y_train_5[test_idx]
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    
    accuracy = sum(y_pred == y_test_fold) / len(y_pred)
    print("Manual fold accuracy:", accuracy)

Manual fold accuracy: 0.9669
Manual fold accuracy: 0.91625
Manual fold accuracy: 0.96785
