https://www.kaggle.com/competitions/digit-recognizer/overview

Data:
https://www.kaggle.com/competitions/digit-recognizer/data

initial prompt:



The goal of this competition is to choose the correct digit 0,1,2,3,4,5,6,7,8, or 9 from a string of pixel values.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).


the data for this challenge is:
/home/john/ai/kaggle2/data/multi-classification/digit-recognition

the training dataset is
/home/john/ai/kaggle2/data/multi-classification/digit-recognition/train.csv
the label is the target feature which can be 0,1,2,3,4,5,6,7,8, or 9.

The test dataset is
/home/john/ai/kaggle2/data/multi-classification/digit-recognition/test.csv
the label is also the target feature

there is a file that should be used as a guide for creating the submission.csv.
that file is here:
/home/john/ai/kaggle2/data/multi-classification/digit-recognition/sample-submission.csv
the format should be:
ImageId, Label
1,0
2,0
3, 0
etc

The models I would like to use are:
Simple Neural Net
SVM
K-nearest neighbor
other appropriate models

for each model do a hyperparameter grid search through each model's hyperparameter permutations.

Important: after each of the four models are trained, pick the one with the highest test score and create a submission.csv as per the prescribed format.  I need to know what that model was and what its hyperparameters for the record.

I would also like some kind of way to monitor progress of the training in progress.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# Load data
train = pd.read_csv('/home/john/ai/kaggle2/data/multi-classification/digit-recognition/train.csv')
test = pd.read_csv('/home/john/ai/kaggle2/data/multi-classification/digit-recognition/test.csv')

# Split features and labels
X = train.drop(columns=['label']).values.reshape(-1, 28, 28)
y = train['label']
X_test = test.values.reshape(-1, 28, 28)

# Train/test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models and hyperparameters
models = {
    'NN': (MLPClassifier(), {
        'hidden_layer_sizes': [(64,), (128, 64)],
        'activation': ['relu', 'tanh'],
        'solver': ['adam', 'sgd'],
        'max_iter': [100, 200]
    }),
    'SVM': (SVC(), {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf', 'poly']
    }),
    'KNN': (KNeighborsClassifier(), {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance'],
        'algorithm': ['auto', 'ball_tree']
    })
}

# Grid search for best hyperparameters and model
best_model = None
best_score = 0
best_params = {}

for model_name, (model, params) in models.items():
    grid = GridSearchCV(model, params, cv=3, verbose=3, n_jobs=2)
    grid.fit(X_train.reshape(-1, 784), y_train)  # Flatten 28x28 images to 1D vectors
    score = grid.score(X_val.reshape(-1, 784), y_val)
    if score > best_score:
        best_model = grid.best_estimator_
        best_score = score
        best_params = grid.best_params_

# Monitor progress using tqdm or Keras callbacks for the NN

# Create submission
y_test_pred = best_model.predict(X_test.reshape(-1, 784))
submission = pd.DataFrame({'ImageId': range(1, len(y_test_pred) + 1), 'Label': y_test_pred})
submission.to_csv('/home/john/ai/kaggle2/data/multi-classification/digit-recognition/submission.csv', index=False)

# Log the best model and hyperparameters
print(f"Best model: {best_model}")
print(f"Best hyperparameters: {best_params}")


Fitting 3 folds for each of 16 candidates, totalling 48 fits
Fitting 3 folds for each of 9 candidates, totalling 27 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best model: SVC(C=10)
Best hyperparameters: {'C': 10, 'kernel': 'rbf'}
[CV 1/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=100, solver=adam;, score=0.941 total time=  45.8s
[CV 1/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=100, solver=sgd;, score=0.830 total time=  39.8s
[CV 3/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=100, solver=sgd;, score=0.901 total time=  32.4s
[CV 2/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=200, solver=adam;, score=0.941 total time=  33.4s
[CV 1/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=200, solver=sgd;, score=0.761 total time=  54.3s
[CV 3/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=200, solver=sgd;, score=0.887 total time= 1.1min
[CV 1/3] END activation=relu, hidden_layer_sizes=(128, 64), m



[CV 2/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=100, solver=adam;, score=0.931 total time=  18.8s
[CV 3/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=100, solver=adam;, score=0.938 total time=  43.5s
[CV 2/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=100, solver=sgd;, score=0.843 total time=  36.6s
[CV 1/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=200, solver=adam;, score=0.938 total time=  40.1s
[CV 3/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=200, solver=adam;, score=0.947 total time=  29.6s
[CV 2/3] END activation=relu, hidden_layer_sizes=(64,), max_iter=200, solver=sgd;, score=0.839 total time=  44.0s
[CV 1/3] END activation=relu, hidden_layer_sizes=(128, 64), max_iter=100, solver=adam;, score=0.945 total time=  19.8s
[CV 2/3] END activation=relu, hidden_layer_sizes=(128, 64), max_iter=100, solver=adam;, score=0.950 total time=  34.3s
[CV 3/3] END activation=relu, hidden_layer_sizes=(128, 64), max_iter=100, 



Best model: SVC(C=10)
Best hyperparameters: {'C': 10, 'kernel': 'rbf'}

submission file got a 0.97921 score

prompt:
train the SVC model with the following hyperparameters {'C': 10, 'kernel': 'rbf'}
do the submission.csv when done

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load data
train = pd.read_csv('/home/john/ai/kaggle2/data/multi-classification/digit-recognition/train.csv')
test = pd.read_csv('/home/john/ai/kaggle2/data/multi-classification/digit-recognition/test.csv')

# Split features and labels
X = train.drop(columns=['label']).values
y = train['label']
X_test = test.values

# Train/test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Flatten the 28x28 images into 1D vectors
X_train_flat = X_train.reshape(-1, 784)
X_val_flat = X_val.reshape(-1, 784)
X_test_flat = X_test.reshape(-1, 784)

# Train SVC model with specified hyperparameters
svc_model = SVC(C=10, kernel='rbf')
svc_model.fit(X_train_flat, y_train)

# Evaluate the model
y_val_pred = svc_model.predict(X_val_flat)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {val_accuracy}")

# Predict on the test set
y_test_pred = svc_model.predict(X_test_flat)

# Create submission file
submission = pd.DataFrame({'ImageId': range(1, len(y_test_pred) + 1), 'Label': y_test_pred})
submission.to_csv('/home/john/ai/kaggle2/data/multi-classification/digit-recognition/submission.csv', index=False)

print("Submission file created successfully.")


Validation Accuracy: 0.9798809523809524
Submission file created successfully.


score 0.97921