https://www.kaggle.com/competitions/breast-cancer-competition/overview

data: https://www.kaggle.com/datasets/nancyalaswad90/breast-cancer-dataset

In [None]:
the data for this challenge is here:
home/john/ai/kaggle2/data/binary-classification/breast-cancer/data.csv

I would like to prepare the data to train models. 
The target feature is diagnosis with either a M for malignant, or B for benign.  This will need to be encoded

I want to use 80% of it to train the model and the remaining 20% to test the model.

I would like to be careful about overfitting.

I would like you select 6 plausible models for this.

For each model I'd like to grid search through all of the possible permutations of hyperparameters.
I would like you to include a progress so that I can monitor the progress of the training jobs

When finished select the model and hyperparameter set that resulted in the most accurate prediction and let me know about that.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# Load the data
data_path = "/home/john/ai/kaggle2/data/binary-classification/breast-cancer/data.csv"
df = pd.read_csv(data_path)

# Encode the target variable (diagnosis: M -> 1, B -> 0)
df['diagnosis'] = df['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

# Separate features (X) and target (y)
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

# Split the data into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (necessary for models like SVC)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the models and their hyperparameters for grid search
models = {
    "LogisticRegression": (LogisticRegression(), {
        'C': [0.1, 1, 10],
        'solver': ['lbfgs', 'liblinear']
    }),
    "RandomForestClassifier": (RandomForestClassifier(), {
        'n_estimators': [10, 50, 100],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }),
    "GradientBoostingClassifier": (GradientBoostingClassifier(), {
        'n_estimators': [50, 100],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5, 7]
    }),
    "SVC": (SVC(), {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf']
    }),
    "XGBoostClassifier": (xgb.XGBClassifier(), {
        'n_estimators': [50, 100],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1]
    }),
    "MLPClassifier": (MLPClassifier(), {
        'hidden_layer_sizes': [(50,), (100,)],
        'activation': ['relu', 'tanh'],
        'solver': ['adam', 'sgd']
    })
}

best_model = None
best_params = None
best_accuracy = 0

# Iterate over models and perform grid search
for model_name, (model, params) in tqdm(models.items()):
    print(f"Running Grid Search for {model_name}...")
    grid_search = GridSearchCV(estimator=model, param_grid=params, cv=5, verbose=2, n_jobs=-1)
    grid_search.fit(X_train_scaled, y_train)
    
    # Evaluate on the test set
    y_pred = grid_search.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"{model_name} Accuracy: {accuracy}")
    
    # Keep track of the best model
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model_name
        best_params = grid_search.best_params_

# Output the best model and its parameters
print(f"The best model is {best_model} with parameters: {best_params} and accuracy: {best_accuracy}")


  0%|                                                                                                                                | 0/6 [00:00<?, ?it/s]

Running Grid Search for LogisticRegression...
Fitting 5 folds for each of 6 candidates, totalling 30 fits


 17%|████████████████████                                                                                                    | 1/6 [00:01<00:09,  1.82s/it]

LogisticRegression Accuracy: 0.9912280701754386
Running Grid Search for RandomForestClassifier...
Fitting 5 folds for each of 27 candidates, totalling 135 fits


 33%|████████████████████████████████████████                                                                                | 2/6 [00:03<00:06,  1.59s/it]

RandomForestClassifier Accuracy: 0.9649122807017544
Running Grid Search for GradientBoostingClassifier...
Fitting 5 folds for each of 12 candidates, totalling 60 fits


 50%|████████████████████████████████████████████████████████████                                                            | 3/6 [00:05<00:06,  2.11s/it]

GradientBoostingClassifier Accuracy: 0.956140350877193
Running Grid Search for SVC...
Fitting 5 folds for each of 6 candidates, totalling 30 fits
SVC Accuracy: 0.9736842105263158
Running Grid Search for XGBoostClassifier...
Fitting 5 folds for each of 12 candidates, totalling 60 fits


 83%|████████████████████████████████████████████████████████████████████████████████████████████████████                    | 5/6 [00:07<00:01,  1.43s/it]

XGBoostClassifier Accuracy: 0.956140350877193
Running Grid Search for MLPClassifier...
Fitting 5 folds for each of 8 candidates, totalling 40 fits


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:09<00:00,  1.63s/it]

MLPClassifier Accuracy: 0.9736842105263158
The best model is LogisticRegression with parameters: {'C': 0.1, 'solver': 'liblinear'} and accuracy: 0.9912280701754386





The best model is LogisticRegression with parameters: {'C': 0.1, 'solver': 'liblinear'} and accuracy: 0.9912280701754386.  run a dedicated training session for that model with those parameters


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create the Logistic Regression model with the best parameters
log_reg = LogisticRegression(C=0.1, solver='liblinear')

# Train the model on the scaled training data
log_reg.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred = log_reg.predict(X_test_scaled)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy}")


Logistic Regression Accuracy: 0.9912280701754386
