# Tumor Classification Challenge - Voting Classifier Approach
## Introduction
In this notebook, we're aiming to classify tumor diagnosis as either malignant (M) or benign (B) using a voting classifier ensemble model that combines SVM, Logistic Regression, and Random Forest. We performed hyperparameter tuning using GridSearchCV and evaluate the model on a validation set, followed by predictions on the test set.

In [9]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tumor-classification-challenge/sample_submissions.csv
/kaggle/input/tumor-classification-challenge/train_data.csv
/kaggle/input/tumor-classification-challenge/test_data.csv


## 1. Importing Libraries
First, we're gonna import the necessary libraries for data processing, machine learning model development, and evaluation.

In [10]:
# Basic libraries
import pandas as pd
import numpy as np

# Model and evaluation tools
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, accuracy_score
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline

# For saving the model
import joblib

## 2. Loading and Preprocessing the Data
We load the dataset provided in the challenge. Then drop the column **id** , and the target variable diagnosis is encoded as **1** for malignant (M) and **0** for benign (B).
### 2.1. Loading the dataset

In [11]:
# 1. Load the training data
train = pd.read_csv('/kaggle/input/tumor-classification-challenge/train_data.csv')

In [12]:
# 2. Preprocess the data: Separate features (X) and target label (y)
X = train.drop(columns=['id', 'diagnosis'])
y = train['diagnosis'].map({'M': 1, 'B': 0})


### 2.2 Handling Class Imbalance
Since the dataset might have class imbalance (i.e., more benign than malignant cases), we use SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes by generating synthetic samples of the minority class.

In [13]:
# 3. Handle class imbalance using SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

### 2.3. Splitting the Dataset
We split the resampled data into training and validation sets, using an 80-20 split.

In [14]:
# 4. Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

In [15]:
# 5. Define the base models for the voting classifier
svm = SVC(probability=True)  # SVM requires probability=True for soft voting
lr = LogisticRegression()
rf = RandomForestClassifier()

## 3. Create a Voting Classifier
A Voting Classifier combines the predictions of multiple models.Here, we use a soft voting classifier to average the predicted probabilities from each model.

In [16]:
#  Create a voting classifier with SVM, Logistic Regression, and Random Forest
voting_clf = VotingClassifier(estimators=[
    ('svm', svm),
    ('lr', lr),
    ('rf', rf)
], voting='soft')

#  Create a pipeline with the scaler and the voting classifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('voting', voting_clf)
])

## 4. Hyperparameter Tuning with GridSearchCV
### 4.1. Setting the Parameter Grid
We define the parameter grid for SVM's **C** (regularization parameter) and **gamma** (kernel coefficient) to explore a wide range of values using GridSearchCV.

In [17]:
#  Set the parameter grid for SVM hyperparameters
param_grid = {
    'voting__svm__C': [0.1, 1, 10, 100],
    'voting__svm__gamma': [0.001, 0.01, 0.1]
}

### 4.2. Cross-Validation Setup
We use **StratifiedKFold** with 5 splits to ensure that each fold has a similar class distribution. GridSearchCV will be used to find the best combination of hyperparameters.

In [18]:
# Use StratifiedKFold for cross-validation
cv = StratifiedKFold(n_splits=5)

#  Initialize GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='f1', n_jobs=-1)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

## 5. Model Model Evaluation on Validation Set
### 5.1. Best Parameters
After performing hyperparameter tuning, we output the best parameters found by GridSearchCV.

In [19]:
# Best parameters found during GridSearchCV
print("Best parameters found:", grid_search.best_params_)
best_model = grid_search.best_estimator_

Best parameters found: {'voting__svm__C': 1, 'voting__svm__gamma': 0.01}


### 5.2. Predictions and Metrics on Validation Set
We use the best model to predict on the validation set and calculate key evaluation metrics such as F1 score and accuracy.

In [20]:
# Make predictions on the validation set
y_val_pred = best_model.predict(X_val)

#  Calculate F1 Score and Accuracy
f1 = f1_score(y_val, y_val_pred)
accuracy = accuracy_score(y_val, y_val_pred)

print("F1 Score on validation set:", f1)
print("Accuracy on validation set:", accuracy)

F1 Score on validation set: 0.9791666666666666
Accuracy on validation set: 0.9801980198019802


## 6. Generating Test Set Predictions
### 6.1. Preparing Test Data
We load the test dataset and drop the **id** column to only retain the feature columns.

In [21]:
#  Load the test dataset
test = pd.read_csv('/kaggle/input/tumor-classification-challenge/test_data.csv')

# Preprocess the test data
X_test = test.drop(columns=['id'])
X_test_scaled = best_model.named_steps['scaler'].transform(X_test)

### 6.2. Making Predictions
We use the best model obtained from GridSearchCV to predict the labels for the test set.

In [22]:
# Make predictions on the test set
test_predictions = best_model.predict(X_test_scaled)




### 6.3. Saving the Submission File
We prepare the submission file in the required format, converting the numerical predictions back to categorical ('M' for malignant, 'B' for benign).

In [23]:

# Create the submission DataFrame
submission = pd.DataFrame({'id': test['id'], 'diagnosis': np.where(test_predictions == 1, 'M', 'B')})

# Save the submission file
submission.to_csv('submissionfinal.csv', index=False)
print("Submission file saved as 'submissionfinal.csv'")


Submission file saved as 'submissionfinal.csv'
