# Cancer Classification with Support Vector Machine (SVM)

This notebook implements a Support Vector Machine (SVM) model for cancer type classification. The model uses an optimized SVM with GridSearchCV to find the best hyperparameters.

## Overview
1. Data loading and preprocessing
2. Hyperparameter tuning with GridSearchCV
3. Model training with optimal parameters
4. Performance evaluation with various metrics
5. Confusion matrix visualization

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, top_k_accuracy_score

## 1. Data Loading and Preprocessing

We start by loading the cancer classification dataset and preparing it for model training.

In [None]:
# Load the dataset
file_path = "cancer_classification_dataset.csv"
df = pd.read_csv(file_path)

# Display the first few rows to understand the data structure
print("Dataset shape:", df.shape)
df.head()

In [None]:
# Check target variable distribution
plt.figure(figsize=(12, 6))
cancer_counts = df['cancer_type'].value_counts()
sns.barplot(x=cancer_counts.index, y=cancer_counts.values)
plt.title('Cancer Type Distribution')
plt.xlabel('Cancer Type')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Separate features and target variable
X = df.drop(columns=['cancer_type'])
y = df['cancer_type']

# Encode target variable to numerical values
y_encoded = y.astype('category').cat.codes

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

## 2. Hyperparameter Tuning with GridSearchCV

We'll compute balanced class weights and use GridSearchCV to find the optimal hyperparameters for our SVM model.

In [None]:
# Compute class weights to handle potential class imbalance
class_weights = dict(enumerate(compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)))
print("Class weights:", class_weights)

In [None]:
# Define the SVM model
svm_model = SVC(class_weight=class_weights, random_state=42, probability=True)

# Define hyperparameter search space
param_grid = {
    'C': [5, 9, 10, 15],  # Narrowing around best found C value
    'gamma': [1e-5, 5e-5, 1e-4, 5e-4],  # Narrowing around best found gamma
    'kernel': ['sigmoid']  # Using the best kernel from previous RandomizedSearchCV
}

# Perform Grid Search with 3-fold cross-validation
grid_search = GridSearchCV(svm_model, param_grid=param_grid, scoring='f1_weighted',
                           cv=3, verbose=2, n_jobs=-1)

print("Starting hyperparameter tuning with GridSearchCV...")
grid_search.fit(X_train, y_train)
print("Best hyperparameters found:", grid_search.best_params_)

If you would not like to run GridSearchCV - the best parameters we found are here.

In [None]:
best_params = {
    'C': 15,
    'gamma': 0.0001,
    'kernel': 'sigmoid'
}

best_svm = SVC(**best_params, class_weight=class_weights, random_state=42, probability=True)
best_svm.fit(X_train, y_train)
print("SVM model trained with best hyperparameters.")

## 3. Model Training with Optimal Parameters

Now we'll train the SVM model using the best hyperparameters found through GridSearchCV.
**Skip this cell if you used the optimized parameters directly.**

In [None]:
# Train SVM with best parameters
best_svm = SVC(**grid_search.best_params_, class_weight=class_weights, random_state=42, probability=True)
best_svm.fit(X_train, y_train)
print("SVM model trained with best hyperparameters.")

## 4. Performance Evaluation

Let's evaluate our model using various metrics to get a comprehensive understanding of its performance.

In [None]:
# Generate predictions
y_pred = best_svm.predict(X_test)
y_pred_probs = best_svm.predict_proba(X_test)

# Compute accuracy metrics
accuracy = accuracy_score(y_test, y_pred)
f1_weighted = f1_score(y_test, y_pred, average='weighted')
top_2_accuracy = top_k_accuracy_score(y_test, y_pred_probs, k=2)
top_3_accuracy = top_k_accuracy_score(y_test, y_pred_probs, k=3)

print(f"Accuracy on test set: {accuracy:.4f}")
print(f"Overall F1 Score (Weighted): {f1_weighted:.4f}")
print(f"Top-2 Accuracy: {top_2_accuracy:.4f}")
print(f"Top-3 Accuracy: {top_3_accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## 5. Confusion Matrix Visualization

Finally, we'll visualize the confusion matrix to better understand the model's performance across different cancer types.

In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Map numerical labels back to original cancer types for better visualization
cancer_types = y.astype('category').cat.categories

plt.figure(figsize=(12, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', 
            xticklabels=cancer_types, yticklabels=cancer_types)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - SVM on Test Set')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## 6. Conclusion and Next Steps

In this notebook, we've implemented a Support Vector Machine model for cancer type classification. Here's a summary of what we've accomplished:

1. Processed and standardized the dataset
2. Used GridSearchCV to find optimal hyperparameters
3. Trained an SVM model with these hyperparameters
4. Evaluated the model using multiple metrics
5. Visualized the confusion matrix

Potential next steps:
- Try different feature selection techniques
- Experiment with other classification algorithms for comparison
- Implement ensemble methods to further improve performance
- Apply sampling techniques to address class imbalance