# Kaggle Project 1 - Breast Cancer Classification

Author: Nicholas Lopez

## Overview
In this workbooks I will train various classification models on breast cancer data from Wisconsin Breast Cancer dataset, which consists of various features extracted from breast mass samples. The goal is to classify the samples as either malignant (M) or benign (B) based on these features.

## Classification Models Used:
1.   Perceptron
2.   Logistic Regression
3.   SVM
4.   Decision Trees
5.   KNN
6.   Random Forest

## Performace Metric:
*   Accuracy

## Import Required Libiries
Below is the list of resources used within this workbook.

In [20]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

### Ignore Warning Messages
Remove scikit-learn warning message from my output

In [21]:
import warnings

# Filter out ConvergenceWarning from sklearn.linear_model
warnings.filterwarnings("ignore", message="lbfgs failed to converge*", category=UserWarning)

##Source Data

### Import Training Data

In [22]:
# import training data
training_data = pd.read_csv('train.csv')
training_data.tail()

Unnamed: 0,id,label,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
450,866674,M,19.79,25.12,130.4,1192.0,0.1015,0.1589,0.2545,0.1149,...,22.63,33.58,148.7,1589.0,0.1275,0.3861,0.5673,0.1732,0.3305,0.08465
451,869254,B,10.75,14.97,68.26,355.3,0.07793,0.05139,0.02251,0.007875,...,11.95,20.72,77.79,441.2,0.1076,0.1223,0.09755,0.03413,0.23,0.06769
452,859717,M,17.2,24.52,114.2,929.4,0.1071,0.183,0.1692,0.07944,...,23.32,33.82,151.6,1681.0,0.1585,0.7394,0.6566,0.1899,0.3313,0.1339
453,88249602,B,14.03,21.25,89.79,603.4,0.0907,0.06945,0.01462,0.01896,...,15.33,30.28,98.27,715.5,0.1287,0.1513,0.06231,0.07963,0.2226,0.07617
454,854941,B,13.03,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.02923,...,13.3,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169


##Data Preperation

###Create Target Dataset
Separate the target (breast cancer label) from the features used train the classification models.

In [23]:
# Create Target Dataframe
y = training_data["label"]
y.tail()

Unnamed: 0,label
450,M
451,B
452,M
453,B
454,B


###Create Feature Dataset
Seperate the features used to predict the breast cancer label

In [24]:
# Create Feature Dataframe
X = training_data.drop(['label', 'id'], axis=1)
display(X.head())

# View quick stats on feature data set
X.describe()
X.info()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,20.66,117.8,991.7,0.1036,0.1304,0.1201,0.08824,0.1992,0.06069,...,21.08,25.41,138.1,1349.0,0.1482,0.3735,0.3301,0.1974,0.306,0.08503
1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
2,9.0,14.4,56.36,246.3,0.07005,0.03116,0.003681,0.003472,0.1788,0.06833,...,9.699,20.07,60.9,285.5,0.09861,0.05232,0.01472,0.01389,0.2991,0.07804
3,12.21,14.09,78.78,462.0,0.08108,0.07823,0.06839,0.02534,0.1646,0.06154,...,13.13,19.29,87.65,529.9,0.1026,0.2431,0.3076,0.0914,0.2677,0.08824
4,12.34,14.95,78.29,469.1,0.08682,0.04571,0.02109,0.02054,0.1571,0.05708,...,13.18,16.85,84.11,533.1,0.1048,0.06744,0.04921,0.04793,0.2298,0.05974


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   radius_mean              455 non-null    float64
 1   texture_mean             455 non-null    float64
 2   perimeter_mean           455 non-null    float64
 3   area_mean                455 non-null    float64
 4   smoothness_mean          455 non-null    float64
 5   compactness_mean         455 non-null    float64
 6   concavity_mean           455 non-null    float64
 7   concave points_mean      455 non-null    float64
 8   symmetry_mean            455 non-null    float64
 9   fractal_dimension_mean   455 non-null    float64
 10  radius_se                455 non-null    float64
 11  texture_se               455 non-null    float64
 12  perimeter_se             455 non-null    float64
 13  area_se                  455 non-null    float64
 14  smoothness_se            4

### Create Target Processor
Create `target_processor` to convert the target variable `y` from string labels ('M' and 'B') to numerical values (1 and 0) and vice versa. The function will be called later to perform the same preprocessing steps on the test data, as well as convert the data back to the labeled values.

In [25]:
# Convert 'M' to 1 and 'B' to 0 in the target variable
def target_processor(y, proc_step):
  """
    Converts the Target label to numerical values or vice versa.

    Args:
        y: A pandas Series containing the target labels.
        proc_step: A string indicating the processing step ('pre' or 'post').
            - pre: Convert 'M' to 1 and 'B' to 0.
            - post: Convert 1 to 'M' and 0 to 'B'.
    Returns:
        A pandas Series containing the converted target labels.
    """
  # Check if y is Pandas Series and if not convert it to Series
  if type(y) != pd.core.series.Series:
    try:
      y = pd.Series(y, name='label')
    except:
      raise TypeError("y must be a pandas Series")

  # Determine the processing step to perform
  if proc_step == 'pre':
    # Pre-processing step to convert factors to numbers
    y_numerical = y.apply(lambda x: 1 if x == 'M' else 0)
  elif proc_step == 'post':
    # Post-processing step to convert numbers to factors
    y_numerical = y.apply(lambda x: 'M' if x == 1 else 'B')
  else: # Error handling
    raise ValueError("Invalid processing step. Use 'pre' or 'post'.")

  # Return Results
  return y_numerical

Perform the pre-processing steps on the traget variable

In [26]:
# Apply the conversion to the target variable
y_processed = target_processor(y, 'pre')

# Display the first few values of the converted target variable
display(y_processed.tail())

# Display the value counts to verify the conversion
display(y_processed.value_counts())

Unnamed: 0,label
450,1
451,0
452,1
453,0
454,0


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,285
1,170


###Create the Training and Test Datasets
Using `train_test_split` create the training and test data sets for the feature and targets. Test size will be 20% of the total training data and the stratify parameter is set to "y" to ensure even distribution of the target classes.

In [27]:
# Create train and test splits on features and targets
X_train, X_test, y_train, y_test = train_test_split(
    X, y_processed, test_size=0.2, random_state=1, stratify=y)

### Create Feature Processor Function
Create a function `feature_processor` to standardize the feature data using `StandardScaler`. This function will be used to apply the same scaling to the test data before making predictions.

In [28]:
def feature_processor(X, proc_step, scaler=None):
    """
    Standardizes the feature data using StandardScaler.

    Args:
        X: A pandas DataFrame containing the features.
        proc_step: A string indicating the processing step ('fit_transform' or
        'transform').
            - fit_transform: Fit the scaler on the data and transform it.
            - transform: Transform the data using an existing scaler.
        scaler: An optional StandardScaler object to use for transformation.

    Returns:
        A tuple containing the standardized features (as a NumPy array) and
        the fitted scaler object (if proc_step is 'fit_transform').
    """
    # Determine the process step
    if proc_step == 'fit_transform':
        # Create StandardScaler instance
        scaler = StandardScaler()
        # Standardize training features
        X_scaled = scaler.fit_transform(X)
        # Return results
        return X_scaled, scaler
    elif proc_step == 'transform':
        # Check for scaler instance
        if scaler is None:
            raise ValueError("Scaler must be provided for 'transform' step.")
        # Standardize prediction features
        X_scaled = scaler.transform(X)
        # Return Results
        return X_scaled
    else: # Error Handleing
        raise ValueError("Invalid processing step. Use 'fit_transform' or 'transform'.")

### Standardize Features
Standardize the training and test feature data using the `feature_processor` function.

In [29]:
# Standardize training data and get the fitted scaler
X_train_scaled, scaler = feature_processor(X_train, 'fit_transform')

# Standardize test data using the fitted scaler
X_test_scaled = feature_processor(X_test, 'transform', scaler=scaler)

##Train Models

### Create Training and Evaluation Function
This function loops through a list of classification models, trains each model, and stores the model name and accuracy score in a new dataframe.

### Steps:
*   Create a Python function that takes the training and testing data (`X_train`, `X_test`, `y_train`, `y_test`) and a list of model instances as input.
*   Create an empty list to store the model names and accuracy scores.
*   Iterate through the list of model instances provided.
*   Train each model using the training data (X_train, y_train) inside the loop.
*   Use the trained model to make predictions on the test data (X_test) inside the loop.
*   Calculate the accuracy score by comparing the predicted values to the actual test target values (y_test) inside the loop.
*   Add a row to the results list with the model's name and its accuracy score inside the loop.
*   After the loop finishes, the function returns the results in a dataframe.

In [30]:
def train_and_evaluate_models(X_train, X_test, y_train, y_test, models):
    """
    Trains and evaluates a list of classification models.

    Args:
        X_train: Training features.
        X_test: Testing features.
        y_train: Training target.
        y_test: Testing target.
        models: A list of instantiated classification models.

    Returns:
        A pandas DataFrame containing model names and their accuracy scores.
    """
    # Create a list to store results
    results_list = []

    # Loop through all models to train them on the training data
    for model in models:

        # Train model
        model.fit(X_train, y_train)
        # Make predictions
        y_pred = model.predict(X_test)
        # Score the model
        accuracy = accuracy_score(y_test, y_pred)
        # Capture results
        model_results = {'Model': type(model).__name__, 'Accuracy': accuracy}
        # Append results to the list
        results_list.append(model_results)

    # Convert the list of results to a DataFrame
    results_df = pd.DataFrame(results_list)
    return results_df

### Instantiate models

Instantiate the following models: Perceptron, Logistic Regression, SVM, Decision Trees, KNN, and Random Forest. Add the models to a list that will be feed into the `train_and_evaluate_models` function.


In [31]:
# Create model instances
def model_instances():
  """
  Creates and returns a list of instantiated classification models.

  Returns:
      A list of instantiated classification models.
  """
  # set random seed
  random_seed = 42

  # Create model instances
  perceptron_model = Perceptron(random_state=random_seed)
  logistic_regression_model = LogisticRegression(random_state=random_seed)
  svm_model = SVC(kernel='linear', C=0.5, random_state=random_seed)
  decision_tree_model = DecisionTreeClassifier(random_state=random_seed)
  knn_model = KNeighborsClassifier(n_neighbors=5)
  random_forest_model = RandomForestClassifier(random_state=random_seed)

  return [
      perceptron_model, logistic_regression_model,
      svm_model, decision_tree_model, knn_model,
      random_forest_model
  ]

# Call the function to get the model instances
models_list = model_instances()


### Train Models on Standardized Data
Call the `train_and_evaluate_models` function with the standardized feature training/testing data and the list of model instances to see how they perform after scaling.

In [32]:
# Train models on standardized feature data
model_results_scaled = train_and_evaluate_models(X_train_scaled, X_test_scaled, y_train, y_test, models_list)

# Return results in order of best accuracy scores
display(model_results_scaled.sort_values(by='Accuracy', ascending=False))

Unnamed: 0,Model,Accuracy
2,SVC,0.989011
1,LogisticRegression,0.978022
0,Perceptron,0.967033
4,KNeighborsClassifier,0.967033
3,DecisionTreeClassifier,0.956044
5,RandomForestClassifier,0.956044


##Model Selection
Based on the results, the `SVC` model has the highest accuracy score with 98.9%. This will be the model used to predict the test data.



In [33]:
# Instantiate the SVC model since it was the best-performing model
svm_model = SVC(kernel='linear', random_state=42)

# Train the model with the scaled training data
svm_model.fit(X_train_scaled, y_train)

# Show best performing model
model_results_scaled.sort_values(by='Accuracy', ascending=False)[:1]

Unnamed: 0,Model,Accuracy
2,SVC,0.989011


##Make Predictions

###Import Test Data

In [34]:
# import test data
test_data = pd.read_csv('test.csv')
test_data.tail()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
109,87164,15.46,11.89,102.5,736.9,0.1257,0.1555,0.2032,0.1097,0.1966,...,18.79,17.04,125.0,1102.0,0.1531,0.3583,0.583,0.1827,0.3216,0.101
110,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
111,859471,9.029,17.33,58.79,250.5,0.1066,0.1413,0.313,0.04375,0.2111,...,10.31,22.65,65.5,324.7,0.1482,0.4365,1.252,0.175,0.4228,0.1175
112,911150,14.53,19.34,94.25,659.7,0.08388,0.078,0.08817,0.02925,0.1473,...,16.3,28.39,108.1,830.5,0.1089,0.2649,0.3779,0.09594,0.2471,0.07463
113,90944601,13.78,15.79,88.37,585.9,0.08817,0.06718,0.01055,0.009937,0.1405,...,15.27,17.5,97.9,706.6,0.1072,0.1071,0.03517,0.03312,0.1859,0.0681


###Pre-process Feature Data
To properly predict the targets, the test data needs to be preprocessed to match how the model was trained. The ID field needs to be dropped and the remaining fields standardized.

In [35]:
# Drop ID field from test data
test_data_processed = test_data.drop(['id'], axis=1)

# Process Feature data
test_data_scaled = feature_processor(test_data_processed, 'transform', scaler=scaler)

###Test Model on Unseen Data
Using the best-performing model, `SVC`, to predict the labels on unseen data.

In [36]:
# Predict labels
test_predictions = svm_model.predict(test_data_scaled)

# Convert predictions to a pandas DataFrame
test_predictions = pd.DataFrame(target_processor(test_predictions, proc_step='post'),
                                columns=['label'])

# Display the last few predictions using array slicing
test_predictions.tail()

Unnamed: 0,label
109,M
110,M
111,M
112,B
113,B


###Create Submission File
Take the predictions from the model and create the Kaggle submission file for assessing the results.

In [37]:
# Combine Prdictions with the original ID.
df_submission = pd.concat([test_data['id'], test_predictions], axis=1)

# Create Submission.csv
df_submission.to_csv('submission.csv', index=False)