<a href="https://colab.research.google.com/github/brem221/breast-cancer-prediction/blob/main/model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, mean_absolute_error, confusion_matrix

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/brem221/breast-cancer-prediction/main/preprocessed_breast_cancer_data.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [8]:
X = df.drop(columns=['diagnosis'], axis=1)
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.35, random_state= 101)

print(f"X_train_shape: {X_train.shape}")
print(f"X_test_shape: {X_test.shape}")
print(f"y_train_shape: {y_train.shape}")
print(f"y_test_shape: {y_test.shape}")

X_train_shape: (369, 32)
X_test_shape: (200, 32)
y_train_shape: (369,)
y_test_shape: (200,)


In [10]:
classifiers = {
    'Decision Tree': DecisionTreeClassifier(),
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC()
}
predictions_df = pd.DataFrame({'Actual': y_test})

best_model_name = None
best_accuracy = 0.0

for name, classifier in classifiers.items():
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    predictions_df[name] = y_pred
    accuracy = accuracy_score(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    print(f"\n{name} Model:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Confusion Matrix:\n{confusion_mat}")
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model_name = name
print("\nPredictions DataFrame:")
print(predictions_df)
print(f"\nBest Model: {best_model_name}")
print(f"Best Accuracy: {best_accuracy:.4f}")



Decision Tree Model:
Accuracy: 0.9250
Confusion Matrix:
[[120   5]
 [ 10  65]]

Logistic Regression Model:
Accuracy: 0.6250
Confusion Matrix:
[[125   0]
 [ 75   0]]

Random Forest Model:
Accuracy: 0.9450
Confusion Matrix:
[[120   5]
 [  6  69]]

Gradient Boosting Model:
Accuracy: 0.9550
Confusion Matrix:
[[122   3]
 [  6  69]]

K-Nearest Neighbors Model:
Accuracy: 0.7600
Confusion Matrix:
[[113  12]
 [ 36  39]]

Support Vector Machine Model:
Accuracy: 0.6250
Confusion Matrix:
[[125   0]
 [ 75   0]]

Predictions DataFrame:
     Actual  Decision Tree  Logistic Regression  Random Forest  \
107       0              0                    0              0   
437       0              0                    0              0   
195       0              0                    0              0   
141       1              1                    0              1   
319       0              0                    0              0   
..      ...            ...                  ...            ...   
375      

In this analysis, we applied several machine learning models to predict breast cancer diagnoses based on a dataset comprising features derived from fine needle aspirate (FNA) images of breast masses. The primary objective was to discern between malignant and benign tumors. The models considered include Decision Tree, Logistic Regression, Random Forest, Gradient Boosting, K-Nearest Neighbors, and Support Vector Machine.


Key Findings:

    Model Performance:
        Each model demonstrated varying degrees of effectiveness in distinguishing between malignant and benign cases.
        Accuracy, as a primary metric, ranged from model to model, providing a comprehensive understanding of their predictive capabilities.

    Best Performing Model:
        The Best Model emerged as the most accurate predictor, achieving an accuracy score of Best Accuracy. This model Provide insights into why this model performed the best, e.g., its inherent strengths or suitability for the dataset.

    Feature Importance:
        Feature importance analyses were conducted for models that support such assessments (e.g., Decision Tree, Random Forest). Highlight any notable features that significantly influenced predictions.

    Limitations:
        It is important to acknowledge the limitations of the models, such as potential overfitting, the need for additional feature engineering, or the sensitivity of certain models to specific types of data.

    Future Directions:
        Future work may involve fine-tuning hyperparameters, exploring ensemble methods, or incorporating advanced techniques like deep learning for enhanced predictive performance.

    Clinical Applicability:
        While achieving high accuracy is essential, the clinical applicability and interpretability of the models are equally important considerations. Further collaboration with domain experts may refine the models for real-world implementation.

Overall Implications:

The successful application of machine learning models for breast cancer prediction holds promising implications for early detection and patient outcomes. The insights gained from this analysis contribute to ongoing efforts in leveraging computational approaches to complement traditional diagnostic methodologies.

In summary, the diverse set of models evaluated in this study provides a comprehensive view of their capabilities and limitations in breast cancer prediction. The identification of the best-performing model and understanding its strengths contribute valuable insights toward advancing the field of medical diagnostics.