<a href="https://colab.research.google.com/github/evanssamwel/Machine-Learning/blob/main/ML_%7C_Cancer_cell_classification_using_Scikit_learn2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ML | Cancer cell classification using Scikit-learn
Machine learning is used in solving real-world problems including medical diagnostics. One such application is classifying cancer cells based on their features and determining whether they are 'malignant' or 'benign'. In this article, we will use Scikit-learn to build a classifier for cancer cell detection.

Overview of the Dataset
The Breast Cancer Wisconsin (Diagnostic) dataset consists of:

569 instances (tumor samples)
30 attributes (features), including radius, texture, perimeter, and area of tumors
Two classification labels:
0 (Malignant) : Cancerous
1 (Benign) : Non-cancerous
We will use these features to train and evaluate our machine learning model.

Implementing Cancer cell classification in Python
Below is the step-by-step implementation:

1. Importing Necessary Modules and Dataset
We will use numpy, matplotlib and scikit learn for this.




from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import pandas as pd
import matplotlib.pyplot as plt
2. Loading the Dataset into a Variable
For this project, we will use the Breast Cancer Wisconsin (Diagnostic) dataset which is available in Scikit-learn’s datasets module. We use the load_breast_cancer() function to load the dataset.




data = load_breast_cancer()
3. Exploring the Dataset
Before training the model let's examine the dataset. This helps us understand how the data is structured and labeled. We will use pandas module to create a dataframe to simplify this process. We will use df.sample() function to fetch some random records from the data.





df=pd.DataFrame(data.data,columns=data.feature_names)
df.sample(5)
Output:

saMPLE
Dataset
To explore the data types of the columns in our dataset we will use the df.info() function. It will help us to understand the categorical and numerical columns in our dataset.




df.info()
Output:

info
Dataset Info
To investigate the numerical columns we will use the df.describe() function. This function provides key summary statistics such as the mean, standard deviation, minimum and maximum values for each numerical column. It helps us understand the distribution and scale of the data which is crucial for preprocessing and model performance.




df.describe()
Output:

describe
Described Dataset
We must also analyze data.target to understand the distribution of malignant and benign cases as class imbalance can affect model performance.




df2=pd.DataFrame(data.target,columns=['target'])
df2.sample(5)
Output:

tar
Data Distribution
Plotting an pie chart will help us understand the distribution of the target values.




class_counts=df2["target"].value_counts()
plt.pie(class_counts, labels=class_counts.index, autopct='%1.2f%%', colors=['red', 'green'])
Output:

dist_pie
Pie Chart
Usually this type of dataset is considered imbalanced. A common threshold is when the minority class constitutes less than 30% of the total samples. However in this case its almost 38% which is acceptable. Incases of imbalances we can use techniques like oversampling, undersampling or class weighting.

4. Splitting the Data into Training and Testing Sets
To evaluate our classifier we split the dataset into training and test sets using train_test_split(). Here 33% of the data is used for testing while the remaining 67% is used for training.




X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.33, random_state=42)
5. Building and Training the Model
We use Naive Bayes algorithm which is effective for binary classification tasks. The fit() function trains the model on the training dataset.




model = GaussianNB()
model.fit(X_train, y_train)
Output:

Screenshot-2025-04-13-173333
Model Training
6. Making Predictions
Now we use our trained model to predict the classification of cancer cells in the test set. The output is an array of 0s and 1s representing predicted tumor classifications.




y_pred = model.predict(X_test)
print(y_pred[:10])
Output:

[1, 0, 0, 1, 1, 0, 0, 0, 1, 1]

7. Evaluating Model Accuracy
To measure how well our model performs we will compare its predictions with the actual labels to calculate its accuracy. We will use accuracy_score from the sklearn.metrics library.




accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Output:

Model Accuracy: 94.15%

This means our Naive Bayes classifier is 94.15% accurate in predicting whether a tumor is malignant or benign meaning our model is working fine and can be used for medical diagnostics.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# prompt: upload dataset

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


Saving wdbc.data to wdbc.data
Saving wdbc.names to wdbc.names
User uploaded file "wdbc.data" with length 124103 bytes
User uploaded file "wdbc.names" with length 4708 bytes


In [None]:
# prompt: i  want to  the load dataset into a variable , specifically the wdbc.data csv  data = load_breast_cancer()

# Assuming the uploaded file is named 'wdbc.data'
filename = list(uploaded.keys())[0] # Get the name of the uploaded file
data = pd.read_csv(filename, header=None) # Load the data into a pandas DataFrame
print(data.head())


         0  1      2      3       4       5        6        7       8   \
0    842302  M  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001   
1    842517  M  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869   
2  84300903  M  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974   
3  84348301  M  11.42  20.38   77.58   386.1  0.14250  0.28390  0.2414   
4  84358402  M  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980   

        9   ...     22     23      24      25      26      27      28      29  \
0  0.14710  ...  25.38  17.33  184.60  2019.0  0.1622  0.6656  0.7119  0.2654   
1  0.07017  ...  24.99  23.41  158.80  1956.0  0.1238  0.1866  0.2416  0.1860   
2  0.12790  ...  23.57  25.53  152.50  1709.0  0.1444  0.4245  0.4504  0.2430   
3  0.10520  ...  14.91  26.50   98.87   567.7  0.2098  0.8663  0.6869  0.2575   
4  0.10430  ...  22.54  16.67  152.20  1575.0  0.1374  0.2050  0.4000  0.1625   

       30       31  
0  0.4601  0.11890  
1  0.2750  0.08902  
2  0.

In [None]:
# prompt: i want to do the following ,,Splitting the Data into Training and Testing Sets   Here 33% of the data is used for testing while the remaining 67% is used for training.Building and Training the Model
# We use Naive Bayes algorithm which is effective for binary classification tasks.  Making Predictions
# Now we use our trained model to predict the classification of cancer cells in the test set. The output is an array of 0s and 1s representing predicted tumor classifications

# Load the dataset
data = load_breast_cancer()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.33, random_state=42)

# Build and Train the Model
model = GaussianNB()
model.fit(X_train, y_train)

# Making Predictions
y_pred = model.predict(X_test)
print(y_pred[:10])

[1 0 0 1 1 0 0 0 1 1]


In [None]:
# prompt:  Evaluating Model Accuracy
# To measure how well our model performs we will compare its predictions with the actual labels to calculate its accuracy.

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 94.15%


ML | Kaggle Breast Cancer Wisconsin Diagnosis using KNN and Cross Validation
Last Updated
Dataset : It is given by Kaggle from UCI Machine Learning Repository, in one of its challenges. It is a dataset of Breast Cancer patients with Malignant and Benign tumor. K-nearest neighbour algorithm is used to predict whether is patient is having cancer (Malignant tumour) or not (Benign tumour). Implementation of KNN algorithm for classification. Code : Importing Libraries

# performing linear algebra
import numpy as np

# data processing
import pandas as pd

# visualisation
import matplotlib.pyplot as plt
Code : Loading dataset

df = pd.read_csv("..\\breast-cancer-wisconsin-data\\data.csv")

print (data.head)
Output :Code: Data Info

df.info()
Output :
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id                         569 non-null int64
diagnosis                  569 non-null object
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
Unnamed: 32                0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
Code: We are dropping columns - 'id' and 'Unnamed: 32' as they have no role in prediction

df.drop(['Unnamed: 32', 'id'], axis = 1)
print(df.shape)
Output:
(569, 31)
Code: Converting the diagnosis value of M and B to a numerical value where M (Malignant) = 1 and B (Benign) = 0

def diagnosis_value(diagnosis):
    if diagnosis == 'M':
        return 1
    else:
        return 0

df['diagnosis'] = df['diagnosis'].apply(diagnosis_value)
Code :

sns.lmplot(x = 'radius_mean', y = 'texture_mean', hue = 'diagnosis', data = df)
Output:

Code :

sns.lmplot(x ='smoothness_mean', y = 'compactness_mean',
           data = df, hue = 'diagnosis')
Output:Code : Input and Output data

X = np.array(df.iloc[:, 1:])
y = np.array(df['diagnosis'])
Code : Splitting data to training and testing

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.33, random_state = 42)
Code : Using Sklearn

knn = KNeighborsClassifier(n_neighbors = 13)
knn.fit(X_train, y_train)
Output:
KNeighborsClassifier(algorithm='auto', leaf_size=30,
             metric='minkowski', metric_params=None,
             n_jobs=None, n_neighbors=13, p=2,
             weights='uniform')
Code : Prediction Score

knn.score(X_test, y_test)
Output:
0.9627659574468085
Code : Performing Cross Validation

neighbors = []
cv_scores = []

from sklearn.model_selection import cross_val_score
# perform 10 fold cross validation
for k in range(1, 51, 2):
    neighbors.append(k)
    knn = KNeighborsClassifier(n_neighbors = k)
    scores = cross_val_score(
        knn, X_train, y_train, cv = 10, scoring = 'accuracy')
    cv_scores.append(scores.mean())
Code : Misclassification error versus k

MSE = [1-x for x in cv_scores]

# determining the best k
optimal_k = neighbors[MSE.index(min(MSE))]
print('The optimal number of neighbors is % d ' % optimal_k)

# plot misclassification error versus k
plt.figure(figsize = (10, 6))
plt.plot(neighbors, MSE)
plt.xlabel('Number of neighbors')
plt.ylabel('Misclassification Error')
plt.show()
Output:
The optimal number of neighbors is 13

