### Detecting Breast Cancer Types: Malignant or Benign with Naive Bayes Algorithm

**Introduction:**<br>

Breast cancer is a critical health concern affecting millions of women worldwide. Early detection and accurate diagnosis are pivotal for successful treatment and improved patient outcomes. In recent years, machine learning algorithms have shown great promise in medical diagnosis, and one such algorithm, the Naive Bayes classifier, has proven effective in distinguishing between malignant and benign breast tumors. In this blog, we will explore how the Naive Bayes algorithm can be applied to detect breast cancer types and enhance the efficiency of medical diagnosis.<br>

**Understanding Breast Cancer:**<br>

Breast cancer is a complex disease characterized by the uncontrolled growth of cells in the breast tissue. Tumors can be broadly classified into two types: malignant and benign. Malignant tumors are cancerous and can invade nearby tissues and spread to other parts of the body, while benign tumors are non-cancerous and do not spread. Early detection is crucial, as it enables prompt intervention and increases the chances of successful treatment.<br>

**The Role of Naive Bayes Algorithm:**<br>

The Naive Bayes algorithm is a probabilistic classification method based on Bayes' theorem, which calculates the probability of a hypothesis given the observed data. Despite its simplicity and "naive" assumptions, the Naive Bayes algorithm has proven effective in various applications, including text classification and medical diagnosis.<br>

In [2]:
# import required libraries
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt



In [3]:
# load the dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [4]:
data

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

**Data Collection:**<br> A comprehensive dataset containing features such as tumor size, shape, texture, and other relevant factors is crucial for training the Naive Bayes model. This dataset should be carefully curated and include labeled examples of both malignant and benign tumors.

In [5]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

In [6]:
# convert the dataset into dataframe
df = pd.DataFrame(np.c_[data.data , data.target] , columns=[list(data.feature_names)+ ["target"]])

In [7]:
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0.0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0.0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0.0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0.0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0.0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0.0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0.0


In [8]:
# Display first 5 rows of dataset
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


In [9]:
# Display the summary of a DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   (mean radius,)              569 non-null    float64
 1   (mean texture,)             569 non-null    float64
 2   (mean perimeter,)           569 non-null    float64
 3   (mean area,)                569 non-null    float64
 4   (mean smoothness,)          569 non-null    float64
 5   (mean compactness,)         569 non-null    float64
 6   (mean concavity,)           569 non-null    float64
 7   (mean concave points,)      569 non-null    float64
 8   (mean symmetry,)            569 non-null    float64
 9   (mean fractal dimension,)   569 non-null    float64
 10  (radius error,)             569 non-null    float64
 11  (texture error,)            569 non-null    float64
 12  (perimeter error,)          569 non-null    float64
 13  (area error,)               569 non

In [10]:
# shape of dataframe or the rows and columns of dataframe
df.shape

(569, 31)

In [11]:
# Display descriptive statistics.
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


**Data Preprocessing:** <br>Cleaning and preprocessing the data are essential to ensure the model's accuracy. This may involve handling missing values, normalizing features, and converting categorical variables into a suitable format for the algorithm.

In [12]:
# Detect missing values
df.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

In [14]:
import warnings
warnings.filterwarnings("ignore")

In [15]:
# Define input and output features
x = df.drop("target", axis = 1)
y = df["target"]

In [16]:
df["target"].head()

Unnamed: 0,target
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


In [17]:
x

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [18]:
type(y)

pandas.core.frame.DataFrame

In [19]:
y

Unnamed: 0,target
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
564,0.0
565,0.0
566,0.0
567,0.0


In [21]:
# Split dataset into random train and test subsets
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.3, random_state=31)

In [22]:
xtrain.shape

(398, 30)

In [23]:
ytrain.shape

(398, 1)

In [24]:
xtest.shape

(171, 30)

In [25]:
ytest.shape

(171, 1)

**Training the Model:** <br>
The Naive Bayes algorithm learns from the training dataset to build a probabilistic model. It calculates the likelihood of each feature given the class (malignant or benign) and the prior probability of each class.

**Naive Bayes GaussianNB (Gaussian Naive Bayes)** is a probabilistic classification algorithm based on the principles of Bayes' theorem. It is particularly well-suited for classification tasks when dealing with continuous data, making it a popular choice in machine learning applications, including medical diagnosis, spam filtering, and document classification.

In [29]:
# import Naive Bayes' libraries and train the model
from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()
GNB.fit(xtrain,ytrain)

GaussianNB()

In [52]:
print("The training score of GNB model is  ", GNB.score(xtest,ytest))

The training score of GNB model is   0.9239766081871345


**Multinomial Naive Bayes** is a probabilistic classification algorithm widely employed in natural language processing tasks, particularly in text classification. Designed for discrete data, it is a variant of the Naive Bayes algorithm that assumes a multinomial distribution of features. In the context of text analysis, MultinomialNB is well-suited for scenarios where the features represent the frequency of words or terms in documents. Known for its computational efficiency and simplicity, MultinomialNB is frequently used in applications such as spam filtering, sentiment analysis, and document categorization, providing reliable and interpretable results in diverse text-based contexts.

In [28]:
# import Naive Bayes' libraries and train the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(xtrain,ytrain)
print("The training score of MNB model is  ", MNB.score(xtest,ytest))

The training score of MNB model is   0.8947368421052632


**Bernoulli Naive Bayes** is a variant of the Naive Bayes algorithm designed for binary classification tasks. In this algorithm, features are treated as binary, indicating their presence or absence. Well-suited for scenarios where only the existence of features matters, BernoulliNB has found applications in spam detection, sentiment analysis, and various binary classification problems. The algorithm's simplicity and computational efficiency make it an attractive choice, especially when dealing with datasets where features are naturally represented as binary indicators. BernoulliNB's ability to provide transparent and interpretable results adds to its appeal, making it a valuable tool in machine learning for tasks requiring binary decision-making.

In [30]:
# import Naive Bayes' libraries and train the model
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
BNB.fit(xtrain,ytrain)
print("The training score of BNB model is  ", BNB.score(xtest,ytest))

The training score of BNB model is   0.6374269005847953


**Prediction for New Data:**<br>
Now, with a trained model, you can use new, unseen data to make predictions. Extract features from the new data, preprocess them, and input them into the Naive Bayes model to predict whether the tumor is malignant or benign.

In [31]:
# Creating list with patient information  regarting each feature in input dataset
patient_info1 = [17.99,
 10.38,
 122.8,
 1001.0,
 0.1184,
 0.2776,
 0.3001,
 0.1471,
 0.2419,
 0.07871,
 1.095,
 0.9053,
 8.589,
 153.4,
 0.006399,
 0.04904,
 0.05373,
 0.01587,
 0.03003,
 0.006193,
 25.38,
 17.33,
 184.6,
 2019.0,
 0.1622,
 0.6656,
 0.7119,
 0.2654,
 0.4601,
 0.1189]

In [32]:
len(patient_info1)

30

In [33]:
#list to array conversion
patient_info1 = np.array([patient_info1])
patient_info1

array([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
        3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
        8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
        3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
        1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01]])

In [34]:
# Prediction based on Gaussian Naive Bayes
GNB.predict(patient_info1)

array([0.])

In [35]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

In [36]:
pred = GNB.predict(patient_info1)

In [37]:
if pred[0] == 0:
    print("Patient has cancer(malignant tumor)")
else:
    print("Patient has no cancer(malignant tumor)")

Patient has cancer(malignant tumor)


**Conclusion:**<br>

The application of the Naive Bayes algorithm in detecting breast cancer types, whether malignant or benign, is a promising avenue in the realm of medical diagnosis. By leveraging the power of machine learning, healthcare professionals can enhance their ability to make accurate and timely decisions, ultimately improving patient outcomes. As research and technology continue to advance, the integration of machine learning algorithms like Naive Bayes into medical practices offers hope for more effective and personalized healthcare solutions in the future.