# **Naive Bayes Classifier For Classifying Whether The Tumor Is Benign or Malignant**
***

**What is Naive Bayes algorithm?**

Naive Bayes is a classification technique based on Bayes’ Theorem(*Probability theory*) with an assumption that all the features that predicts the target value are independent of each other. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature in determining the target value.

> Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) - *(read as Probability of **c** given **x**)*,  from P(c), P(x) and P(x|c). Look at the equation below:
>
> $$\mathbf{P} \left({x \mid c} \right) = \frac{\mathbf{P} \left ({c \mid x} \right) \mathbf{P} \left({c} \right)}{\mathbf{P} \left( {x} \right)}$$

where,

* *x is set of features*
* *c is set of classes*
* P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
* P(c) is the prior probability of class **c**.
* P(x|c) is the observation density or likelihood which is the probability of predictor(the query  **x**) given class.
* P(x) is the prior probability of predictor **x**, and it is also called as Evidence.

**Why should we use Naive Bayes ?**

* As stated above, It is **_easy_** to build and is particularly useful for **_very large data sets_**.
* It is **extremely fast** for both training and prediction.
* It provide straightforward probabilistic prediction.
* It is often very easily interpretable.
* It has very few (if any) tunable parameters.
* It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

In [1]:
#Importing the Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import random 
import scipy.stats as S

In [2]:
# Importing tha Datasets

Data = pd.read_csv("Breast Cancer Data.csv")
Data.dropna(axis=1,inplace=True)

In [3]:
Data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


# Principal Component Analysis(PCA)

it is generally used for dimensionality reduction...
In this data we are using PCA for finding important features which have more effect/weightage on finding posterior probability

In [4]:
def PCA(Data):
    Data_array = np.array(Data)

    #______ Make Zero Mean Distribution_____

    Means_value = np.mean(Data_array,axis=0)  #finding mean of each columns
    Means_value = Means_value.reshape(1,Data_array.shape[1])


    Centred_value = Data_array - Means_value    #Substracting respective mean with their respective columns values

    
    #___finding covariance matrix of zero mean distrubuted value____

    Covariance_matrix = np.cov(Centred_value , rowvar=0)    
    Covariance_matrix.shape
    
    #__finding eigen values and eigen vectors of covariance matrix
    values,vectors = np.linalg.eig(Covariance_matrix)
    values = values.reshape(1,len(values))

    values_index = np.argsort(values)   #getting original index on the basis of sorted values
    values_index = values_index[0]


    values_index = (values_index[::-1])   # transform sorted_index to descn. order
    
    values = values[:,values_index]     #getting values which will be in descn. order
    
    
    #_______finding cummulative sum for calculating weightage change____
    weightage_of_features = np.cumsum(values)/np.sum(values)
    
    features_list_index=[]  #__list of important features index

    for i in range(0,len(weightage_of_features)):
        weightage_in_percent = weightage_of_features[i]*100

        if weightage_in_percent <= 99.9:
            features_list_index.append(values_index[i])

    return(features_list_index)  

In [5]:
no_features = PCA(Data.iloc[:, 2:])
print(no_features)

[0, 1]


> By Applying PCA on our dataset, important features are 2+0, 2+1 i.e. 2, 3 index of our data which is radius_mean and texture_mean

> *Now split our Data into training and testing set*

In [6]:
train, test = train_test_split(Data, test_size=0.3)
test

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
282,89122,M,19.40,18.18,127.20,1145.0,0.10370,0.14420,0.16260,0.09464,...,23.79,28.65,152.40,1628.0,0.1518,0.3749,0.43160,0.22520,0.3590,0.07787
148,86973702,B,14.44,15.18,93.97,640.1,0.09970,0.10210,0.08487,0.05532,...,15.85,19.85,108.60,766.9,0.1316,0.2735,0.31030,0.15990,0.2691,0.07683
558,925277,B,14.59,22.68,96.39,657.1,0.08473,0.13300,0.10290,0.03736,...,15.48,27.27,105.90,733.5,0.1026,0.3171,0.36620,0.11050,0.2258,0.08004
432,908194,M,20.18,19.54,133.80,1250.0,0.11330,0.14890,0.21330,0.12590,...,22.03,25.07,146.00,1479.0,0.1665,0.2942,0.53080,0.21730,0.3032,0.08075
281,8912055,B,11.74,14.02,74.24,427.3,0.07813,0.04340,0.02245,0.02763,...,13.31,18.26,84.70,533.7,0.1036,0.0850,0.06735,0.08290,0.3101,0.06688
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,8711003,B,12.25,17.94,78.27,460.3,0.08654,0.06679,0.03885,0.02331,...,13.59,25.22,86.60,564.2,0.1217,0.1788,0.19430,0.08211,0.3113,0.08132
237,883263,M,20.48,21.46,132.50,1306.0,0.08355,0.08348,0.09042,0.06022,...,24.22,26.17,161.70,1750.0,0.1228,0.2311,0.31580,0.14450,0.2238,0.07127
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.91,26.50,98.87,567.7,0.2098,0.8663,0.68690,0.25750,0.6638,0.17300
57,857793,M,14.71,21.59,95.55,656.9,0.11370,0.13650,0.12930,0.08123,...,17.87,30.70,115.70,985.5,0.1368,0.4290,0.35870,0.18340,0.3698,0.10940


># Training of Model and Seperating By Class: { Benign, Malignant }

In [7]:
# Seperarting data by class
dataB = train[train['diagnosis'] == 'B']
dataM = train[train['diagnosis'] == 'M']

# This function returns the mean and covariance matrix of provided data
def calculate_mean_covMat(data):
    return data.iloc[:, 2:4].mean(), np.cov(data.iloc[:, 2:4],rowvar=0)

# Calculating mean and covariance matrix of Benign
BT_mean, BT_cov = calculate_mean_covMat(dataB)

# Calculating mean and covariance matrix of Benign
MT_mean, MT_cov = calculate_mean_covMat(dataM)

# Calculating the P(B) and P(M) independently
P_B = dataB.shape[0]/train.shape[0]
P_M = dataM.shape[0]/train.shape[0]

Before we go any further we should know **Posterior Conditional Probability** which is,

$\mathbf{P} \left({x \mid c} \right) = \mathbf{P} \left ({c \mid x} \right) \mathbf{P} \left({c} \right)$

where, $\mathbf{P} \left ({c \mid x} \right)$ is ***Observation Distribution***

And Mathematical Formula of Observation Distribution is

$$ \frac {1}{(\sqrt{2}\pi)^2\sqrt{\textstyle\sum}}e^{-0.5}A^T{\textstyle\sum}^{-1}A $$

where,

* $ \textstyle\sum $    is a covariance matrix

* A is a vector which contains 
$
A=
  \left [ 
      {\begin{array}{c}
           R_i - Mean(radius\_mean) \\
           T_i - Mean(texture\_mean) \\
      \end{array} } 
  \right]
$

># Testing of Model

In [11]:
# This function returns the Observation Distribution
def calculateObservationDistribution(test, mean, covMat):
    return S.multivariate_normal.pdf(test, mean, covMat)


# Here we are Calculating the Posterior Conditional Probability of Benign and Malignant Data
PosteriorConditionalProbabilityB = calculateObservationDistribution(test.iloc[:, 2:4], BT_mean, BT_cov)*P_B
PosteriorConditionalProbabilityM = calculateObservationDistribution(test.iloc[:, 2:4], MT_mean, MT_cov)*P_M

># ***Prediction***

In [12]:
# In this section we are labelling whether it is Benign or Malignant

# creating empty list of label prediction
label_prediction = []

# Comparing PosteriorConditionalProbability of Benign and Malignant
for b, m in zip(range(len(PosteriorConditionalProbabilityB)), range(len(PosteriorConditionalProbabilityM))):
    if(PosteriorConditionalProbabilityB[b] > PosteriorConditionalProbabilityM[m]):
        label_prediction.append('B')
    else:
        label_prediction.append('M')

# list to array
label_prediction = np.array(label_prediction)

> # ***Finding an Accuracy***

In [13]:
# Comapring all the rows of diagnosis of test data with label prediction
count = 0
total = len(test)
for i in range(total):
    if test.iloc[i, 1] == label_prediction[i]:
        count += 1
accuracy = count/total
print('Accuracy = ' + str(accuracy*100) + '%')

Accuracy = 87.71929824561403%
