# Naive Bayes from Scratch over Breast Cancer Data
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.
### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

### Calling Data
Data is call for work. The Columns are selected here is according to the BREAST CANCER DATASET from WINCONSIN Hospital Easily find on Kaggle(www.kaggle.com).

In [2]:
# Reading Data
data = pd.read_csv("./Dataset/Breast Cancer Dataset/Breast_Cancer_Data.csv")

# Droping Unnecessory Columns (Eg. Id, Unnamed 32, etc)
data.drop([data.columns[0],data.columns[32]],axis = 1, inplace = True)

# Differentiating data on basis of 'M' and 'B'
m_data = data[data["diagnosis"]=='M']
b_data = data[data["diagnosis"]=='B']

In [3]:
data.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Splitting the training and testing Data

In [4]:
# Length of the data to be used in training and testing
training_len=int(0.7*data.shape[0])
m_trainlen=int(training_len/2)
b_trainlen=m_trainlen

# Fetching out training andd testing data.
# Training
m_train=m_data.iloc[:m_trainlen]
b_train=b_data.iloc[:b_trainlen]

# Testing
m_test=m_data.iloc[m_trainlen:]
b_test=b_data.iloc[b_trainlen:]

# Concating data.
training=pd.concat([m_train,b_train])
testing=pd.concat([m_test,b_test])

### Feature dictionary function
Function to create Dictionary of requested feature for a particular class using the relative frequency

In [5]:
def feature_dict(feature_name, cancer_class):
    p_values=[]
    feature_unique_values=data[feature_name].unique()
    cancer_train_data=data[data['diagnosis']==cancer_class].iloc[:training_len]
    cancer_unique=cancer_train_data[feature_name].unique()
    for feature in feature_unique_values:
        if feature in cancer_unique:
            rf=cancer_train_data[cancer_train_data[feature_name]==feature].shape[0]/cancer_train_data.shape[0]
        else:
            rf=1/(cancer_train_data.shape[0]+feature_unique_values.shape[0])
        p_values.append(rf)
    feature_dictionary=dict(zip(feature_unique_values,p_values))
    return feature_dictionary

Creating Feature Dictionary for both the classes

In [6]:
dictionary_m={}
dictionary_b={}
for i in data:
    if i=='diagnosis':
        continue
    dictionary_m[i]=feature_dict(i,'M')
    dictionary_b[i]=feature_dict(i,'B')

### Creating Dataframe from Feature Dictionary

In [7]:
testing['diagnosis'].replace(to_replace=['B','M'],value=[0,1],inplace=True)
testing_b_pvalue=pd.DataFrame()
testing_m_pvalue=pd.DataFrame()
for i in data.columns[1:]:
    testing_m_pvalue[i]=testing[i].replace(to_replace=data[i].unique(),value=dictionary_m[i].values())
for i in data.columns[1:]:
    testing_b_pvalue[i]=testing[i].replace(to_replace=data[i].unique(),value=dictionary_b[i].values())

### Posterior Probability Function
Function to calculate Posterior Probability

In [8]:
def calc_posterior(x,testing_b,testing_m):
    posterior_p = np.prod(testing_b.iloc[x])/(np.prod(testing_m.iloc[x])+np.prod(testing_b.iloc[x]))
    return posterior_p

### Testing Data (Bening)
Testing on test data

In [9]:
post_bclass = np.array(list(map(lambda x:calc_posterior(x,testing_b_pvalue,testing_m_pvalue),
                                np.arange(0,testing.shape[0]))))
acc_bening = (np.count_nonzero(np.equal(testing['diagnosis'],
                                        post_bclass.astype(int)))/testing.shape[0])*100

In [10]:
print("Accuracy on testing Bening =",acc_bening)

Accuracy on testing Bening = 89.47368421052632


### Testing Data (Malingnent)
Testing on test data

In [11]:
post_mclass=np.array(list(map(lambda x:calc_posterior(x,testing_m_pvalue,testing_b_pvalue),
                              np.arange(0,testing.shape[0]))))
acc_malignent=(np.count_nonzero(np.equal(testing['diagnosis'],
                                         post_mclass.astype(int)))/testing.shape[0])*100

In [12]:
print("Accuracy on testing Malignent =",acc_malignent)

Accuracy on testing Malignent = 100.0
