# Build a machine learning classifier using the Naive Bayes Classifier

### The Naive Bayes classifier 

#### In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models. 

Example: classify whether a given person is a male or a female based on the measured features. The features include height, weight, and foot size.

As a reminder to self: Bayesian Statistics use a prior probability in addition to new data to update probability.  The alternative would be a frequentist perspective where probability does not change given new data.    

### Bayes
#### P(A|B) = P(B|A)P(A) / P(B)

#### P(H|D) = P(D|H)P(H) / P(D)

• The prior P(H) is the probability that H is true before the data is considered.

• The posterior P(H | D) is the probability that H is true after the data is considered.

• The likelihood P(D | H) is the evidence about H provided by the data D.

• P(D) is the total probability of the data taking into account all possible hypotheses.


#### Bayesian inference

• uses probabilities for both hypotheses and data.

• depends on the prior and likelihood of observed data.

• requires one to know or construct a ‘subjective prior’.

• dominated statistical practice before the 20th century.

• may be computationally intensive due to integration over many parameters.

#### Frequentist inference (NHST)

• never uses or gives the probability of a hypothesis (no prior or posterior).

• depends on the likelihood P(D | H)) for both observed and unobserved data.

• does not require a prior.

• dominated statistical practice during the 20th century.

• tends to be less computationally intensive



Frequentist measures like p-values and confidence intervals continue to dominate research,
especially in the life sciences. However, in the current era of powerful computers and
big data, Bayesian methods have undergone an enormous renaissance in fields like machine learning and genetics. There are now a number of large, ongoing clinical trials using
Bayesian protocols, something that would have been hard to imagine a generation ago.
While professional divisions remain, the consensus forming among top statisticians is that
the most effective approaches to complex problems often draw on the best insights from
both schools working in concert.

Learn more:  https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading20.pdf



In [2]:
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()

In [8]:
type(data)

sklearn.utils.Bunch

In [21]:
# Organize our data
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
list(data)

['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

In [22]:
# Look at our data
print(label_names)
print(labels[0])
print(feature_names[0])
print(features[0])

['malignant' 'benign']
0
mean radius
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


In [23]:
print(feature_names) # features

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [25]:
# Create Training and Test sets
from sklearn.model_selection import train_test_split


# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33,  # 33% of the data will be set aside to test accuracy of our classifier
                                                          random_state=42) # use randomstate to create reproducible results

In [26]:
from sklearn.naive_bayes import GaussianNB # the Naive Bayes classification algorithm is good for binary classification tasks

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

In [28]:
print(model)

GaussianNB(priors=None, var_smoothing=1e-09)


In [30]:
# Make predictions on test set
preds = gnb.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


In [36]:
# Time to evaluate the accuary of our model
from sklearn.metrics import accuracy_score


# Evaluate accuracy (94.15% of predictions are accurate)
print('percent of accurate predictions: ',accuracy_score(test_labels, preds)*100)

percent of accurate predictions:  94.14893617021278


In [None]:
# Full script
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


# Load dataset
data = load_breast_cancer()

# Organize our data
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

# Look at our data
print(label_names)
print('Class label = ', labels[0])
print(feature_names)
print(features[0])

# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33,
                                                          random_state=42)

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

# Make predictions
preds = gnb.predict(test)
print(preds)

# Evaluate accuracy
print(accuracy_score(test_labels, preds))