# Program 5 - Write a Program to implement the naive Bayes Classifier for a sample training data stored as a .CSV File. Compute the accuracy, precision and recall for your data set. 

### Importing the required libraries

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python

In [1]:
import pandas as pd

### Reading the data as a DataFrame Object 

pd.read_csv(filename or buffer, sep=',', delimiter=None) - reads the CSV File into the DataFrame and returns a DataFrame.  
df.columns.values - returns a numpy.ndarray object containing all column headers  
df.iloc[ : , : ] - Integer-location based indexing for selection by position. First parameter is for selection of rows and the second one is for selection of columns. 


In [2]:
df = pd.read_csv('NaiveBayesClassifier.csv')
headers = df.columns.values
df = df.iloc[:16,:]
df

Unnamed: 0,OUTLOOK,TEMPERATURE,HUMIDITY,WIND,TARGET
0,1.0,1.0,1.0,1.0,5.0
1,1.0,1.0,1.0,2.0,5.0
2,2.0,1.0,1.0,2.0,10.0
3,3.0,2.0,1.0,1.0,10.0
4,3.0,3.0,2.0,1.0,10.0
5,3.0,3.0,2.0,2.0,5.0
6,2.0,3.0,2.0,2.0,10.0
7,1.0,2.0,1.0,1.0,5.0
8,1.0,3.0,2.0,1.0,10.0
9,3.0,2.0,2.0,2.0,10.0


### Partition the DataFrame

The DataFrame is now partitioned into data and columns using Integer based indexing function - iloc[:,:]. The last column in the DataFrame contains the set of target values while the remaining four columns have the data which help in prediction of the target. 

In [3]:
data=df.iloc[:,0:-1].values
target=df.iloc[:,-1].values

###  Partition the data set for training and testing

train_test_split(\*arrays , \**options) 


\*arrays - sequence of indexables with same length. Allowed Inputs: lists, numpy arrays, scipy-sparse matrices or DataFrames  

\**options  
  
test_size : float, int, None, optional (default=0.25). 25% of the data will be used for testing and the remaining 75% will be used for training. 
  
shuffle : boolean, optional (default=True). Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.20, random_state=1) 


### Printing the details of the data set and the results of the operations done so far

In [5]:
print('Split {0} rows into'.format(len(data)))
print('Number of Training data: ' + (repr(len(X_train))))
print('Number of Test Data: ' + (repr(len(X_test))))
print("\nThe values assumed for the concept learning attributes are\n")
print("OUTLOOK=> Sunny=1 Overcast=2 Rain=3\nTEMPERATURE=> Hot=1 Mild=2 Cool=3\nHUMIDITY=> High=1 Normal=2\nWIND=> Weak=1 Strong=2")
print("TARGET CONCEPT:PLAY TENNIS=> Yes=10 No=5")
print("\nThe Training set are:")
for x,y in zip(X_train,y_train):
    print((x,y))
print("\nThe Test data set are:")
for x,y in zip(X_test,y_test):
    print((x,y))

Split 16 rows into
Number of Training data: 12
Number of Test Data: 4

The values assumed for the concept learning attributes are

OUTLOOK=> Sunny=1 Overcast=2 Rain=3
TEMPERATURE=> Hot=1 Mild=2 Cool=3
HUMIDITY=> High=1 Normal=2
WIND=> Weak=1 Strong=2
TARGET CONCEPT:PLAY TENNIS=> Yes=10 No=5

The Training set are:
(array([2., 3., 2., 2.]), 10.0)
(array([1., 2., 2., 2.]), 10.0)
(array([3., 3., 2., 1.]), 10.0)
(array([1., 1., 1., 2.]), 5.0)
(array([1., 2., 1., 2.]), 10.0)
(array([1., 1., 1., 1.]), 5.0)
(array([1., 2., 1., 2.]), 5.0)
(array([3., 2., 2., 2.]), 10.0)
(array([1., 3., 2., 1.]), 10.0)
(array([2., 1., 2., 1.]), 10.0)
(array([2., 2., 1., 2.]), 10.0)
(array([3., 3., 2., 2.]), 5.0)

The Test data set are:
(array([3., 2., 1., 1.]), 10.0)
(array([3., 2., 1., 2.]), 5.0)
(array([1., 2., 1., 1.]), 5.0)
(array([2., 1., 1., 2.]), 10.0)


### Naive Bayesian Classifier  
  
Instantiate GaussianNB (Gaussian Naive Bayes) Class. These classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong independence assumptions between the features.  
  
gnb.fit(X,y) - Fit Gaussian Naive Bayes according to X, y where X consists of training vectors and y consists of target values.  

In [6]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB() 
gnb.fit(X_train, y_train) 

GaussianNB(priors=None)

### Predictions from the Model

After training your model successfully, it's now capable of making predictions.   
gnb.predict(X) - Perform classification on an array of test vectors X.  


In [7]:
y_pred = gnb.predict(X_test)
print("Predictions for the given training set",y_pred)
print("The actual testing data ",y_test)

Predictions for the given training set [10.  5.  5.  5.]
The actual testing data  [10.  5.  5. 10.]


### Computing the accuracy 

metrics from sklearn package is used for achieving the task.    
metrics.accuracy_score(y_true, y_pred) - In multilabel classification, this function computes subset accuracy.  
The set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.  
  
y_true : 1d array-like, or label indicator array / sparse matrix. Ground truth (correct) labels.  
  
y_pred : 1d array-like, or label indicator array / sparse matrix. Predicted labels, as returned by a classifier.

In [8]:
from sklearn import metrics 
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)



Gaussian Naive Bayes model accuracy(in %): 75.0


### Hope this helps :) 

#### Follow my work [here](https://github.com/NandanSatheesh)