<center>
    <H1> NAIVE BAYES CLASSIFIER </H1>
    <br>
======================================================================================================================
<br>
Naive Bayes Classification algorithm is a type of supervised machine learning algorithm. It is extremely easy to implement in its most basic form, and yet performs quite complex classification tasks.


## STEP 1: IMPORT LIBRARIES

In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix, accuracy_score

## STEP 2: LOAD DATASET

In [4]:
dataset = pd.read_csv('data/twitter_dataset.csv', encoding = 'latin-1')  #load data from csv
dataset.head()                  #show first 5 rows

Unnamed: 0,name_wt,statuses_count,followers_count,friends_count,favourites_count,listed_count,label
0,0.9375,43,5,34,0,0,1
1,0.909091,12204,1182,1327,0,4,1
2,0.909091,42,3,34,0,0,1
3,1.0,215,1158,1545,0,21,1
4,0.285714,38420,2293,2198,1987,2,0


In [5]:
dataset.shape

(6945, 7)

## STEP 3: FEATURE SELECTION

In [6]:
features=[]
for attributes in dataset.columns:
    if attributes != 'label':
        features.append(attributes)
features

['name_wt',
 'statuses_count',
 'followers_count',
 'friends_count',
 'favourites_count',
 'listed_count']

In [7]:
#Combinig attributes into single list of tuples and using those features create a 2D matrix 

data = dataset.as_matrix(columns = features)
# data

  This is separate from the ipykernel package so we can avoid doing imports until


In [8]:
data = dataset.values
data

array([[9.37500000e-01, 4.30000000e+01, 5.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [9.09090909e-01, 1.22040000e+04, 1.18200000e+03, ...,
        0.00000000e+00, 4.00000000e+00, 1.00000000e+00],
       [9.09090909e-01, 4.20000000e+01, 3.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       ...,
       [8.18181818e-01, 1.04390000e+04, 7.80000000e+01, ...,
        0.00000000e+00, 3.90000000e+01, 1.00000000e+00],
       [9.09090909e-01, 5.25000000e+02, 3.60000000e+01, ...,
        5.30000000e+01, 0.00000000e+00, 0.00000000e+00],
       [5.45454545e-01, 1.26200000e+04, 6.13000000e+02, ...,
        2.00000000e+00, 6.00000000e+00, 1.00000000e+00]])

In [9]:
print("Total instances : ", data.shape[0], "\nNumber of features : ", data.shape[1])

Total instances :  6945 
Number of features :  7


In [11]:
#convert label column into 1D arrray

label = np.array(dataset['label'])
# label

## STEP 4: CREATE TEST AND TRAIN SETS

We will randomly split our dataset in 80–20 ratio. Where 80% of the total data will be used as training set and rest 20% will be considered as test set. 

In [17]:
'''
    We have X_train, y_train, X_test, y_test.
    Using these lists and dataframes we will randomly create two non-overlapping datasets 
        1. training set
        2. testing set
'''

X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=0)

print("Number of training instances: ", X_train.shape[0])
print("Number of testing instances: ", X_test.shape[0])

Number of training instances:  5556
Number of testing instances:  1389


## STEP 5: TRAIN THE CLASSIFIER 

In [18]:
# Generate the model
nb_model = GaussianNB()

# Train the model using the training sets
data = X_train
label = y_train

nb_model.fit(data, label)

GaussianNB(priors=None, var_smoothing=1e-09)

## STEP 6: TEST THE CLASSIFIER 

Now our model is ready. We will test our data against given labels. For every test case, calculate class score (using Bayes theorem) and assign the class to the test case, having maximum score.

In [19]:
#test set
X_test

array([[4.73684211e-01, 4.50000000e+01, 2.98000000e+02, ...,
        0.00000000e+00, 1.70000000e+01, 1.00000000e+00],
       [1.42857143e-01, 1.50010000e+04, 5.00000000e+02, ...,
        2.78390000e+04, 2.40000000e+01, 0.00000000e+00],
       [3.33333333e-01, 1.41840000e+04, 2.79000000e+02, ...,
        3.83900000e+03, 6.00000000e+00, 0.00000000e+00],
       ...,
       [8.57142857e-01, 7.71700000e+03, 1.60000000e+01, ...,
        9.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.00000000e-01, 1.42600000e+03, 1.43000000e+02, ...,
        1.23000000e+02, 1.00000000e+00, 0.00000000e+00],
       [6.66666667e-01, 3.75500000e+03, 2.79000000e+02, ...,
        6.04100000e+03, 3.00000000e+00, 0.00000000e+00]])

In [20]:
nb_model.predict([X_test[1]])    #testing for single instance

array([0], dtype=int64)

In [21]:
'''
   Now, apply the model to the entire test set and predict the label for each test example

'''       
       
y_predict = []                       #to store prediction of each test example

for test_case in range(len(X_test)): 
    label = nb_model.predict([X_test[test_case]])
    
    #append to the predictions list
    y_predict.append(np.asscalar(label))

#predictions

In [23]:
y_predict   #predicted labels for al the test instances

[1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


## STEP 7: EVALUATION OF CLASSIFICATION RESULTS

The classifier will be evaluted using Accuracy, Recall, Precision and F-measure. For this first, a confusion matrix will be created. 

In [24]:
#true negatives is C(0,0), false negatives is C(1,0), false positives is C(0,1) and true positives is C(1,1) 
conf_matrix = confusion_matrix(y_test, y_predict)

In [25]:
#true_negative
TN = conf_matrix[0][0]
#false_negative
FN = conf_matrix[1][0]
#false_positive
FP = conf_matrix[0][1]
#true_positive
TP = conf_matrix[1][1]

In [26]:
# Recall is the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. 
# High Recall indicates the class is correctly recognized (small number of FN)

recall = (TP)/(TP + FN)

In [27]:
# Precision is the the total number of correctly classified positive examples divided by the total number of predicted positive examples. 
# High Precision indicates an example labeled as positive is indeed positive (small number of FP)

precision = (TP)/(TP + FP)

In [28]:
fmeasure = (2*recall*precision)/(recall+precision)   #f-measure is the harmonice mean of Recall and Precision
accuracy = (TP + TN)/(TN + FN + FP + TP) #Total number of correct predictions divided by total number of instances predicted

accuracy_score(y_test, y_predict)

0.7746580273578114

In [29]:
print("------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ \n"\
      "\n Recall : ", (recall*100) ,"%" \
      "\n Precision : ", (precision*100) ,"%" \
      "\n Accuracy : ", (accuracy*100) ,"%" \
      "\n F-measure : ", (fmeasure*100) ,"%" )


------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ 

 Recall :  99.42446043165467 %
 Precision :  69.1 %
 Accuracy :  77.46580273578114 %
 F-measure :  81.53392330383481 %
