### Getting started with Naive Bayes

In [1]:
#Install the package
install.packages("e1071")

#Loading the library
library(e1071)
#?naiveBayes #The documentation also contains an example implementation of Titanic dataset

"installation of package 'e1071' had non-zero exit status"Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


### The Titanic dataset in R is a table for about **2200** passengers summarised according to four factors – economic status ranging from 1st class, 2nd class, 3rd class and crew; gender which is either male or female; Age category which is either Child or Adult and whether the type of passenger survived. For each combination of Age, Gender, Class and Survived status, the table gives the number of passengers who fall into the combination. We will use the Naive Bayes Technique to classify such passengers and check how well it performs.

### As we know, Bayes theorem is based on conditional probability and uses the formula:

##### $P(A | B) = P(A) * P(B | A) / P(B)$

In [2]:
#Next load the Titanic dataset
data("Titanic")

#Save into a data frame and view it
Titanic_df=as.data.frame(Titanic)
Titanic_df

Class,Sex,Age,Survived,Freq
1st,Male,Child,No,0
2nd,Male,Child,No,0
3rd,Male,Child,No,35
Crew,Male,Child,No,0
1st,Female,Child,No,0
2nd,Female,Child,No,0
3rd,Female,Child,No,17
Crew,Female,Child,No,0
1st,Male,Adult,No,118
2nd,Male,Adult,No,154


### We see that there are 32 observations which represent all possible combinations of Class, Sex, Age and Survived with their frequency. Since it is summarised, this table is not suitable for modelling purposes. We need to expand the table into individual rows. Let’s create a repeating sequence of rows based on the frequencies in the table

In [17]:
#Creating data from table
repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq) #This will repeat each combination equal to the frequency of each combination

#Create the dataset by row repetition created
Titanic_dataset=Titanic_df[repeating_sequence,]
head(Titanic_df[repeating_sequence,])

Unnamed: 0,Class,Sex,Age,Survived,Freq
3.0,3rd,Male,Child,No,35
3.1,3rd,Male,Child,No,35
3.2,3rd,Male,Child,No,35
3.3,3rd,Male,Child,No,35
3.4,3rd,Male,Child,No,35
3.5,3rd,Male,Child,No,35


In [18]:
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL

Titanic_dataset

Unnamed: 0,Class,Sex,Age,Survived
3,3rd,Male,Child,No
3.1,3rd,Male,Child,No
3.2,3rd,Male,Child,No
3.3,3rd,Male,Child,No
3.4,3rd,Male,Child,No
3.5,3rd,Male,Child,No
3.6,3rd,Male,Child,No
3.7,3rd,Male,Child,No
3.8,3rd,Male,Child,No
3.9,3rd,Male,Child,No


In [5]:
#Fitting the Naive Bayes model
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)

#What does the model say? Print the model summary
Naive_Bayes_Model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      No      Yes 
0.676965 0.323035 

Conditional probabilities:
     Class
Y            1st        2nd        3rd       Crew
  No  0.08187919 0.11208054 0.35436242 0.45167785
  Yes 0.28551336 0.16596343 0.25035162 0.29817159

     Sex
Y           Male     Female
  No  0.91543624 0.08456376
  Yes 0.51617440 0.48382560

     Age
Y          Child      Adult
  No  0.03489933 0.96510067
  Yes 0.08016878 0.91983122


### The model creates the conditional probability for each feature separately. We also have the a-priori probabilities which indicates the distribution of our data. Let’s calculate how we perform on the data

In [6]:
#Prediction on the dataset
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)

#Confusion matrix to check accuracy
table(NB_Predictions,Titanic_dataset$Survived)

              
NB_Predictions   No  Yes
           No  1364  362
           Yes  126  349

### Conclusion:

* We have the results! 

* We are able to classify 1364 out of 1490 “No” cases correctly and 349 out of 711 “Yes” cases correctly. 

* This means the ability of Naive Bayes algorithm to predict “No” cases is about 91.5% but it falls down to only 49% of the “Yes” cases resulting in an overall accuracy of 77.8%.

# Parte 2

### Getting started with Naive Bayes in `mlr` (Machine Learning in R)

Caso tenha problemas na instalação do pacote `devtools` no MacBook, faça o seguinte:
* brew install libgit2
* `# conda install -c r r-xml`

# ------------------------------------------------------------------------------------

In [7]:
# install.packages("devtools")

In [8]:
# library(devtools)

In [9]:
#devtools::install_github("mlr-org/mlr")

In [10]:
#Install the package
install.packages("mlr")

#Loading the library
library(mlr)
?mlr

"installation of package 'mlr' had non-zero exit status"Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Loading required package: ParamHelpers

Attaching package: 'mlr'

The following object is masked from 'package:e1071':

    impute



In [11]:
#Create a classification task for learning on Titanic Dataset and specify the target feature
task = makeClassifTask(data = Titanic_dataset, target = "Survived")

#Initialize the Naive Bayes classifier
selected_model = makeLearner("classif.naiveBayes")

#Train the model
NB_mlr = train(selected_model, task)

# Naive Bayes Classifier for Discrete Predictors

In [12]:
#Read the model learned  
NB_mlr$learner.model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      No      Yes 
0.676965 0.323035 

Conditional probabilities:
     Class
Y            1st        2nd        3rd       Crew
  No  0.08187919 0.11208054 0.35436242 0.45167785
  Yes 0.28551336 0.16596343 0.25035162 0.29817159

     Sex
Y           Male     Female
  No  0.91543624 0.08456376
  Yes 0.51617440 0.48382560

     Age
Y          Child      Adult
  No  0.03489933 0.96510067
  Yes 0.08016878 0.91983122


In [13]:
#Predict on the dataset without passing the target feature
predictions_mlr = as.data.frame(predict(NB_mlr, newdata = Titanic_dataset[,1:3]))

In [14]:
##Confusion matrix to check accuracy
table(predictions_mlr[,1],Titanic_dataset$Survived)

     
        No  Yes
  No  1364  362
  Yes  126  349

# Here is the Complete Code

In [15]:
#Getting started with Naive Bayes
#Install the package
#install.packages(“e1071”)
#Loading the library
library(e1071)
?naiveBayes #The documentation also contains an example implementation of Titanic dataset
#Next load the Titanic dataset
data("Titanic")
#Save into a data frame and view it
Titanic_df=as.data.frame(Titanic)
#Creating data from table
repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq) #This will repeat each combination equal to the frequency of each combination
 
#Create the dataset by row repetition created
Titanic_dataset=Titanic_df[repeating_sequence,]
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL
 
#Fitting the Naive Bayes model
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
#What does the model say? Print the model summary
Naive_Bayes_Model
 
#Prediction on the dataset
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)
#Confusion matrix to check accuracy
table(NB_Predictions,Titanic_dataset$Survived)
 
#Getting started with Naive Bayes in mlr
#Install the package
#install.packages(“mlr”)
#Loading the library
library(mlr)
 
#Create a classification task for learning on Titanic Dataset and specify the target feature
task = makeClassifTask(data = Titanic_dataset, target = "Survived")
 
#Initialize the Naive Bayes classifier
selected_model = makeLearner("classif.naiveBayes")
 
#Train the model
NB_mlr = train(selected_model, task)
 
#Read the model learned  
NB_mlr$learner.model
 
#Predict on the dataset without passing the target feature
predictions_mlr = as.data.frame(predict(NB_mlr, newdata = Titanic_dataset[,1:3]))
 
##Confusion matrix to check accuracy
table(predictions_mlr[,1],Titanic_dataset$Survived)


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      No      Yes 
0.676965 0.323035 

Conditional probabilities:
     Class
Y            1st        2nd        3rd       Crew
  No  0.08187919 0.11208054 0.35436242 0.45167785
  Yes 0.28551336 0.16596343 0.25035162 0.29817159

     Sex
Y           Male     Female
  No  0.91543624 0.08456376
  Yes 0.51617440 0.48382560

     Age
Y          Child      Adult
  No  0.03489933 0.96510067
  Yes 0.08016878 0.91983122


              
NB_Predictions   No  Yes
           No  1364  362
           Yes  126  349


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      No      Yes 
0.676965 0.323035 

Conditional probabilities:
     Class
Y            1st        2nd        3rd       Crew
  No  0.08187919 0.11208054 0.35436242 0.45167785
  Yes 0.28551336 0.16596343 0.25035162 0.29817159

     Sex
Y           Male     Female
  No  0.91543624 0.08456376
  Yes 0.51617440 0.48382560

     Age
Y          Child      Adult
  No  0.03489933 0.96510067
  Yes 0.08016878 0.91983122


     
        No  Yes
  No  1364  362
  Yes  126  349