# Naive Bayes classifier for Fake News recognition
Fake news are defined by the New York Times as *"a made-up story with an intention to deceive"*, with
the intent to confuse or deceive people. They are everywhere in our daily life, and come especially from
social media platforms and applications in the online world. Being able to distinguish fake contents form
real news is today one of the most serious challenges facing the news industry.  

[Naive Bayes classifiers][1] are powerful algorithms that are used for text data analysis and are connected
to classification tasks of text in multiple classes.  
The goal of the project is to implement a __Multinomial Naive Bayes classifier__ in `R` and test its perfor-
mances in the classification of social media posts.
The suggested data set is available on [Kaggle][2]

[1]: https://nlp.stanford.edu/IR-book/pdf/13bayes.pdf "C. D. Manning, Chapter 13, Text Classification and Naive Bayes, in Introduction to Information Retrieval, Cambridge University Press, 2008."
[2]: https://www.kaggle.com/datasets/anmolkumar/fake-news-content-detection?select=train.csv

## Import useful libraries

In [2]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Import dataset

In [3]:
df <- read_csv('train.csv', col_types = 'icc')
head(df, 10)

Labels,Text,Text_Tag
<int>,<chr>,<chr>
1,Says the Annies List political group supports third-trimester abortions on demand.,abortion
2,When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.,"energy,history,job-accomplishments"
3,"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran.""",foreign-policy
1,Health care reform legislation is likely to mandate free sex change surgeries.,health-care
2,The economic turnaround started at the end of my term.,"economy,jobs"
5,The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.,education
0,Jim Dunnam has not lived in the district he represents for years now.,candidates-biography
2,"I'm the only person on this stage who has worked actively just last year passing, along with Russ Feingold, some of the toughest ethics reform since Watergate.",ethics
2,"However, it took $19.5 million in Oregon Lottery funds for the Port of Newport to eventually land the new NOAA Marine Operations Center-Pacific.",jobs
3,Says GOP primary opponents Glenn Grothman and Joe Leibham cast a compromise vote that cost $788 million in higher electricity costs.,"energy,message-machine-2014,voting-record"


## Multinomial NB classifier implementation

In [36]:
accuracy <- function( predictions, true_labels) {
    sum(predictions==true_labels) / length(predictions)
}

In [4]:
extract_vocabulary <- function(documents, unique=TRUE) {
    words <- unlist(str_split(documents, ' '))
    words <- str_replace_all(words, "[:punct:]|[$]", "")
    words <- words[words!='']
    words <- casefold(words, upper = FALSE)
    words <- str_replace(words, "[:digit:]+", "isnumeric")
    if(unique) return(unique(words))
    else return(words)
}

In [5]:
multinomNBC.fit <- function(documents, labels) {
    # documents must be a list

    
    log.like  <- list()

    # extract vocabulary
    V <- extract_vocabulary(documents[[1]])
    N <- length(documents[[1]])
    log.prior <- log(table(labels)) - log(N)

    for(l in unique(labels[[1]])) {
        
        text <- extract_vocabulary(documents[labels==l], unique=FALSE)
        freq <- table(c(text, V))
        log.like[[as.character(l)]] <- log(freq) - log(sum(freq))
    }

    return(list(log.likelihood=log.like, log.prior=log.prior, vocabulary=V))
}

In [59]:
multinomNBC.predict <- function(doc, log.prior, log.likelihood, voc){

    text <- extract_vocabulary(doc, unique=FALSE)
    freq <- table(text)[voc]
    freq <- freq[!is.na(freq)]
    #print(freq)
    scores <- rep(0, length(log.prior))
    names(scores) <- dimnames(log.prior)$labels

    for( l in names(scores)) {
        scores[l] <- log.prior[l] + sum(log.likelihood[[l]][dimnames(freq)$text] * freq)
    }
    
    return(names(scores)[which.max(scores)[[1]]])
}

In [6]:
nDoc <- 500

results <- multinomNBC.fit(df[1:nDoc, 2], df[1:nDoc, 1])

 'table' num [1:2366(1d)] -4.99 -8.24 -8.24 -8.24 -8.24 ...
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:2366] "a" "abbas" "abbott" "abdul" ...


In [58]:
multinomNBC.predict(df[1, 2][[1]], results$log.prior, results$log.likelihood, results$vocabulary)

text
          says            the         annies           list      political 
             1              1              1              1              1 
         group       supports thirdtrimester      abortions             on 
             1              1              1              1              1 
        demand 
             1 


In [61]:
pred <- Vectorize(multinomNBC.predict, vectorize.args='doc', USE.NAMES=FALSE)(df[1:1000, 2][[1]], results$log.prior, results$log.likelihood, results$vocabulary)

accuracy(pred, df[1:1000, 1][[1]])

In [30]:
?Vectorize

Vectorize                 package:base                 R Documentation

_V_e_c_t_o_r_i_z_e _a _S_c_a_l_a_r _F_u_n_c_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     ‘Vectorize’ creates a function wrapper that vectorizes the action
     of its argument ‘FUN’.

_U_s_a_g_e:

     Vectorize(FUN, vectorize.args = arg.names, SIMPLIFY = TRUE,
               USE.NAMES = TRUE)
     
_A_r_g_u_m_e_n_t_s:

     FUN: function to apply, found via ‘match.fun’.

vectorize.args: a character vector of arguments which should be
          vectorized.  Defaults to all arguments of ‘FUN’.

SIMPLIFY: logical or character string; attempt to reduce the result to
          a vector, matrix or higher dimensional array; see the
          ‘simplify’ argument of ‘sapply’.

USE.NAMES: logical; use names if the first ... argument has names, or
          if it is a character vector, use that character vector as the
          names.

_D_e_t_a_i_l_s:

     The arguments name

In [277]:
log(table(setdiff(voc, text)))


       abortion          across             act        actively           added 
              0               0               0               0               0 
     affordable          aflcio africanamerican           after         against 
              0               0               0               0               0 
            ago          agrees            alex             all          almost 
              0               0               0               0               0 
          along         already         america        american    americanmade 
              0               0               0               0               0 
      americans              an          anyone        anywhere         appears 
              0               0               0               0               0 
            arm          around              as     atlantaarea       authority 
              0               0               0               0               0 
     authorized            

In [258]:
#our vocabulary
voc <- extract_vocabulary(df[1:100, 2][[1]])
voc

In [115]:
example_char <- c("ciao sono Pietro ho 23 anni e non 50 ")
example <- str_split(example_char,' ')[[1]]
num <- 0
for (i in 1:length(example)){
    # print(example[i])
    aa <- as.numeric((example[i]))
    # print(is.na(aa))
    if(is.na(aa)==FALSE) {
        num <- num + 1
    }
}
cat("\nIn the sentences there are",num,"numbers")

“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”
“NAs introduced by coercion”

In the sentences there are 2 numbers

In [11]:
str_split(df[2:100, 2], ' ')

“argument is not an atomic vector; coercing”


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=811d2f44-4ec2-4c23-800a-72ed1acf812e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>