**Natural Language Processing**

# Bag of Words (BoW) model

A typical Bag of Words (BoW) vector represents a text document as a numerical array, where each element corresponds to the frequency (or presence) of a specific word in the vocabulary.

A structured representation of a BoW vector can look like this:

[**SoS**, **EoS**, ...frequency of words..., **frequency of special words**]

Where:

+ **SoS** (Start of Sentence Token) – Optional. Can be used to indicate the beginning of a sentence.
+ **EoS** (End of Sentence Token) – Optional. Marks the end of a sentence.
+ **frequency of words** – The main part of the vector, representing the frequency or presence of each word in the vocabulary.
+ **frequency of special words** – Can include punctuation, stop words, named entities, or special tokens (e.g., <UNK> for unknown words).

## Step-by-step

### Step 1. Collect and Prepare Text Data
- Gather a set of text documents (e.g., emails, reviews, or news articles).  
- Label the data if needed (e.g., spam vs. not spam).  

### Step 2. Clean and Preprocess the Text
- Convert text to lowercase.  
- Remove punctuation, numbers, and special characters.  
- Remove stop words (common words like "the", "and", "is" that don’t add much meaning).  
- Tokenize the text (split sentences into words).  

### Step 3. Convert Sentences into Vectors using Bag of Words
- Create a vocabulary (a list of all unique words in the dataset).  
- Count how many times each word appears in each document.  
- Represent each document as a vector of word counts.  

### Step 4. Split Data into Training and Testing Sets
- Divide the dataset into a training set (to teach the model) and a test set (to check its accuracy).  

### Step 5. Train a Machine Learning Model
- Use a classification algorithm (e.g., Naive Bayes, Logistic Regression, or SVM) or a neural network.
- Feed the BoW vectors into the model so it learns patterns in the data.  

### Step 6. Test and Evaluate the Model
- Use the test set to check how well the model can predict labels for new text.  
- Measure accuracy and other performance metrics.  

### Step 7. Make Predictions on New Text
- Convert new sentences into BoW vectors.  
- Feed them into the trained model to get predictions.  

## Libraries Loading 

In [1]:
# install.packages('tm')
# install.packages('SnowballC')

In [26]:
library(tidyverse)
library(tm)
library(SnowballC)
library(rpart)
library(caret)
library(ggplot2)
library(MLmetrics)
library(class)
library(e1071)
library(randomForest)
library(pROC)

## Data Loading

In [27]:
# Step 1.
data = read_delim('../00_data/Restaurant_Reviews.tsv')

head(data)

[1mRows: [22m[34m1000[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (1): Review
[32mdbl[39m (1): Liked

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Review,Liked
<chr>,<dbl>
Wow... Loved this place.,1
Crust is not good.,0
Not tasty and the texture was just nasty.,0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
The selection on the menu was great and so were the prices.,1
Now I am getting angry and I want my damn pho.,0


In [4]:
tail(data)

Review,Liked
<chr>,<dbl>
I can't tell you how disappointed I was.,0
I think food should have flavor and texture and both were lacking.,0
Appetite instantly gone.,0
Overall I was not impressed and would not go back.,0
"The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time.",0
"Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing out the time it took to bring the check.",0


In [5]:
dim(data)

## Data Preprocessing

In [6]:
# Step 2.
corpus <- VCorpus(VectorSource(data$Review))
as.character(corpus[[1]])

In [7]:
# make lowercases
corpus <- tm_map(corpus, content_transformer(tolower))
as.character(corpus[[1]])

In [8]:
# remove numbers
print(as.character(corpus[[841]]))
corpus <- tm_map(corpus, removeNumbers)
print(as.character(corpus[[841]]))

[1] "for 40 bucks a head, i really expect better food."
[1] "for  bucks a head, i really expect better food."


In [9]:
# remove punctuation
print(as.character(corpus[[841]]))
corpus <- tm_map(corpus, removePunctuation)
print(as.character(corpus[[841]]))

[1] "for  bucks a head, i really expect better food."
[1] "for  bucks a head i really expect better food"


In [10]:
# remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords())
as.character(corpus[[1]])

In [11]:
# stemming
corpus <- tm_map(corpus, stemDocument)
as.character(corpus[[1]])

In [12]:
# extra spaces
corpus <- tm_map(corpus, stripWhitespace)
print(as.character(corpus[[841]]))

[1] "buck head realli expect better food"


## Create Bag of Words model (Create Features)

In [13]:
# Step 3.
dtm <- DocumentTermMatrix(corpus) 
dtm

<<DocumentTermMatrix (documents: 1000, terms: 1577)>>
Non-/sparse entries: 5435/1571565
Sparsity           : 100%
Maximal term length: 32
Weighting          : term frequency (tf)

In [14]:
# keep 99% of the most frequent words
dtm <- removeSparseTerms(dtm, 0.999)
dtm

<<DocumentTermMatrix (documents: 1000, terms: 691)>>
Non-/sparse entries: 4549/686451
Sparsity           : 99%
Maximal term length: 12
Weighting          : term frequency (tf)

## Split the data into training anf test sets

In [15]:
# Step 4.
data_df <- as.data.frame(as.matrix(dtm)) # convert to data frame
data_df$Liked <- factor(data$Liked, levels=c(0, 1))

In [16]:
# split
set.seed(42)

train_data <- data_df |> slice_sample(prop = 0.8)
test_data <- data_df |> anti_join(train_data, by=colnames(train_data))

## Train a machine learning model on the training set

### Random Forest model

In [17]:
# Step 5.

# names `break` and `next` are reserved in randomForest and can cause issues
colnames(train_data)[colnames(train_data) == "break"] <- "break_1"
colnames(train_data)[colnames(train_data) == "next"] <- "next_1"
colnames(test_data)[colnames(test_data) == "break"] <- "break_1"
colnames(test_data)[colnames(test_data) == "next"] <- "next_1"

fit <- randomForest(Liked ~ .,
            data = train_data, 
            ntree = 100)

fit


Call:
 randomForest(formula = Liked ~ ., data = train_data, ntree = 100) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 26

        OOB estimate of  error rate: 25.37%
Confusion matrix:
    0   1 class.error
0 317  83      0.2075
1 120 280      0.3000

## Predicting the test set results

In [18]:
# Step 6.
#test_data <- test_data[, !names(test_data) %in% c("break", "next")]
y_pred <- predict(fit, newdata=test_data)
head(y_pred)

## Performance Metrics

### Confusion Matrix

In [19]:

cm <- table(test_data$Liked, y_pred)
print(cm)

   y_pred
     0  1
  0 77 17
  1 27 70


### Accuracy Score

In [20]:
accuracy <- mean(test_data$Liked == y_pred)
print(paste0('Accuracy on test set: ', round(accuracy*100, 2), "%"))

[1] "Accuracy on test set: 76.96%"


## Predict a new result

In [21]:
# Step 7.

In [22]:
pos_neg_classifier <- function(classifier, model_terms, new_review){    
    # preprocessing a new data
    new_corpus <- VCorpus(VectorSource(new_review))
    new_corpus <- tm_map(new_corpus, content_transformer(tolower)) # lowercases
    new_corpus <- tm_map(new_corpus, removeNumbers)                # remove numbers
    new_corpus <- tm_map(new_corpus, removePunctuation)            # remove punctuations
    new_corpus <- tm_map(new_corpus, removeWords, stopwords("en")) # remove stopwords
    new_corpus <- tm_map(new_corpus, stemDocument)                 # stemming
    new_corpus <- tm_map(new_corpus, stripWhitespace)              # remove extra whitespaces
    
    new_dtm <- DocumentTermMatrix(new_corpus)
    
    new_dtm <- as.matrix(new_dtm)
    new_dtm <- new_dtm[, colnames(new_dtm) %in% model_terms, drop = FALSE]
    
    # add missing terms with zeros
    missing_terms <- setdiff(model_terms, colnames(new_dtm))
    for (term in missing_terms) {
      new_dtm <- cbind(new_dtm, matrix(0, nrow = nrow(new_dtm), ncol = 1, dimnames = list(NULL, term)))
    }
    
    new_dtm <- new_dtm[, model_terms, drop = FALSE]
    # rename reserved words
    colnames(new_dtm)[colnames(new_dtm) == "break"] <- "break_1"
    colnames(new_dtm)[colnames(new_dtm) == "next"] <- "next_1"
    
    prediction <- predict(fit, new_dtm)
    
    return(prediction)
}

+ **Positive Review:**
  > The food was exceptional, with fresh ingredients and bold flavors. Friendly staff, cozy atmosphere, and quick service made it a delightful experience.

In [23]:
new_review <- c('The food was exceptional, with fresh ingredients and bold flavors. Friendly staff, cozy atmosphere, and quick service made it a delightful experience.')
pos_neg_classifier(fit, Terms(dtm), new_review)

+ **Negative Review:**
  > Disappointing experience—slow service, overpriced dishes, and bland flavors. The ambiance was noisy, and the staff seemed inattentive. Not worth the visit.

In [24]:
new_review <- c('Disappointing experience—slow service, overpriced dishes, and bland flavors. The ambiance was noisy, and the staff seemed inattentive. Not worth the visit.')
pos_neg_classifier(fit, Terms(dtm), new_review)

## The best model selection

In [25]:
random_seed <- 42

# get scores to choose the best model
scores <- function(y_pred, y_prob){
    roc_curve <- roc(test_data$Liked, y_prob, levels = c(0, 1), direction = "<")
    auc_value <- auc(roc_curve)
    accuracy <- mean(test_data$Liked == y_pred)
    list('f1' = F1_Score(y_pred, test_data$Liked, positive = "1"),
         'roc-auc' = auc_value,
         'accuracy' = accuracy)
}

# convert probabilities to classes
prob_to_class <- function(y_prob){
    ifelse(y_prob > 0.5, 1, 0)
}

# get confusion matrix
conf_mat <- function(y_pred){
    table(test_data$Liked, y_pred)
}

# models configurations
models_config <- list(
  # 1. KNN
  'KNN' = list(
      name = 'KNN',
      train_test = function(){
          # fit the model on the train set and get probabilities predicted on test set
          y_pred <- knn(train = train_data[, -ncol(test_data)],
                        test = test_data[, -ncol(test_data)],
                        cl = train_data$Liked, # classes from training set
                        k = 5,
                        prob = TRUE) # get probabilities

          winning_class_probs <- attr(y_pred, "prob")
          # Adjust probabilities for the positive class - 1
          positive_probs <- ifelse(y_pred == 1, winning_class_probs, 1 - winning_class_probs)
          
          return(positive_probs)
    },
    evaluate = conf_mat,
    scores = scores
  ),
  # 2. SVM
  'SVM' = list(
      name = 'SVM',
      train_test = function(){
          kernels <- list(linear = list(kernel = 'linear',
                                        params = list(cost = c(0.1, 1, 10, 100))),
                          redial = list(kernel = 'radial',
                                        params = list(cost = c(0.1, 1, 10, 100), 
                                                      gamma = c(1, 0.1, 0.01))),
                          poly = list(kernel = 'poly',
                                      params = list(cost = c(0.1, 1, 10, 100), 
                                                    gamma = c(1, 0.1, 0.01),
                                                    degree = 2:4))
                         )
          # tune params and select the best SVM kernel
          best_models <- lapply(kernels, function(x){
              set.seed(random_seed)
              tuned <- tune('svm',
                            Liked ~ .,
                            data = train_data,
                            kernel = x$kernel,
                            ranges = x$params,
                            tunecontrol = tune.control(cross = 10),
                            probability = TRUE,
                            scale = FALSE
                           )
            
              best_model <- tuned$best.model
              y_pred <- predict(best_model, newdata = test_data)
              f1 <- F1_Score(y_pred, test_data$Liked, positive = "1")
              list(kernel = x$kernel,
                   best.params = tuned$best.parameters,
                   best.f1=f1,
                   best.model=best_model)
          })
          # select best SVM
          best_model <- best_models[[which.max(sapply(best_models, function(x) x$best.f1))]]
          # predict probabilities on test set
          y_pred <- predict(best_model$best.model, newdata = test_data, probability = TRUE)
          probs <- attr(y_pred, 'probabilities')
                                                      
          return(probs[,2])
      },
      evaluate = conf_mat,
      scores = scores
  ),
  # 3. Naive Bayes
  'Naive Bayes' = list(
      name = 'Naive Bayes',
      train_test = function(){
          # fit the model on training set
          fit <- naiveBayes(Liked ~ ., data = train_data)
          # predict probabilities on teset set
          y_pred <- predict(fit, newdata = test_data[-ncol(test_data)], type = 'raw')
          
          return(y_pred[, 2])
      },
      evaluate = conf_mat,
      scores = scores
  ),
  # 4. Decision Tree
  'Decision Tree' = list(
      name = 'Decision Tree',
      train_test = function(){
          set.seed(random_seed)
          # fit the model on training set
          fit <- rpart(Liked ~ ., data = train_data)
          # predict probabilities on teset set
          y_pred <- predict(fit, newdata = test_data[-ncol(test_data)], type = 'prob')
          
          return(y_pred[, 2])
      },
      evaluate = conf_mat,
      scores = scores
  ),
                                               
  # 5. Random Forest
  'Random Forest' = list(
      name = 'Random Forest',
      train_test = function(){
          set.seed(random_seed)
          # fit the model on training set
          fit <- randomForest(Liked ~ .,
                              data = train_data,
                              ntree = 100)        
          # predict probabilities on teset set
          y_pred <- predict(fit, newdata = test_data, type = 'prob')
          
          return(y_pred[, 2])
      },
      evaluate = conf_mat,
      scores = scores
  )
)

# run experiment
best_models <- lapply(models_config, function(x){
    print(paste0('Running ', x[['name']], '...'))
    # train
    prob_pred <- x[['train_test']]()
    #print('Model trained!')
    y_pred <- prob_to_class(prob_pred)
    #print('Evaluating...')
    cm <- x[['evaluate']](y_pred)
    # score
    #print('Scoring...')
    scores <- x[['scores']](y_pred, prob_pred)
    print(list(
        name = x[['name']],
        confusion_matrix = cm,
        scores = scores
    ))
    list(
        name = x[['name']],
        confusion_matrix = cm,
        scores = scores
    )
})

# Best model selection
max_val <- max(sapply(best_models, function(x) x$scores$f1))
best_of_the_best <- best_models[which(sapply(best_models, function(x) x$scores$f1) == max_val, arr.ind = TRUE)]

print('-------------------------------------------')
print('The best models:')

best_of_the_best

[1] "Running KNN..."
$name
[1] "KNN"

$confusion_matrix
   y_pred
     0  1
  0 82 12
  1 55 42

$scores
$scores$f1
[1] 0.5562914

$scores$`roc-auc`
Area under the curve: 0.7085

$scores$accuracy
[1] 0.6492147


[1] "Running SVM..."
$name
[1] "SVM"

$confusion_matrix
   y_pred
     0  1
  0 76 18
  1 16 81

$scores
$scores$f1
[1] 0.8265306

$scores$`roc-auc`
Area under the curve: 0.863

$scores$accuracy
[1] 0.8219895


[1] "Running Naive Bayes..."
$name
[1] "Naive Bayes"

$confusion_matrix
   y_pred
     0  1
  0 19 75
  1 26 71

$scores
$scores$f1
[1] 0.5843621

$scores$`roc-auc`
Area under the curve: 0.4696

$scores$accuracy
[1] 0.4712042


[1] "Running Decision Tree..."
$name
[1] "Decision Tree"

$confusion_matrix
   y_pred
     0  1
  0 85  9
  1 38 59

$scores
$scores$f1
[1] 0.7151515

$scores$`roc-auc`
Area under the curve: 0.7739

$scores$accuracy
[1] 0.7539267


[1] "Running Random Forest..."
$name
[1] "Random Forest"

$confusion_matrix
   y_pred
     0  1
  0 77 17
  1 26 71



$SVM
$SVM$name
[1] "SVM"

$SVM$confusion_matrix
   y_pred
     0  1
  0 76 18
  1 16 81

$SVM$scores
$SVM$scores$f1
[1] 0.8265306

$SVM$scores$`roc-auc`
Area under the curve: 0.863

$SVM$scores$accuracy
[1] 0.8219895


